# Linux Commands for Data Scientists

In this notebook, I am going to talk about some important Linux commands that data scientists should know and demo how to use them to help your daily work.

## Simple commands

```ls``` is Linux command to list folder content, including files and sub-folders.

In [3]:
!ls -l ../datasets/

total 14964
-rwxrwxrwx 1 ivan ivan 4270916 Jul 23 23:41 1_Log.csv
-rwxrwxrwx 1 ivan ivan 4017296 Jul 23 23:41 2_Log.csv
-rwxrwxrwx 1 ivan ivan 6038837 Jul 23 23:41 3_Log.csv
drwxrwxrwx 1 ivan ivan    4096 Jul 23 23:41 Excel
-rwxrwxrwx 1 ivan ivan    1083 Jul 23 23:41 forecasting_data_RAV4_sales.csv
-rwxrwxrwx 1 ivan ivan  981391 Jan 13 12:10 jobs.csv
-rwxrwxrwx 1 ivan ivan     796 Jul 23 23:41 test.csv


```pwd``` is Linux command to show what is current work folder.

In [4]:
!pwd

/mnt/c/users/ychen/Desktop/GitHub/Data-Analytics-Machine-Learning-Project/Linux_Data_Science


```find``` is Linux command to find files or directories based on all sort of criteria

In [2]:
!find -name *.ipynb

./Linux_Data_Scientist.ipynb


In [5]:
# find some files and then delete
!find ./ -name test.txt -exec rm -f {} \;

## Commands for disk and memory

When you want a very high-level view of the memory situation on a Linux machine, use ```free```.

In [5]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:           7.9G        7.0G        719M         17M        223M        812M
Swap:           24G        1.1G         22G


Short for disk usage, ```du``` is extremely useful for estimating the size of directories.

In [6]:
!du -sh ../datasets/

30M	../datasets/


```df``` stands for disk free, which is for monitoring disk space at the file system level.

In [7]:
!df -h

Filesystem      Size  Used Avail Use% Mounted on
rootfs          235G  130G  106G  56% /
none            235G  130G  106G  56% /dev
none            235G  130G  106G  56% /run
none            235G  130G  106G  56% /run/lock
none            235G  130G  106G  56% /run/shm
none            235G  130G  106G  56% /run/user
C:              235G  130G  106G  56% /mnt/c


Want to only look at specific folder or drive? Use ```grep``` with ```df``` together

In [19]:
!df -h | grep C

C:              235G  130G  106G  56% /mnt/c


## Commands for simple data processing

While it's tempting to stay within the seemingly-friendly realm of GUI-based applications, becoming skilled at operating directly in the terminal can make you massively faster at typical tasks.

The simplest command in text processing is ```wc```. Its name stands for word count, but actually it counts characters and lines as well as words. Invoke it with ```wc <filename>```.

In [3]:
!wc ../datasets/test.csv

 10  42 796 ../datasets/test.csv


```grep``` might be the most famous command line tool. It is a super-powered text-searching utility. Invoke it with ```grep <string> <filename>```.

In [None]:
!grep 'Nairobi' ../datasets/test.csv

```awk``` bears the distinction of being one of the least-intuitively named command line tools. The name is just an abbreviation for its authors: Aho, Weinberger, and Kernighan. ```awk``` is essentially a very simple programming language optimized for dealing with delimited files (e.g. CSVs, TSVs). Invoke it with ```awk <command> <filename>```. Use ```-F``` to specify the delimiter character.

In [6]:
!awk -F ',' '{print $1, $2}' ../datasets/test.csv

ID Date
30 1/11/2008
31 1/14/2008
32 1/14/2008
33 1/17/2008
34 1/19/2008
36 1/22/2008
37 1/23/2008
38 1/24/2008
39 1/26/2008
74 1/6/2008


```sed``` stands for Stream EDitor. Like ```awk``` , it is Turing Complete and extremely powerful. The basic syntax for this is ```sed 's/<find_pattern>/<replace_pattern>/g' <filename>```. The ```s``` stands for substitute, a more formal term for find-and-replace. The ```g``` is for global – without it, ```sed``` will just replace the first occurence of your find pattern on each line.

In [None]:
!sed 's/ /,/g' ../datasets/test.csv

## Commands for pipe (combining commands)

Pipes allow you to chain together as many commands as you’d like, using each command to transform the output of the last. This unlocks an enormous amount of power without requiring you to import data into a specialized tool.

The following example is to show first 3 columns in a csv file.

In [9]:
!cat ../datasets/test.csv | awk -F ',' '{print $1, $2, $3}'

ID Date Country
30 1/11/2008 1
31 1/14/2008 1
32 1/14/2008 1
33 1/17/2008 1
34 1/19/2008 1
36 1/22/2008 1
37 1/23/2008 1
38 1/24/2008 1
39 1/26/2008 1
74 1/6/2008 1


The following example is to show the 3 columns in a csv file with comma delimiter.

In [10]:
!cat ../datasets/test.csv | awk -F ',' '{print $1, $2, $3}' | sed 's/ /,/g'

ID,Date,Country
30,1/11/2008,1
31,1/14/2008,1
32,1/14/2008,1
33,1/17/2008,1
34,1/19/2008,1
36,1/22/2008,1
37,1/23/2008,1
38,1/24/2008,1
39,1/26/2008,1
74,1/6/2008,1


The following example is to display the top 5 lines in a csv file with line number.

In [None]:
!cat -n ../datasets/test.csv | head -n 5

## Command for redirection

```>``` is used for output redirection: sending what usually displays on screen into a file instead.

In [20]:
!free -h > file.txt

In [21]:
!cat file.txt

              total        used        free      shared  buff/cache   available
Mem:           7.9G        7.0G        633M         17M        223M        726M
Swap:           24G        1.0G         22G


Combine pipe ```|``` and redirection ```>``` together to make complex operations.

In [12]:
!cat ../datasets/test.csv | awk -F ',' '{print $1, $2, $3}' | sed 's/ /,/g' > new_test.csv

In [13]:
!cat new_test.csv

ID,Date,Country
30,1/11/2008,1
31,1/14/2008,1
32,1/14/2008,1
33,1/17/2008,1
34,1/19/2008,1
36,1/22/2008,1
37,1/23/2008,1
38,1/24/2008,1
39,1/26/2008,1
74,1/6/2008,1
