## Topic:  GREP, Unix
#### By:  Reshama

#### Dataset:  movies, election, etc.

----

**grep** = **G**lobal search for **R**egular **E**xpressions and **P**rint

References:

[Grep Commands in Unix Examples](http://www.folkstalk.com/2012/01/grep-command-in-unix-examples.html)

[15 Practical Grep Command Examples](http://www.thegeekstuff.com/2009/03/15-practical-unix-grep-command-examples/)

#### The basic syntax of grep command is
 
>#### $ grep [options] pattern [list of files]

### Ones I Most Commonly Use

<br>

1)  look for text within a directory and sub-directories (R = recursive)

**$ grep -R "_django1" ./* **

<br>

2)  search for number of occurrences of a word (i = ignore case, c = count)

**$ grep -i -c "comedy" movies.csv  **

<br>

3)  list number of lines in a file (l = line count)

**$ wc  -l *  **


<br>

4)  remove header from a data file (start at second row of current file and output it to new file)

**$ tail -n +2 election2012.csv > election2012_noheader.csv  **

<br>

5)  create a subset of the data (easier to work with while doing preliminary coding); take first 10 lines of a file and output them to a new file

**$ head -10 election2012_noheader.csv > election10.csv  **

---

[MovieLens Latest Datasets](http://grouplens.org/datasets/movielens/)

Small: 100,000 ratings and 2,200 tag applications applied to 9,000 movies by 700 users. Last updated **8/2015**.

[README.html](http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html)

### Election Data
[Full US 2012 election county-level results to download](http://www.theguardian.com/news/datablog/2012/nov/07/us-2012-election-county-results-download#data)

---

In [27]:
# print working directory
!pwd

/Users/reshamashaikh/_ds/metis/metisgh/pygotham-2016/grep_tutorial


In [28]:
# how many lines are in the files
!wc -l *

      10 election10.csv
    4640 election2012.csv
    4639 election2012_noheader.csv
     480 grep_unix_abbreviated.ipynb
    1213 grep_unix_full.ipynb
     667 metis.html
    8928 movies.csv
     165 movies1415.csv
     129 movies_2014.csv
      36 movies_2015.csv
   20907 total


In [98]:
# head, by default, prints first 10 lines of a file
!head movies.csv

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action


In [99]:
# look at first 5 lines of a file
!head -5 movies.csv

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance


In [100]:
#  1)  look for text within a directory and sub-directories (R = recursive)
#  **$ grep -R "_django1" ./* **


# pipe all records that have '2015' in each row of data to a new file
!grep "2015" movies.csv > movies_2015.csv

In [26]:
# preview a file

# 1 way to do it
#!more movies_2015.csv

# prints first 10 lines of file
!head movies_2015.csv

#!cat movies_2015.csv

2015,"Absent-Minded Professor, The (1961)",Children|Comedy|Fantasy
42015,Casanova (2005),Action|Adventure|Comedy|Drama|Romance
115713,Ex Machina (2015),Drama|Sci-Fi|Thriller
117529,Jurassic World (2015),Action|Adventure|Sci-Fi|Thriller
119145,Kingsman: The Secret Service (2015),Action|Adventure|Comedy|Crime
120466,Chappie (2015),Action|Thriller
120635,Taken 3 (2015),Action|Crime|Thriller
122882,Mad Max: Fury Road (2015),Action|Adventure
122892,Avengers: Age of Ultron (2015),Action|Adventure|Sci-Fi
125916,Fifty Shades of Grey (2015),Drama


----------

In [34]:
#1)  look for text within a directory and sub-directories (R = recursive)

!grep -R "pipe"  ./*

./grep_unix_abbreviated.ipynb:    "# pipe all records that have '2015' in each row of data to a new file\n",
./grep_unix_full.ipynb:    "# pipe all records that have '2015' in each row of data to a new file\n",
./grep_unix_full.ipynb:    "# pipe all records that have '2014' in each row of data to a new file\n",
./grep_unix_full.ipynb:    "# let's combine the two files; pipe 2 files into a new file\n",
./movies.csv:3073,"Sandpiper, The (1965)",Drama|Romance


### GREP:  how many counts of a particular word in a file?

In [19]:
#2)  search for number of occurrences of a word (i = ignore case, c = count)

!grep -c "The" movies.csv

!grep -i -c "the" movies.csv

2102
2848


In [9]:
# 3)  list number of lines in a file (l = line count)

# check number of lines in each file
!wc -l *

      10 election10.csv
    4640 election2012.csv
    4639 election2012_noheader.csv
    1081 grep_unix_abbreviated.ipynb
    1213 grep_unix_full.ipynb
     667 metis.html
    8928 movies.csv
     165 movies1415.csv
     129 movies_2014.csv
      36 movies_2015.csv
   21508 total


In [7]:
# 4)  remove header from a data file (start at second row of current file and output it to new file)

# Use Case:   remove the header row, first line
!tail -n +2 election2012.csv > election2012_noheader.csv

# Check line count for original file and new file; noheader file should have one less line
!wc -l *election*

      10 election10.csv
    4640 election2012.csv
    4639 election2012_noheader.csv
    9289 total


In [5]:
#  5)  create a subset of the data (easier to work with while doing preliminary coding); 
#  take first 10 lines of a file and output them to a new file

# Use Case:  Create a subset of the data.  Work with smaller subset before running all our 
# programs on full dataset to save time
# Take first 10 lines and save it to a new dataset

!head -10 election2012_noheader.csv > election10.csv

# check that the file has 10 rows (records)
!wc -l election10.csv

      10 election10.csv
