#### Topic:  GREP, Unix
#### Date:  12/5/15
#### By:  Reshama

#### Organization:  WiMLDS

#### Dataset:  movies, election, etc.

----

**grep** = **G**lobal search for **R**egular **E**xpressions and **P**rint

References:

[Grep Commands in Unix Examples](http://www.folkstalk.com/2012/01/grep-command-in-unix-examples.html)

[15 Practical Grep Command Examples](http://www.thegeekstuff.com/2009/03/15-practical-unix-grep-command-examples/)

#### The basic syntax of grep command is
 
>#### $ grep [options] pattern [list of files]

### Ones I Most Commonly Use

<br>

1)  look for text within a directory and sub-directories<br>
    R = recursive

**$ grep -R "_django1" ./* **

<br>

2)  search for number of occurrences of a word (i=ignore case)

**$ grep -i -c "comedy" movies.csv  **

<br>

3)  list number of lines in a file

**$ wc  -l *  **


<br>

4)  remove header from a data file (start at second row of current file and output it to new file)

**$ tail -n +2 election2012.csv > election2012_noheader.csv  **

<br>

5)  create a subset of the data (easier to work with while doing preliminary coding); take first 10 lines of a file and output them to a new file

**$ head -10 election2012_noheader.csv > election10.csv  **

---

[MovieLens Latest Datasets](http://grouplens.org/datasets/movielens/)

Small: 100,000 ratings and 2,200 tag applications applied to 9,000 movies by 700 users. Last updated **8/2015**.

[README.html](http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html)

---

In [96]:
# how many lines are in the files
!wc -l *

    4640 election2012.csv
    8928 movies.csv
    1208 using_grep_unix_wimlds.ipynb
   14776 total


In [97]:
# print working directory
!pwd

/Users/reshamashaikh/_ds/wimlds_grep


In [98]:
# head, by default, prints first 10 lines of a file
!head movies.csv

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action


In [99]:
# look at first 5 lines of a file
!head -5 movies.csv

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance


In [100]:
# pipe all records that have '2015' in each row of data to a new file
!grep "2015" movies.csv > movies_2015.csv

In [101]:
# list number of lines in a file
!wc -l *

    4640 election2012.csv
    8928 movies.csv
      36 movies_2015.csv
    1208 using_grep_unix_wimlds.ipynb
   14812 total


In [102]:
# look at file
#!more movies_2015.csv

In [103]:
# preview a file; notice the search includes record numbers where '2015' is present
!cat movies_2015.csv

2015,"Absent-Minded Professor, The (1961)",Children|Comedy|Fantasy
42015,Casanova (2005),Action|Adventure|Comedy|Drama|Romance
115713,Ex Machina (2015),Drama|Sci-Fi|Thriller
117529,Jurassic World (2015),Action|Adventure|Sci-Fi|Thriller
119145,Kingsman: The Secret Service (2015),Action|Adventure|Comedy|Crime
120466,Chappie (2015),Action|Thriller
120635,Taken 3 (2015),Action|Crime|Thriller
122882,Mad Max: Fury Road (2015),Action|Adventure
122892,Avengers: Age of Ultron (2015),Action|Adventure|Sci-Fi
125916,Fifty Shades of Grey (2015),Drama
127098,Louis C.K.: Live at The Comedy Store (2015),Comedy
127146,Kurt Cobain: Montage of Heck (2015),Documentary
127152,Going Clear: Scientology and the Prison of Belief (2015),Documentary
127188,Advantageous (2015),Children|Drama|Sci-Fi
128842,Dragonheart 3: The Sorcerer's Curse (2015),Action|Adventure|Fantasy
129354,Focus (2015),Comedy|Crime|Drama|Romance
129937,Run All Night (2015),Action|Crime|Drama|Mystery|Thriller

In [104]:
# pipe all records that have '2014' in each row of data to a new file
!grep "2014" movies.csv > movies_2014.csv

In [105]:
# how many lines in each file
!wc  -l *

    4640 election2012.csv
    8928 movies.csv
     129 movies_2014.csv
      36 movies_2015.csv
    1208 using_grep_unix_wimlds.ipynb
   14941 total


### CAT:  combine two files

In [106]:
# let's combine the two files; pipe 2 files into a new file
!cat movies_2014.csv movies_2015.csv >> movies1415.csv

In [107]:
# check lines in file, see that they add up
!wc -l movies*.csv

    8928 movies.csv
     165 movies1415.csv
     129 movies_2014.csv
      36 movies_2015.csv
    9258 total


### GREP:  how many counts of a particular word in a file?

#### How many occurrences of 'love' in the full movie dataset?

In [108]:
!grep -c "love" movies.csv

6


#### How many movies in the full dataset again?  (how many lines in the file?)

In [109]:
!wc -l movies.csv

    8928 movies.csv


#### How about the word 'comedy'?

In [110]:
!grep -c "comedy" movies.csv

0


#### that's interesting:  count is 0

#### That doesn't seem right.  Maybe search 'Comedy'?

In [111]:
!grep -c "Comedy" movies.csv

3262


#### There must be a way to ignore case, right?
#### Yes, there is!

In [112]:
!grep -i -c "comedy" movies.csv

3262


In [113]:
# look for text within a directory and sub-dirs
# let's see how many times 'drama' appears as genre in this data file with 2014 and 2015 films
!grep -c "drama" movies1415.csv     

0


In [114]:
# add '-i' option which means ignore case
!grep -c -i "drama" movies1415.csv

65


#### Let's try the word 'love' again, ignoring case

In [115]:
!grep -i -c "love" movies.csv

137


In [116]:
# notice without the -c option, it returns the data that match the search
!grep -i "jennifer" movies.csv

3557,Jennifer 8 (1992),Mystery|Thriller
71205,Jennifer's Body (2009),Comedy|Horror|Sci-Fi|Thriller


---

### Use 'curl' to read in a webpage and save to an html file

In [117]:
!curl http://www.thisismetis.com/ > metis.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 42029  100 42029    0     0   107k      0 --:--:-- --:--:-- --:--:--  107k


In [118]:
# check number of lines in each file
!wc -l *

    4640 election2012.csv
     667 metis.html
    8928 movies.csv
     165 movies1415.csv
     129 movies_2014.csv
      36 movies_2015.csv
    1208 using_grep_unix_wimlds.ipynb
   15773 total


In [119]:
# let's also see how many lines, words and characters are in each file
# wc = word count
# counts of:  line, word, character
!wc  *.*

    4640   15793 1899176 election2012.csv
     667    4214   42029 metis.html
    8928   37684  442596 movies.csv
     165     664    8247 movies1415.csv
     129     525    6385 movies_2014.csv
      36     139    1862 movies_2015.csv
    1208    3035   29989 using_grep_unix_wimlds.ipynb
   15773   62054 2430284 total


In [120]:
!head *.html

<!-- 

  _____   _____ _____ __ ______
 /     \_/ __ \   __\  |/  ___/
|  Y Y  \  ___/|  | |  |\___ \ 
|__|_|  /\___  >__| |__/____  >
      \/     \/             \/ 

             ( )
         _____|_____       


In [121]:
# want to count the number of times 'jpg' appears in the html file
!grep -c "jpg" metis.html

36


In [122]:
# how many times does this word appear in the html file?
!grep -c "metis" metis.html

48


----------

### Election Data
[Full US 2012 election county-level results to download](http://www.theguardian.com/news/datablog/2012/nov/07/us-2012-election-county-results-download#data)

In [123]:
!wc -l *

    4640 election2012.csv
     667 metis.html
    8928 movies.csv
     165 movies1415.csv
     129 movies_2014.csv
      36 movies_2015.csv
    1208 using_grep_unix_wimlds.ipynb
   15773 total


In [124]:
# print the first 2 lines of the file
!head -2 election2012.csv

State Postal,,County Number,FIPS Code,County Name,,Office Description,Precincts Reporting,Total Precincts,State Candidate Number (varies between state),TOTAL VOTES CAST,Order,Party,First name,Middle name,Last name,Junior?,Use Junior,Incumbent,Votes,Winner,National Politician ID (NPID),State Candidate Number (varies between state),Order,Party,First name,Middle name,Last name,Junior?,Use Junior,Incumbent,Votes,Winner,National Politician ID (NPID),State Candidate Number (varies between state),Order,Party,First name,Middle name,Last name,Junior?,Use Junior,Incumbent,Votes,Winner,National Politician ID (NPID),State Candidate Number (varies between state),Order,Party,First name,Middle name,Last name,Junior?,Use Junior,Incumbent,Votes,Winner,National Politician ID (NPID),State Candidate Number (varies between state),Order,Party,First name,Middle name,Last name,Junior?,Use Junior,Incumbent,Votes,Winner,National Politician ID (NPID),State Candidate Number (varies between state),Order,Party,Firs

In [125]:
# This will search for the lines which starts with:  'DE' 
# Regular expressions is huge topic, and this is an intro 
# This example is just for providing the usage of regular expressions.

# how many rows of data (by county) are there for the state of Delaware
!grep -c "^[DE].*" election*.*

6


In [126]:
# return count of how many records begin with "CA"
!grep -c "^[CA].*" election2012.csv

456


### Let's say we want to remove the header row, first line

In [127]:
!tail -n +2 election2012.csv > election2012_noheader.csv

In [128]:
# let's compare line counts for each file.  noheader file should have one less line
!wc -l *election*

    4640 election2012.csv
    4639 election2012_noheader.csv
    9279 total


### Create a subset of the data.  Work with smaller subset before running all our programs on full dataset to save time

#### Take first 10 lines and save it to a new dataset

In [129]:
!head -10 election2012_noheader.csv > election10.csv

In [130]:
# check that new file has 10 records 
!wc -l *election*

      10 election10.csv
    4640 election2012.csv
    4639 election2012_noheader.csv
    9289 total


In [131]:
# looking at first 2 rows of data
!head -2 election10.csv

AK,AK Alaska,1,0,Alaska,AK Alaska,President,437,438,6017,220596,1,Dem,Barack,,Obama,,0,1,91696,,1918,6018,2,GOP,Mitt,,Romney,,0,0,121234,X,893,6028,3,Lib,Gary,,Johnson,,0,0,5539,,31708,6142,4,Grn,Jill,,Stein,,0,0,2127,,895,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,220596
AK,AK Alaska,2001,2000,Alaska,AK Alaska,President,437,438,6017,220596,1,Dem,Barack,,Obama,,0,1,91696,,1918,6018,2,GOP,Mitt,,Romney,,0,0,121234,X,893,6028,3,Lib,Gary,,Johnson,,0,0,5539,,31708,6142,4,Grn,Jill,,Stein,,0,0,2127,,895,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,220596


-----

### GREP - Regular Expressions

In [132]:
# This will search for the lines which starts with a number.  
# This example is just for providing the usage of regular expressions.

#!grep -r "^[0-9].*" "/Users/reshamashaikh/_ds/pydata2015/d3-jupyter-tutorial/lib/"

!grep -r "^[0-9].*" "/Users/reshamashaikh/_ds/_02_book_machinel_python/chap2_classify/"


Binary file /Users/reshamashaikh/_ds/_02_book_machinel_python/chap2_classify//1400_02_04.png matches
Binary file /Users/reshamashaikh/_ds/_02_book_machinel_python/chap2_classify//1400_02_05.png matches
Binary file /Users/reshamashaikh/_ds/_02_book_machinel_python/chap2_classify//figure_1_1400_02_01.png matches


### Really Miscellaneous - More Unix Commands
http://www.bath.ac.uk/bucs/tools/unix/basicunixcommands/moreunix.html


In [133]:
!date

Fri Dec 11 13:33:28 EST 2015


In [134]:
!cal

   December 2015
Su Mo Tu We Th Fr Sa
       1  2  3  4  5
 6  7  8  9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31



In [135]:
# I wonder what day of the week Valentine's Day was in the year 1952?
!cal february 1952

   February 1952
Su Mo Tu We Th Fr Sa
                1  2
 3  4  5  6  7  8  9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29

