# grep

#### First we'll use cat to make a simple txt file with what are supposed to be phone numbers.

In [None]:
%%bash
cat > data_to_grep.txt
303-456-7545
303-7585-5156
303-45-84d5
720-54885543
7205489456
7204585214
30545224475
3034587548
7205485425
7201212125
3030145487
30345445458
3132548
3-32-dd2-*d
asdf
PHONE NUMBER
87201458549
97201414141
454584
303
3038748787
7204545455
7201234564
3035455454
721212548
720-545-8789
720-8754569
3-03-12-21458
30345-545656982
3034567845
303-125-4578


#### Suppose we're only interested in valid numbers with area codes 303 and 720.  Below we'll pipe this file to tr and delete (-d) any punctuation characters.  We'll pipe this to grep using the extended regular expression engine.  Our regex will match entries that begin with 303 or 720 and end with seven subsequent digits.

In [None]:
%%bash 
cat data_to_grep.txt | tr -d '[:punct:]' | grep -E "^303[[:digit:]]{7}$|^720[[:digit:]]{7}$"

#### Notice what happens when we remove the ^ anchor from the or expression.

In [None]:
%%bash 
cat data_to_grep.txt | tr -d '[:punct:]' | grep -E "^303[[:digit:]]{7}$|720[[:digit:]]{7}$" | wc -l

#### Let's experiment with grep's recursive search option (-R), which seraches a directory and all subdirectories.  Running the following cell will make some nested directories.  You can delete these when you're finished with them.

In [None]:
%%bash
mkdir a_folder
mkdir a_folder 2> ./a_folder/error.txt
cd a_folder
echo -e  "foo bar\nnorf spam\neggs ham\nfoo\n" > a_data.txt
mkdir b_folder
cp a_data.txt ./b_folder/b_data.txt
cd b_folder
echo -e 'SPAM\n' >> b_data.txt
cd ../..
grep -Ri 'SPAM' a_folder

#### Let's add some duplicate entries to b_data.txt in the b_folder directory and try to detect repeated words.

In [None]:
%%bash
echo -e "foo foo bar\nbar\nhello hello hello\nfoo bar\n" >> ./a_folder/b_folder/b_data.txt

In [None]:
%%bash
grep -E "([[:alpha:]]+) \1" < ./a_folder/b_folder/b_data.txt

#### Above we're enclosing the pattern we want to recall with () and using \1 to reference this pattern.  If we wanted to detect triplicates we can just recall the pattern again.

In [None]:
%%bash
grep -E "([[:alpha:]]+) \1 \1" < ./a_folder/b_folder/b_data.txt

## awk
#### .........and some more grep

#### Let's download another file using curl.  This is data on baby names from the UCI ML repository.

In [None]:
%%bash
# note -O will download the file with the name provided in the url. -o allows us to redirect output to a specific file name

curl -o 'names.csv' 'https://archive.ics.uci.edu/ml/machine-learning-databases/00591/name_gender_dataset.csv'

#### Let's examine the names that begin with 'Ben' or 'Jen'.

In [None]:
%%bash
grep -Ei "^(Ben)|^(Jen)" names.csv

#### Now we'll print just the names instead of all four fields.

In [None]:
%%bash
grep -Ei "^(Ben)|^(Jen)" names.csv | awk 'BEGIN {FS = ","} ; {print $1}'

#### Now let's count those names and total the number of occurrences.

In [None]:
%%bash
grep -Ei "^(Ben)|^(Jen)" names.csv | 
awk 'BEGIN {total=0; count=0; FS=","} NR>1 {total+=1; count+=$3} END{print "total: " total "\n" "count: "count}'

#### What's the cummulative probability of having a name starting with 'Ben' or 'Jen'?

In [None]:
%%bash
grep -Ei "^(Ben)|^(Jen)" names.csv | 
awk 'BEGIN {total=0; FS=","} {total+=$4; print total} END {print "\n" "total: " total}'

In [None]:
#### let's curl another file

In [None]:
!curl -o "hotel.csv" "https://archive.ics.uci.edu/ml/machine-learning-databases/00398/dataset-CalheirosMoroRita-2017.csv"

#### Now let's count the words

In [None]:
%%bash
cat hotel.csv | grep -Eo "[[:alpha:]]+" | sort | uniq -c | sort -nr

#### Note the difference between the above and below regular expressions.  The first version is only matching words.  The second version is matching numbers as well.

In [None]:
%%bash
cat hotel.csv | grep -Eo "\w+" | sort | uniq -c

#### Ok, let's implement a word count on the hotels data

In [None]:
%%bash
cat hotel.csv | grep -Eo "\w+" | sort | uniq -c | sort -nr | awk 'BEGIN {count=0} {count+=$1} END{print count}'

#### Let's double-check that result using the wc command with the words flag (-w).

In [None]:
%%bash
cat hotel.csv | wc -w

#### ..........pretty close

## sed
https://www.geeksforgeeks.org/sed-command-linux-set-2/

SED is a stream editor.  It performs text transformation.  The most useful feature of sed is search and replace.  SED is used for finding, filtering, text substitution, replacement and text manipulations like insertion, deletion search, etc. SED can be used with regular expressions.  See [here](https://www.grymoire.com/Unix/Sed.html#uh-0) for a good sed overview.

#### In sed, s is for substitution.  It changes all occurrrences of the regular expression into a new value.  Let's make some data to experiment with...

In [None]:
%%bash
cat > some_data_to_sed.txt
Colorado is a state. Colorado!
What is the shape of Colorado?
Is Colorado a big state?
Colorado


#### Using the s option with the global (g) flag. To replace all instances of Colorado with Indiana. Without the g flag we just replace the first instance of Colorado in each line.

In [None]:
%%bash
sed 's/Colorado/Indiana/g' < some_data_to_sed.txt > sed_out.txt
cat sed_out.txt

#### Note that the way we delimit the arguments to sed is conventionally a forward slash, but you can use anything.  Your final sed command should have three of whatever delimiter you choose. Here we'll use a colon.

In [None]:
%%bash
sed 's:Indiana:California:g' < sed_out.txt > sed_out2.txt
cat sed_out2.txt

#### Replacing a pattern with a modified version of that pattern.  In the above examples we've hard-coded our replacement string.  What if we want to match a pattern and replace with a modified version of that pattern.  We can use the '&' character as a refernce to the pattern.  Note that (-E (Mac OS) or -r) invoke extended regular expression engine.

In [None]:
%%bash
sed -r 's/[aeiou]{2,}/(&)/g' < sed_out.txt > sed_out3.txt
cat sed_out3.txt