The grep (global regular expression print) command that helps us search for specific text within files. We can apply `grep` to find words, phrases or patterns in one or multiple files. `grep` makes it easier to locate and manage information across the file system in Databases. It can be used to locate specific error messages in log files. In bioinformatics it is useful in Identifying sequences with specific motifs or patterns in FASTA/FASTQ files.
It also proves to be useful in retrieving data/records related to specific samples or conditions in large metadata or from tables containing results.

Let us learn the usage of `grep` using the files in our `tutorials` folder

In [2]:
cd /Users/debangana/linux_basics/tutorials;ls #to execute one bash command after another, use a semicolon


1					GSE111629Annotation1Filtered.csv
DMP.csv					GSE111629Annotation1Filtered1.csv
DMP1.csv				calculator.sh
GSE111629Annotation1.csv		student_data.txt


Suppose we want to find out if a particular cpg probe is present in the file DMP.csv and DMP1.csv 
First let us copy a file to the directory and rename the file to DMP1.csv.

In [3]:
cp /Users/debangana/Desktop/Thesis/Parkinson/annotation/DesplatesCS.csv .
mv DesplatesCS.csv DMP1.csv
ls -l

total 296
-rw-r--r--@ 1 debangana  staff      0 Jun 21 13:04 1
-rw-r--r--@ 1 debangana  staff   5237 Jun 15 14:58 DMP.csv
-rw-r--r--@ 1 debangana  staff   2445 Jun 21 22:50 DMP1.csv
-rw-r--r--@ 1 debangana  staff  83120 Jun 15 14:59 GSE111629Annotation1.csv
-rw-r--r--@ 1 debangana  staff  18896 Jun 19 15:03 GSE111629Annotation1Filtered.csv
-rw-r--r--@ 1 debangana  staff  26922 Jun 19 20:44 GSE111629Annotation1Filtered1.csv
-rwxrwxrw-@ 1 debangana  staff    147 Jun 15 18:37 calculator.sh
-rw-r--r--@ 1 debangana  staff      0 Jun 15 14:57 student_data.txt


Let us grep for the first `CPG_ID` `cg25939728` in the the file DMP1.csv

In [55]:
head -n 5 DMP1.csv
head -n 5 DMP.csv


"Name","Chr","Position","Relation to TSS","Gene ID","logFC","P.Value","FDR"
"cg06889422","chr22","24627294","Body","N/A","−0.28384","1.28E-08","0.003907"
"cg16133681","chr12","25801621","TSS200","IFLTD1","−0.38619","1.54E-08","0.003907"
"cg26524067","chr12","133003928","Open Sea","N/A","−0.49670","1.55E-08","0.003907"
"cg09994891","chr10","2173024","Open Sea","N/A","0.31875","1.31E-07","0.024816"
"","CpG_ID","late_cont_post_conv","Change_Controls","Change_Converters","adj.P.Val","Gene","Gene_group","Rank"
"1","cg25939728",0.0628479544242751,-0.417093190689641,-0.380458701660897,0.0343760690519294,"KCNS3","Body",69
"2","cg25614726",0.135924174806047,-0.719481141920014,-0.681351758471092,0.0142492107228919,"PXDNL","Body",10
"3","cg02350320",-0.122782503874819,-0.441199672480944,-0.509810621730303,0.0152057103004108,"","",6
"4","cg03263197",-0.203956429722215,-0.570542118526064,-0.658132792699776,0.00155475053089534,"","",22


Suppose we want to display the number of cpg sites that are in chromosome 12 in DMP1.csv

In [18]:
grep chr12 DMP1.csv | wc -l

       7


In [21]:
grep -c chr12 DMP1.csv # alternative command using -c parameter to fetch the number of occurences

7


Let us display the corresponding cpg_ids associated with chromsome 12

In [20]:
grep chr12 DMP1.csv | awk -F ',' '{print $1}' # searches from chr 12 and then displays all the cpg_ids associated with the chromosome

"cg16133681"
"cg26524067"
"cg23979954"
"cg04741728"
"cg03681383"
"cg13211181"
"cg01181415"


Now let us search in multiple files. Suppose we find out the occurences of `chr12` in all DMP files

In [37]:
grep 'chr12' DMP.csv DMP1.csv


DMP1.csv:"cg16133681","chr12","25801621","TSS200","IFLTD1","−0.38619","1.54E-08","0.003907"
DMP1.csv:"cg26524067","chr12","133003928","Open Sea","N/A","−0.49670","1.55E-08","0.003907"
DMP1.csv:"cg23979954","chr12","25801601","TSS200","IFLTD1","−0.30554","6.65E-07","0.062331"
DMP1.csv:"cg04741728","chr12","133003907","Open Sea","N/A","−0.59444","1.17E-06","0.076787"
DMP1.csv:"cg03681383","chr12","25801522","TSS200","IFLTD1","−0.30440","1.56E-06","0.090741"
DMP1.csv:"cg13211181","chr12","25801455","1stExon","IFLTD1","−0.26773","2.52E-06","0.108120"
DMP1.csv:"cg01181415","chr12","16757954","5ʹUTR","LMO3","−0.13991","3.68E-06","0.112632"


Suppose we find out the number of occurences of `chr12` in all DMP files

In [34]:
#The logic below is useful when there are multiple files that we want to search
for file in DMP*.csv; do 
  count=$(grep -c "chr12" "$file")
  echo "$file: $count occurrence"  #The echo command is used to display the desired message 
done


DMP.csv: 0 occurrence
DMP1.csv: 7 occurrence


We can `grep` all the cpg_ids that do not associate with chromosome 12

In [40]:
grep -v chr12 DMP1.csv | wc -l   #counts number of cpg_ids that do not assocuate with chromosome 12
grep -v chr12 DMP1.csv | awk -F ',' '{print $1}' | head -n 4 #prints the cpg ids that do not associate with chr 12

      23
"Name"
"cg06889422"
"cg09994891"
"cg11408952"


Alternatively, the output of the `grep -v` command from the above example  can also be restricted to the desired number of lines using the `-m ` parameter

In [38]:
grep -vm3 chr12 DMP1.csv 

"Name","Chr","Position","Relation to TSS","Gene ID","logFC","P.Value","FDR"
"cg06889422","chr22","24627294","Body","N/A","−0.28384","1.28E-08","0.003907"
"cg09994891","chr10","2173024","Open Sea","N/A","0.31875","1.31E-07","0.024816"


`grep` command can also be used to search for patterns of texts recursively within directories

In [57]:
cd ..
pwd

/Users/debangana/linux_basics


Sometimes, a column can be named in different ways in different files, however, they provide information in a similar context. The `grep -i` command would 
be useful to search for the name and location of the columns in such scenarios as it ignores the case of the string being searched for. The `grep -R` command searches all files in the directory to search for the specified pattern

In [63]:
grep -Ri "p.val" tutorials 


tutorials/DMP.csv:"","CpG_ID","late_cont_post_conv","Change_Controls","Change_Converters","adj.P.Val","Gene","Gene_group","Rank"
tutorials/DMP1.csv:"Name","Chr","Position","Relation to TSS","Gene ID","logFC","P.Value","FDR"


We can also search for file names that match a specific pattern

In [87]:
cd debangana/linux_basics
pwd
grep -Ril "p.val" tutorials # using the -l parameter we are able to display only the file names and not the matching lines


bash: cd: debangana/linux_basics: No such file or directory
/Users/debangana/linux_basics
tutorials/DMP.csv
tutorials/DMP1.csv


Suppose we want to locate the exact occurence of a string pattern, say "p.val" in the file, then `grep -n` command is useful as it displays the line number

In [89]:
grep -Rin "p.val" tutorials #combining the parameters R, i and n, to search recursively inside tutorials, perform case insensitive search and display the line numbers


tutorials/DMP.csv:1:"","CpG_ID","late_cont_post_conv","Change_Controls","Change_Converters","adj.P.Val","Gene","Gene_group","Rank"
tutorials/DMP1.csv:1:"Name","Chr","Position","Relation to TSS","Gene ID","logFC","P.Value","FDR"


In [100]:
grep -A 2 "P.Val" DMP.csv #prints matched line and 2 lines after the matched line


"","CpG_ID","late_cont_post_conv","Change_Controls","Change_Converters","adj.P.Val","Gene","Gene_group","Rank"
"1","cg25939728",0.0628479544242751,-0.417093190689641,-0.380458701660897,0.0343760690519294,"KCNS3","Body",69
"2","cg25614726",0.135924174806047,-0.719481141920014,-0.681351758471092,0.0142492107228919,"PXDNL","Body",10


In [101]:
grep -B 2 "P.Val" DMP.csv #prints matched line and 2 lines before the matched line



"","CpG_ID","late_cont_post_conv","Change_Controls","Change_Converters","adj.P.Val","Gene","Gene_group","Rank"


In [102]:
grep -C 2 "P.Val" DMP.csv #prints 2 lines before and after the matched line

"","CpG_ID","late_cont_post_conv","Change_Controls","Change_Converters","adj.P.Val","Gene","Gene_group","Rank"
"1","cg25939728",0.0628479544242751,-0.417093190689641,-0.380458701660897,0.0343760690519294,"KCNS3","Body",69
"2","cg25614726",0.135924174806047,-0.719481141920014,-0.681351758471092,0.0142492107228919,"PXDNL","Body",10
