/
filtering-and-processing.Rmd
132 lines (116 loc) · 3.24 KB
/
filtering-and-processing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
title: "PDS: Filtering and processing input with bash scripts"
subtitle: "Examples and exercises from linux2_all_191017.pdf"
author: "Alfredo Hernández"
date: "19 October 2017"
output:
html_notebook:
highlight: pygments
theme: cosmo
toc: yes
pdf_document:
toc: yes
---
# Filtering and processing
## Work (page 64)
- First wo do a `head` to see the structure of the file
```{bash}
FILE=data/articles-large/articles-large.csv
head -n2 $FILE
```
- How many unique authors are on that list?
```{bash}
# Count unique authors
cut data/articles-large/articles-large.csv -d ',' -f3 | grep -v "Author" | grep . | sort | uniq -c | wc -l
```
- How many articles did each author write?
```{bash}
# Normal way
grep -i "Article" data/articles-large/articles-large.csv | cut -d ',' -f3 | sort | uniq -c
```
```{bash}
# Slightly faster way (not needed in this case)
cut data/articles-large/articles-large.csv -d ',' -f2,3 | grep -i -n "Article" | cut -d ',' -f2 | sort | uniq -c
```
## Work (page 72)
- Create a new list of january 20 articles
```{bash}
grep -i "20 Jan 2017" data/articles-large/articles-large.csv
```
- Extract the authors of those articles with the number of words each author wrote
```{bash}
# Using grep
grep -i "20 Jan 2017" data/articles-large/articles-large.csv | awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) } 1' | cut -d ',' -f3,8 | sort -u
```
Notice that we call `awk` to replace the commas (`,`) between the quotation marks (`"`). AKA, black magic.
```{bash}
# Using awk (WIP)
#awk {if ($1 == "20 Jan 2017") print $3"\t"$8} data/articles-large.csv
```
# Loops and conditionals
## Work WIP (page 75)
```{bash}
#firstArg=$1
#secondArg=$2
#echo “You have entered \”$firstArg\” and \”$secondArg\””
```
## Example WIP (page 77)
```{bash}
#files = `ls *.txt`
#for file in $files
#do
#cat $file >> /Output.txt
#done
```
## Work WIP (page 79)
- Now modify the script to print the first 10 lines of each file in the
compressed tarball
```{bash}
cd data
for filename in $(ls *.tar.gz)
do echo $filename
tar tfz $filename
echo
#zcat $filename | head -n1
done
```
## Work: Plant gene data set (page 80)
- What plants systems contain a Smell gene?
```{bash}
grep -l "Smell" data/Plants/*.genes
```
- How many plant systems contain a Color gene?
```{bash}
grep -l "Color" data/Plants/*.genes | wc -l
```
```{bash}
# Toni's answer
grep -l "Color" data/Plants/*.genes | cut -f1 -d "." | uniq -c
```
- What genes are in common between apple and pear? Which are specific to each of them?
```{bash}
cd data/Plants
# In common (repeated)
cat apple.genes pear.genes | cut -f2 | sort | uniq -d
echo
# True unique
cat apple.genes pear.genes | cut -f2 | sort | uniq -u
# Specific to each of them?
```
- How many genes are in common to all three plant systems?
```{bash}
# Complicated way
cd data/Plants
cut -f2 apple.genes | sort | uniq -u > apple-uniq.txt
cut -f2 pear.genes | sort | uniq -u > pear-uniq.txt
cut -f2 peach.genes | sort | uniq -u > peach-uniq.txt
cat *-uniq.txt | sort | uniq -dc | grep "3 "
rm *-uniq.txt
```
```{bash}
# Toni's answer (not totally correct)
cd data/Plants
cat *.genes | sort | uniq -c | grep "3 " | wc -l
# Column correction (still not correct)
cat *.genes | cut -f2 | sort | uniq -c | sort | grep "3 " | wc -l
```