filtering-and-processing.Rmd

---
title: "PDS: Filtering and processing input with bash scripts"
subtitle: "Examples and exercises from linux2_all_191017.pdf"
author: "Alfredo Hernández"
date: "19 October 2017"
output:
  html_notebook:
    highlight: pygments
    theme: cosmo
    toc: yes
  pdf_document:
    toc: yes
---


# Filtering and processing

## Work (page 64)

- First wo do a `head` to see the structure of the file
```{bash}
FILE=data/articles-large/articles-large.csv
head -n2 $FILE 
```
- How many unique authors are on that list?
```{bash}
# Count unique authors
cut data/articles-large/articles-large.csv -d ',' -f3 | grep -v "Author" | grep . | sort | uniq -c | wc -l
```
- How many articles did each author write?
```{bash}
# Normal way
grep -i "Article" data/articles-large/articles-large.csv | cut -d ',' -f3 | sort | uniq -c
```

```{bash}
# Slightly faster way (not needed in this case)
cut data/articles-large/articles-large.csv -d ',' -f2,3 | grep -i -n "Article" | cut -d ',' -f2 | sort | uniq -c
```

## Work (page 72)

- Create a new list of january 20 articles
```{bash}
grep -i "20 Jan 2017" data/articles-large/articles-large.csv
```

- Extract the authors of those articles with the number of words each author wrote
```{bash}
# Using grep
grep -i "20 Jan 2017" data/articles-large/articles-large.csv | awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) } 1' | cut -d ',' -f3,8 | sort -u
```
Notice that we call `awk` to replace the commas (`,`) between the quotation marks (`"`). AKA, black magic.

```{bash}
# Using awk (WIP)
#awk {if ($1 == "20 Jan 2017") print $3"\t"$8} data/articles-large.csv
```

# Loops and conditionals

## Work WIP (page 75)
```{bash}
#firstArg=$1
#secondArg=$2
#echo “You have entered \”$firstArg\” and \”$secondArg\””
```


## Example WIP (page 77)
```{bash}
#files = `ls *.txt`
#for file in $files
#do
	#cat $file >> /Output.txt
#done
```
## Work WIP (page 79)
- Now modify the script to print the first 10 lines of each file in the
compressed tarball
```{bash}
cd data
for filename in $(ls *.tar.gz)
	do echo $filename
	tar tfz $filename
	echo
	#zcat $filename | head -n1
done
```

## Work: Plant gene data set (page 80)

- What plants systems contain a Smell gene?
```{bash}
grep -l "Smell" data/Plants/*.genes
```

- How many plant systems contain a Color gene?
```{bash}
grep -l "Color" data/Plants/*.genes | wc -l
```
```{bash}
# Toni's answer
grep -l "Color" data/Plants/*.genes | cut -f1 -d "." | uniq -c
```
- What genes are in common between apple and pear? Which are specific to each of them?
```{bash}
cd data/Plants
# In common (repeated)
cat apple.genes pear.genes | cut -f2 | sort | uniq -d
echo
# True unique
cat apple.genes pear.genes | cut -f2 | sort | uniq -u
# Specific to each of them?
```
- How many genes are in common to all three plant systems?
```{bash}
# Complicated way
cd data/Plants
cut -f2 apple.genes | sort | uniq -u > apple-uniq.txt
cut -f2 pear.genes | sort | uniq -u > pear-uniq.txt
cut -f2 peach.genes | sort | uniq -u > peach-uniq.txt
cat *-uniq.txt | sort | uniq -dc | grep "3 "
rm *-uniq.txt
```
```{bash}
# Toni's answer (not totally correct)
cd data/Plants
cat *.genes | sort | uniq -c | grep "3 " | wc -l
# Column correction (still not correct)
cat *.genes | cut -f2 | sort | uniq -c | sort | grep "3 " | wc -l
```