# Working with text

## Create a FASTA file to play with

In [None]:
cat > bgp.fasta << EOF
>HSBGPG Human gene for bone gla protein (BGP)
GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAGAGGGT
ATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAGACACC
ATGAGAGCCCTCACACTCCTCGCCCTATTGGCCCTGGCCGCACTTTGCATCGCTGGCCAGGCAGGTGAGTGCCCC
CACCTCCCCTCAGGCCGCATTGCAGTGGGGGCTGAGAGGAGGAAGCACCATGGCCCACCTCTTCTCACCCCTTTG
GCTGGCAGTCCCTTTGCAGTCTAACCACCTTGTTGCAGGCTCAATCCATTTGCCCCAGCTCTGCCCTTGCAGAGG
GAGAGGAGGGAAGAGCAAGCTGCCCGAGACGCAGGGGAAGGAGGATGAGGGCCCTGGGGATGAGCTGGGGTGAAC
CAGGCTCCCTTTCCTTTGCAGGTGCGAAGCCCAGCGGTGCAGAGTCCAGCAAAGGTGCAGGTATGAGGATGGACC
TGATGGGTTCCTGGACCCTCCCCTCTCACCCTGGTCCCTCAGTCTCATTCCCCCACTCCTGCCACCTCCTGTCTG
GCCATCAGGAAGGCCAGCCTGCTCCCCACCTGATCCTCCCAAACCCAGAGCCACCTGATGCCTGCCCCTCTGCTC
CACAGCCTTTGTGTCCAAGCAGGAGGGCAGCGAGGTAGTGAAGAGACCCAGGCGCTACCTGTATCAATGGCTGGG
GTGAGAGAAAAGGCAGAGCTGGGCCAAGGCCCTGCCTCTCCGGGATGGTCTGTGGGGGAGCTGCAGCAGGGAGTG
GCCTCTCTGGGTTGTGGTGGGGGTACAGGCAGCCTGCCCTGGTGGGCACCCTGGAGCCCCATGTGTAGGGAGAGG
AGGGATGGGCATTTTGCACGGGGGCTGATGCCACCACGTCGGGTGTCTCAGAGCCCCAGTCCCCTACCCGGATCC
CCTGGAGCCCAGGAGGGAGGTGTGTGAGCTCAATCCGGACTGTGACGAGTTGGCTGACCACATCGGCTTTCAGGA
GGCCTATCGGCGCTTCTACGGCCCGGTCTAGGGTGTCGCTCTGCTGGCCTGGCCGGCAACCCCAGTTCTGCTCCT
CTCCAGGCACCCTTCTTTCCTCTTCCCCTTGCCCTTGCCCTGACCTCCCAGCCCTATGGATGTGGGGTCCCCATC
ATCCCAGCTGCTCCCAAATAAACTCCAGAAG
EOF

In [None]:
wc bgp.fasta

## Using regular expresssions

In [None]:
cat bgp.fasta | 
grep "CCCCC"

In [None]:
cat bgp.fasta | 
grep -n "CCCCC"

In [None]:
cat bgp.fasta |
grep -nv "CCCCC"

In [None]:
cat bgp.fasta |
grep "GA*TT.*CA"

In [None]:
cat bgp.fasta |
grep -E "^C"

In [None]:
cat bgp.fasta |
grep -E "G$"

In [None]:
cat bgp.fasta |
grep -E "^C.*G$"

In [None]:
cat bgp.fasta |
grep -o "GA*TT.*CA"

In [None]:
cat bgp.fasta | 
grep -E "(GCAT)+"

In [None]:
cat bgp.fasta | 
grep -Eon "(GCA){2,}"

## Transliteration

In [None]:
cat bgp.fasta |
grep -E "^C.*G$" 

### Complement

In [None]:
cat bgp.fasta |
grep -E "^C.*G$" |
tr ACTG TAGC

### Reverse complement

In [None]:
cat bgp.fasta |
grep -E "^C.*G$" |
tr ACTG TAGC |
rev

## Sorting

In [None]:
cat bgp.fasta |
grep -v "^>" |
sort

### Sort by default uses lexicographic order

In [None]:
cat bgp.fasta |
grep -nv "^>" |
sort

### use `-n` flag for numeric order

In [None]:
cat bgp.fasta |
grep -nv "^>" |
sort -n

### Sort descending

In [None]:
cat bgp.fasta |
grep -nv "^>" |
sort -rn

## Downloading files

In [None]:
wget ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz

In [None]:
ls

## File compression/uncompression

In [None]:
ls -lh Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz

In [None]:
gunzip Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz

In [None]:
ls -lh

## Inspecting the GTF file

A GTF file has some header lines, followed by tabular data in 9 columns:

```
chromosome name > chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M}
annotation source > {ENSEMBL,HAVANA}
feature-type > {gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine}
genomic start location > integer-value (1-based)
genomic end location > integer-value
score (not used) > .
genomic strand > {+,-}
genomic phase (for CDS features) > {0,1,2,.}
additional information as key-value pairs > (format: key “value”;)
```

In [None]:
head Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf

In [None]:
tail Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf

## Remove comment lines

#### If you know the number of lines

In [None]:
tail +6 Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf | head -3

#### Using regular expressions (advanced)

In [None]:
cat Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf |
grep -v '^#' |
head -3

## Spliting columns

In [None]:
File compression and archivalcat Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf |
grep -v '^#' |
cut -f3 |
head -3

In [None]:
cat Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf |
grep -v '^#' |
cut -f4-5 |
head -3

In [None]:
cat Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf |
grep -v '^#' |
cut -f2,4-5 |
head -3

## Exercises

1. Waht is the mRNA version of bgp.fasta?

In [None]:
cat bgp.fasta |
grep -v '^>' |
tr T U

2. Extract the nucleotides in positions 5,10 and 15 of each line of bgp.fasta.

In [None]:
cat bgp.fasta |
grep -v '^>' |
cut -c5,10,15

3. Find the number of mitochondrial exons in the GTF file.

In [None]:
cat Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf |
grep '^M' |
cut -f3 |
grep "exon" | 
wc -l