# Welcome to Day 3! 

## More sequence modification and data extraction

### Section 1: Filtering for sequences matching certain conditions

### Section 2: Storing filtered sequences to files

### Section 3: Reformating and storing sequences

---

## Session summary


Today things start to get a little serious. Building on concepts we learned in Day 1 and Day 2, we can do things like filter for sequences of a certain length or containing a motif of interest, remove undesired characters, or rename headers. We then go over how to store all of our modifications and reformats.

---


---
## For Google colab users only

Run the following commands

In [None]:
pip install Biopython

In [None]:
!wget https://raw.githubusercontent.com/agmcfarland/biopython_workshop/master/to_reformat.fasta
!wget https://raw.githubusercontent.com/agmcfarland/biopython_workshop/master/mixed_args.fasta

---

# Section 1: Filtering for sequences matching certain conditions

Let's say a collaborator gives you fasta file containing many sequences of two types of antibiotic resistance genes: one gene type, ermA, confers erythromycin resistance and the other gene type, blaC, imparts beta lactam antibiotic resistance. How would you separate the two sets of antbiotic resistance gene classes into two files? 

In [None]:
# import our two Biopython modules that we will be using
from Bio import SeqIO
from Bio.Seq import Seq

One way to distinguish between the different gene types is their nucleotide lengths. A useful way to examine their lengths is to parse a fasta file and print the length (using the `len()` function), and the gene name in a single line. 

In [None]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    print(len(record.seq), ' ',record.id)



From the above output, we can see that ermA genes are 729 nucleotdies long while Bla genes are at least 921 nucleotides long. 

However, there are some gene names that are not especially clear. We can access additional information stored in each gene's header using record.description 

---
### Exercise 1

Using the above code as an example, use `SeqIO.parse()` to parse `mixed_args.fasta` and `print()` the `record.description` of each record followed by the length of the `record.seq`

In [None]:
for record in ___.parse('mixed_args.fasta', '___'):
    print(record.___, ' ', len(___))

---

The output now clearly shows that there is a nucleotide length difference between ermA and blaC that is sufficiently different to separate the two gene types.

Now we will filter for sequences that match the ermA length and print them.

In [None]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    if len(record.seq) == 729:
        print(record.description)

The blaC genes have lengths that range from 921 to 1143. What if we want to filter for only blaC genes? We can do this in a couple of different ways.

The simplest is to search for anything larger than 900


In [None]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    if len(record.seq) > 900:
        print(record.description)

Another is to set length boundaries between 921 and 1143

In [None]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    if len(record.seq) >= 921:
        if len(record.seq) <= 1143:
            print(record.description)

---
### Exercise 2

Show only Bla genes less than 1143 nucleotides.

`print()` the `record.description` only if the `record.seq` has a `len()` **less** than `1143` nucleotides.

In [None]:
for ___ in SeqIO.___('mixed_args.fasta','___'):
    if len(___) < ___:
        print(record.___)
        print('\n')

---

Another way to filter for genes of interest is to select only records that contain key words in their headers. Such as all sequences that have the word Beta-lactamase in `record.description`. We can use the base Python function `find()` for filtering.

`find()` returns the position that matches the start of a string of interest. If the string is not found, it retuns -1.


In [None]:
# where starts at position 0 of the string 'where am I?'
'where am I?'.find('where')

In [None]:
# ? starts at position 10 of the string 'where am I?'
'where am I?'.find('?')

In [None]:
# the first occurrence of e is at position 2 of the string 'where am I?'
'where am I?'.find('e')

In [None]:
# q is not in the string 'where am I?' and so it returns a -1 position
'where am I?'.find('q')

We will filter for sequences that have the word Beta-lactamase in `record.description`. 

In [None]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    # if the record.description does have Beta-lactamase in it, then the character position will be greater than -1.
    if record.description.find('Beta-lactamase') > -1:
        print(record.description)

---

### Exercise 3a

Filter for sequences with the term `'Erythromycin resistance'` in their `record.description` using `find()`



In [None]:
for ___ in SeqIO.___('mixed_args.fasta','___'):
    if record.description.___('___') > __:
        print(record.description)
        print('\n')

### Exercise 3b

Filter for sequences that do **not** contain `'Erythromycin resistance'` in their `record.description` using `find()`


In [None]:
for ___ in SeqIO.___('mixed_args.fasta','___'):
    if record.description.___('___') ___ -1:
        print(record.description)
        print('\n')

___

From exercise 3, we can tell that sometimes gene names and descriptions are not always useful for filtering. 

We can also try filtering by searching for sequence characters (nucleotide or amino acid) of interest.

For example, we can look for all sequences that contain a DNA motif:




In [None]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    if record.seq.find('GCTGTAAATC') > -1:
        print(record.description)

Now let's try filtering for sequences using protein motifs. 

This is a two step-challenge: first we must translate the DNA sequence and afterwards we need to search for the motif. 

We will search by filtering for sequences that carry the blaC motif, GAKR

In [None]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    translated_record = record.translate()
    if translated_record.seq.find('GAKR') > -1:
        print(record.description)

What if we want to use more than one motif to more clearly differentiate between sequences?



In [None]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    translated_record = record.translate()
    if translated_record.seq.find('GAKR') > -1:
        if translated_record.seq.find('DGTA') > -1:
            print(record.description)
# compare result to search with only GAKR motif

With the double motif filter we end up losing the blaC gene belonging to citroboacter freundii, which seems to be more phylogenetically distant than the other blaC (and 1143 nucleotides long compared to 921)


---

### Exercise 4

`translate()` the `record` object and assign it to the variable `translated_record`. Afterwards, filter for sequences that contain the ermA sequence motif `'QRALGL'` using `find()`

In [None]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    translated_record = record.___
    if translated_record.___.___('___') > ___:
        print(record.description)

---

# Section 2: Storing filtered sequences to files

After we parse files and look for features of interest, we typically want to store them in a new location, such as a new fasta file. 

We will first store the sequences we have filtered for in a list (like we learned to do in Day 2). Afterwards, we will write that list to a file using `SeqIO.write()`, which we also learned to do in Day 2.



In [None]:
# empty list to store modified sequence records
filtered_records_list = []

for record in SeqIO.parse('mixed_args.fasta', 'fasta'):
    # translate the sequence
    translated_seq = record.seq.translate()
    # search for the ermA motif
    if translated_seq.find('QRALGL') > -1:
        # if the sequence contains the motif, append it to our list
        filtered_records_list.append(record)

# examine the contents (will look a little messy)
print(filtered_records_list)


Now that we have the list ready, we can write it a new fasta file using `SeqIO.write()`


In [None]:
SeqIO.write(filtered_records_list, 'erma.fasta','fasta')
# open and inspect!

---

### Exercise 5a

For each sequence, keep only those with with a blaC motif. `append()` each record that passes these conditions to the list `extracted_seqs`. Use `SeqIO.write()` to write them to a new file called `'day3_5a.fasta'`






In [None]:
# empty list to store modified sequence records
extracted_seqs = ___

for ___ in ___('mixed_args.fasta', 'fasta'):
    # translate the sequence
    translated_seq = record.seq.translate()
    # search for the blaC motif
    if translated_seq.find('GAKR') > -1:
        # if the sequence contains the motif, append it to our list
        extracted_seqs.___(record)

# examine the contents
print(extracted_seqs)

#write to file
SeqIO.___(___, 'day3_5a_blac.fasta','___')

### Exercise 5b

For each sequence with a blaC motif within `record.seq`, select only those with a `record.seq` `len()` shorter than 1143 nucleotides. `append()` each record that passes these conditions to the list `extracted_seqs`. Use `SeqIO.write()` to write them to a new file called `'day3_5b.fasta'`

In [None]:
# empty list to store modified sequence records
extracted_seqs = ___

for ___ in ___('mixed_args.fasta', 'fasta'):
    # translate the sequence
    translated_seq = record.seq.translate()
    # search for the blaC motif
    if translated_seq.find('GAKR') > -1:
        # check if the nucleotide sequence is less than 1143 nucleotides
        if ___(___.seq) < 1143:
            # append to sequence
            extracted_seqs.___(record)

# write to file
SeqIO.___(___, '___.fasta','fasta')

___

# Section 3: Reformating and storing sequences

It is sometimes useful to rename or reformat the headers of sequences.

You can use string manipulation to modify record.id or record.description.

Let's look at the unsightly headers in a file called `'to_reformat.fasta'`

In [None]:
for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    print('record id:',' ',record.id)
    print('record description:',' ',record.description)
    print('\n')

Let's say we don't want our record id to say UgLyGenE_ because we already have a nice gene name next to it. We can use the `replace()` function we learned in Day 1

In [None]:
for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    record.id = record.id.replace('UgLyGenE_', '')
    
    print('record id:',' ',record.id)
    print('record description:',' ',record.description)
    print('\n')

Now we also want to change the description. Maybe all these genes are from *E. coli* and so we want to include that in the description

In [None]:
for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    record.id = record.id.replace('UgLyGenE_', '')
    record.description = 'this gene is from Escherichia coli'

    print('record id:',' ',record.id)
    print('record description:',' ',record.description)
    print('\n') # for clarity

We also notice one more thing: All of our sequences have an asterik at the end. Let's get rid of it!

In [None]:
for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    print(record.seq)
    print('\n') #for clarity

Let's get rid of the asterik in the sequences.

In [None]:
for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    # create a string version of record.seq. This is because replace only works on strings
    sequence_string = str(record.seq)
    sequence_string = sequence_string.replace('*','')
    # afterwards, convert the modified string into a Seq object
    record.seq = Seq(sequence_string)
    print(record.seq)
    print('\n') #for clarity

Putting it all together:

In [None]:
store_modifed_record = []

for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    # change record.id
    record.id = record.id.replace('UgLyGenE_', '')
    # change record.description
    record.description = 'this gene is from Escherichia coli'
    # change sequence
    sequence_string = str(record.seq)
    sequence_string = sequence_string.replace('*','')
    record.seq = Seq(sequence_string)
    # store 
    store_modifed_record.append(record)

store_modifed_record
# write to file
SeqIO.write(store_modifed_record,'reformatted_sequences.fasta','fasta')
# open and view file


---

### Exercise 6a

Modify `record.id` so that you `replace()` the string `'UgLyGenE_'` found in `record.id` with the emtpy string `''`.

Assign `record.description` to equal an uppercased `record.id` using `upper()`. 

Afterward, add the string `'_Escherichia coli'` to `record.description.

In [None]:
for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    ___.id = ___.id.___('___', '')
    ___.description = record.___.upper()
    ___.description = ___.description + '___'

    print(record.description)
    print('\n')

### Exercise 6b

Make an empty list called `store_records`

Modify `record.id` so that you `replace()` the string `'UgLyGenE_'` found in `record.id` with the emtpy string `''`.

Assign `record.description` to equal an uppercased `record.id` using `upper()`. 

Afterward, add the string `'_Escherichia coli'` to `record.description.

`append()` the modified record to `store_records`

Use `SeqIO.write()` to write the contents to the fasta file `'day3_6b.fasta'`


In [None]:
___ = []

for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    ___.id = ___.id.___('___', '')
    ___.description = record.___.upper()
    ___.description = ___.description + '___'

    store_records.___(___)

    print(record.description)
    print('\n')

SeqIO.write(store_records, '___.fasta', 'fasta')

---

That's it for Day 3! 