# Welcome to Day 3! 

## More sequence modification and data extraction

### Section 1: Filtering for sequences matching certain conditions

### Section 2: Storing filtered sequences to files

### Section 3: Reformating and storing sequences

---

## Session summary


Today things start to get a little serious. Building on concepts we learned in Day 1 and Day 2, we can do things like filter for sequences of a certain length or containing a motif of interest, remove undesired characters, or rename headers. We then go over how to store all of our modifications and reformats.

---


# Section 1: Filtering for sequences matching certain conditions

Let's say a collaborator gives you fasta file containing many sequences of two types of antibiotic resistance genes: one gene type, ermA, confers erythromycin resistance and the other gene type, blaC, imparts beta lactam antibiotic resistance. How would you separate the two sets of antbiotic resistance gene classes into two files? 

In [68]:
# import our two Biopython modules that we will be using
from Bio import SeqIO
from Bio.Seq import Seq

One way to distinguish between the different gene types is their nucleotide lengths. A useful way to examine their lengths is to parse a fasta file and print the length and the gene name in a single line. 

In [109]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    print(len(record.seq), ' ',record.id)

729   ERMA_STAAR
729   ERMA_STAAM
729   ERMA_STAAU
921   BLAC_MYCTO
921   BLAC_MYCBO
729   A0A133QE03_STASI
1143   AMPC_CITFR
729   D4REG1_ENTFC
729   Q7DIC2_STAAU
921   BLAC_MYCTA
729   A0A564SEJ2_ENTAV
729   A0A7V6HLW4_9FIRM
921   BLAC_MYCTU




From the above output, we can see that ermA genes are 729 nucleotdies long while Bla genes are at least 921 nucleotides long. 

However, there are some gene names that are not especially clear. We can access additional information stored in each gene's header using record.description 

---
### Exercise 1

Using the above code as an example, parse `mixed_args.fasta` and print the `record.description` of each record followed by the length of the `record.seq`

In [None]:
for record in ___.parse('mixed_args.fasta', '___'):
    print(record.___, ' ', len(___))

---

The output now clearly shows that there is a nucleotide length difference between ermA and blaC that is sufficiently different to separate the two gene types.

Now we will filter for sequences that match the ermA length and print them.

In [13]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    if len(record.seq) == 729:
        print(record.description)

ERMA_STAAR Q6GKQ0 rRNA adenine N-6-methyltransferase OS=Staphylococcus aureus (strain MRSA252) OX=282458 GN=ermA1 PE=3 SV=1
ERMA_STAAM P0A0H1 rRNA adenine N-6-methyltransferase OS=Staphylococcus aureus (strain Mu50 / ATCC 700699) OX=158878 GN=ermA1 PE=3 SV=1
ERMA_STAAU P0A0H3 rRNA adenine N-6-methyltransferase OS=Staphylococcus aureus OX=1280 GN=ermA PE=3 SV=1
A0A133QE03_STASI Erythromycin resistance protein OS=Staphylococcus simulans OX=1286 GN=HMPREF3215_02510 PE=3 SV=1
D4REG1_ENTFC D4REG1 Macrolide-lincosamide-streptogramin B resistance protein OS=Enterococcus faecium E1679 OX=546340 GN=EfmE1679_0214 PE=3 SV=1
Q7DIC2_STAAU Q7DIC2 Erythromycin resistance protein OS=Staphylococcus aureus OX=1280 GN=ermA PE=3 SV=1
A0A564SEJ2_ENTAV Macrolide-lincosamide-streptogramin B resistance protein OS=Enterococcus avium OX=33945 GN=erm(A) PE=3 SV=1
A0A7V6HLW4_9FIRM 23S rRNA (Adenine(2058)-N(6))-methyltransferase Erm(A) OS=Bacilli bacterium OX=1903720 GN=erm(A) PE=4 SV=1


The blaC genes have lengths that range from 921 to 1143. What if we want to filter for only blaC genes? We can do this in a couple of different ways.

The simplest is to search for anything larger than 900


In [15]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    if len(record.seq) > 900:
        print(record.description)

BLAC_MYCTO P9WKD2 Beta-lactamase OS=Mycobacterium tuberculosis (strain CDC 1551 / Oshkosh) OX=83331 GN=blaC PE=3 SV=1
BLAC_MYCBO P0A5I7 Beta-lactamase OS=Mycobacterium bovis (strain ATCC BAA-935 / AF2122/97) OX=233413 GN=blaC PE=3 SV=1
AMPC_CITFR P05193 Beta-lactamase OS=Citrobacter freundii OX=546 GN=ampC PE=1 SV=1
BLAC_MYCTA A5U493 Beta-lactamase OS=Mycobacterium tuberculosis (strain ATCC 25177 / H37Ra) OX=419947 GN=blaC PE=1 SV=1
BLAC_MYCTU P9WKD3 Beta-lactamase OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) OX=83332 GN=blaC PE=1 SV=1


Another is to set length boundaries between 921 and 1143

In [18]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    if len(record.seq) >= 921:
        if len(record.seq) <= 1143:
            print(record.description)

BLAC_MYCTO P9WKD2 Beta-lactamase OS=Mycobacterium tuberculosis (strain CDC 1551 / Oshkosh) OX=83331 GN=blaC PE=3 SV=1
BLAC_MYCBO P0A5I7 Beta-lactamase OS=Mycobacterium bovis (strain ATCC BAA-935 / AF2122/97) OX=233413 GN=blaC PE=3 SV=1
AMPC_CITFR P05193 Beta-lactamase OS=Citrobacter freundii OX=546 GN=ampC PE=1 SV=1
BLAC_MYCTA A5U493 Beta-lactamase OS=Mycobacterium tuberculosis (strain ATCC 25177 / H37Ra) OX=419947 GN=blaC PE=1 SV=1
BLAC_MYCTU P9WKD3 Beta-lactamase OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) OX=83332 GN=blaC PE=1 SV=1


---
### Exercise 2

Print only Bla genes that are **less** than 1143 nucleotides long by using `len()`

In [None]:
for ___ in SeqIO.___('mixed_args.fasta','___'):
    if len(___) < ___:
        print(record.description)

---

Another way to filter for genes of interest is to select only records that contain key words in their headers. Such as all sequences that have the word Beta-lactamase in `record.description`. We can use the base Python function `find()` for filtering.

`find()` returns the position that matches a string. If the string is not found, it retuns -1.


In [19]:
# where starts at position 0 of the string 'where am I?'
'where am I?'.find('where')

0

In [20]:
# ? starts at position 10 of the string 'where am I?'
'where am I?'.find('?')

10

In [22]:
# the first occurrence of e is at position 2 of the string 'where am I?'
'where am I?'.find('e')

2

In [23]:
# q is not in the string 'where am I?' and so it returns a -1 position
'where am I?'.find('q')

-1

We will filter for sequences that have the word Beta-lactamase in `record.description`. 

In [24]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    # if the record.description does have Beta-lactamase in it, then the character position will be greater than -1.
    if record.description.find('Beta-lactamase') > -1:
        print(record.description)

BLAC_MYCTO P9WKD2 Beta-lactamase OS=Mycobacterium tuberculosis (strain CDC 1551 / Oshkosh) OX=83331 GN=blaC PE=3 SV=1
BLAC_MYCBO P0A5I7 Beta-lactamase OS=Mycobacterium bovis (strain ATCC BAA-935 / AF2122/97) OX=233413 GN=blaC PE=3 SV=1
AMPC_CITFR P05193 Beta-lactamase OS=Citrobacter freundii OX=546 GN=ampC PE=1 SV=1
BLAC_MYCTA A5U493 Beta-lactamase OS=Mycobacterium tuberculosis (strain ATCC 25177 / H37Ra) OX=419947 GN=blaC PE=1 SV=1
BLAC_MYCTU P9WKD3 Beta-lactamase OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) OX=83332 GN=blaC PE=1 SV=1


---

### Exercise 3a

Filter for sequences with 'Erythromycin resistance' in the record.description using `find()`



In [None]:
for ___ in SeqIO.___('mixed_args.fasta','___'):
    if record.description.___('___') > __:
        print(record.description)

### Exercise 3b

Filter for sequences that do NOT have 'Erythromycin resistance' in record.description using `find()`


In [None]:
for ___ in SeqIO.___('mixed_args.fasta','___'):
    if record.description.___('___') ___ -1:
        print(record.description)

___

From exercise 3, we can tell that sometimes gene names and descriptions are not always useful for filtering. 

We can also try filtering by searching for sequence characters (nucleotide or amino acid) of interest.

For example, we can look for all sequences that contain a DNA motif:




In [78]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    if record.seq.find('GCTGTAAATC') > -1:
        print(record.description)

ERMA_STAAR Q6GKQ0 rRNA adenine N-6-methyltransferase OS=Staphylococcus aureus (strain MRSA252) OX=282458 GN=ermA1 PE=3 SV=1
ERMA_STAAM P0A0H1 rRNA adenine N-6-methyltransferase OS=Staphylococcus aureus (strain Mu50 / ATCC 700699) OX=158878 GN=ermA1 PE=3 SV=1
ERMA_STAAU P0A0H3 rRNA adenine N-6-methyltransferase OS=Staphylococcus aureus OX=1280 GN=ermA PE=3 SV=1
A0A133QE03_STASI Erythromycin resistance protein OS=Staphylococcus simulans OX=1286 GN=HMPREF3215_02510 PE=3 SV=1
D4REG1_ENTFC D4REG1 Macrolide-lincosamide-streptogramin B resistance protein OS=Enterococcus faecium E1679 OX=546340 GN=EfmE1679_0214 PE=3 SV=1
Q7DIC2_STAAU Q7DIC2 Erythromycin resistance protein OS=Staphylococcus aureus OX=1280 GN=ermA PE=3 SV=1
A0A564SEJ2_ENTAV Macrolide-lincosamide-streptogramin B resistance protein OS=Enterococcus avium OX=33945 GN=erm(A) PE=3 SV=1
A0A7V6HLW4_9FIRM 23S rRNA (Adenine(2058)-N(6))-methyltransferase Erm(A) OS=Bacilli bacterium OX=1903720 GN=erm(A) PE=4 SV=1


Now let's try filtering for sequences using protein motifs. 

This is a two step-challenge: first we must translate the DNA sequence and afterwards we need to search for the motif. 

We will search by filtering for sequences that carry the blaC motif, GAKR

In [32]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    translated_record = record.translate()
    if translated_record.seq.find('GAKR') > -1:
        print(record.description)

BLAC_MYCTO P9WKD2 Beta-lactamase OS=Mycobacterium tuberculosis (strain CDC 1551 / Oshkosh) OX=83331 GN=blaC PE=3 SV=1
BLAC_MYCBO P0A5I7 Beta-lactamase OS=Mycobacterium bovis (strain ATCC BAA-935 / AF2122/97) OX=233413 GN=blaC PE=3 SV=1
AMPC_CITFR P05193 Beta-lactamase OS=Citrobacter freundii OX=546 GN=ampC PE=1 SV=1
BLAC_MYCTA A5U493 Beta-lactamase OS=Mycobacterium tuberculosis (strain ATCC 25177 / H37Ra) OX=419947 GN=blaC PE=1 SV=1
BLAC_MYCTU P9WKD3 Beta-lactamase OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) OX=83332 GN=blaC PE=1 SV=1


What if we want to use more than one motif to more clearly differentiate between sequences?



In [34]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    translated_record = record.translate()
    if translated_record.seq.find('GAKR') > -1:
        if translated_record.seq.find('DGTA') > -1:
            print(record.description)
# compare result to search with only GAKR motif

BLAC_MYCTO P9WKD2 Beta-lactamase OS=Mycobacterium tuberculosis (strain CDC 1551 / Oshkosh) OX=83331 GN=blaC PE=3 SV=1
BLAC_MYCBO P0A5I7 Beta-lactamase OS=Mycobacterium bovis (strain ATCC BAA-935 / AF2122/97) OX=233413 GN=blaC PE=3 SV=1
BLAC_MYCTA A5U493 Beta-lactamase OS=Mycobacterium tuberculosis (strain ATCC 25177 / H37Ra) OX=419947 GN=blaC PE=1 SV=1
BLAC_MYCTU P9WKD3 Beta-lactamase OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) OX=83332 GN=blaC PE=1 SV=1


With the double motif filter we end up losing the blaC gene belonging to citroboacter freundii, which seems to be more phylogenetically distant than the other blaC (and 1143 nucleotides long compared to 921)


---

### Exercise 4

`translate()` the record.seq and assign to the variable `translated_record`. Afterwards, filter for sequences that contain the ermA sequence motif `'QRALGL'` using `find()`

In [None]:

for record in SeqIO.parse('mixed_args.fasta','fasta'):
    translated_record = ___
    if translated_record.___.___('___') > ___:
        print(record.description)

---

# Section 2: Storing filtered sequences to files

After we parse files and look for features of interest, we typically want to store them in a new location, such as a new fasta file. 

We will first store the sequences we have filtered for in a list (like we learned to do in Day 2). Afterwards, we will write that list to a file using `SeqIO.write()`, which we also learned to do in Day 2.



In [94]:
# empty list to store modified sequence records
filtered_records_list = []

for record in SeqIO.parse('mixed_args.fasta', 'fasta'):
    # translate the sequence
    translated_seq = record.seq.translate()
    # search for the ermA motif
    if translated_seq.find('QRALGL') > -1:
        # if the sequence contains the motif, append it to our list
        filtered_records_list.append(record)

# examine the contents (will look a little messy)
print(filtered_records_list)


[SeqRecord(seq=Seq('ATGAATCAAAAAAATCCTAAAGATACACAAAATTTTATTACATCTAAAAAACAT...CAT', SingleLetterAlphabet()), id='ERMA_STAAR', name='ERMA_STAAR', description='ERMA_STAAR Q6GKQ0 rRNA adenine N-6-methyltransferase OS=Staphylococcus aureus (strain MRSA252) OX=282458 GN=ermA1 PE=3 SV=1', dbxrefs=[]), SeqRecord(seq=Seq('ATGAATCAAAAAAATCCTAAAGATACACAAAATTTTATTACATCTAAAAAACAT...CAT', SingleLetterAlphabet()), id='ERMA_STAAM', name='ERMA_STAAM', description='ERMA_STAAM P0A0H1 rRNA adenine N-6-methyltransferase OS=Staphylococcus aureus (strain Mu50 / ATCC 700699) OX=158878 GN=ermA1 PE=3 SV=1', dbxrefs=[]), SeqRecord(seq=Seq('ATGAATCAAAAAAATCCTAAAGATACACAAAATTTTATTACATCTAAAAAACAT...CAT', SingleLetterAlphabet()), id='ERMA_STAAU', name='ERMA_STAAU', description='ERMA_STAAU P0A0H3 rRNA adenine N-6-methyltransferase OS=Staphylococcus aureus OX=1280 GN=ermA PE=3 SV=1', dbxrefs=[]), SeqRecord(seq=Seq('ATGAATCAAAAAAATCCTAAAGATACACAAAATTTTATTACATCTAAAAAACAT...CAT', SingleLetterAlphabet()), id='A0A133QE03_S

Now that we have the list ready, we can write it a new fasta file using `SeqIO.write()`


In [102]:
SeqIO.write(filtered_records_list, 'erma.fasta','fasta')
# open and inspect!

8

---

### Exercise 5a

Extract all sequences with the blaC motif to the list `extracted_seqs`. Use `SeqIO.write()` to write them to a new file called `'day3_5a_blac.fasta'`






In [46]:
# empty list to store modified sequence records
extracted_seqs = ___

for ___ in ___('mixed_args.fasta', 'fasta'):
    # translate the sequence
    translated_seq = record.seq.translate()
    # search for the ermA motif
    if translated_seq.find('QRALGL') > -1:
        # if the sequence contains the motif, append it to our list
        filtered_records_list.append(record)

# examine the contents
print(filtered_records_list)

SeqIO.___(___, 'day3_5a_blac.fasta','___')

Seq('ATGCGTAATCGTGGATTTGGACGTCGTGAATTATTAGTAGCTATGGCTATGTTA...GCT', IUPACUnambiguousDNA())

### Exercise 5b

Extract all sequences with the blaC motif and are also fewer than 1143 nucleotides long. Use `SeqIO.write()` to write them to a new file called `'day3_5b_blac.fasta'`

In [None]:
# empty list to store modified sequence records
extracted_seqs = ___

for ___ in ___('mixed_args.fasta', 'fasta'):
    # translate the sequence
    translated_seq = record.seq.translate()
    # search for the ermA motif
    if translated_seq.find('QRALGL') > -1:
        # check if the nucleotide sequence is less than 1143 nucleotides
        if ___.seq < 1143:
            # append to sequence
            filtered_records_list.___(record)

# write to file
SeqIO.___(___, 'day3_5b_blac.fasta','fasta')

___

# Section 3: Reformating and storing sequences

It is sometimes useful to rename or reformat the headers of sequences.

You can use string manipulation to modify record.id or record.description.

Let's look at the unsightly headers in a file called `'to_reformat.fasta'`

In [80]:
for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    print('record id:',' ',record.id)
    print('record description:',' ',record.description)
    print('\n')

record id:   UgLyGenE_fabV
record description:   UgLyGenE_fabV with_UgLy_dEscRiptOr fOr for fabV


record id:   UgLyGenE_fabI
record description:   UgLyGenE_fabI with_UgLy_dEscRiptOr fOr for fabV


record id:   UgLyGenE_fabk
record description:   UgLyGenE_fabk with_UgLy_dEscRiptOr fOr fabV




Let's say we don't want our record id to say UgLyGenE_ because we already have a nice gene name next to it. We can use the `replace()` function we learned in Day 1

In [83]:
for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    record.id = record.id.replace('UgLyGenE_', '')
    
    print('record id:',' ',record.id)
    print('record description:',' ',record.description)
    print('\n')

record id:   fabV
record description:   UgLyGenE_fabV with_UgLy_dEscRiptOr fOr for fabV


record id:   fabI
record description:   UgLyGenE_fabI with_UgLy_dEscRiptOr fOr for fabI


record id:   fabK
record description:   UgLyGenE_fabK with_UgLy_dEscRiptOr fOr fabK




Now we also want to change the description. Maybe all these genes are from *E. coli* and so we want to include that in the description

In [84]:
for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    record.id = record.id.replace('UgLyGenE_', '')
    record.description = 'this gene is from Escherichia coli'

    print('record id:',' ',record.id)
    print('record description:',' ',record.description)
    print('\n') # for clarity

record id:   fabV
record description:   this gene is from Escherichia coli


record id:   fabI
record description:   this gene is from Escherichia coli


record id:   fabK
record description:   this gene is from Escherichia coli




We also notice one more thing: All of our sequences have an asterik at the end. Let's get rid of it!

In [90]:
for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    print(record.seq)
    print('\n') #for clarity

MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRSVTAIEIDGGLCQVTKEAVNPSENIKVIQTDILKFSFPKHINYKIYGNIPYNISTDIVKRITFESQAKYSYLIVEKGFAKRLQNLQRALGLLLMVEMDIKMLKKVPPLYFHPKPSVDSVLIVLERHQPLISKKDYKKYRSFVYKWVNREYRVLFTKNQFRQALKHANVTNINKLSKEQFLSIFNSYKLFH*


MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRSVTAIEIDGGLCQVTKEAVNPSENIKVIQTDILKFSFPKHINYKIYGNIPYNISTDIVKRITFESQAKYSYLIVEKGFAKRLQNLQRALGLLLMVEMDIKMLKKVPPLYFHPKPSVDSVLIVLERHQPLISKKDYKKYRSFVYKWVNREYRVLFTKNQFRQALKHANVTNINKLSKEQFLSIFNSYKLFH*


MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRSVTAIEIDGGLCQVTKEAVNPSENIKVIQTDILKFSFPKHINYKIYGNIPYNISTDIVKRITFESQAKYSYLIVEKGFAKRLQNLQRALGLLLMVEMDIKMLKKVPPLYFHPKPSVDSVLIVLERHQPLISKKDYKKYRSFVYKWVNREYRVLFTKNQFRQALKHANVTNINKLSKEQFLSIFNSYKLFH*




Let's get rid of the asterik in the sequences.

In [113]:
for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    # create a string version of record.seq. This is because replace only works on strings
    sequence_string = str(record.seq)
    sequence_string = sequence_string.replace('*','')
    # afterwards, convert the modified string into a Seq object
    record.seq = Seq(sequence_string)
    print(record.seq)
    print('\n') #for clarity

MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRSVTAIEIDGGLCQVTKEAVNPSENIKVIQTDILKFSFPKHINYKIYGNIPYNISTDIVKRITFESQAKYSYLIVEKGFAKRLQNLQRALGLLLMVEMDIKMLKKVPPLYFHPKPSVDSVLIVLERHQPLISKKDYKKYRSFVYKWVNREYRVLFTKNQFRQALKHANVTNINKLSKEQFLSIFNSYKLFH


MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRSVTAIEIDGGLCQVTKEAVNPSENIKVIQTDILKFSFPKHINYKIYGNIPYNISTDIVKRITFESQAKYSYLIVEKGFAKRLQNLQRALGLLLMVEMDIKMLKKVPPLYFHPKPSVDSVLIVLERHQPLISKKDYKKYRSFVYKWVNREYRVLFTKNQFRQALKHANVTNINKLSKEQFLSIFNSYKLFH


MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRSVTAIEIDGGLCQVTKEAVNPSENIKVIQTDILKFSFPKHINYKIYGNIPYNISTDIVKRITFESQAKYSYLIVEKGFAKRLQNLQRALGLLLMVEMDIKMLKKVPPLYFHPKPSVDSVLIVLERHQPLISKKDYKKYRSFVYKWVNREYRVLFTKNQFRQALKHANVTNINKLSKEQFLSIFNSYKLFH




Putting it all together:

In [114]:
store_modifed_record = []

for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    # change record.id
    record.id = record.id.replace('UgLyGenE_', '')
    # change record.description
    record.description = 'this gene is from Escherichia coli'
    # change sequence
    sequence_string = str(record.seq)
    sequence_string = sequence_string.replace('*','')
    record.seq = Seq(sequence_string)
    # store 
    store_modifed_record.append(record)

store_modifed_record
# write to file
SeqIO.write(store_modifed_record,'reformatted_sequences.fasta','fasta')
# open and view file


3

---

### Exercise 6: 

Replace the string `'UgLyGenE_'` found in `record.id` with an emtpy string `''` by using `replace()`

Assign `record.description` to equal an uppercased `record.id` using `upper()`. 

Afterward, add the string `'_Escherichia coli'` to `record.description.

In [None]:
for record in SeqIO.parse('to_reformat.fasta', 'fasta'):
    ___.id = ___.id.___('___', '')
    ___.description = record.___.upper()
    ___.description = ___.description + '___'

    print(record.description)
    print('\n')

---

That's it for Day 3! 