In [19]:
from fasta import *
import alignment
import utilities
from checkGenome import *
from ipywidgets import widgets
from ipywidgets import *
from traitlets import *
from IPython.display import display

# Loading the files

The first thing we want to do is load in our files. Loading them in this way actually loads them into a dictionary where the keys are the record IDs and the values are the full records.

In [20]:
# Read in the FASTA file
full_record = utilities.load_sequences("files/example_curation.fasta")

#### How many sequences do we have in the full records?

In [21]:
print ("The total length of CYP2U1 hits before cleaning them up is %d \n " % len(full_record))

print ("And the sequences are")

for seq in full_record.values():
    print (seq.description)

The total length of CYP2U1 hits before cleaning them up is 12 
 
And the sequences are
ARO89866.1 cytochrome P450 Cyp2u1 [Andrias davidianus]
NP_001106471.1 cytochrome P450 family 2 subfamily U member 1 [Xenopus tropicalis]
XP_018106696.1 PREDICTED: cytochrome P450 2U1-like [Xenopus laevis]
XP_018409984.1 PREDICTED: cytochrome P450 2U1 [Nanorana parkeri]
XP_006787161.1 PREDICTED: cytochrome P450 2U1-like [Neolamprologus brichardi]
BAF82691.1 unnamed protein product [Homo sapiens]
NP_898898.1 cytochrome P450 2U1 [Homo sapiens]
XP_004040295.1 PREDICTED: cytochrome P450 2U1 isoform X1 [Gorilla gorilla gorilla]
BAG64362.1 unnamed protein product [Homo sapiens]
BAG65487.1 unnamed protein product [Homo sapiens]
NP_001069518.1 cytochrome P450 2U1 [Bos taurus]
DAA28880.1 TPA: cytochrome P450 2U1 [Bos taurus]


# Creating subsets of the records
### Including / excluding based on header annotations

Now the workflow moves to including and excluding certain sequences. `subset_records` allows us to provide a list of terms which we either want or don't want in the header description. We can also give a minimum length for sequences to meet for inclusion.

We don't ever alter the original `full_record`, we just create new dictionary objects that are subsets.

We can either provide arguments directly to the function or we can pass in a list variable, such as `header_terms`.

In the example below `only_2U1_records` is set to only include sequences which have either '2U1' or '2U1-like' in the header. And `filtered_records` will contain the full set of sequences as we are not providing a length minimum (and the default is 0) and we are not passing in `header_terms`.

In [22]:
# A blank list to hold terms we want to exclude or include
header_terms = []

In [23]:
only_2U1_records = subset_records("2U1", "2U1-like", records=full_record, length=400, mode="include")
filtered_records = subset_records(records=full_record)
x_removed_records = exclude_character(filtered_records, "X")

print ("The number of sequences with either 2U1 or 2U1-like in the header is %s " % (len(only_2U1_records)))
print ("The number of sequences we've filtered is %s which should be equal to %s" % (len(filtered_records), len(full_record)))
print ("The number of sequences without X character is %s" % (len(x_removed_records)))

The number of sequences with either 2U1 or 2U1-like in the header is 8 
The number of sequences we've filtered is 12 which should be equal to 12
The number of sequences without X character is 11


### Try running the previous cell, but use the `mode="exclude"`

### Adding terms to the `header_terms` variable
The following section makes it easy to add in terms to the `header_terms` variable and to save these files for later use.

Let's first print out the terms in our variable and the length of it. As you add to the list you can always come back and rerun this cell to peek inside the `header_terms` variable

In [24]:
print (header_terms)
print (len(header_terms))

[]
0


The first thing we might be interested in doing is to print out the header information of the sequences we currently have.

In [25]:
for record in filtered_records:
    print (filtered_records[record].description)

ARO89866.1 cytochrome P450 Cyp2u1 [Andrias davidianus]
NP_001106471.1 cytochrome P450 family 2 subfamily U member 1 [Xenopus tropicalis]
XP_018106696.1 PREDICTED: cytochrome P450 2U1-like [Xenopus laevis]
XP_018409984.1 PREDICTED: cytochrome P450 2U1 [Nanorana parkeri]
XP_006787161.1 PREDICTED: cytochrome P450 2U1-like [Neolamprologus brichardi]
BAF82691.1 unnamed protein product [Homo sapiens]
NP_898898.1 cytochrome P450 2U1 [Homo sapiens]
XP_004040295.1 PREDICTED: cytochrome P450 2U1 isoform X1 [Gorilla gorilla gorilla]
BAG64362.1 unnamed protein product [Homo sapiens]
BAG65487.1 unnamed protein product [Homo sapiens]
NP_001069518.1 cytochrome P450 2U1 [Bos taurus]
DAA28880.1 TPA: cytochrome P450 2U1 [Bos taurus]


The cell below will add items to our `header_terms` variable. Hit run on the cell and you'll see an input box - simply add words seperated by a space that you want to add.

In [26]:
add = widgets.Text()
display(add)

def handle_submit(sender):
    for item in add.value.split():
        header_terms.append(item)
    print (header_terms)
add.on_submit(handle_submit)

And then we can also remove 

In [27]:
remove = widgets.Text()
display(remove)

def handle_submit(sender):
    for item in remove.value.split():
        header_terms.remove(item)
    print (header_terms)

remove.on_submit(handle_submit)

Below is that cell that lets us check all the words in `header_terms` so far.

In [28]:
print (header_terms)
print (len(header_terms))

[]
0


Have a play around with adding and removing words to the `header_terms` list and then the following cells illustrate how it can be used.

Make the `header_terms` list contain just the terms "2U1-like" and "Xenopus" and then we'll create a new record called `no_xenopus_records` that doesn't contain any 2U1-like sequences or any Xenopus sequences

In [29]:
no_xenopus_records = subset_records(*header_terms, records=full_record, mode='exclude')
for record in no_xenopus_records:
    print (no_xenopus_records[record].description)

ARO89866.1 cytochrome P450 Cyp2u1 [Andrias davidianus]
NP_001106471.1 cytochrome P450 family 2 subfamily U member 1 [Xenopus tropicalis]
XP_018106696.1 PREDICTED: cytochrome P450 2U1-like [Xenopus laevis]
XP_018409984.1 PREDICTED: cytochrome P450 2U1 [Nanorana parkeri]
XP_006787161.1 PREDICTED: cytochrome P450 2U1-like [Neolamprologus brichardi]
BAF82691.1 unnamed protein product [Homo sapiens]
NP_898898.1 cytochrome P450 2U1 [Homo sapiens]
XP_004040295.1 PREDICTED: cytochrome P450 2U1 isoform X1 [Gorilla gorilla gorilla]
BAG64362.1 unnamed protein product [Homo sapiens]
BAG65487.1 unnamed protein product [Homo sapiens]
NP_001069518.1 cytochrome P450 2U1 [Bos taurus]
DAA28880.1 TPA: cytochrome P450 2U1 [Bos taurus]


### Subsetting record files using regular expressions
Typing all of the particular items we want to include or exclude can be time-consuming, and often we want to include or exclude all of the members of a family. So we can use regular expressions in `subset_records_with_regex` and only supply the first part of the family name and have it automatically match to all headers that contain text starting with that first part.

For example - excluding "2J" would exclude "2J6", "2J2", and "2J2-like" (as well as others)

# Evaluating how many hits per species

`build_species_count` builds a dictionary which has the set of unique species as its keys and a list of the sequence IDs that belong to each unique species as its . So we can use it to easily see how many unique species we have and which species are over represented.

In [30]:
species_counts = build_species_count(records=full_record)
print("There are %s unique species in our dataset." % (len(species_counts)))

There are 8 unique species in our dataset.


### Plotting the frequency of proteins per species
`plot_record_number` is a function that plots the numbers of IDs per species. We can set a minimum number of IDs that a species must have in order to be plotted.

In [31]:
plotthis = plot_record_number(species_counts, "Bar", min_length=3)
py.iplot(plotthis, filename='inline_bar')

PlotlyRequestError: No message

In [32]:
plotthis = plot_record_number(species_counts, "Bar", min_length=2)
py.iplot(plotthis, filename='inline_bar')

PlotlyRequestError: No message

#### We can also just extract the names using `get_species_name`, which also accepts a minimum number of IDs required and can print out the number of counts per each species

In [33]:
species_names_with_counts = get_species_names(species_counts, min_length=2, counts=True)
for name in species_names_with_counts:
    print (name)

Homo sapiens 4
Bos taurus 2


### Counting the total number of sequences with multiple hits
`count_ids` is a function that counts the total number of sequences in a species count dictionary, not just the number of unique species.

As before, it can also take a minimum number of IDs required

In [34]:
min_num = 2
print ("There are %s total sequences in our filtered dataset." % (count_ids(species_counts)))
print ("There are %s total sequences in our filtered dataset that have %d or more IDs per species." % (count_ids(species_counts, min_length=min_num), min_num))

There are 12 total sequences in our filtered dataset.
There are 6 total sequences in our filtered dataset that have 2 or more IDs per species.


# Generating datasets containing information about species with multiple hits
For each species that has more than the given number of hits, we create 
1. A FASTA file of the protein sequences from that species
2. An alignment of the protein sequences
3. An information file telling use where in the genome the protein maps to
4. A visual diagram of the genome mapping the proteins to the genome

In [37]:
def generate_multiple_hit_data(species_names, species_counts, full_record, file_path):
    id_dict = {}
    check_genomic_location(species_counts, min_length=1, file_path=file_path +" gene locations ")
    check_genomic_location(species_counts, min_length=1, visualise="linear")


species_names = get_species_names(species_counts, min_length=1)
generate_multiple_hit_data(species_names, species_counts, full_record, "files/multiple_hits/")

URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

Or we could just use parts of this function. The cell below will just print out the locations of the proteins in the genome. We could save this to disk by providing an argument to the `file_path` variable or visualise it by providing either 'linear' or 'circular' to the `visualise` variable.

In [38]:
check_genomic_location(species_counts, min_length=5)

# Saving the records to FASTA files
Because `filtered_records` just contains the species name and IDs of these species, we need to map these IDs back to their full records. We can use the function `map_ids_to_records` which allows for us to select all the records in `filtered_ids` or just the unique species.

In [46]:
filtered_records = map_list_to_records(species_counts, full_record)
filtered_records_unique = map_list_to_records(species_counts, full_record, unique=True)

# Check that the numbers are correct
print (len(filtered_records))
print (len(filtered_records_unique))

0
0


And now we can save these records to a new FASTA file using `write_fasta`

In [None]:
write_fasta(filtered_records, "files/2U1_BLAST_smaller_records.fasta")

We can also use the function `map_species_to_records` to just map a particular species to a FASTA file.

In [None]:
priapulus_caudatus = map_species_to_records(species_counts['Priapulus caudatus'], full_record)
write_fasta(priapulus_caudatus, "files/priapulus_caudatus.fasta")