In [1]:
from fasta import *
import alignment
import utilities
from checkGenome import *
from ipywidgets import widgets
from ipywidgets import *
from traitlets import *
from IPython.display import display

# Loading the files

The first thing we want to do is load in our files. Loading them in this way actually loads them into a dictionary where the keys are the record IDs and the values are the full records.

In [2]:
# Load the 2U1 files
# full_record = SeqIO.to_dict(SeqIO.parse("files/2U1_all_candidates_PSI_BLAST_unique.fasta", "fasta"))
# full_record = SeqIO.to_dict(SeqIO.parse("files/Python_bivittaus_full_PSI.fasta", "fasta"))
# full_record = SeqIO.to_dict(SeqIO.parse("files/2U1_and_2U1_like_candidates_BLAST_results_unique_X_seqs_removed.fasta", "fasta"))
full_record = SeqIO.to_dict(SeqIO.parse("files/candidates/2U1_BLAST_unique.fasta", "fasta"))
# full_record = SeqIO.to_dict(SeqIO.parse("files/71AV.fasta", "fasta"))
#full_record = SeqIO.to_dict(SeqIO.parse("files/homo_sapiens.fasta", "fasta"))

# full_record = SeqIO.to_dict(SeqIO.parse("files/candidates/regextest.fasta", "fasta"))

#### How many sequences do we have in the full records?

In [3]:
print (len(full_record))

24665


# Creating subsets of the records
### Including / excluding based on header annotations

Now the workflow moves to including and excluding certain sequences. `subset_records` allows us to provide a list of terms which we either want or don't want in the header description. We can also give a minimum length for sequences to meet for inclusion.

We don't ever alter the original `full_record`, we just create new dictionary objects that are subsets.

We can either provide arguments directly to the function or we can pass in a list variable, such as `header_terms`.

In the example below `only_2U1_records` is set to only include sequences which have either '2U1' or '2U1-like' in the header. And `filtered_records` will contain the full set of sequences as we are not providing a length minimum (and the default is 0) and we are passing in the currently empty `header_terms`.

In [4]:
# A blank list to hold terms we want to exclude or include
header_terms = []

In [5]:
only_2U1_records = subset_records("2U1", "2U1-like", records=full_record, length=400, mode="include")
filtered_records = subset_records(*header_terms, records=full_record, length=500, mode='exclude')

print ("The number of sequences with either 2U1 or 2U1-like in the header is %s " % (len(only_2U1_records)))
print ("The number of sequences we've filtered is %s which should be equal to %s" % (len(filtered_records), len(full_record)))

The number of sequences with either 2U1 or 2U1-like in the header is 437 
The number of sequences we've filtered is 12420 which should be equal to 24665


### Adding terms to the `header_terms` variable
The following section makes it easy to add in terms to the `header_terms` variable and to save these files for later use.

Let's first print out the terms in our variable and the length of it. As you add to the list you can always come back and rerun this cell to peek inside the `header_terms` variable

In [6]:
print (header_terms)
print (len(header_terms))

[]
0


The first thing we might be interested in doing is to print out the header information of the sequences we currently have.

In [7]:
for record in filtered_records:
    print (filtered_records[record].description)

ARO89866.1 cytochrome P450 Cyp2u1 [Andrias davidianus]
NP_001106471.1 cytochrome P450 family 2 subfamily U member 1 [Xenopus tropicalis] >AAI54093.1 LOC100127656 protein [Xenopus tropicalis]
XP_018106696.1 PREDICTED: cytochrome P450 2U1-like [Xenopus laevis] >OCT99840.1 hypothetical protein XELAEV_18005623mg [Xenopus laevis]
XP_018409984.1 PREDICTED: cytochrome P450 2U1 [Nanorana parkeri]
XP_005287999.1 PREDICTED: cytochrome P450 2U1 [Chrysemys picta bellii]
XP_019272345.1 PREDICTED: cytochrome P450 2U1 isoform X1 [Panthera pardus]
XP_019652744.1 PREDICTED: cytochrome P450 2U1 [Ailuropoda melanoleuca]
XP_022370877.1 cytochrome P450 2U1 [Enhydra lutris kenyoni]
XP_006272957.2 PREDICTED: cytochrome P450 2U1 isoform X1 [Alligator mississippiensis]
XP_006881166.1 PREDICTED: cytochrome P450 2U1-like [Elephantulus edwardii]
XP_004411289.1 PREDICTED: cytochrome P450 2U1 [Odobenus rosmarus divergens]
XP_013220884.1 cytochrome P450 2U1 [Ictidomys tridecemlineatus] >XP_021576219.1 cytochrome P45

XP_012142657.1 PREDICTED: cytochrome P450 18a1 isoform X4 [Megachile rotundata]
AJQ25351.1 cytochrome P450 family 17 polypeptide 2 [Anguilla japonica]
XP_013217590.1 cytochrome P450 2C18-like isoform X2 [Ictidomys tridecemlineatus]
XP_019109555.1 PREDICTED: cytochrome P450 2K1-like [Larimichthys crocea]
XP_002605316.1 hypothetical protein BRAFLDRAFT_89041 [Branchiostoma floridae] >EEN61326.1 hypothetical protein BRAFLDRAFT_89041 [Branchiostoma floridae]
XP_018121720.1 PREDICTED: uncharacterized protein LOC379464 isoform X1 [Xenopus laevis]
AJQ25354.1 cytochrome P450 family 17 polypeptide 2 [Anguilla japonica]
XP_013091701.1 PREDICTED: cytochrome P450 2U1-like [Biomphalaria glabrata]
XP_019213447.1 PREDICTED: cytochrome P450 2K1-like [Oreochromis niloticus]
XP_015801058.1 PREDICTED: cytochrome P450 2K1-like [Nothobranchius furzeri]
KFP15311.1 Steroid 17-alpha-hydroxylase/17,20 lyase, partial [Egretta garzetta]
XP_017282896.1 PREDICTED: cytochrome P450 2K1-like [Kryptolebias marmoratus]


KXJ22358.1 Steroid 17-alpha-hydroxylase/17,20 lyase [Exaiptasia pallida]
XP_015781174.1 PREDICTED: cytochrome P450 2J6-like [Tetranychus urticae]
XP_002636032.1 Hypothetical protein CBG01270 [Caenorhabditis briggsae]
XP_014616131.1 PREDICTED: methyl farnesoate epoxidase-like [Polistes canadensis]
XP_017005121.1 PREDICTED: probable cytochrome P450 304a1 [Drosophila takahashii]
XP_017464534.1 PREDICTED: probable cytochrome P450 304a1 [Rhagoletis zephyria]
ADA85878.1 flavonoid 3'-hydroxylase [Chrysanthemum x morifolium]
XP_007256560.3 steroid 21-hydroxylase isoform X1 [Astyanax mexicanus]
NP_001024903.1 Cytochrome P450 daf-9 [Caenorhabditis elegans] >AAL65132.1 DAF-9 isoform A [Caenorhabditis elegans] >CCD64553.1 Cytochrome P450 daf-9 [Caenorhabditis elegans]
XP_021189628.1 cytochrome P450 18a1-like [Helicoverpa armigera]
XP_018973376.1 PREDICTED: steroid 21-hydroxylase-like isoform X1 [Cyprinus carpio]
BAV60916.1 flavonoid 3'-hydroxylase protein [Chrysanthemum x morifolium]
XP_006427642.

XP_015613041.1 PREDICTED: flavonoid 3'-monooxygenase [Oryza sativa Japonica Group] >Q7G602.1 RecName: Full=Flavonoid 3'-monooxygenase CYP75B3; AltName: Full=Cytochrome P450 75B3; AltName: Full=Flavonoid 3'-hydroxylase; Short=OsF3'H >AAM00948.1 Putative flavonoid 3'-hydroxylase [Oryza sativa Japonica Group] >AAN04937.1 Putative chalcone flavonoid 3' - hydroxylase [Oryza sativa Japonica Group] >AAP52914.1 Flavonoid 3'-monooxygenase, putative, expressed [Oryza sativa Japonica Group] >BAF26252.1 Os10g0320100 [Oryza sativa Japonica Group] >EAY78007.1 hypothetical protein OsI_33047 [Oryza sativa Indica Group] >EAZ15637.1 hypothetical protein OsJ_31048 [Oryza sativa Japonica Group] >BAG89180.1 unnamed protein product [Oryza sativa Japonica Group] >AEK31169.1 flavonoid 3'-hydroxylase [Oryza sativa Japonica Group] >BAT10309.1 Os10g0320100 [Oryza sativa Japonica Group]
XP_020092516.1 cytochrome P450 71A1-like [Ananas comosus]
AEE60886.1 flavonoid 3'-hydroxylase, partial [Fragaria x ananassa]
XP_

XP_007845935.1 cytochrome p450 [Moniliophthora roreri MCA 2997] >ESK94724.1 cytochrome p450 [Moniliophthora roreri MCA 2997]
OHX00730.1 cytochrome p450 [Colletotrichum incanum]
BBB04707.1 cytochrome P450 [Perilla frutescens]
XP_007849707.1 cytochrome p450 [Moniliophthora roreri MCA 2997] >ESK90967.1 cytochrome p450 [Moniliophthora roreri MCA 2997]
XP_015949118.1 trans-cinnamate 4-monooxygenase [Arachis duranensis]
XP_015939110.1 cytochrome P450 82C4-like [Arachis duranensis]
XP_001824369.2 cytochrome P450 monooxygenase [Aspergillus oryzae RIB40]
XP_010056841.1 PREDICTED: cytochrome P450 76A1 [Eucalyptus grandis] >KCW73724.1 hypothetical protein EUGRSUZ_E02332 [Eucalyptus grandis]
KZT21922.1 cytochrome P450 [Neolentinus lepideus HHB14362 ss-1]
KZV47376.1 hypothetical protein F511_07790 [Dorcoceras hygrometricum]
XP_006491339.1 PREDICTED: cytochrome P450 78A9-like [Citrus sinensis]
XP_018077582.1 cytochrome P450 [Phialocephala scopiformis] >KUJ23227.1 cytochrome P450 [Phialocephala scopi

BAL05085.1 cytochrome P450 [Phanerochaete chrysosporium]
XP_008378426.1 PREDICTED: cytochrome P450 CYP82D47-like [Malus domestica]
XP_006457266.1 hypothetical protein AGABI2DRAFT_212609 [Agaricus bisporus var. bisporus H97] >EKV42014.1 hypothetical protein AGABI2DRAFT_212609 [Agaricus bisporus var. bisporus H97]
XP_019457331.1 PREDICTED: cytochrome P450 71A1-like [Lupinus angustifolius]
KZS88871.1 cytochrome P450 [Sistotremastrum niveocremeum HHB9708]
XP_022769823.1 cytochrome P450 71A1-like [Durio zibethinus]
ADM67353.1 flavone synthase II [Dahlia pinnata]
XP_020398428.1 cytochrome P450 71A1-like [Zea mays] >AQK91132.1 Putative cytochrome P450 superfamily protein [Zea mays]
PHT87675.1 Cytochrome 89A2 [Capsicum annuum]
GAT52448.1 cytochrome P450 [Mycena chlorophos]
XP_010450431.1 PREDICTED: cytochrome P450 81D1 [Camelina sativa]
PIL25684.1 cytochrome P450 [Ganoderma sinense ZZ0214-1]
XP_021815024.1 cytochrome P450 78A9-like [Prunus avium]
XP_023005007.1 isoflavone 2'-hydroxylase-like [

The cell below will add items to our `header_terms` variable. Hit run on the cell and you'll see an input box - simply add words seperated by a space that you want to add.

In [12]:
add = widgets.Text()
display(add)

def handle_submit(sender):
    for item in add.value.split():
        header_terms.append(item)
    print (header_terms)
add.on_submit(handle_submit)

['2J2', '2J2-like']


And then we can also remove 

In [9]:
remove = widgets.Text()
display(remove)

def handle_submit(sender):
    for item in remove.value.split():
        header_terms.remove(item)
    print (header_terms)

remove.on_submit(handle_submit)

Below is that cell that lets us check all the words in `header_terms` so far.

In [14]:
print (header_terms)
print (len(header_terms))

['2J2', '2J2-like']
2


Have a play around with adding and removing words to the `header_terms` list and then the following cells illustrate how it can be used.

Make the `header_terms` list contain just the terms "2B4" and "2B4-like" and then we'll create a new record called `only_2B4_records`

In [15]:
only_2B4_records = subset_records(*header_terms, records=full_record, mode='include')
for record in only_2B4_records:
    print (only_2B4_records[record].description)

XP_017543757.1 PREDICTED: cytochrome P450 2J2-like [Pygocentrus nattereri]
XP_022059510.1 cytochrome P450 2J2-like [Acanthochromis polyacanthus]
XP_022059511.1 cytochrome P450 2J2-like [Acanthochromis polyacanthus]
XP_020505680.1 cytochrome P450 2J2-like [Labrus bergylta]
XP_020505675.1 cytochrome P450 2J2-like [Labrus bergylta]
XP_016404252.1 PREDICTED: cytochrome P450 2J2-like isoform X1 [Sinocyclocheilus rhinocerous]
XP_016094677.1 PREDICTED: cytochrome P450 2J2-like isoform X1 [Sinocyclocheilus grahami]
XP_020505678.1 cytochrome P450 2J2-like [Labrus bergylta]
XP_010728521.2 PREDICTED: cytochrome P450 2J6-like [Larimichthys crocea] >KKF12110.1 Cytochrome P450 2J2 [Larimichthys crocea]
XP_020462694.1 cytochrome P450 2J2-like [Monopterus albus]
XP_003770699.2 PREDICTED: cytochrome P450 2J2-like [Sarcophilus harrisii]
XP_020505723.1 cytochrome P450 2J2-like [Labrus bergylta]
XP_007259570.2 cytochrome P450 2J2-like [Astyanax mexicanus]
XP_003767346.1 PREDICTED: cytochrome P450 2J2-like

### Saving and loading the header terms variable

In [16]:
utilities.saveHeaderTerms(header_terms, "files/headerterms.txt")

2J2 2J2-like
<class 'str'>


In [8]:
header_terms = utilities.loadHeaderTerms("files/headerterms.txt")

### Subsetting record files using regular expressions
Typing all of the particular items we want to include or exclude can be time-consuming, and often we want to include or exclude all of the members of a family. So we can use regular expressions in `subset_records_with_regex` and only supply the first part of the family name and have it automatically match to all headers that contain text starting with that first part.

For example - excluding "2J" would exclude "2J6", "2J2", and "2J2-like" (as well as others)

In [9]:
test_records = subset_records_with_regex("2U", "2D", "2J", "2B", "2A", "2B", "2C", "2D", "2E", "2F", "2G", "2H", "2I", "2J", "2K", "2L", "2M", "2N", "2O", "2P", "2Q", "2R", "2S", "2T", "2V", "2W", "2X", "2Y", "2Z", "1A", "1B", "76C", "84A", "98A", "304a1", "305a1", "306a1", "CYP17A", "303a1", "307a1", "83B", "81E", "81d1", "81D1", "18",    records=full_record, mode="exclude")
# test_records = subset_records_with_regex("2U", records=full_record, mode="include")
# test_records = subset_records("uncharacterized", "25-hydroxylase", "partial", "hypothetical", "unnamed", "Cyp2r1", records=test_records, mode='exclude')

print (len(test_records)) 

10325


In [12]:
species_counts = build_species_count(records=test_records)


In [10]:
for item in test_records:
    print (test_records[item].description)

ARO89866.1 cytochrome P450 Cyp2u1 [Andrias davidianus]
NP_001106471.1 cytochrome P450 family 2 subfamily U member 1 [Xenopus tropicalis] >AAI54093.1 LOC100127656 protein [Xenopus tropicalis]
EFB16740.1 hypothetical protein PANDA_005136, partial [Ailuropoda melanoleuca]
BAG64362.1 unnamed protein product [Homo sapiens]
BAF82691.1 unnamed protein product [Homo sapiens]
BAG65487.1 unnamed protein product [Homo sapiens]
OCA48700.1 hypothetical protein XENTR_v900010652mg, partial [Xenopus tropicalis]
EHH53889.1 hypothetical protein EGM_14598, partial [Macaca fascicularis]
EHH26107.1 hypothetical protein EGK_15996, partial [Macaca mulatta]
KTF87224.1 hypothetical protein cypCar_00019558 [Cyprinus carpio]
AGN04284.1 cytochrome P450 [Oryzias melastigma]
BAB31223.1 unnamed protein product [Mus musculus] >EDL12201.1 mCG10210, isoform CRA_a [Mus musculus]
ETE67196.1 Cytochrome protein, partial [Ophiophagus hannah]
CAG11477.1 unnamed protein product, partial [Tetraodon nigroviridis]
OXB76790.1 hyp

XP_020917709.1 steroid 17-alpha-hydroxylase/17,20 lyase-like isoform X1 [Exaiptasia pallida]
AAH99352.1 LOC100036775 protein, partial [Xenopus laevis]
EGT38304.1 hypothetical protein CAEBREN_16222 [Caenorhabditis brenneri]
NP_001004777.1 MGC69416 protein [Xenopus tropicalis] >AAH74508.1 MGC69416 protein [Xenopus tropicalis] >ATO94628.1 hypothetical protein, partial [synthetic construct]
PFX15416.1 Steroid 17-alpha-hydroxylase/17,20 lyase [Stylophora pistillata]
ADV17351.1 methyl farnesoate epoxidase [Schistocerca gregaria]
KTF78369.1 hypothetical protein cypCar_00043767 [Cyprinus carpio]
XP_015316161.1 PREDICTED: cytochrome P450, family 2, subfamily C, polypeptide 87 isoform X1 [Bos taurus]
AKH03513.1 cytochrome P450 3075B2 [Paracyclopina nana]
OXB82427.1 hypothetical protein H355_000685 [Colinus virginianus]
XP_781556.2 PREDICTED: steroid 17-alpha-hydroxylase/17,20 lyase [Strongylocentrotus purpuratus] >XP_011673213.1 PREDICTED: steroid 17-alpha-hydroxylase/17,20 lyase [Strongylocentr

ADW66160.1 flavonoid 3' 5' hydroxylase [Pisum sativum] >ADW66161.1 flavonoid 3' 5' hydroxylase [Pisum sativum]
BAJ93256.1 predicted protein, partial [Hordeum vulgare subsp. vulgare]
XP_013459330.1 flavonoid hydroxylase [Medicago truncatula] >KEH33361.1 flavonoid hydroxylase [Medicago truncatula]
XP_009386727.1 PREDICTED: flavonoid 3',5'-hydroxylase 2-like [Musa acuminata subsp. malaccensis]
SJL01482.1 related to cytochrome P450 CYP2 subfamily [Armillaria ostoyae]
O81973.1 RecName: Full=Cytochrome P450 93A3; AltName: Full=Cytochrome P450 CP5 >CAA71516.1 putative cytochrome P450 [Glycine max]
XP_007864037.1 cytochrome P450 [Gloeophyllum trabeum ATCC 11539] >EPQ56835.1 cytochrome P450 [Gloeophyllum trabeum ATCC 11539]
KFZ56210.1 Vitamin D 25-hydroxylase, partial [Podiceps cristatus]
OAK96951.1 putative cytochrome P450 [Stagonospora sp. SRC1lsM3a]
XP_008364814.1 PREDICTED: cytochrome P450 CYP736A12-like [Malus domestica]
KHF97502.1 Flavonoid 3',5'-hydroxylase 2 [Gossypium arboreum] >KHG074

PHU14756.1 hypothetical protein BC332_15961 [Capsicum chinense]
GAT42577.1 cytochrome P450 [Mycena chlorophos]
EYU22203.1 hypothetical protein MIMGU_mgv1a021448mg [Erythranthe guttata]
XP_004514899.1 PREDICTED: geraniol 8-hydroxylase-like [Cicer arietinum]
BAJ98052.1 predicted protein [Hordeum vulgare subsp. vulgare] >BAK02967.1 predicted protein [Hordeum vulgare subsp. vulgare]
XP_017973878.1 PREDICTED: geraniol 8-hydroxylase-like [Theobroma cacao]
KDO36667.1 hypothetical protein CISIN_1g011040mg [Citrus sinensis]
KTB30982.1 hypothetical protein WG66_16439 [Moniliophthora roreri]
CBI19823.3 unnamed protein product, partial [Vitis vinifera]
XP_016507825.1 PREDICTED: ferruginol synthase-like [Nicotiana tabacum]
XP_007209906.1 flavonoid 3'-monooxygenase [Prunus persica] >ONI06742.1 hypothetical protein PRUPE_5G078300 [Prunus persica]
BAL05079.1 cytochrome P450 [Phanerochaete chrysosporium]
XP_006423653.1 hypothetical protein CICLE_v10028270mg [Citrus clementina] >ESR36893.1 hypothetical 

XP_006411901.1 hypothetical protein EUTSA_v10024999mg [Eutrema salsugineum] >ESQ53354.1 hypothetical protein EUTSA_v10024999mg [Eutrema salsugineum]
PHT27530.1 hypothetical protein CQW23_32866 [Capsicum baccatum]
XP_007342800.1 cytochrome P450, partial [Auricularia subglabra TFB-10046 SS5] >EJD48900.1 cytochrome P450, partial [Auricularia subglabra TFB-10046 SS5]
XP_007766087.1 cytochrome P450 [Coniophora puteana RWD-64-598 SS2] >EIW84375.1 cytochrome P450 [Coniophora puteana RWD-64-598 SS2]
ABD97102.1 cytochrome P450 monooxygenase CYP83G2 [Medicago truncatula]
XP_017221758.1 PREDICTED: cytochrome P450 76A1-like [Daucus carota subsp. sativus]
PIA50742.1 hypothetical protein AQUCO_01200167v1 [Aquilegia coerulea]
XP_007325099.1 hypothetical protein AGABI1DRAFT_65950 [Agaricus bisporus var. burnettii JB137-S8] >EKM83397.1 hypothetical protein AGABI1DRAFT_65950 [Agaricus bisporus var. burnettii JB137-S8]
OAY50847.1 hypothetical protein MANES_05G166900 [Manihot esculenta]
AJD25173.1 cytochr

# Evaluating how many hits per species

`build_species_count` builds a dictionary which has the set of unique species as its keys and a list of the sequence IDs that belong to each unique species as its . So we can use it to easily see how many unique species we have and which species are over represented.

In [18]:
species_counts = build_species_count(records=only_2U1_records)
print("There are %s unique species in our dataset." % (len(species_counts)))

There are 212 unique species in our dataset.


### Plotting the frequency of proteins per species
`plot_record_number` is a function that plots the numbers of IDs per species. We can set a minimum number of IDs that a species must have in order to be plotted.

In [None]:
plotthis = plot_record_number(species_counts, "Bar", min=0)
py.iplot(plotthis, filename='inline_bar')

In [None]:
plotthis = plot_record_number(species_counts, "Bar", min=2)
py.iplot(plotthis, filename='inline_bar')

In [21]:
plotthis = plot_record_number(species_counts, "Bar", min=5)
py.iplot(plotthis, filename='inline_bar')

#### We can also just extract the names using `get_species_name`, which also accepts a minimum number of IDs required and can print out the number of counts per each species

In [16]:
species_names = get_species_names(species_counts, min=5)
for name in species_names:
    print (name)

In [None]:
species_names_with_counts = get_species_names(species_counts, min=5, counts=True)
for name in species_names_with_counts:
    print (name)

### Counting the total number of sequences with multiple hits
`count_ids` is a function that counts the total number of sequences in a species count dictionary, not just the number of unique species.

As before, it can also take a minimum number of IDs required

In [None]:
min_num = 5
print ("There are %s total sequences in our filtered dataset." % (count_ids(species_counts)))
print ("There are %s total sequences in our filtered dataset that have %d or more IDs per species." % (count_ids(species_counts, min=min_num), min_num))

# Generating datasets containing information about species with multiple hits
For each species that has more than the given number of hits, we create 
1. A FASTA file of the protein sequences from that species
2. An alignment of the protein sequences
3. An information file telling use where in the genome the protein maps to
4. A visual diagram of the genome mapping the proteins to the genome

In [22]:
def generate_multiple_hit_data(species_names, species_counts, full_record, file_path):
    id_dict = {}
    for name in species_names:
        seqs = map_species_to_records(species_counts[name], full_record)
        write_fasta(seqs, file_path + name + " sequences")
        alignmentFile = alignment.alignWithMAFFT(file_path + name + " sequences")
        alignment.writeAlignment(alignmentFile, file_path + name + ".aln", "fasta")
        

    check_genomic_location(species_counts, min=5, file_path=file_path +" gene locations ")
    check_genomic_location(species_counts, min=5, visualise="linear")


species_names = get_species_names(species_counts, min=5)
generate_multiple_hit_data(species_names, species_counts, full_record, "files/multiple_hits/")

{'XP_786956.3': '581885', 'XP_786300.2': '581193', 'XP_003727710.1': '100888078', 'XP_785753.3': '580612', 'XP_003728047.1': '576998', 'XP_782351.2': '576998', 'XP_003724748.2': '100893396', 'XP_782513.1': '577176', 'XP_782458.3': '577116', 'XP_011664220.1': '575452', 'XP_787466.1': '582422', 'XP_011677992.1': '582422', 'XP_796217.3': '591567', 'XP_001197257.2': '757026', 'XP_003724341.1': '100888399'}
{'XP_022091223.1': '110979585', 'XP_022084337.1': '110975828', 'XP_022090540.1': '110979233', 'XP_022103601.1': '110986213', 'XP_022099104.1': '110983833', 'XP_022095211.1': '110981708', 'XP_022087533.1': '110977588', 'XP_022094290.1': '110981220', 'XP_022103600.1': '110986212', 'XP_022081995.1': '110974561', 'XP_022083647.1': '110975423', 'XP_022099105.1': '110983833', 'XP_022081701.1': '110974401', 'XP_022093163.1': '110980619', 'XP_022098029.1': '110983243', 'XP_022082638.1': '110974947'}
{'XP_002740620.1': '100371843', 'XP_006822041.1': '100376031', 'XP_006811947.1': '100374257', 'XP

{'XP_013091703.1': '106075284', 'XP_013091704.1': '106075284', 'XP_013091702.1': '106075284', 'XP_013091701.1': '106075282', 'XP_013091316.1': '106074959', 'XP_013065833.1': '106054494', 'XP_013065834.1': '106054495', 'XP_013065835.1': '106054496', 'XP_013075120.1': '106061501'}
{}


Or we could just use parts of this function. The cell below will just print out the locations of the proteins in the genome. We could save this to disk by providing an argument to the `file_path` variable or visualise it by providing either 'linear' or 'circular' to the `visualise` variable.

In [None]:
check_genomic_location(species_counts, min=5)

# Saving the records to FASTA files
Because `filtered_records` just contains the species name and IDs of these species, we need to map these IDs back to their full records. We can use the function `map_ids_to_records` which allows for us to select all the records in `filtered_ids` or just the unique species.

In [13]:
filtered_records = map_ids_to_records(species_counts, full_record)
filtered_records_unique = map_ids_to_records(species_counts, full_record, unique=True)

# Check that the numbers are correct
print (len(filtered_records))
print (len(filtered_records_unique))

10325
1312


And now we can save these records to a new FASTA file using `write_fasta`

In [14]:
write_fasta(filtered_records, "files/2U1_BLAST_smaller_records.fasta")
# write_fasta(filtered_records_unique, "files/2U1_BLAST_filtered_records_unique.fasta")

We can also use the function `map_species_to_records` to just map a particular species to a FASTA file.

In [None]:
priapulus_caudatus = map_species_to_records(species_counts['Priapulus caudatus'], full_record)
write_fasta(priapulus_caudatus, "files/priapulus_caudatus.fasta")