# Bloodmeal Calling

In this notebook, we analyze contigs from each bloodfed mosquito sample with LCA in *Vertebrata*. The potential bloodmeal call is the lowest taxonomic group consistent with the LCAs of all such contigs in a sample.

In [1]:
import pandas as pd
import numpy as np
from ete3 import NCBITaxa
import boto3
import tempfile
import subprocess
import os
import io
import re
import time
import json
ncbi = NCBITaxa()

In [2]:
df = pd.read_csv('../../figures/fig3/all_contigs_df.tsv', sep='\t', 
                dtype={'taxid': np.int})
df = df[df['group'] == 'Metazoa']

In [3]:
def taxid2name(taxid):
    return ncbi.get_taxid_translator([taxid])[taxid]

There is a partial order on taxa: $a < b$ if $a$ is an ancestor of $b$. A taxon $t$ is admissible as a bloodmeal call for a given sample if it is consistent with all *Vertebrata* LCA taxa $b$: $t < b$ or $b < t$ for all $b$. That is, a taxon is admissable if t in lineage(b) or b in lineage(t) for all b.

We will report the lowest admissable taxon for each sample.

In [4]:
def get_lowest_admissable_taxon(taxa):
    lineages = [ncbi.get_lineage(taxid) for taxid in taxa]
    
    if len(lineages) == 0:
        return 0
    
    all_taxa = np.unique([taxid for lineage in lineages for taxid in lineage])
    non_leaf_taxa = np.unique([taxid for lineage in lineages for taxid in lineage[:-1]])
    leaf_taxa = [taxid for taxid in all_taxa if taxid not in non_leaf_taxa]
    
    leaf_lineages = [ncbi.get_lineage(taxid) for taxid in leaf_taxa]
    leaf_common_ancestors = set.intersection(*[set(l) for l in leaf_lineages])
    lca = [taxid for taxid in leaf_lineages[0] if taxid in leaf_common_ancestors][-1]
        
    return lca

In [5]:
def filter_taxon(taxid, exclude = [], # drop these taxa
                               exclude_children = [], # drop children of these taxa
                               parent=None # only keep children of the parent
                ):
    if taxid in exclude:
        return False
    
    lineage = ncbi.get_lineage(taxid)
    
    exclude_children = set(exclude_children)
    
    if len(set(lineage) & set(exclude_children)) > 0:
        return False
    
    if parent and parent not in lineage:
        return False
    
    return True

In [6]:
vertebrate_taxid = 7742
primate_taxid = 9443

In [7]:
euarchontoglires_taxid = 314146

In [8]:
df['filter_taxon'] = df['taxid'].apply(lambda x: filter_taxon(x, 
                                           exclude = [euarchontoglires_taxid],
                                           exclude_children = [primate_taxid],
                                           parent = vertebrate_taxid))

How many nonprimate vertebrate contigs per sample? 1 to 11.

In [9]:
%pprint
sorted(df[df['filter_taxon']].groupby('sample').count()['taxid'])

Pretty printing has been turned OFF


[1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 8, 8, 8, 8, 9, 9, 10, 10, 11]

In [10]:
lowest_admissable_taxa = []
for sample in df['sample'].unique():
    taxid = get_lowest_admissable_taxon(df[(df['sample'] == sample) & df['filter_taxon']]['taxid'])
    name = taxid2name(taxid) if taxid else "NA"
    lowest_admissable_taxa.append({'sample': sample, 'name': name, 'taxid': taxid})
lowest_admissable_taxa = pd.DataFrame(lowest_admissable_taxa).sort_values('sample')
lowest_admissable_taxa = lowest_admissable_taxa[['sample', 'taxid', 'name']]

In [11]:
lowest_admissable_taxa.head()

Unnamed: 0,sample,taxid,name
39,CMS001_001_Ra_S1,35500,Pecora
0,CMS001_003_Ra_S2,35500,Pecora
21,CMS001_004_Ra_S2,379584,Caniformia
6,CMS001_005_Ra_S3,1437010,Boreoeutheria
3,CMS001_008_Ra_S3,35500,Pecora


In [12]:
partition = "Pecora Carnivora Homininae Rodentia Leporidae Aves".split()
partition = ncbi.get_name_translator(partition)
partition = {v[0]: k for k, v in partition.items()}

def get_category(taxid):
    if not taxid:
        return None
    lineage = ncbi.get_lineage(taxid)
    for k in partition:
        if k in lineage:
            return partition[k]
    else:
        return 'NA'

The ranks of the categories are:

In [13]:
ncbi.get_rank(partition.keys())

{207598: 'subfamily', 33554: 'order', 9989: 'order', 9979: 'family', 35500: 'infraorder', 8782: 'class'}

In [14]:
bloodmeal_calls = lowest_admissable_taxa

bloodmeal_calls['category'] = bloodmeal_calls['taxid'].apply(get_category)

bloodmeal_calls = bloodmeal_calls[bloodmeal_calls['category'] != 'NA']
bloodmeal_calls = bloodmeal_calls[bloodmeal_calls['name'] != 'NA']

bloodmeal_calls = bloodmeal_calls[['sample', 'category', 'name']]
bloodmeal_calls = bloodmeal_calls.sort_values('sample')
bloodmeal_calls = bloodmeal_calls.rename(columns={'sample': 'Sample',
                                                  'category': 'Bloodmeal Category',
                                                  'name': 'Bloodmeal Call'})

In [15]:
metadata = pd.read_csv('../../data/metadata/CMS001_CMS002_MergedAnnotations.csv')
metadata = metadata[['NewIDseqName', 'Habitat', 'collection_lat', 'collection_long', 'ska_genus', 'ska_species']].rename(
    columns = {'NewIDseqName': 'Sample',
               'ska_genus': 'Genus',
    'ska_species': 'Species',
     'collection_lat': 'Lat',
     'collection_long': 'Long'})

In [16]:
metadata['Sample']

0             CMS001_001_Ra_S1
1             CMS001_002_Ra_S1
2             CMS001_003_Ra_S2
3             CMS001_004_Ra_S2
4             CMS001_005_Ra_S3
                ...           
143     CMS002_051a_Rb_S6_L004
144     CMS002_053a_Rb_S7_L004
145     CMS002_054a_Rb_S8_L004
146     CMS002_056a_Rb_S9_L004
147    CMS002_057a_Rb_S10_L004
Name: Sample, Length: 148, dtype: object

In [19]:
bloodmeal_calls = bloodmeal_calls.merge(metadata, on='Sample', how='left')

In [18]:
bloodmeal_calls.to_csv(
    '../../figures/fig4/bloodmeal_calls.csv', index=False)