# Purpose: This notebook was used to read/extract phage fasta files from the [MVP database](https://doi.org/10.1093/nar/gkx1124)  and categorize the files according to individual hosts for downstream analysis

In [1]:
from Bio import SeqIO
import pandas as pd
import os

# The main file containing interacting viral clusters

**Just exploring some of the MVP files (E. coli taxid is 562), and note that these ar representatives from the clusters and therefore have been made at least a bit less redundant**

In [8]:
interactions_df = pd.read_csv('../../Data/MVP_data/MVP_DB/mvp_interactions.txt.gz',\
                              compression='gzip',\
                              sep='\t')
print(interactions_df.shape)
interactions_df = interactions_df[interactions_df['host_rank']=='species']
print(interactions_df.shape)
interactions_df = interactions_df[interactions_df['host_superkingdom']=='Bacteria']
print(interactions_df.shape)


###Visualize E. coli-infecting phages in the database
interactions_df[interactions_df['host_taxon_id']==562]

(27854, 5)
(26944, 5)
(25572, 5)


Unnamed: 0,interaction_uid,host_taxon_id,host_rank,host_superkingdom,viral_cluster_id
1,2,562,species,Bacteria,177
58,59,562,species,Bacteria,229
154,155,562,species,Bacteria,335
156,157,562,species,Bacteria,337
355,356,562,species,Bacteria,382
...,...,...,...,...,...
25960,25961,562,species,Bacteria,534
25961,25962,562,species,Bacteria,191
25964,25965,562,species,Bacteria,206
26146,26147,562,species,Bacteria,544


# Extracting names and lineages for the top n species with the highest number of phage representatives

Where "n" is pretty arbitrary but I'm just manually looking to find species with decently large data avaialbility  

In [9]:
n = 35
topN = interactions_df[(interactions_df['host_rank']=='species')&\
                      (interactions_df['host_superkingdom']=='Bacteria')]['host_taxon_id'].value_counts()[:n]
topN

562       424
287       185
1280      150
573       107
470        99
657318     92
657321     78
83334      73
357276     63
1428       62
1639       60
657319     55
1314       44
665950     42
90371      41
1423       41
28450      40
435591     40
72407      40
1590       40
411479     39
1396       35
411477     33
748224     32
411476     31
411901     31
718255     31
571        31
46170      30
492670     30
592028     30
657309     28
36809      28
717959     27
305        27
Name: host_taxon_id, dtype: int64

**Grabbing full taxonomies from this data**

I used these taxonomies to **manually** select species to move forward with. 
1. Starting at the top, I took all species for which I could find a complete (not broken into CONTIGs) genome. 

2. Next, I then skipped any taxon that already had an identical genus level lineage.

I used these taxon id's to **manually** download `.fasta` and `.gff3` files for the hosts since the numbers were fairly small

In [10]:
from Bio import Entrez
import time
Entrez.email = 'adam.hockenberry@utexas.edu'
for top in topN.keys():
    handle = Entrez.efetch(db='Taxonomy', id=str(top), retmode='xml')
    records = Entrez.read(handle)
    assert len(records) == 1
    record = records[0]
    print(top, record['Lineage'])
    print()
    time.sleep(5)
    

562 cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Escherichia

287 cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales; Pseudomonadaceae; Pseudomonas; Pseudomonas aeruginosa group

1280 cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Staphylococcaceae; Staphylococcus

573 cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Klebsiella

470 cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales; Moraxellaceae; Acinetobacter; Acinetobacter calcoaceticus/baumannii complex

657318 cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Clostridiales; Lachnospiraceae; unclassified Lachnospiraceae; [Eubacterium] rectale

657321 cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Clostridiales; Ruminococcaceae; Ruminococcus; Ruminococcus bromii

833

# Extracting and writing viral fastas for a test set

**As noted above, I might want to change this to be stricter/looser with inclusion criteria. For instance:** 
1. Perhaps we should only consider analyzing viruses that *only* infect a certain species (i.e. exclude promiscuous viruses) 
2. Only analyzing/restricting prophages?
3. Going deeper taxonomically to analyze viruses that infect a genus/family/etc.
4. Length cutoffs are probably particularly important. Are really short genomes trustworthy at all? Maybe? Really long ones?
5. ...

**All things to keep in mind**

**Load the sequence dataframe (this is pretty big and will take a minute or two)**

In [12]:
reps_df = pd.read_csv('../../Data/MVP_data/MVP_DB/mvp_viral_cluster_representative_seqs.txt.gz',\
                      compression='gzip',\
                      sep='\t')

In [13]:
reps_df.head()

Unnamed: 0,seq_uid,cluster_id,is_representative,mapping_to_representative,seq_name,ncbi_taxon_id,seq_length,seq_source,seq_str
0,1,0,1,,NC_014637,693272,617453,ncbi_ref_viral_genome_database,TTAAATGTGTTAAACTTTATAGGTAAACATTCTTTCTGTCATCATT...
1,2,1,1,,NC_023423,1450746,610033,ncbi_ref_viral_genome_database,TTATAATATATCAAAGATACTTGAAATTTAGAACTTCGGTTAAAGT...
2,3,2,1,,NC_023719,1084719,497513,ncbi_ref_viral_genome_database,ATGTTTGAATTATCAAAAATACAAAGCGATACAAAGGCTTTGCAAA...
3,4,3,1,,NC_021312,251749,459984,ncbi_ref_viral_genome_database,AGGTCGCCAGAGGCTAAGAAGACCGCTAAGAAGACCGCTGAGGGCA...
4,5,4,1,,NC_007346,181082,407339,ncbi_ref_viral_genome_database,TATATTTAACGCGAATGATTTAAGGATTTTTATGGTTTTAACCAAA...


## Writing fasta files
1. Write individual phage `.fasta` files to individual host-linked directories
2. Write one large `.fasta` file for each host (to run prodigal with)


In [81]:
host_id = xxxx ###Manual specificiation at this stage
length_cutoff = 20000
save_dir = '../../Data/MVP_data/host_linked_genomes/{}_rep_viruses/'.format(host_id)

if not os.path.exists(save_dir):
    os.makedirs(save_dir)

for index in interactions_df[interactions_df['host_taxon_id']==host_id].index:
    cluster_id = interactions_df.at[index, 'viral_cluster_id']
    tempy = reps_df[reps_df['cluster_id'] == cluster_id]
    assert len(tempy.index) <= 1
    if len(tempy.index) == 0:
        print('Strangely can not find this cluster id', cluster_id)
        continue
    sequence = tempy.iloc[0]['seq_str']
    sequence_name = tempy.iloc[0]['seq_name']
    taxon_id = tempy.iloc[0]['ncbi_taxon_id']
    if tempy.iloc[0]['seq_length'] < length_cutoff:
        continue
    with open(save_dir+'{}.fasta'.format(cluster_id), 'w') as outfile:
        outfile.write('>{}|{}\n'.format(sequence_name, taxon_id))
        outfile.write(sequence)
        
        
###Next stage, writing a (temporary file). This obviously could have been combined with the above
###to be a bit cleaner.
cluster_id_dict = {}
with open(save_dir+'all_temp.fasta', 'w') as outfile:
    for index in interactions_df[interactions_df['host_taxon_id']==host_id].index:
        cluster_id = interactions_df.at[index, 'viral_cluster_id']
        tempy = reps_df[reps_df['cluster_id'] == cluster_id]
        assert len(tempy.index) <= 1
        if len(tempy.index) == 0:
            print("Strangely can't find this cluster id", cluster_id)
            continue
        sequence = tempy.iloc[0]['seq_str']
        sequence_name = tempy.iloc[0]['seq_name']
        taxon_id = tempy.iloc[0]['ncbi_taxon_id']
        if tempy.iloc[0]['seq_length'] < length_cutoff:
            continue

        outfile.write('>{}|{}\n'.format(sequence_name, taxon_id))
        outfile.write('{}\n'.format(sequence))
        
        cluster_id_dict['{}|{}'.format(sequence_name, taxon_id)] = cluster_id

## Run Prodigal offline in normal mode

Example code:
```
/home/adhock/Workspace/Prodigal/prodigal.linux -i ../../Data/MVP_data/host_linked_genomes/562_rep_viruses/all_temp.fasta -f gff -o ../Data/562_rep_viruses/all_temp.gff
```

**Come back online to:**
1. Split apart the massive gff file and 
2. Kill the temp files

In [83]:
infile = save_dir+'all_temp.gff'

comment=True
write_lines = []
with open(infile, 'r') as infile:
    lines = infile.readlines()
    for line in lines:
        if line[0]=='#':
            if comment == False:
                cluster_id = cluster_id_dict[write_lines[-1].split('\t')[0]]
                with open(save_dir+'{}.gff'.format(cluster_id), 'w') as outfile:
                    for write_line in write_lines:
                        outfile.write(write_line)
                comment=True
                write_lines = [line]
            else:
                write_lines.append(line)
        else:
            comment=False
            write_lines.append(line)

In [84]:
os.remove(save_dir+'all_temp.gff')
os.remove(save_dir+'all_temp.fasta')

**Et voila. REPEAT for all manually identified `host_id`'s of interest**

# Scratch

What follows is just some playing around with the data, giving thought to possible future uses.

## Interaction evidence

**Could be useful to one day limit to certain interaction types**

However, note that many of the viruses (but not all) contained in `pmid:26200428` are also prophages, which will make this quite difficult

In [14]:
evidence_df = pd.read_csv('../../Data/MVP_data/MVP_DB/mvp_interaction_evidence.txt.gz',\
                          compression='gzip',\
                          sep='\t')
evidence_df.head()

Unnamed: 0,evidence_uid,interaction_uid,interaction_source,interaction_remarks,phage_cluster_id,phage_seq_uid
0,1,1,"prophage_in_NCBI_complete_prokaryotic_genomes,...",,175,181
1,2,2,"prophage_in_NCBI_complete_prokaryotic_genomes,...",,177,183
2,3,3,pmid:26200428,,229,238
3,4,4,pmid:26200428,,229,239
4,5,5,pmid:26200428,,229,240


In [15]:
evidence_df['interaction_source'].value_counts()

pmid:26200428                                                       12443
prophage_in_NCBI_complete_prokaryotic_genomes,phage_finder          10059
prophage_in_EMBL_progenomes_representative_genomes,phage_finder      3322
prophage_in_metagenomic_assembled_contigs,phage_finder,human_gut     2585
ncbi_ref_viral_genome_database                                       1989
ICTV                                                                  668
pmid:27533034                                                         537
Name: interaction_source, dtype: int64

## Prophages
**As noted, I might want to analyze prophages separately? But personally I find them a bit strange to even consider at all.**

Also note that in this case we're considering all viruses regardless of clustering (which is dangerous)

In [17]:
prophage_df = pd.read_csv('../../Data/MVP_data/MVP_DB/mvp_prophages_and_their_genomic_locations.txt.gz',\
                          compression='gzip',\
                          sep='\t')
prophage_df[prophage_df['host_taxon_id']==562]

Unnamed: 0,host_contig_id,host_genome_id,prophage_start,prophage_end,data_source,host_scientific_name,host_taxon_id
1,CP015229.1,CP015229.1,1789353,1960089,NCBI complete prokaryotic genomes,Escherichia coli,562
2,CP007392.1,CP007392.1,2084500,2246945,NCBI complete prokaryotic genomes,Escherichia coli,562
10,CP012693.1,CP012693.1,2757790,2906285,NCBI complete prokaryotic genomes,Escherichia coli,562
11,CP007393.1,CP007393.1,2091722,2240084,NCBI complete prokaryotic genomes,Escherichia coli,562
31,CP017249.1,CP017249.1,2305199,2348482,NCBI complete prokaryotic genomes,Escherichia coli,562
...,...,...,...,...,...,...,...
12855,CP013031.1,CP013031.1,4651953,4664018,NCBI complete prokaryotic genomes,Escherichia coli,562
12874,CP016546.1,CP016546.1,2328896,2340864,NCBI complete prokaryotic genomes,Escherichia coli,562
13113,CP009104.1,CP009104.1,3079856,3090553,NCBI complete prokaryotic genomes,Escherichia coli,562
13114,CP011018.1,CP011018.1,288287,298985,NCBI complete prokaryotic genomes,Escherichia coli,562


**But also as noted, prophages identified in the aforementioned manuscript are not called as such here, though many are**

In [18]:
prophage_df['data_source'].value_counts()

NCBI complete prokaryotic genomes                             10007
representative prokaryotic genomes from proGenome database     3302
assembled metagenomic contigs; human gut                       2583
complete prokaryotic genomes from EMBL proGenomes database        2
Name: data_source, dtype: int64

## Taxonomy 
**My current strategy is to limit my analysis to viral clusters with a defined species-level host. Perhaps expanding out to genus would allow for more**

In [19]:
taxon_df = pd.read_csv('../../Data/MVP_data/MVP_DB/superkingdom2descendents.txt.gz',\
                       compression='gzip',\
                       sep='\t')

In [20]:
taxon_df[taxon_df['scientific_name'].str.contains('Caulobacter vibri')]

Unnamed: 0,scientific_name,node_rank,ncbi_taxon_id,taxon_id,parent_taxon_id,superkingdom
121392,Caulobacter vibrioides,species,155892,123335,185,Bacteria


## Identifying the names of some common phage to use as examples

In [21]:
phage_lambda = 'NC_001416' 
t7 = 'NC_001604'
t4 = 'NC_000866'
phi = 'NC_001422'

In [23]:
cluster_id_dict = {}
host_id = 562
length_cutoff=20000
for index in interactions_df[interactions_df['host_taxon_id']==host_id].index:
    cluster_id = interactions_df.at[index, 'viral_cluster_id']
    tempy = reps_df[reps_df['cluster_id'] == cluster_id]
    assert len(tempy.index) <= 1
    if len(tempy.index) == 0:
        print("Strangely can't find this cluster id", cluster_id)
        continue
    sequence = tempy.iloc[0]['seq_str']
    sequence_name = tempy.iloc[0]['seq_name']
    taxon_id = tempy.iloc[0]['ncbi_taxon_id']
    if tempy.iloc[0]['seq_length'] < length_cutoff:
        continue

    cluster_id_dict['{}|{}'.format(sequence_name, taxon_id)] = cluster_id

Strangely can't find this cluster id 32825
Strangely can't find this cluster id 32826
Strangely can't find this cluster id 32827
Strangely can't find this cluster id 32828


In [29]:
for i,j in cluster_id_dict.items():
    if phage_lambda in i:
        print('Lambda', i, '\t', j)
    elif t7 in i:
        print('T7', i, '\t', j)
    elif t4 in i:
        print('T4', i, '\t', j)
    elif phi in i:
        print('PhiX', i, '\t', j)

T4 NC_000866|10665 	 188
T7 NC_001604|10760 	 7841
Lambda NC_001416|10710 	 3868
