# Notes:

This notebook is for compiling and processing the phage genomes that will be used as ground-truth benchmarks for training/testing a lifestyle prediction model. The notebook should be stepped through top to bottom and assumes that you have cloned this repository in it's entirety, including data folders with identical directory structures. No promises made on how it will run on Windows machines.

Of particular note is that this notebook is very specific to this one phage training dataset and the code will need re-factoring depending on the format (data tables, column names, etc.) of any potential future training datasets.

# Imports

In [1]:
import pandas as pd
from Bio import SeqIO, Entrez
import glob

import time
import os

# Constants

In [2]:
base_dir = '../../Data/model_data/phage_data_nmicro2017/'
annotations_file = base_dir + 'original_data/41564_2017_BFnmicrobiol2017112_MOESM2_ESM.csv'
phagesdb_fasta_file = base_dir + 'original_data/Actinobacteriophages-All.fasta'
fasta_dir = base_dir + 'phage_fasta_files/'
processed_df_outfile = base_dir + 'processed_benchmark_set.csv'

##############################################
#If you're re-running this dataframe with the intention of downloading or re-downloading
#genome data in bulk from NCBI please be sure to provide your email address for them
#If you have cloned the full repository for this project, all files should be accessible 
#and will not re-download by default making this superfluous
Entrez.email = 'ENTER YOUR EMAIL ADDRESS HERE'
##############################################

In [4]:
###Make sure that all the directories and file paths are valid
for directory in [base_dir, fasta_dir]:
    assert os.path.exists(directory)

for file_of_interest in [annotations_file, phagesdb_fasta_file]:
    assert os.path.exists(file_of_interest)

# Select phages of interest from the starting file

In [5]:
df = pd.read_csv(annotations_file)
print('Starting dataframe shape:', df.shape)
df = df[df['Temperate (empirical)'] != 'Unspecified']
print('Ground truth dataframe shape:', df.shape)
df.head()

Starting dataframe shape: (2333, 22)
Ground truth dataframe shape: (1059, 22)


Unnamed: 0,Virus identifier used for the analysis,Database source,RefSeq header source description,RefSeq accession number,Genome type,Order,Family,Host domain,Host phylum,Host class,...,Host genus,Cluster,Subcluster,Genome size,Number of genes used by Phamerator for creating phams,Used to optimize Mash parameters,Known to encode toxins or virulence factors,Temperate (empirical),Temperate (bioinformatically predicted),Evolutionary mode
1,0507-kn2-1__nc_022343,NCBI RefSeq,Klebsiella phage 0507-KN2-1,NC_022343,dsDNA,Caudovirales,Myoviridae,Bacteria,Proteobacteria,Gammaproteobacteria,...,Klebsiella,Unspecified,Unspecified,159991,154,no,Unspecified,no,no,LGCF
14,11__nc_004615,NCBI RefSeq,Staphylococcus phage 11,NC_004615,dsDNA,Caudovirales,Siphoviridae,Bacteria,Firmicutes,Bacilli,...,Staphylococcus,Unspecified,Unspecified,43604,53,no,Unspecified,yes,yes,HGCF
19,13__nc_004617,NCBI RefSeq,Staphylococcus phage 13,NC_004617,dsDNA,Caudovirales,Siphoviridae,Bacteria,Firmicutes,Bacilli,...,Staphylococcus,Unspecified,Unspecified,42722,49,no,yes,yes,yes,HGCF
22,13a__nc_011045,NCBI RefSeq,Enterobacteria phage 13a,NC_011045,dsDNA,Caudovirales,Podoviridae,Bacteria,Proteobacteria,Gammaproteobacteria,...,Escherichia,Unspecified,Unspecified,38841,55,no,Unspecified,no,no,LGCF
27,1717__nc_011357,NCBI RefSeq,Stx2-converting phage 1717,NC_011357,dsDNA,Caudovirales,Siphoviridae,Bacteria,Proteobacteria,Gammaproteobacteria,...,Escherichia,Unspecified,Unspecified,62147,77,no,yes,yes,yes,HGCF


# Scan through bulk fasta file from PhagesDB to isolate and write individual records of interest

Note that this file was retreived in March 2020, from https://phagesdb.org/data/

The relevant data file was listed under the heading:
"Download Multifasta of All Actinobacteriophage Genomes"

In [6]:
fastas = list(SeqIO.parse(phagesdb_fasta_file, 'fasta'))
print('Contains {} separate fasta sequences'.format(len(fastas)))
print()
print('########## Here is an example entry:')
print(fastas[0])

Contains 3359 separate fasta sequences

########## Here is an example entry:
ID: Mycobacterium
Name: Mycobacterium
Description: Mycobacterium phage 20ES complete sequence, 53124 bp including 10-base 3' overhang (CGGTCGGTTA), Cluster A2
Number of features: 0
Seq('TGCGGCCGCCCCATCCTGTACGGGTTTCCAAGTCGATCGGAGGGCAGAGTCGGC...TTA', SingleLetterAlphabet())


**Pattern matching data to pair the `csv` file entries with `fasta` records**

In [7]:
temp_df = df[df['Database source'] == 'Actinobacteriophage_785']
problem_cases = []
for index in temp_df.index[:]:
    found = []
    identifier = temp_df.loc[index]['Virus identifier used for the analysis'].split('_')[0].lower()
    for fasta in fastas:
        if identifier in fasta.description.lower().split(' '):
            found.append(fasta)
    if len(found) != 1:
        problem_cases.append(identifier)
        print(identifier, len(found))
        continue
    SeqIO.write(found[0], fasta_dir + '{}.fasta'.format(identifier), 'fasta')

bennie 0
guacamole 0
harvey 0
kamiyu 0
schwabeltier 0
theloniousmonk 0
wayne 0


In [8]:
###Cleaning up and dealing with some of those problem cases
for index in temp_df.index[:]:
    found = []
    identifier = temp_df.loc[index]['Virus identifier used for the analysis'].split('_')[0].lower()
    
    ########
    if identifier not in problem_cases:
        continue
    ########
    
    for fasta in fastas:
        if identifier in fasta.description.lower().replace(',', ' ').replace('_', ' ').split(' '):
            found.append(fasta)
    if len(found) != 1:
        problem_cases.append(identifier)
        print(identifier, len(found))
        continue
    SeqIO.write(found[0], fasta_dir + '{}.fasta'.format(identifier), 'fasta')

schwabeltier 0
theloniousmonk 0


In [9]:
###Remove the problematic cases that for some reason or another I can't easily find
df = df[df['Virus identifier used for the analysis'].str.contains('schwabeltier', case=False) == False]
df = df[df['Virus identifier used for the analysis'].str.contains('theloniousmonk', case=False) == False]
print(df.shape)

(1057, 22)


**Write the new dataframe containing all of the actinobacteria phages that I was able  to find and for which we have lifestyle annotations**

In [12]:
df.to_csv(processed_df_outfile)

# Download RefSeq data from NCBI

This section will download full genome fasta files from NCBI and as such may take some time to run, and should probably be run at night time. In reality, the number is fairly small so it's not placing a heavy burden on NCBI servers but be conscious and aware of bulk downloading in this manner

In [13]:
temp_df = df[df['Database source'] == 'NCBI RefSeq']
print(temp_df.shape)

(506, 22)


In [14]:
for index in temp_df.index[:]:
    print(index)
    accession = temp_df.loc[index]['RefSeq accession number']
    
    #####################################################################
    ###Don't download if I already have the file downloaded and ready to go
    if os.path.exists(fasta_dir + '{}.fasta'.format(accession)):
        continue
    #####################################################################

    handle = Entrez.efetch(db='nuccore', id=accession, rettype='fasta', retmode='text')
    text = handle.read()
    handle.close()
    with open(fasta_dir + '{}.fasta'.format(accession), 'w') as outfile:
        outfile.write(text)
    time.sleep(10)

**Ensuring that everything was downloaded by checking that each genome sequence is parseable and at least 1000 nts long**

In [15]:
problem_cases = []
for index in temp_df.index[:]:
    accession = temp_df.loc[index]['RefSeq accession number']
    if os.path.exists(fasta_dir + '{}.fasta'.format(accession)):
        record = SeqIO.read(fasta_dir + '{}.fasta'.format(accession), format='fasta')
        if len(record.seq) < 1000:
            problem_cases.append(accession)
print(len(problem_cases))

0


**fin.**