# Notes:

This notebook is for compiling and processing the phage genomes that will be used as ground-truth benchmarks for training/testing a lifestyle prediction model. The notebook should be stepped through top to bottom and assumes that you have cloned this repository in it's entirety, including data folders with identical directory structures. No promises made on how it will run on Windows machines.

Of particular note is that this notebook is **very specific** to **this particular phage** training dataset and the code will need to be re-factored depending on the format (data tables, column names, etc.) of any potential future training datasets.

# Imports

In [1]:
import pandas as pd
from Bio import SeqIO, Entrez
import glob

import time
import os

# Constants

In [2]:
base_dir = '../Data/phage_data_nmicro2017/'
annotations_file = base_dir + 'original_data/41564_2017_BFnmicrobiol2017112_MOESM2_ESM.csv'
phagesdb_fasta_file = base_dir + 'original_data/Actinobacteriophages-All.fasta'
fasta_dir = base_dir + 'phage_fasta_files/'
processed_df_outfile = base_dir + 'processed_benchmark_set.csv'

##############################################
#If you're re-running this dataframe with the intention of downloading or re-downloading
#genome data in bulk from NCBI please be sure to provide your email address for them
#If you have cloned the full repository for this project, all files should be accessible 
#and will not re-download by default making this superfluous
Entrez.email = 'ENTER YOUR EMAIL ADDRESS HERE'
##############################################

In [3]:
###Make sure that all the directories and file paths are valid
for directory in [base_dir, fasta_dir]:
    assert os.path.exists(directory)

for file_of_interest in [annotations_file, phagesdb_fasta_file]:
    assert os.path.exists(file_of_interest)

# Select phages of interest from the starting file

In [3]:
df = pd.read_csv(annotations_file)
print('Starting dataframe shape:', df.shape)
df = df[df['Temperate (empirical)'] != 'Unspecified']
print('Ground truth dataframe shape:', df.shape)
df.head()

Starting dataframe shape: (2333, 22)
Ground truth dataframe shape: (1059, 22)


Unnamed: 0,Virus identifier used for the analysis,Database source,RefSeq header source description,RefSeq accession number,Genome type,Order,Family,Host domain,Host phylum,Host class,...,Host genus,Cluster,Subcluster,Genome size,Number of genes used by Phamerator for creating phams,Used to optimize Mash parameters,Known to encode toxins or virulence factors,Temperate (empirical),Temperate (bioinformatically predicted),Evolutionary mode
1,0507-kn2-1__nc_022343,NCBI RefSeq,Klebsiella phage 0507-KN2-1,NC_022343,dsDNA,Caudovirales,Myoviridae,Bacteria,Proteobacteria,Gammaproteobacteria,...,Klebsiella,Unspecified,Unspecified,159991,154,no,Unspecified,no,no,LGCF
14,11__nc_004615,NCBI RefSeq,Staphylococcus phage 11,NC_004615,dsDNA,Caudovirales,Siphoviridae,Bacteria,Firmicutes,Bacilli,...,Staphylococcus,Unspecified,Unspecified,43604,53,no,Unspecified,yes,yes,HGCF
19,13__nc_004617,NCBI RefSeq,Staphylococcus phage 13,NC_004617,dsDNA,Caudovirales,Siphoviridae,Bacteria,Firmicutes,Bacilli,...,Staphylococcus,Unspecified,Unspecified,42722,49,no,yes,yes,yes,HGCF
22,13a__nc_011045,NCBI RefSeq,Enterobacteria phage 13a,NC_011045,dsDNA,Caudovirales,Podoviridae,Bacteria,Proteobacteria,Gammaproteobacteria,...,Escherichia,Unspecified,Unspecified,38841,55,no,Unspecified,no,no,LGCF
27,1717__nc_011357,NCBI RefSeq,Stx2-converting phage 1717,NC_011357,dsDNA,Caudovirales,Siphoviridae,Bacteria,Proteobacteria,Gammaproteobacteria,...,Escherichia,Unspecified,Unspecified,62147,77,no,yes,yes,yes,HGCF


In [7]:
df.columns

Index(['Virus identifier used for the analysis', 'Database source',
       'RefSeq header source description', 'RefSeq accession number',
       'Genome type', 'Order', 'Family', 'Host domain', 'Host phylum',
       'Host class', 'Host order', 'Host family', 'Host genus', 'Cluster',
       'Subcluster', 'Genome size',
       'Number of genes used by Phamerator for creating phams',
       'Used to optimize Mash parameters',
       'Known to encode toxins or virulence factors', 'Temperate (empirical)',
       'Temperate (bioinformatically predicted)', 'Evolutionary mode'],
      dtype='object')

In [5]:
df['Host phylum'].value_counts()

Actinobacteria    554
Proteobacteria    362
Firmicutes        131
Cyanobacteria       7
Tenericutes         4
Unspecified         1
Name: Host phylum, dtype: int64

In [6]:
df['Host class'].value_counts()

Actinobacteria         554
Gammaproteobacteria    337
Bacilli                117
Betaproteobacteria      17
Clostridia              14
Unspecified              8
Alphaproteobacteria      7
Mollicutes               4
Deltaproteobacteria      1
Name: Host class, dtype: int64

In [11]:
(552+337+117)/1057

0.9517502365184485

In [8]:
df['Host order'].value_counts()

Corynebacteriales    507
Enterobacterales     281
Bacillales            68
Lactobacillales       49
Micrococcales         46
Pseudomonadales       25
Burkholderiales       17
Clostridiales         14
Vibrionales           12
Xanthomonadales        7
Pasteurellales         6
Synechococcales        6
Aeromonadales          5
Rhodobacterales        5
Mycoplasmatales        3
Unspecified            2
Rhizobiales            1
Myxococcales           1
Acholeplasmatales      1
Streptomycetales       1
Oceanospirillales      1
Nostocales             1
Name: Host order, dtype: int64

# Scan through bulk fasta file from PhagesDB to isolate and write individual records of interest

Note that this file was retreived in March 2020, from https://phagesdb.org/data/

The relevant data file was listed under the heading:
"Download Multifasta of All Actinobacteriophage Genomes"

In [5]:
fastas = list(SeqIO.parse(phagesdb_fasta_file, 'fasta'))
print('Contains {} separate fasta sequences'.format(len(fastas)))
print()
print('########## Here is an example entry:')
print(fastas[0])

Contains 3359 separate fasta sequences

########## Here is an example entry:
ID: Mycobacterium
Name: Mycobacterium
Description: Mycobacterium phage 20ES complete sequence, 53124 bp including 10-base 3' overhang (CGGTCGGTTA), Cluster A2
Number of features: 0
Seq('TGCGGCCGCCCCATCCTGTACGGGTTTCCAAGTCGATCGGAGGGCAGAGTCGGC...TTA', SingleLetterAlphabet())


**Pattern matching data to pair the `csv` file entries with `fasta` records**

In [6]:
temp_df = df[df['Database source'] == 'Actinobacteriophage_785']
problem_cases = []
for index in temp_df.index[:]:
    found = []
    identifier = temp_df.loc[index]['Virus identifier used for the analysis'].split('_')[0].lower()
    for fasta in fastas:
        if identifier in fasta.description.lower().split(' '):
            found.append(fasta)
    if len(found) != 1:
        problem_cases.append(identifier)
        print(identifier, len(found))
        continue
    SeqIO.write(found[0], fasta_dir + '{}.fasta'.format(identifier), 'fasta')

bennie 0
guacamole 0
harvey 0
kamiyu 0
schwabeltier 0
theloniousmonk 0
wayne 0


In [7]:
###Cleaning up and dealing with some of those problem cases
for index in temp_df.index[:]:
    found = []
    identifier = temp_df.loc[index]['Virus identifier used for the analysis'].split('_')[0].lower()
    
    ########
    if identifier not in problem_cases:
        continue
    ########
    
    for fasta in fastas:
        if identifier in fasta.description.lower().replace(',', ' ').replace('_', ' ').split(' '):
            found.append(fasta)
    if len(found) != 1:
        problem_cases.append(identifier)
        print(identifier, len(found))
        continue
    SeqIO.write(found[0], fasta_dir + '{}.fasta'.format(identifier), 'fasta')

schwabeltier 0
theloniousmonk 0


In [8]:
###Remove the problematic cases that for some reason or another I can't easily find
df = df[df['Virus identifier used for the analysis'].str.contains('schwabeltier', case=False) == False]
df = df[df['Virus identifier used for the analysis'].str.contains('theloniousmonk', case=False) == False]
print(df.shape)

(1057, 22)


**Write the new dataframe containing all of the actinobacteria phages that I was able  to find and for which we have lifestyle annotations**

In [9]:
df.to_csv(processed_df_outfile)

# Download RefSeq data from NCBI

This section will download full genome fasta files from NCBI and as such may take some time to run, and should probably be run at night time. In reality, the number is fairly small so it's not placing a heavy burden on NCBI servers but be conscious and aware of bulk downloading in this manner

In [10]:
temp_df = df[df['Database source'] == 'NCBI RefSeq']
print(temp_df.shape)

(506, 22)


In [11]:
for index in temp_df.index[:]:
    print(index)
    accession = temp_df.loc[index]['RefSeq accession number']
    
    #####################################################################
    ###Don't download if I already have the file downloaded and ready to go
    if os.path.exists(fasta_dir + '{}.fasta'.format(accession)):
        continue
    #####################################################################

    handle = Entrez.efetch(db='nuccore', id=accession, rettype='fasta', retmode='text')
    text = handle.read()
    handle.close()
    with open(fasta_dir + '{}.fasta'.format(accession), 'w') as outfile:
        outfile.write(text)
    time.sleep(10)

1
14
19
22
27
29
30
36
41
44
46
47
48
49
52
53
54
55
56
57
58
62
66
72
73
74
75
76
78
79
80
82
83
86
88
90
93
97
99
100
101
102
105
107
108
109
110
111
112
114
115
117
126
133
136
141
146
150
177
178
179
197
208
209
210
216
220
229
241
244
245
246
247
252
255
263
273
274
275
276
277
278
279
280
284
304
306
310
342
343
345
365
367
371
416
418
427
429
430
431
433
444
446
447
464
466
476
493
494
505
511
512
515
516
517
518
521
522
523
524
525
528
536
538
540
547
550
551
552
553
555
556
557
558
560
565
566
570
577
580
581
582
584
592
593
594
595
609
626
627
628
629
630
631
633
634
637
643
651
655
656
657
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
748
749
756
770
774
776
777
789
790
791
792
793
794
825
837
842
846
848
860
861
862
865
868
869
870
871
872
874
875
876
877
890
897
907
913
914
915
916
917
926
933
934
935
937
940
952
969
970
971
980
981
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1019
1022
1042
1046
1048
1059
1060
1061
1062
1063
1064
1065
1066
1067
1078

**Ensuring that everything was downloaded by checking that each genome sequence is parseable and at least 1000 nts long**

In [12]:
problem_cases = []
for index in temp_df.index[:]:
    accession = temp_df.loc[index]['RefSeq accession number']
    if os.path.exists(fasta_dir + '{}.fasta'.format(accession)):
        record = SeqIO.read(fasta_dir + '{}.fasta'.format(accession), format='fasta')
        if len(record.seq) < 1000:
            problem_cases.append(accession)
print(len(problem_cases))

0


# Write six-frame translations for every fasta file in the dataset

Note that this code will fail on phages with non-standard genetic codes!

In [13]:
import sys
sys.path.append('../../bacphlip-py/bacphlip/')
import bacphlip

In [15]:
for fasta_file in glob.glob(fasta_dir+'*.fasta')[:]:
    print(fasta_file)
    bacphlip.six_frame_translate(fasta_file, fasta_file.replace('.fasta', '.fasta.6frame'),\
                                force_overwrite=True)

../Data/phage_data_nmicro2017/phage_fasta_files/baxterfox.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/scottmcg.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/validus.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/cooper.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_003309.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/gordon.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_014467.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_011201.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/lucky10.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/kita.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_019505.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_009815.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_001884.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/bpbiebs31.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_001416.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_00694

../Data/phage_data_nmicro2017/phage_fasta_files/arv1.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_022754.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/optimus.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/kazan.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_003216.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/seagreen.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_007019.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/lebron.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_007734.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_015585.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/klucky39.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/phlei.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/quinnkiro.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_028905.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_004814.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/zoej.fasta
.

../Data/phage_data_nmicro2017/phage_fasta_files/NC_022343.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_007022.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_005354.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_019720.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_009236.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_010325.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/cabrinians.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/vix.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/marcell.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_002486.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/annihilator.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/odin.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/deadp.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/puhltonio.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_003356.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_003

../Data/phage_data_nmicro2017/phage_fasta_files/NC_029000.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/hamulus.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/slarp.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/packman.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/dlane.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_004348.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_011038.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_001697.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/vortex.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_023693.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_024383.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_019520.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_003444.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_004585.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_004679.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/sisi.fa

../Data/phage_data_nmicro2017/phage_fasta_files/NC_007064.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_005357.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_007021.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_019422.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_015719.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/rosebush.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_027353.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/maverick.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/kayacho.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/vista.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/lockley.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/alma.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/serendipity.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/bricole.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/phaedrus.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_02178

../Data/phage_data_nmicro2017/phage_fasta_files/NC_013693.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_009821.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_024392.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_011421.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/zonia.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_019709.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/littlecherry.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_005887.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/soilassassin.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/annal29.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_011048.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_021774.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/vivaldi.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_012223.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_005260.fasta
../Data/phage_data_nmicro2017/phage_fasta_

../Data/phage_data_nmicro2017/phage_fasta_files/NC_023561.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/pops.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/museum.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/gizmo.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_004167.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/willis.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/wile.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/loser.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_017974.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/xerxes.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_021772.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_007048.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/ovechkin.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_004456.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/captaintrips.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/batstarr.fasta
..

../Data/phage_data_nmicro2017/phage_fasta_files/NC_020083.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_008515.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/violet.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_010393.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_002167.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/blueberry.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/zapner.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/jolie2.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_001418.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_004617.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/et08.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_012749.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/dorothy.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_008201.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/kbg.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/archie.fasta
.

../Data/phage_data_nmicro2017/phage_fasta_files/pmc.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_019715.fasta
../Data/phage_data_nmicro2017/phage_fasta_files/NC_021344.fasta


**fin.**