# Notes:

This notebook is meant to be stepped through **from top-to-bottom, in order**. This code can be used both to initially compile a dataset and to update an existing dataset. The project scans through protein domain annotations in order to select and download protein domain files that are hypothesized to be useful for **identifying temperate phages**. The same code can be used to update these classifications as new data is added to relevant files that are drawn upon (most notably the `cddid.tbl` file, addressed below).

The main point here is to avoid searching through > 50,000 protein domains for any given phage with all the boring clutter that would probably entail. So ideally, we can come up with a "short list" of domains via some rational means, download them, and later test them for discriminatory potential in classifying phage lifestyles with the expectation that some may be very useless for this purpose. 


Note that if you change any search parameters or possibly when stepping through this code with updated data you might want to manually delete the existing `.afa` (and `.hmm`) files before stepping through this notebook otherwise older `.afa`/`.hmm` files will continue to be included in your analysis.

# Imports

In [1]:
import pandas as pd
import datetime
import time
import requests ###Will be downloading files in bulk, potentially
import os

# Notebook wide constants

In [2]:
base_dir = '../Data/protein_domain_data/'
assert os.path.exists(base_dir)

full_table_file = base_dir + 'cddid.tbl' ###The backbone of the notebook
assert os.path.exists(full_table_file)
#
###Whether or not to re-download alignments that already exist. Note that although an alignment
###may already exist it may currently be larger/better in newer releases so should probably be
###redownloaded from time-to-time by setting this flag to "True"
redownload_alignments = False 
#
min_domain_length = 30
#
today = datetime.date.today()
savedate = '{}_{}_{}'.format(today.year, today.month, today.day)
#
save_file = base_dir+'cddid_selected_{}.tsv'.format(savedate)

# Read in conserved domain database datatable

**Original CDD table was downloaded on 03/30/2020 from https://ftp.ncbi.nih.gov/pub/mmdb/cdd/ and should be regularly downloaded as a part of updating/re-running this code base**

In [3]:
df = pd.read_csv(full_table_file, sep='\t', header=None, index_col=0)
print(df.shape)
###Filter tiny little families to make life easier
df = df[df[4]>min_domain_length]
print(df.shape)
df.head()

(59807, 4)
(54793, 4)


Unnamed: 0_level_0,1,2,3,4
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
214330,CHL00001,rpoB,RNA polymerase beta subunit,1070
214331,CHL00002,matK,maturase K,504
176948,CHL00003,psbA,photosystem II protein D1,338
176949,CHL00004,psbD,photosystem II protein D2,353
176950,CHL00005,rps16,ribosomal protein S16,82


# Start working through several conserved domain search strategies that seem *reasonable*

The only search term that immediately comes to mind, and which I'm not considering is "prophage". But it might be a good idea to include more specific protein names the way I am doing for paraA/B (i.e. xerC/D, etc.).

In [4]:
###Identify the domains related to the following searcn terms 
case_insensitive_search_terms = ['integrase', 'excisionase', 'recombinase',\
                                 'transposase', 'lysogen', 'temperate']
case_sensitive_search_terms = ['parA|ParA|parB|ParB']

for search_term in case_insensitive_search_terms:
    indices = df[df[3].str.contains(search_term, case=False)==True].index
    df[search_term] = 0
    df.at[indices, search_term] = 1
    
for search_term in case_sensitive_search_terms:
    indices = df[df[3].str.contains(search_term, case=True)==True].index
    df[search_term] = 0
    df.at[indices, search_term] = 1

In [5]:
df['search_hits'] = 0
df.at[df.index, 'search_hits'] = df[df.columns[4:]].sum(axis=1)
interesting_df = df[df['search_hits']>0]
print(interesting_df.shape)

(371, 12)


**Out of pure curiosity, view how many domains in this final set have one or more "hits" since some of these descriptions are long long long. Others, unfortunately for this search strategy, are very short which is a methodological limitation of this approach.**

In [6]:
interesting_df['search_hits'].value_counts()

1    330
2     34
3      6
4      1
Name: search_hits, dtype: int64

**Write a new and much smaller file containing only the selected domains**

In [7]:
interesting_df.to_csv(save_file, sep='\t', index=True, header=True)

# Download relevant alignments for a given dataframe

I didn't spend much time handling errors and cleaning up the requests to work a bit better but that may be forthcoming depending on how things go.

In [8]:
dl_df = pd.read_csv(save_file, sep='\t', index_col=0)
print(dl_df.shape)
dl_df.head()

(371, 12)


Unnamed: 0_level_0,1,2,3,4,integrase,excisionase,recombinase,transposase,lysogen,temperate,parA|ParA|parB|ParB,search_hits
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
223544,COG0468,RecA,"RecA/RadA recombinase [Replication, recombinat...",279,0,0,1,0,0,0,0,1
223655,COG0582,XerC,"Integrase [Replication, recombination and repa...",309,1,0,0,0,0,0,0,1
223747,COG0675,InsQ,"Transposase [Mobilome: prophages, transposons].",364,0,0,0,1,0,0,0,1
224392,COG1475,Spo0J,"Chromosome segregation protein Spo0J, contains...",240,0,0,0,0,0,0,1,1
224396,COG1479,COG1479,"Uncharacterized conserved protein, contains Pa...",409,0,0,0,0,0,0,1,1


In [9]:
error_ids = []
for index in dl_df.index[:]:
    print('####')
    print(index)
    uid = dl_df.loc[index][str(1)]
    print(uid)
    
    ########################################################
    ###As noted above, this parameter prevents you from re-downloading data constantly
    ###so if a file already exists it gets skipped.
    if redownload_alignments == False:
        if os.path.exists(base_dir+'domain_alignments_and_hmms/{}.afa'.format(uid)):
            continue
    ########################################################

    ###Easiest way that I found to navigate NCBI for this particular database, unfortunately.
    ###Am sure there is a better API?
    dl_link = 'https://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid={}&seqout=1&maxaln=-1'.format(uid)
    r = requests.get(dl_link)
    if r.status_code == 200:
        tempy = r.text
        ###The HTML is super simple so I am just manually parsing it
        tempy = tempy.split('<!DOCTYPE html><html><body><pre>\n')[1]
        tempy = tempy.split('</pre></body></html>')[0]
        with open('../Data/cdd/cdd_alignments/{}.afa'.format(uid), 'w') as outfile:
            outfile.write(tempy)
    else:
        error_ids.append(uid)
        
    ###Used to prevent overwhelming servers / getting locked out
    time.sleep(10)

####
223544
COG0468
####
223655
COG0582
####
223747
COG0675
####
224392
COG1475
####
224396
COG1479
####
224576
COG1662
####
224854
COG1943
####
224872
COG1961
####
225296
COG2452
####
225361
COG2801
####
225382
COG2826
####
225467
COG2915
####
225511
COG2963
####
225581
COG3039
####
225830
COG3293
####
225853
COG3316
####
225865
COG3328
####
225868
COG3331
####
225872
COG3335
####
225920
COG3385
####
225949
COG3415
####
225970
COG3436
####
225995
COG3464
####
226077
COG3547
####
226192
COG3666
####
226201
COG3676
####
226202
COG3677
####
226792
COG4342
####
226824
COG4389
####
226950
COG4584
####
226991
COG4644
####
227307
COG4973
####
227308
COG4974
####
227449
COG5119
####
227708
COG5421
####
227720
COG5433
####
227751
COG5464
####
227758
COG5471
####
227845
COG5558
####
227946
COG5659
####
380186
NF033179
####
222813
PHA00730
####
222853
PHA02517
####
222854
PHA02518
####
222904
PHA02601
####
177485
PHA02731
####
165252
PHA02942
####
234690
PRK00218
####
234698
PRK00236
####
234713

In [10]:
print(len(error_ids))

0


**Ensure that only the IDs in the directory are the ones that I wanted**

As noted above, if you change any search parameters or in future releases you might want to manually delete existing `.afa` files before stepping through this notebook otherwise older `.afa` files will continue to be included.

In [11]:
###Ensure that the only IDs in that list are the ones I meant to download
import glob
for afa_file in glob.glob('../Data/model_data/protein_domain_data/domain_alignments_and_hmms/*.afa'):
    temp_id = afa_file.split('/')[-1].split('.afa')[0]
    if temp_id.lower() not in list(dl_df['1'].str.lower()):
        print('Found something here that I did not expect')

**fin.**