# Protein embeddings improve phage-host interaction prediction

**Mark Edward M. Gonzales<sup>1, 2</sup>, Jennifer C. Ureta<sup>1, 2</sup> & Anish M.S. Shrestha<sup>1, 2</sup>**

<sup>1</sup> Bioinformatics Laboratory, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila, Philippines <br>
<sup>2</sup> Department of Software Technology, College of Computer Studies, De La Salle University, Manila, Philippines 

{mark_gonzales, jennifer.ureta, anish.shrestha}@dlsu.edu.ph

<hr>

## 💡 Dataset

**To download the dataset via INPHARED:**

1. Clone the latest version of INPHARED from its official [repository](http://github.com/RyanCook94/inphared), and install the necessary [dependencies](https://github.com/RyanCook94/inphared#dependencies) to run INPHARED.
2. Download the PHROG HMMs (`all_phrogs.hmm`) from [Millard Lab](http://s3.climb.ac.uk/ADM_share/all_phrogs.hmm.gz) or from [Google Drive](https://drive.google.com/file/d/1oX-oOJzIynC6XKWWAvnd57CKqEV3t3kc/view?usp=sharing), and save it in the same folder as `inphared.pl`. 
3. Download the GenomesDB directory of INPHARED from this [link](https://millardlab-inphared.s3.climb.ac.uk/GenomesDB_20201412.tar.gz), and save it inside the same folder as `inphared.pl`.
4. Run the following script to execute the INPHARED pipeline (refer to the [documentation](https://github.com/RyanCook94/inphared#usage) for details): 
   - `perl inphared.pl -e exclusion_list.txt -P all_phrogs.hmm -c 0`
   - Notes:
      - The goal of running the pipeline is to complete populating the `GenomesDB` directory. Note that this may take some time since genome annotation is performed.
      - Note that running the INPHARED pipeline will load the most recent data. Our work uses the September data. For reproducibility, we have provided the following files to load the September data: 
        - `16Sep2022_data_excluding_refseq.tsv` (should have already been included when the repository was cloned) <br>
        - `16Sep2022_phages_downloaded_from_genbank.gb` (can be downloaded from [Google Drive](https://drive.google.com/file/d/14LG1iGa1CqPbAjofZT1EY8VKnE8Iy45Q/view?usp=sharing)) <br>
        
**After INPHARED finishes running:**

1. Save `16Sep2022_data_excluding_refseq.tsv` and `16Sep2022_phages_downloaded_from_genbank.gb` inside the `inphared` directory located in the same folder as this notebook. The folder structure should look like this:<br> <br>
   `experiments` (parent folder of this notebook) <br> 
     ↳ `inphared` <br>
     &nbsp; &nbsp; ↳ `16Sep2022_data_excluding_refseq.tsv` <br>
     &nbsp; &nbsp; ↳ `16Sep2022_phages_downloaded_from_genbank.gb`  <br>
     ↳ `1. Sequence Preprocessing.ipynb` (this notebook) <br> <br>
        
2. Create a directory named `datasets` in the **root** of this project.
   - Transfer `GenomesDB` to `datasets/inphared/inphared`.
   - The folder structure should look like this: <br> <br>
        `phage-host-prediction` (root) <br>
        ↳ `datasets` <br>
        &nbsp; &nbsp; ↳ `inphared` <br> 
        &nbsp; &nbsp;&nbsp; &nbsp; ↳ `inphared`  <br>
        &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `GenomesDB`  <br>
        &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ `AB002632`  <br>
        &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; ↳ ... <br>
        ↳ `experiments` <br>
        &nbsp; &nbsp; ↳ `1. Sequence Preprocessing.ipynb` (this notebook)

<hr>

## 📁 Output Files
If you would like to skip running this notebook, you may download the output FASTA files from [Google Drive](https://drive.google.com/drive/folders/16ZBXZCpC0OmldtPPIy5sEBtS4EVohorT?usp=sharing). Save the results inside the `inphared` directory located in the same folder as this notebook. The folder structure should look like this:

`experiments` (parent folder of this notebook) <br> 
↳ `inphared` <br>
&nbsp; &nbsp;↳ `fasta` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `hypothetical` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `nucleotide` <br>
&nbsp; &nbsp;&nbsp; &nbsp; ↳ `rbp` <br>
↳ `1. Sequence Preprocessing.ipynb` (this notebook) <br>

Intermediate output files (i.e., those saved in `temp/preprocessing`) should have already been included when the repository was cloned.

<hr>

# Part I: Preliminaries

Import the necessary libraries and modules.

In [1]:
import os
import shutil
import pickle

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from ConstantsUtil import ConstantsUtil
from SequenceParsingUtil import SequenceParsingUtil

%load_ext autoreload
%autoreload 2

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

pd.options.mode.chained_assignment = None

The parameter of `ConstantsUtil` corresponds to the download date of the dataset. 

In particular, when [INPHARED](https://github.com/RyanCook94/inphared) is run, among the files generated are:
- `<download_date>_data_excluding_refseq.tsv` <br>
- `<download_date>_phages_downloaded_from_genbank.gb` <br>

The download date of the dataset is `download_date`.

In [3]:
constants = ConstantsUtil('16Sep2022')
util = SequenceParsingUtil(constants.DISPLAY_PROGRESS, constants.MISSPELLING_THRESHOLD, constants.MIN_LEN_KEYWORD)

In [4]:
inphared_raw = pd.read_csv(f'{constants.INPHARED_TSV}', sep='\t')
inphared_raw = inphared_raw.drop_duplicates(ignore_index = True)
inphared_raw.head()

Unnamed: 0,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values)
0,LR756511,uncultured phage,uncultured phage environmental samples Viruses,595163,True,42.11,DNA,26-MAR-2020,1080,20.648148,79.351852,93.248236,,62,Unspecified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,ENV,Unspecified
1,LR756508,uncultured phage,uncultured phage environmental samples Viruses,484177,True,39.553,DNA,26-MAR-2020,683,98.389458,1.610542,89.636022,,31,Unspecified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,ENV,Unspecified
2,LR756504,uncultured phage,uncultured phage environmental samples Viruses,636363,True,26.393,DNA,26-MAR-2020,920,51.956522,48.043478,92.786193,,34,Unspecified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,ENV,Unspecified
3,LR756503,uncultured phage,uncultured phage environmental samples Viruses,735411,True,32.225,DNA,26-MAR-2020,1014,30.276134,69.723866,93.113239,,56,Unspecified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,ENV,Unspecified
4,LR756502,uncultured phage,uncultured phage environmental samples Viruses,642428,True,31.472,DNA,26-MAR-2020,971,54.582904,45.417096,94.525145,,45,Unspecified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,ENV,Unspecified


In [5]:
inphared_raw.shape

(18389, 27)

<hr>

# Part II: Preprocessing of Host Information

For entries with `Unspecified` host in INPHARED, obtain the isolation host information from GenBank, if it makes sense.

In [6]:
util.set_inphared_gb(constants.INPHARED_GB)

In [7]:
inphared_unspec_host = inphared_raw.loc[inphared_raw['Host'] == 'Unspecified']
inphared_unspec_host.reset_index(inplace = True, drop = True)
inphared_unspec_host.head()

Unnamed: 0,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values)
0,LR756511,uncultured phage,uncultured phage environmental samples Viruses,595163,True,42.11,DNA,26-MAR-2020,1080,20.648148,79.351852,93.248236,,62,Unspecified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,ENV,Unspecified
1,LR756508,uncultured phage,uncultured phage environmental samples Viruses,484177,True,39.553,DNA,26-MAR-2020,683,98.389458,1.610542,89.636022,,31,Unspecified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,ENV,Unspecified
2,LR756504,uncultured phage,uncultured phage environmental samples Viruses,636363,True,26.393,DNA,26-MAR-2020,920,51.956522,48.043478,92.786193,,34,Unspecified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,ENV,Unspecified
3,LR756503,uncultured phage,uncultured phage environmental samples Viruses,735411,True,32.225,DNA,26-MAR-2020,1014,30.276134,69.723866,93.113239,,56,Unspecified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,ENV,Unspecified
4,LR756502,uncultured phage,uncultured phage environmental samples Viruses,642428,True,31.472,DNA,26-MAR-2020,971,54.582904,45.417096,94.525145,,45,Unspecified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,Unclassified,ENV,Unspecified


In [8]:
phages_unspec_host = util.get_phages_unspec_host(inphared_unspec_host)

Processed 1000 records
Processed 2000 records
Processed 3000 records
Processed 4000 records
Processed 5000 records
Processed 6000 records
Processed 7000 records
Processed 8000 records
Processed 9000 records
Processed 10000 records
Processed 11000 records
Processed 12000 records
Processed 13000 records
Processed 14000 records
Processed 15000 records
Processed 16000 records
Processed 17000 records
Processed 18000 records
Processed 19000 records
Processed 20000 records
Processed 21000 records
Processed 22000 records
Processed 23000 records
Processed 24000 records
Processed 25000 records
Processed 26000 records
Processed 27000 records
Processed 28000 records
Processed 29000 records
Processed 30000 records
Processed 31000 records


In [9]:
unfiltered_hosts = util.get_unfiltered_hosts()
genus_typo = util.get_genus_typo(f'{constants.GENUS_TYPO}')
unfiltered_suspected_genera = util.get_unfiltered_suspected_genera(constants.CANDIDATE_REGEX)

excluded_hosts = util.get_excluded_hosts(constants.BACTERIA_NOT_GENUS, constants.EXCLUDED_HOSTS)
valid_hosts = util.get_valid_hosts()

Add another layer of checking to identify typos in GenBank annotations, as well as oversights during manual filtering:
- Recheck genera in `valid_hosts` that are not in INPHARED's `Host` column.
- Identify entries with minimum edit distance &leq; 2 (likely misspellings).
<br>

Adjust the exclusion and typo correction files in the `preprocessing` directory accordingly, and rerun the previous cell, as needed.

In [10]:
inphared_hosts = set(inphared_raw['Host'].str.lower())
valid_hosts - inphared_hosts

{'silvanigrella'}

In [11]:
valid_hosts_list = list(valid_hosts)
for i in range(len(valid_hosts_list)):
    for j in range(i + 1, len(valid_hosts_list)):
        if util.is_possible_misspelling(valid_hosts_list[i], valid_hosts_list[j]):
            print(valid_hosts_list[i], valid_hosts_list[j])

microbacterium mycobacterium


Update `host` column based on the manually filtered info from GenBank.

In [12]:
inphared_augmented = inphared_raw.copy(deep = True)
util.update_host_column(constants.CANDIDATE_REGEX, inphared_unspec_host, inphared_augmented)

inphared_augmented['Host'] = inphared_augmented['Host'].str.lower()
inphared_augmented['Host'].value_counts()

Index: 424 | Name: ON970599 | Host: gordonia
Processed 1000 records
Processed 2000 records
Index: 2105 | Name: MN940411 | Host: staphylococcus
Processed 3000 records
Index: 3471 | Name: AF125163 | Host: vibrio
Index: 3477 | Name: AF063097 | Host: escherichia
Processed 4000 records
Index: 3576 | Name: OK499975 | Host: proteus
Index: 3705 | Name: OK649958 | Host: haloferax
Index: 3761 | Name: MW822148 | Host: mycobacterium
Processed 5000 records
Index: 4552 | Name: MH590600 | Host: microbacterium
Index: 4628 | Name: MH271321 | Host: microbacterium
Index: 4643 | Name: MH155873 | Host: microbacterium
Index: 5392 | Name: LC625742 | Host: enterococcus
Processed 6000 records
Processed 7000 records
Processed 8000 records
Processed 9000 records
Index: 5662 | Name: MT588083 | Host: salmonella
Processed 10000 records
Processed 11000 records
Processed 12000 records
Processed 13000 records
Processed 14000 records
Processed 15000 records
Processed 16000 records
Processed 17000 records
Processed 1800

unspecified                                     2566
mycobacterium                                   2189
escherichia                                     1627
vibrio                                           852
pseudomonas                                      826
salmonella                                       744
streptococcus                                    730
klebsiella                                       628
gordonia                                         579
staphylococcus                                   508
enterobacteria                                   498
microbacterium                                   459
bacillus                                         377
lactococcus                                      364
synechococcus                                    362
arthrobacter                                     340
streptomyces                                     310
flavobacterium                                   247
acinetobacter                                 

### Polyvalent Phages

There are only few polyvalent (multi-host) phages in our database. Hence, for simplicity, we map each polyvalent phage only to its host with the highest number of interacting phages in the dataset.

In [13]:
multiple_hosts = set()

for host, count in dict(inphared_augmented['Host'].value_counts()).items():
    if '|' in host:
        multiple_hosts.add(host)
        print(host, ':', count)

escherichia | chryseobacterium | pseudomonas : 4
phormidium | plectonema : 1


In [14]:
for host in multiple_hosts:
    # We have confirmed that the first host is the host with the highest number of interacting phages in the dataset.
    inphared_augmented.loc[inphared_augmented['Host'] == host, 'Host'] = host.split(' | ')[0]

In [15]:
inphared = inphared_augmented.loc[inphared_augmented['Host'] != 'unspecified']
inphared.reset_index(inplace = True, drop = True)
inphared.head()

Unnamed: 0,Accession,Description,Classification,Genome Length (bp),Jumbophage,molGC (%),Molecule,Modification Date,Number CDS,Positive Strand (%),Negative Strand (%),Coding Capacity(%),Low Coding Capacity Warning,tRNAs,Host,Lowest Taxa,Genus,Sub-family,Family,Order,Class,Phylum,Kingdom,Realm,Baltimore Group,Genbank Division,Isolation Host (beware inconsistent and nonsense values)
0,MK250029,Prevotella phage Lak-C1,Prevotella phage Lak-C1 Myoviridae Caudovirice...,540217,True,25.796,DNA,13-JAN-2019,830,47.108434,52.891566,68.324951,,30,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.
1,MK250028,Prevotella phage Lak-B9,Prevotella phage Lak-B9 Myoviridae Caudovirice...,550053,True,26.012,DNA,13-JAN-2019,859,52.270081,47.729919,69.188424,,29,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.
2,MK250027,Prevotella phage Lak-B8,Prevotella phage Lak-B8 Myoviridae Caudovirice...,551627,True,26.022,DNA,13-JAN-2019,860,53.023256,46.976744,69.318761,,33,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.
3,MK250026,Prevotella phage Lak-B7,Prevotella phage Lak-B7 Myoviridae Caudovirice...,550702,True,26.02,DNA,13-JAN-2019,859,53.201397,46.798603,69.363285,,33,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.
4,MK250025,Prevotella phage Lak-B6,Prevotella phage Lak-B6 Myoviridae Caudovirice...,546689,True,26.029,DNA,13-JAN-2019,847,52.656434,47.343566,69.118274,,30,prevotella,Myoviridae,Unclassified,Unclassified,Myoviridae,Unclassified,Caudoviricetes,Uroviricota,Heunggongvirae,Duplodnaviria,Group I,ENV,Prevotella sp.


Replace the genera of some hosts with standard NCBI Taxonomy nomenclature.

In [16]:
inphared['Host'].replace(util.get_ncbi_standard_nomenclature(constants.NCBI_STANDARD_NOMENCLATURE), inplace = True)
inphared['Host'].value_counts()

mycobacterium                2189
escherichia                  1632
vibrio                        852
pseudomonas                   826
salmonella                    744
streptococcus                 730
klebsiella                    628
gordonia                      579
enterobacter                  558
staphylococcus                508
microbacterium                459
bacillus                      377
lactococcus                   364
synechococcus                 362
arthrobacter                  340
streptomyces                  310
flavobacterium                247
acinetobacter                 196
enterococcus                  185
rhizobium                     154
aeromonas                     140
erwinia                       137
shigella                      128
propionibacterium             125
yersinia                      120
rheinheimera                  110
campylobacter                  95
pectobacterium                 92
xanthomonas                    91
lactobacillus 

Just verifying that all the entries are unique

In [17]:
inphared.shape[0] == inphared['Accession'].nunique()

True

###  Some Statistics

Statistics on phages

In [18]:
print("Original INPHARED:\t\t", inphared_raw.shape[0])
print("With Specified Hosts:\t\t", inphared_raw.shape[0] - inphared_unspec_host.shape[0])
print("Plus Manually Filtered Hosts:\t", inphared.shape[0])

Original INPHARED:		 18389
With Specified Hosts:		 15739
Plus Manually Filtered Hosts:	 15823


Statistics on host genera

In [19]:
print("Num. of Unique Host Genera:\t", inphared['Host'].nunique())

Num. of Unique Host Genera:	 279


Save the entries with host information to a CSV file.

In [20]:
inphared.to_csv(os.path.join(constants.TEMP_PREPROCESSING, constants.INPHARED_WITH_HOSTS), index = False)

<hr>

# Part III: Selection of Annotated Receptor-Binding Proteins (RBPs)

Load only the phage entries with host data.

In [21]:
inphared = pd.read_csv(f'{constants.TEMP_PREPROCESSING}/{constants.INPHARED_WITH_HOSTS}')
util.set_inphared(inphared)

inphared.shape

(15823, 27)

Identify which phage entries have gene annotations and which do not.

In [22]:
no_cds_annot = util.get_no_cds_annot()

Processed 1000 records
Processed 2000 records
Processed 3000 records
Processed 4000 records
Processed 5000 records
Processed 6000 records
Processed 7000 records
Processed 8000 records
Processed 9000 records
Processed 10000 records
Processed 11000 records
Processed 12000 records
Processed 13000 records
Processed 14000 records
Processed 15000 records
Processed 16000 records
Processed 17000 records
Processed 18000 records
Processed 19000 records
Processed 20000 records
Processed 21000 records
Processed 22000 records
Processed 23000 records
Processed 24000 records
Processed 25000 records
Processed 26000 records
Processed 27000 records
Processed 28000 records
Processed 29000 records
Processed 30000 records
Processed 31000 records


In [23]:
with open(os.path.join(constants.TEMP_PREPROCESSING, constants.NO_CDS_ANNOT),'wb') as no_cds_annot_file:
    pickle.dump(no_cds_annot, no_cds_annot_file)

In [24]:
with open(f'{constants.TEMP_PREPROCESSING}/{constants.NO_CDS_ANNOT}','rb') as no_cds_annot_file:
    no_cds_annot = pickle.load(no_cds_annot_file)
    
util.set_no_cds_annot(no_cds_annot)

### Some statistics

In [25]:
print("Without gene annot:\t", len(no_cds_annot))
print("With gene annot:\t", inphared.shape[0] - len(no_cds_annot))

Without gene annot:	 665
With gene annot:	 15158


## *A. Process entries with gene annotation*

- Use regex from this [RBP prediction study](https://www.mdpi.com/1999-4915/14/6/1329) by Boeckaerts <i>et al.</i> (2022) to select the annotated RBPs
   - We modify this regex to:
     - Make more accommodating to typos (e.g., look for `bind` instead of `binding`)
     - Allow multiple spaces between tokens
   - For comparison:
     - Original regex: `tail.?(?:spike|fib(?:er|re))|^recept(?:o|e)r.?(?:binding|recognizing).*(?:protein)?|^RBP`
     - Modified regex: `tail?(.?|\s*)(?:spike?|fib(?:er|re))|recept(?:o|e)r(.?|\s*)(?:bind|recogn).*(?:protein)?|(?<!\w)RBP(?!a)`
     - Annotations captured by modified regex but not by old regex:
       - `receptor recognition protein`
   - The same [study](https://www.mdpi.com/1999-4915/14/6/1329) has an exclusion list that covers proteins related to RBPs but are not RBPs themselves (e.g., assembly and portal proteins).
       <br><br>

- If product does not match regex for RBP, check if it is a hypothetical protein
   - The same [study](https://www.mdpi.com/1999-4915/14/6/1329) has a list of keywords associated with hypothetical proteins
      - We remove `putative`, `probable`, and `probably` since they are indicative of a product with a putative (rather than an unknown) function.
   - Make more accommodating to typos by allowing keywords with minimum edit distance ≤ 2 (likely misspellings)
   - Create a manual list to handle cases where the keyword `hypothetical` is present but the putative function is already given, as in the case of:
       - `conserved hypothetical lipoprotein`
       - `hypothetical host-like ribonucleoside diphosphate reductase`       <br><br>
      
- Predict whether these hypothetical proteins are RBPs using the XGBoost model from the [RBP prediction study](https://www.mdpi.com/1999-4915/14/6/1329) by Boeckaerts <i>et al.</i> (2022).

In [26]:
util.set_token_delimiter(constants.TOKEN_DELIMITER)
util.construct_keyword_list(constants.HYPOTHETICAL_KEYWORDS, constants.RBP_RELATED_NOT_RBP, constants.PUTATIVE_FUNCTIONS)

In [27]:
annot_products = util.get_annot_products()

Processed 1000 records
Processed 2000 records
Processed 3000 records
Processed 4000 records
Processed 5000 records
Processed 6000 records
Processed 7000 records
Processed 8000 records
Processed 9000 records
Processed 10000 records
Processed 11000 records
Processed 12000 records
Processed 13000 records
Processed 14000 records
Processed 15000 records
Processed 16000 records
Processed 17000 records
Processed 18000 records
Processed 19000 records
Processed 20000 records
Processed 21000 records
Processed 22000 records
Processed 23000 records
Processed 24000 records
Processed 25000 records
Processed 26000 records
Processed 27000 records
Processed 28000 records
Processed 29000 records
Processed 30000 records
Processed 31000 records


In [28]:
with open(os.path.join(constants.TEMP_PREPROCESSING, constants.ANNOT_PRODUCTS),'wb') as annot_products_file:
    pickle.dump(annot_products, annot_products_file)

In [29]:
with open(f'{constants.TEMP_PREPROCESSING}/{constants.ANNOT_PRODUCTS}','rb') as annot_products_file:
    annot_products = pickle.load(annot_products_file)
    
util.set_annot_products(annot_products)
for annot_product in sorted(list(annot_products)):
    print(annot_product)

(2Fe-2S)-binding protein
(E)-4-hydroxy-3-methylbut-2-enyl-diphosphate synthase
(NAD(+)) DNA ligase
(R,S)-reticuline 7-O-methyltransferase-like isoform X1
(RecA-like) recombination and repair protein
(p)ppGpp synthase/hydrolase
(p)ppGpp synthetase, RelA/SpoT family
(pyro)phosphatase or 5'(3')-deoxyribonucleotidase
(thymidylate) synthase
0.2
0.3 protein
0.3B protein
0.4 protein
0.6
0.6A protein
0.6B protein
1
1, 4-beta-N-acetylmuramidase
1,4-alpha-glucan (glycogen) branching enzyme
1,4-alpha-glucan (glycogen) branching enzyme, GH-13-type
1,4-alpha-glucan (glycogen) branching enzyme, GH-13-type (EC
1,4-beta-N-acetylmuramidase
1,4-beta-N-acetylmuramidase or lysozyme
1,4-dihydroxy-6-naphthoate synthase
1,6-anhydro-N-acetylmuramyl-L-alanine amidase
1-(5-phosphoribosyl)-5-amino-4-imidazole- carboxylate carboxylase
1-(5-phosphoribosyl)-5-amino-4-imidazole- carboxylatecarboxylase
1-acyl-sn-glycerol-3-phosphate acyltransferase
1-aminocyclopropane-1-carboxylate deaminase
1-deoxy-D-xylulose-5-phos

Beta glucosyl transferase
Beta-lactamase SHV-2
BhlA
Bifunctional DNA primase/polymerase
Bifunctional DNA primase/polymerase, N-terminal
Bifunctional NMN adenylyltransferase/Nudix hydrolase
Big-1 domain-containing protein
Bis(5'-nucleosyl)-tetraphosphatase PrpE
Bla
BlaI/MecI/CopY family transcriptional regulator
BmpA
BmpB
BmpC
BofL
BofR
Bor
Bor lipoprotein
Bor protein
Bor protein precursor
BplA
BplB
BppU domain protein
BppU family baseplate upper protein
BppU family phage baseplate upper protein
BppU-like baseplate protein
BppU_N domain-containing protein
BrnA-like antitoxin
BrnT family toxin
BrnT-like antitoxin
BrnT-like toxin
Bro family antirepressor
Bro family toxin component
Bro-N domain protein
Bro-N family protein
Bro-N/ORF6C domain-containing antirepressor
BroN DNA binding protein
BspA family leucine rich repeat surface protein
BspA family leucine-rich repeat surface protein
BsuMI modification methylase subunit
BtrN protein
BtuR/CobO ATP:corrinoid adenosyltransferase
C
C (capsid 

D/B-b-hydroxylase DiO
D1
D10 protein
D108-specific protein
D11 protein
D11-like protein
D12 class N6 adenine-specific DNA methyltransferase
D12 class N6 adenine-specific DNA methyltransferase family protein
D12 class N6 adenine-specific DNAmethyltransferase
D13; putative exonuclease SbcCD, C subunit (ACLAME 108)
D14 Protein
D14 protein
D14 protein/putative Holiday junction resolvase (ACLAME 1108)
D14 protein/putative holiday junction resolvase
D14 protein/putative holiday junction resolvase (ACLAME 1108)
D14 protein/putative resolvase
D14-like protein
D152
D154
D2
D2 protein
D3 protein
D3-like protein
D383
D5 protein
D5 protein/DNA-binding protein
D5 transcription factor
D5-like protein
D6 protein
D60
DAN primase/helicase
DBP
DCM methylase
DCMP deaminase
DCTP deaminase
DCTP pyrophosphatase
DD-transpeptidase
DDE endonuclease
DDE superendonuclease family protein
DDE transposase
DDE-type integrase/transposase
DDE-type integrase/transposase/recombinase
DEAD box family helicase
DEAD box hel

DNA-directer RNA polymerase beta subunit
DNA-endonuclease-like protein
DNA-gyrase A-subunit
DNA-gyrase B-subunit
DNA-gyrase IV A-subunit
DNA-invertase
DNA-invertase hin
DNA-methylase
DNA-methyltransferase
DNA-methyltransferase subunit M
DNA-methyltransferase type II restriction modification system
DNA-packaging protein
DNA-packaging protein A
DNA-packaging protein B
DNA-packaging protein FI
DNA-packaging protein small subunit
DNA-packing protein small subunit
DNA-polymerase
DNA-polymerase catalytic subunit
DNA-primase
DNA-sulfur modification-associated
DNA-topoisomerase II B-subunit
DNA/RNA binding domain-containing protein
DNA/RNA binding protein
DNA/RNA endonuclease G
DNA/RNA helicase
DNA/RNA helicase of superfamily II
DNA/RNA helicase protein
DNA/RNA helicase superfamily II
DNA/RNA helicase-like protein
DNA/RNA non-specific endonuclease
DNA/RNA polymerase
DNA/RNA repair helicase
DNA/RNA-binding 3-helical bundle
DNA/RNA-binding protein
DNA/protein translocase
DNAB helicase
DNAB-like 

DenB DNA endonuclease IV
DenB-like DNA endonuclease IV
DenB-like endonuclease
DenB.1 hypothetical protein
DenV
DenV DNA endonuclease V
DenV Endonuclease V
DenV endonuclease
DenV endonuclease V
DenV endonuclease V N-glycosylase UV repair enzyme
DenV endonuclease V, N-glycosylase UV repair enzyme
DenV-like UV repair enzyme
DeoR family transcriptional regulator
DeoR/GlpR transcriptional regulator
Deoxyadenosine kinase (EC / Deoxyguanosine kinase (EC
Deoxyadenosine kinase DnK
Deoxyadenosine/Deoxyguanosine kinase
Deoxycytidine triphosphate deaminase
Deoxycytidine triphosphate deaminase (dUMP-forming)
Deoxycytidylate 5-hydroxymethyltransferase
Deoxycytidylate 5-hydroxymethyltransferase (EC
Deoxycytidylate deaminase
Deoxyguanosine kinase
Deoxynucleotide monophosphate kinase
Deoxynucleotide monophosphate kinase #T4-like phage gp1 #T4 GC1586
Deoxynucleotide monophosphate kinase (EC
Deoxynucleotide monophosphate kinase (EC #T4-like phage gp1 #T4 GC1586
Deoxyuridine 5'-triphosphate nucleotidohydr

Gp40
Gp40 protein
Gp41
Gp41 DNA primase-helicase subunit
Gp41 protein
Gp42
Gp42 protein
Gp42.1
Gp43
Gp43 DNA polymerase
Gp43 protein
Gp44
Gp44 clamp loader subunit DNA polymerase accessory protein
Gp44 protein
Gp44-sliding clamp holder
Gp45
Gp45 protein
Gp45 sliding clamp DNA polymerase accessory protein
Gp46
Gp46 protein
Gp46 recombination endonuclease subunit
Gp47
Gp47 protein
Gp47 recombination protein subunit
Gp48
Gp48 protein
Gp48 putative tail tube associated base plate protein
Gp49
Gp49 protein
Gp5
Gp5 baseplate hub subunit and tail lysozyme
Gp5 baseplate hub subunit and tail lysozyme (Shigella phage phiSboM-AG3)
Gp50
Gp50 protein
Gp51
Gp51 protein
Gp52 protein
Gp53
Gp53 baseplate wedge subunit
Gp53 protein
Gp54
Gp54 baseplate tail tube initiator
Gp54 protein
Gp55
Gp55 T4-like sigma factor involved in late transcription
Gp55 protein
Gp55-like protein
Gp56
Gp57
Gp58
Gp59
Gp59 T4-like loader of gp41 DNA helicase
Gp6
Gp6 baseplate wedge subunit
Gp6 protein
Gp6-like protein
Gp60
Gp6

JK_16P
JK_17P
JK_18P
JK_19P
JK_1P
JK_20P
JK_21P
JK_22P
JK_23P
JK_24P
JK_25P
JK_26P
JK_27P
JK_28P
JK_29P
JK_2P
JK_30P
JK_31P
JK_32P
JK_33P
JK_34P
JK_35P
JK_36P
JK_37P
JK_38P
JK_39P
JK_3P
JK_40P
JK_41P
JK_42P
JK_43P
JK_44P
JK_45P
JK_46P
JK_47P
JK_48P
JK_49P
JK_4P
JK_50P
JK_51P
JK_52P
JK_53P
JK_54P
JK_55P
JK_56P
JK_57P
JK_58P
JK_59P
JK_5P
JK_60P
JK_61P
JK_62P
JK_63P
JK_64P
JK_65P
JK_66P
JK_67P
JK_68P
JK_69P
JK_6P
JK_70P
JK_71P
JK_72P
JK_73P
JK_74P
JK_75P
JK_76P
JK_77P
JK_78P
JK_79P
JK_7P
JK_80P
JK_81P
JK_82P
JK_8P
JK_9P
JNK kinase domain-containing protein
K
K (tail component;199)
K protein
K1 endosialidase
K1E endosialidase adaptor protein
K1E myramoyl peptidase
K5 lyase
KAP family P-loop domain protein
KH domain protein
KID repeat family protein
KID repeat-containing family protein
KOW motif containing protein
KR domain-containing protein
KTSC domain containing protein
KTSC domain protein
KTSC domain-containing protein
KacT Acetyltransferase-type toxin
KaiC
Kelch motif family protein
Ki

N4 gp54-like protein
N4 gp55-like protein
N4 gp56-like protein
N4 gp57-like protein
N4 gp59 protein
N4 gp59-like protein
N4 gp67-like protein
N4 gp68 like protein
N4 gp68-like protein
N4 gp69 like protein
N4 gp69-like protein
N4 rIIA-like protein
N4 rIIB-like protein
N4 v-RNAP-like protein
N4 vRNAP-like protein
N4-cytosine Mtase
N4-cytosine methyltransferase
N4-gp56 family major capsid protein
N5 gp16-like protein
N5 gp53-like protein
N5 gp57-like protein
N6 adenine-specific DNA methyltransferase
N6-adenine DNA methyltransferase
N6-adenine methyltransferase
N6-adenine specific DNA methyltransferase
NA ligase 1 and tail fiber attachment catalyst
NA+/K+ ATPase
NA-directed RNA polymerase 3 subunit
NACHT, LRR and PYD domains-containing protein 12-like
NAD (FAD)-utilizing dehydrogenase
NAD dependent DNA ligase subunit A
NAD dependent DNA ligase subunit B
NAD dependent deacetylase
NAD dependent epimerase
NAD dependent epimerase/dehydratase
NAD kinase
NAD protein ADP-ribosyltransferase
NAD sy

ORF20
ORF200
ORF201
ORF202
ORF203
ORF204
ORF205
ORF206
ORF207
ORF209
ORF21
ORF211
ORF212
ORF213
ORF215
ORF216
ORF217
ORF218
ORF219
ORF22
ORF221
ORF222
ORF224
ORF225
ORF226
ORF227
ORF228
ORF229
ORF23
ORF230
ORF231
ORF232
ORF233
ORF234
ORF235
ORF236
ORF237
ORF24
ORF240
ORF241
ORF245
ORF247
ORF25
ORF252
ORF253
ORF256
ORF259
ORF26
ORF262
ORF263
ORF27
ORF277
ORF28
ORF285
ORF286
ORF29
ORF293
ORF297
ORF3
ORF30
ORF30/ORF32
ORF300
ORF309
ORF31
ORF310
ORF311
ORF312
ORF319
ORF32
ORF326
ORF33
ORF333
ORF34
ORF340
ORF35
ORF36
ORF362
ORF366
ORF37
ORF371
ORF38
ORF381
ORF388
ORF39
ORF3; putative transposase
ORF4
ORF40
ORF401
ORF404
ORF407
ORF408
ORF41
ORF42
ORF43
ORF437
ORF44
ORF445
ORF45
ORF450
ORF46
ORF47
ORF48
ORF49
ORF5
ORF50
ORF51
ORF52
ORF53
ORF54
ORF55
ORF56
ORF57
ORF58
ORF59
ORF6
ORF60
ORF61
ORF62
ORF63
ORF64
ORF65
ORF66
ORF67
ORF68
ORF69
ORF6C domain-containing protein
ORF6N domain protein
ORF6N domain-containing protein
ORF6a
ORF6b
ORF7
ORF70
ORF71
ORF72
ORF73
ORF74
ORF75
ORF76
ORF77
ORF78
OR

PHIKZ263
PHIKZ264
PHIKZ265
PHIKZ266
PHIKZ267
PHIKZ267.1
PHIKZ268
PHIKZ269
PHIKZ270
PHIKZ271
PHIKZ272
PHIKZ273
PHIKZ274
PHIKZ275
PHIKZ276
PHIKZ277
PHIKZ278
PHIKZ279
PHIKZ280
PHIKZ281
PHIKZ282
PHIKZ283
PHIKZ283.1
PHIKZ284
PHIKZ285
PHIKZ286
PHIKZ286.1
PHIKZ287
PHIKZ287.1
PHIKZ288
PHIKZ289
PHIKZ290
PHIKZ290.1
PHIKZ291
PHIKZ292
PHIKZ293
PHIKZ293.1
PHIKZ293.2
PHIKZ294
PHIKZ294.1
PHIKZ294.2
PHIKZ295
PHIKZ295.1
PHIKZ296
PHIKZ297
PHIKZ298
PHIKZ299
PHIKZ299.1
PHIKZ300
PHIKZ301
PHIKZ302
PHIKZ303
PHIKZ304
PHIKZ305
PHIKZ306
PHP domain-containing protein
PIG-L family deacetylase
PIN domain protein
PIN domain-containing protein
PIN domain-like protein
PIN-like domain superfamily protein
PKD domain
PKD domain containing protein
PKD domain protein
PKD domain-containing protein
PLA-2 like domain protein
PLA-2 like protein
PLA2-like domain protein
PLP-dependent aminotransferase family protein
PLxRFG domain-containing protein
PLxRFG protein
PNK protein
POLAc domain-containing protein
PPE family protein
PP

Phi92_gp137
Phi92_gp138
Phi92_gp139
Phi92_gp139 [Enterobacteria phage phi92]
Phi92_gp140
Phi92_gp141
Phi92_gp142
Phi92_gp143
Phi92_gp144
Phi92_gp144 [Enterobacteria phage phi92]
Phi92_gp145
Phi92_gp146
Phi92_gp147
Phi92_gp148
Phi92_gp149
Phi92_gp150
Phi92_gp151
Phi92_gp152
Phi92_gp153
Phi92_gp154
Phi92_gp154 [Enterobacteria phage phi92]
Phi92_gp155
Phi92_gp156
Phi92_gp157
Phi92_gp157 [Enterobacteria phage phi92]
Phi92_gp158
Phi92_gp159
Phi92_gp160
Phi92_gp161
Phi92_gp162
Phi92_gp162 [Enterobacteria phage phi92]
Phi92_gp163
Phi92_gp164
Phi92_gp165
Phi92_gp165 [Enterobacteria phage phi92]
Phi92_gp166
Phi92_gp167
Phi92_gp167 [Enterobacteria phage phi92]
Phi92_gp168
Phi92_gp169
Phi92_gp169 [Enterobacteria phage phi92]
Phi92_gp170
Phi92_gp171
Phi92_gp172
Phi92_gp173
Phi92_gp174
Phi92_gp175
Phi92_gp175 [Enterobacteria phage phi92]
Phi92_gp176
Phi92_gp177
Phi92_gp178
Phi92_gp179
Phi92_gp179 [Enterobacteria phage phi92]
Phi92_gp180
Phi92_gp181
Phi92_gp181 [Enterobacteria phage phi92]
Phi92_gp1

RF-1 peptide chain release factor
RGL3
RHA family transcriptional regulator
RHS repeat-associated core domain protein
RHS repeat-associated core domain-containing protein
RI lysis inhibition regulator
RI membrane protein
RI membrane protein antiholin
RI membrane protein/RI lysis inhibition protein
RI.-1
RIB-like protein
RIIA
RIIA lysis inhibitor
RIIA lysis inhibitor protein
RIIA membrane-associated protein
RIIA phage protein
RIIA protector from prophage induced early lysis
RIIA protector from prophage-induced early lysis
RIIA protein
RIIA protein [Escherichia phage wV7]
RIIA-RIIB membrane associated protein/rIIA lysis inhibitor
RIIA-RIIB membrane-associated
RIIA-RIIB membrane-associated protein
RIIA-like protein
RIIA.1
RIIB
RIIB Protector from prophage-induced early lysis
RIIB domain protein
RIIB early lysis inhibitor
RIIB lysis inhibitor
RIIB phage protein
RIIB protector from prophage induced early lysis
RIIB protector from prophage-induced early lysis
RIIB protein
RIIB protein [Esche

SAM hydrolase / restriction inhibitor
SAM methytransferase
SAM-depedendent methyltransferase
SAM-dependent DNA methyltransferase
SAM-dependent methyl transferase
SAM-dependent methyltransferase
SAM-dependent methyltransferase related to tRNA (uracil-5-)-methyltransferase
SAP DNA binding domain protein
SAP domain protein
SAR domain lysozyme
SAR endolysin
SAR endolysin N-acetylmuramidase
SAR endolysin glycoside hydrolase
SAR endolysin glycosyl hydrolase
SAR endolysin transglycosylase
SAR-endolysin
SASA family carbohydrate esterase
SCIN
SCO family protein
SDR family NAD(P)-dependent oxidoreductase
SDR family oxidoreductase
SDR family oxidoreductase/putative virion structural protein
SEC-C motif protein
SEFIR domain protein
SEL1-like repeat protein
SET domain protein
SET domain-containing protein
SF4 helicase domain-containing protein
SFP
SFPH_like superfamily domain-containing protein, putative membrane protein
SGNH esterase
SGNH hydrolase
SGNH hydrolase domain containing tail fiber prote

T7-like phage ssDNA-binding protein
T7-like primase/helicase
T7-like ssDNA binding protein
T7-like tail fiber
T7-like tail tubular protein A
T7-like tail tubular protein B
T7-like tubular tail B family-like protein
T9SS C-terminal target domain-containing protein
TA system inhibitor protein
TAC 4 superfamily protein
TAR DNA-binding protein
TATA-box-binding protein
TAXI family TRAP transporter solute receptor
TC3 transposase
TCP-1/cpn60 chaperonin family protein
TDP-4-keto-6-deoxy-D-glucose transaminase
TETR-family transcription regulator
TF1
TGT domain-containing protein
TIGR02218 family protein
TIGR02594 family protein
TIR domain-containing protein
TIR protein
TIR-like domain-containing protein
TM helix containing protein
TM helix-containing protein
TM2 domain containing protein
TM2 domain protein
TM2 domain-containing protein
TMP
TMP chaperone
TMP domain-containing protein
TMP kinase
TMP protein
TMP repeat containing protein
TMP repeat family protein
TMP repeat protein
TMP repeat-con

VHS1008 protein
VHS1009 protein
VHS1010 protein
VHS1011 protein
VHS1012 protein
VHS1013 protein
VHS1014 protein
VHS1015 protein
VHS1016 protein
VHS1017 protein
VHS1018 protein
VHS1019 protein
VHS1020 protein
VHS1021 protein
VHS1022 protein
VHS1023 protein
VHS1024 protein
VHS1025 protein
VHS1026 protein
VHS1027 protein
VHS1028 protein
VHS1029 protein
VHS1030 protein
VHS1031 protein
VHS1032 protein
VHS1033 protein
VHS1034 protein
VHS1035 protein
VHS1036 protein
VHS1037 protein
VHS1038 protein
VHS1039 protein
VHS1040 protein
VHS1041 protein
VHS1042 protein
VHS1043 protein
VHS1044 protein
VHS1045 protein
VHS1046 protein
VHS1047 protein
VHS1048 protein
VHS1049 protein
VHS1050 protein
VHS1051 protein
VHS1052 protein
VHS1053 protein
VHS1054 protein
VHS1055 protein
VHS1056 protein
VHS1057 protein
VHS1058 protein
VHS1059 protein
VHS1060 protein
VHS1061 protein
VHS1062 protein
VHS1063 protein
VHS1064 protein
VHS1065 protein
VHS1066 protein
VHS1067 protein
VHS1068 protein
VHS1069 protein
VHS1070 

Zinc-binding domain containing protein
Zinc-binding domain of primase-helicase
Zinc-finger domain protein DNA primase
ZipA protein
Zn carboxypeptidase
Zn finger
Zn finger domain containing protein
Zn finger protein
Zn finger protein of DnaJ family
Zn peptidase
Zn ribbon
Zn-binding Pro-Ala-Ala-Arg (PAAR) domain-containing protein
Zn-binding domain containing protein
Zn-dependent hydrolase
Zn-dependent protease
Zn-finger DNA binding domain protein
Zn-finger DNA binding protein
Zn-finger domain protein
Zn-finger protein
Zn-finger protein fused to HTH domain
Zn-peptidase
Zn-ribbon containing protein
Zn-ribbon domain containing protein
Zn2+ binding domain protein
Zn2+ binding protein
Zot
Zot protein
Zot-like protein
Zot-like putative assembly protein
Zwf
[Enterobacteria phage IME08.]
a-gt alpha glucosyl transferase
a-gt alpha glucosyl transferase [Enterobacteria phage T4]
a-gt.2 conserved hypothetical protein
a-gt.2 hypothetical protein
a-gt.3 conserved hypothetical protein
a-gt.3 hypotheti

bacterial RNA polymerase
bacterial RNA polymerase inhibidor
bacterial RNA polymerase inhibitor
bacterial RNAP inhibitor
bacterial SH3 domain protein
bacterial conjugation repressor protein
bacterial helix-turn-helix protein
bacterial nucleoid DNA-binding protein
bacterial pH domain protein
bacterial regulatory protein, luxR family
bacterial seryl-tRNA synthetase related protein
bacterial surface protein
bacterial surface protein containing Ig-like domain
bacterial toxin
bacterial toxin 44
bacterial transferase hexapeptide repeat domain protein
bacterial tryptophan halogenase
bacterial type single-stranded DNA-binding protein
bacteriocin
bacteriocin UviB
bacteriocin UviB precursor
bacteriocin biosynthesis protein
bacteriocin-like protein
bacterioferritin-associated ferredoxin
bacteriolytic protein
bacteriophage 186 Tum95.5
bacteriophage Mu I protein gp32
bacteriophage SPbeta N-acetylmuramoyl-L-alanine amidase
bacteriophage baseplate assembly protein J
bacteriophage control infection pro

cobalt-zinc-cadmium resistance protein
coenzyme F420 hydrogenase domain protein
coenzyme PQQ synthesis protein D
cohesin domain-containing protein
coil containing protein
coil containing protein [Vibrio phage 1.264.O._10N.286.51.F2]
coiled coil segment-containing protein
coiled stalk of trimeric autotransporter adhesin family protein
coiled-coil and C2 domain-containing protein 1-like isoform X2
coiled-coil domain-containing protein
coiled-coil domain-containing protein 90B
coiled-coil domain-containing protein 90b
coiled-coil structural protein
colanic acid biosynthesis glycosyltransferase
colanic acid biosynthesis protein
colanic acid biosynthesis protein wcaM
colanic acid degrading protein
colanic acid-degrading protein
colanidase tailspike
colanidase tailspike [Enterobacteria phage ECGD1]
cold shock CspD protein
cold shock protein
cold shock-like protein
cold-shock DNA-binding domain protein
cold-shock protein
colicin-like ion channel
collagen alpha-2(i) chain
collagen domain prote

endoribonuclease toxin
endoribonuclease translational repressor
endoribonuclease translational repressor of early genes
endoribonulcease
endoribonulcease translational repressor of early genes
endoribonulcease translational repressor of early genes RegA
endoribonulcease translational repressor of early genes regA
endosialidase
endosialidase chaperone
endosialidase tailspike
endosialidase tailspike protein
endouclease subunit
endoysin
energy-coupling factor transporter ATPase domain containing protein
energy-coupling factor transporter ATPase domaincontaining protein
enolase
enolase-like protein
enoyl-ACP reductase
enoyl-CoA hydratase
enoyl-CoA hydratase/carnithine racemase-like protein
enoyl-CoA hydratase/isomerase family protein
enoyl-[acyl-carrier-protein] reductase [NADH]
enterobacterial exodeoxyribonuclease VIII
enterohemolysin
enterohemolysin 1
enterotoxin
enterotoxin type A precursor
enterotoxin type A/P
enterotoxin/peptidoglycan binding protein
enterotoxin_gp300
envelope glycopr

gluconate kinase
gluconolaconase
gluconolactonase family protein
glucosamine N-acyltransferase
glucosamine-fructose-6-phosphate aminotransferase
glucosaminidase
glucosaminidase /N-acetylmuramoyl-L-alanine amidase
glucosaminidase domain-containing protein
glucosaminyl deacetylase
glucose 6 phosphate dehydrogenase
glucose 6-phosphate dehydrogenase
glucose ABC transport system, periplasmic sugar-binding protein
glucose-1-phosphate adenylyltransferase related protein
glucose-1-phosphate thymidylyltransferase
glucose-1-phosphate thymidylyltransferase[Escherichia phage vB_vPM_PD114]
glucose-6-phosphate 1-dehydrogenase
glucose-6-phosphate dehydrogenase
glucose-6-phosphate isomerase
glucoside hydrolase
glucosyl transferase
glucosyl transferase [Shigella phage SfPhi01]
glucosyltransferase
glucosyltransferase domain-containing protein
glutamate 5-kinase
glutamate dehydrogenase
glutamate synthase [NADPH] large chain
glutamate--cysteine ligase
glutamate--tRNA ligase
glutamine amidotransferase
glut

gp30.2 hypothetical protein
gp30.3
gp30.3 conserved hypothetical protein
gp30.3 hypothetical protein
gp30.3 protein
gp30.3' hypothetical protein
gp30.4 conserved hypothetical protein
gp30.4 hypothetical protein
gp30.5 conserved hypothetical protein
gp30.5 hypothetical protein
gp30.5 protein
gp30.6 conserved hypothetical protein
gp30.6 hypothetical protein
gp30.7 conserved hypothetical protein
gp30.7 hypothetical protein
gp30.8 conserved hypothetical protein
gp30.8 hypothetical protein
gp30.9 conserved hypothetical protein
gp30.9 hypothetical protein
gp300
gp301
gp302
gp303
gp304
gp305
gp306
gp307
gp308
gp309
gp31
gp31 co-chaperonin for GroEL
gp31 head assembly co-chaperonin for GroEL
gp31 head assembly co-chaperonin for GroEL [Enterobacteria phage RB14]
gp31 head assembly cochaperone with GroEL
gp31 protein
gp31, bacteriophage-acquired protein
gp31, non-glycosylated membrane-associated protein
gp31, phage tail protein E
gp31.1
gp31.1 conserved hypothetical protein
gp31.1 conserverd hyp

gp9 base plate wedge completion tail fiber socket protein
gp9 base plate wedge component
gp9 baseplate tail fiber connector
gp9 baseplate wedge completion tail fiber socket
gp9 baseplate wedge completion tail fiber socket protein
gp9 baseplate wedge subunit
gp9 baseplate wedge tail fiber connector
gp9 protein
gp9, Cpp15
gp9, bacteriophage protein
gp9, phage head-tail adaptor, putative
gp9.1
gp9.5 protein
gp90
gp91
gp92
gp93
gp94
gp95
gp96
gp97
gp98
gp99
gp9plus10 baseplate wedge tail fiber connector and baseplate wedge subunit and tail pin
gpA
gpA*
gpB
gpC
gpD
gpE
gpE+E'
gpE; Major coat protein
gpF
gpF-like protein
gpFI
gpFII
gpG
gpG-T
gpH
gpH domain protein
gpI
gpJ
gpK
gpL
gpM
gpN
gpO
gpORF005
gpORF006
gpORF008
gpORF009
gpORF010
gpORF011
gpORF013
gpORF014
gpORF015
gpORF016
gpORF017
gpORF018
gpORF019
gpORF020
gpORF020_phage A5W
gpORF021
gpORF022
gpORF023
gpORF024
gpORF025
gpORF028
gpORF029
gpORF030
gpORF031
gpORF034
gpORF035
gpORF036
gpORF038
gpORF039
gpORF041
gpORF044
gpORF046
gpORF04

hypothetical protein AVU03_gp29
hypothetical protein Aeh1ORF298c-like protein
hypothetical protein Aeh1ORF302c-like protein
hypothetical protein B
hypothetical protein CPT_Marfa_271[Klebsiellaphage Marfa]
hypothetical protein CPT_phageK_gp001
hypothetical protein CPT_phageK_gp022
hypothetical protein CPT_phageK_gp057
hypothetical protein CPT_phageK_gp058
hypothetical protein CPT_phageK_gp063
hypothetical protein CPT_phageK_gp079
hypothetical protein CPT_phageK_gp174
hypothetical protein CR3_gp151 [Cronobacter phageCR3]
hypothetical protein D
hypothetical protein D5505_00079[Escherichiaphage D5505]
hypothetical protein D5505_00080[Escherichiaphage D5505]
hypothetical protein DDB_G0286989
hypothetical protein DexA.2
hypothetical protein ECGD1_004 [Enterobacteriaphage ECGD1]
hypothetical protein ECGD1_021 [Enterobacteriaphage ECGD1]
hypothetical protein ECGD1_038 [Enterobacteriaphage ECGD1]
hypothetical protein ECGD1_044 [Enterobacteriaphage ECGD1]
hypothetical protein ECGD1_057 [Enteroba

iron-uptake factor
iron/magnesium ABC transporter periplasmic binding protein
iron/manganese ABC transporter substrate-binding protein SitA
iron/manganese ABC transporter substrate-binding protein SitB
iron/manganese ABC transporter substrate-binding protein SitC
iron/manganese ABC transporter substrate-binding protein SitD
isocitrate dehydrogenase [NADP]
isoleucyl-tRNA synthetase
isoniazid inducible gene protein
isopentenyl-diphosphate delta-isomerase
isovaleryl-CoA dehydrogenase
istB-like ATP binding family protein
junction endodeoxyribonuclease
kappa-carrageenase
kappa-carrageenase precursor
kelch repeat protein
kelch repeat-containing protein
kelch-like protein
kelch-motif containing protein
ketol-acid reductoisomerase (NADP(+))
ketoreductase or glucose dehydrogenase
kil
kil protein
kil(host-killing;54)
kilA anti-repressor protein
kilA protein
kilA-N domain protein
kilR
kinase
kinase GTPase
kinase domain protein
kinase inhibitor
kinase inhibitor-like protein, UPF0098 family
kinase 

minor tail protein G
minor tail protein H
minor tail protein K
minor tail protein K-like protein
minor tail protein L
minor tail protein L.
minor tail protein M
minor tail protein M.
minor tail protein T
minor tail protein U
minor tail protein V
minor tail protein Z
minor tail protein Z-like protein
minor tail protein gp12
minor tail protein gp14
minor tail protein gp16
minor tail protein gp24-like protein
minor tail protein gp26-like
minor tail protein gp26-like protein
minor tail protein l
minor tail protein precursor H
minor tail protein/D-ala-D-ala carboxypeptidase
minor tail structural protein
minor tail structural protein L
minor tail subunit
minor virion protein
minor virion protein VP1
minor virion structural protein
minor_capsid_2 domain protein
minor_capsid_3 domain protein
mismatch repair ATPase
mismatch repair protein
mitochondrial carrier-like protein 2
mitochondrial chaperone
mitochondrial chaperone BCS1
mitochondrial ribosomal protein subunit L20-like protein
mitogen-act

p39
p4
p40
p41
p42
p42.1
p43
p44
p45
p46
p47
p48
p49
p50
p51
p52
p53
p54
p55
p55.1
p56
p57
p57.1
p58
p59
p6
p60
p61
p62
p63
p64
p65
p66
p67
p68
p69
p7
p70
p71
p72
p73
p74
p75
p76
p77
p78
p79
p8
p80
p81
p82
p83
p84
p85
p9
pANL56
pIII
pIII protein
pIII-CTX
pXl
pXu
paar protein
packaged DNA stabilization family protein
packaged DNA stabilization protein
packaging ATPase
packaging and recombination endonuclease
packaging and recombination endonuclease VII
packaging and recombination endonucleaseVII
packaging protein
packaging protein 1
packaging protein 3
packaging terminase large subunit gpA
palmitoyltransferase
pantetheine-phosphate adenylyltransferase
panti-restriction nuclease
panton-valentine leukocidin F precursor
panton-valentine leukocidin S precursor
panton-valentine leukocidin chain F precursor
panton-valentine leukocidin chain S precursor
papain family cysteine protease
papain-like cysteine peptidase
papain-like cysteine protease
parA domain protein
parA protein
parB-like nuclea

phage-encoded membrane protein
phage-encoded peptidoglycan binding protein
phage-like element PBSX protein
phage-like protein
phage-related DNA helicase
phage-related DNA-binding protein
phage-related Mu protein F-like protein
phage-related adenine-specific DNA methyltransferase
phage-related amidase
phage-related antirepressor
phage-related baseplate assembly
phage-related baseplate assembly protein gp45
phage-related capsid packaging protein
phage-related conserved hypothetical protein
phage-related exonuclease
phage-related holin
phage-related hydrogenase
phage-related integrase
phage-related lysis protein
phage-related major structural protein
phage-related minor tail protein
phage-related protein
phage-related protein HI1409
phage-related protein, ribonucleoside-diphosphate reductase
phage-related replication protein
phage-related tail protein
phage-related terminase small subunit-like protein
phage-specific RNA polymerase
phage-type endonuclease
phage/conjugal plasmid C-4 type zi

protease inhibitor
protease inhibitor cIII
protease or head maturation protease
protease protein
protease regulator
protease subunit
protease subunit of ATP-dependent Clp protease
protease(I) and scaffold(Z) protein
protease, ATP dependent, HslV-like
protease-like protein
protease-scaffold-major head protein
protease/major capsid protein
protease/scaffold
protease/scaffold protein
protease/scaffold protein gp4
proteasome subunit
proteasome subunit alpha type-2
proteasome subunit alpha/beta
proteasome subunit beta
proteasome subunit-like protein
proteasome-associated ATPase
proteasome-like hydrolase
proteasome-like protein
protector from phage-induced early lysis
protector from prophage induced early lysis
protector from prophage-induced early lysis
protector from prophage-induced early lysis rIIA
protector from prophage-induced early lysis rIIA-like protein
protector from prophage-induced early lysis rIIB
protector from prophage-induced early lysis rIIB-like protein
protein
protein 0.3

putative AgrD
putative Alc inhibitor of host transcription
putative Alc inhibitor of host transcription [Shigella phage Shfl2]
putative AlpA
putative AlpA family regulatory protein
putative AlpA family transcriptional regulator
putative AntA domain-containing protein
putative AntA/AntB antirepressor
putative Appr-1-P processing domain-containing protein
putative Appr-1-p processing domain protein
putative Appr-1-p processing domain-containing protein
putative Appr-1-p processing enzyme
putative Appr-1-p processing enzyme family protein
putative Appr-1-p processing protein
putative AraC family trancriptional regulator
putative AraC family transcriptional regulator
putative Arc family DNA-binding protein
putative Arc protein
putative Arc-like DNA binding domain
putative Arc-like DNA binding domain protein
putative ArdC-like antirestriction protein
putative Arm DNA-binding domain protein
putative Arn.3
putative ArpR DNA binding protein
putative ArpU family transcriptional activator
putati

putative GtrA family protein
putative H-N-H endonuclease
putative H-N-H endonuclease [Escherichia phage vB_EcoM_FFH2]
putative H-N-H-endonuclease
putative H-N-H-endonuclease P-TflIX
putative H-N-H-endonuclease P-TflVII
putative H-N-H-endonuclease P-TflVIII
putative H-N-H-endonuclease P-TflX
putative H-T-H transcriptional regulator
putative HAD domain-containing protein
putative HAD hydrolase
putative HAD superfamily polynucleotide kinase
putative HAD-like protein
putative HAD-like superfamily domain containing protein
putative HAD-like superfamily protein
putative HAMP domain-containing protein
putative HD domain protein
putative HD domain-containing protein
putative HD domain-like protein
putative HD phosphohydrolase
putative HD superfamily hydrolase
putative HD-domain protein
putative HD-domain/PDEase-like protein
putative HD/PDEase-like protein
putative HIRAN domain-containing protein
putative HK97 family phage prohead protease
putative HKD family nuclease
putative HMH homimg endonu

putative acetyl transferase
putative acetyl-CoA acetyltransferase
putative acetylesterase protein
putative acetylmuramoyl-L-alanine amidase
putative acetyltransferase
putative acetyltransferase family protein
putative acetyltransferase-like protein
putative acetyltransferase-related protein
putative acridine resistance protein
putative acriflavin resistance protein
putative actin-like protein
putative activating signal cointegrator protein
putative activator of host endonuclease
putative activator of late transcription
putative activator of middle period
putative activator of middle period transcription
putative activator of middle period transcription [Shigella phage Shfl2]
putative activator of middle transcription
putative activator of tail terminator
putative acyl CoA N-acyltransferase
putative acyl carrier protein
putative acyl-CoA N-acyltransferase
putative acyl-CoA N-acyltransferase domain-containing protein
putative acylphosphatase
putative acyltransferase
putative acyltransfer

putative class I ribonucleotide reductase (RNR2 subunit)
putative class I ribonucleotide reductase alpha subunit
putative class I ribonucleotide reductase beta subunit
putative class II holin
putative class II holin-like protein
putative class III anaerobic ribunucleotide reductase
putative class III ribonucleotide reductase
putative class Ib ribonucleoside-diphosphate reductase assembly flavoprotein NrdI
putative class lb ribonucleoside-diphosphate reductase assembly flavoprotein Nrdl
putative closticin
putative clp-protease
putative co-chaperone GroES
putative co-chaperonin GroES
putative co-chaperonin for GroEL
putative coagulation factor 5/8 domain-containing protein
putative coagulation factor 5/8 type domain protein
putative coat protein
putative cobalamin adenosyltransferase
putative cobalamin biosynthesis protein
putative cobalamin biosynthesis protein CobS
putative cobalamin biosynthesis protein CobT
putative cobalt chelatase subunit
putative cobalt chelatase subunit CobS
puta

putative hinge long tail fiber protein proximal connector
putative hinge long tail fiber proximal connector
putative histidine kinase
putative histidine kinase-like ATPase
putative histone deacetylase protein
putative histone family DNA-binding protein
putative histone family protein DNA-binding protein
putative histone like protein
putative histone protein
putative histone-like DNA-binding protein
putative histone-like protein
putative histone-lysine N-methyltransferase
putative hmC-arabinosyltransferase
putative hnh endonuclease
putative hoc protein
putative hoin
putative hol protein
putative holiday junction resolvase
putative holin
putative holin 1
putative holin 2
putative holin 8
putative holin [Shigella phage Shfl2]
putative holin class II
putative holin family protein
putative holin lysin mediator
putative holin lysis mediator
putative holin lysis protein
putative holin or anitholin
putative holin or anti-holin
putative holin or antiholin
putative holin protein
putative holin, 

putative phage coat protein
putative phage collar protein
putative phage collar protein (head-tail connector)
putative phage control protein D
putative phage core tail protein
putative phage encoded transcriptional regulator ArpU family
putative phage encoded transcriptional regulator, ArpU family
putative phage endonuclease
putative phage endopeptidase
putative phage essential recombination function protein
putative phage excisionase protein
putative phage gene
putative phage glutaredoxin
putative phage gp6-like head-tail connector protein
putative phage head assembly protein
putative phage head fiber protein
putative phage head morphogenesis protein
putative phage head portal protein
putative phage head protein
putative phage head tail adapter
putative phage head-binding domain-containing protein
putative phage head-tail adaptor
putative phage head-tail connector protein
putative phage head-tail joining protein
putative phage helicase
putative phage holin
putative phage holin protein

putative replication factor C small subunit
putative replication factor C small subunit / DNA polymerase clamp loader subunit
putative replication function protein
putative replication gene B protein
putative replication helicase
putative replication initiation factor
putative replication initiation protein
putative replication initiation protein P12
putative replication initiator
putative replication initiator protein
putative replication initiator protein A
putative replication origin binding protein
putative replication origin-binding protein
putative replication protein
putative replication protein 18
putative replication protein DnaC
putative replication protein O
putative replication protein P
putative replication protein RepA
putative replication protein RepB
putative replication protein large subunit
putative replication protein p
putative replication termination protein
putative replication terminator protein
putative replication-associated protein
putative replicative DNA hel

putative transmembrane region domain containing protein
putative transpeptidase family protein
putative transport protein
putative transporter
putative transporter, ABC binding casette-like
putative transposase
putative transposase A
putative transposase A subunit
putative transposase B
putative transposase B subunit
putative transposase IS200-family protein
putative transposase ISCaje3 family
putative transposase ISCaje4 family
putative transposase OrfA protein of IS629
putative transposase OrfB protein of IS629
putative transposase domain-containing protein
putative transposase fusion protein
putative transposase like protein
putative transposase, IS607 family
putative transposase, ISCaje3 family
putative transposase, Tn7_Tnp_TnsA_N superfamily
putative transposase-like protein
putative transposase-like protein [Escherichiaphage vB_EcoM_PHB05]
putative transposon-related DNA-binding protein
putative transposon-related dna-binding protein
putative trimeric spike protein
putative tripa

ribonucleoside-diosate reductase subunit beta
ribonucleoside-diphosphatase reductase small subunit
ribonucleoside-diphosphate alpha subunit
ribonucleoside-diphosphate beta subunit
ribonucleoside-diphosphate reductase
ribonucleoside-diphosphate reductase (alpha)-like protein
ribonucleoside-diphosphate reductase 1 alpha chain
ribonucleoside-diphosphate reductase 1 alpha subunit
ribonucleoside-diphosphate reductase 1 subunit
ribonucleoside-diphosphate reductase 1 subunit alpha
ribonucleoside-diphosphate reductase 1 subunit beta
ribonucleoside-diphosphate reductase 1subunitalpha
ribonucleoside-diphosphate reductase 2 subunit alpha
ribonucleoside-diphosphate reductase 2 subunit beta
ribonucleoside-diphosphate reductase I subunit alpha
ribonucleoside-diphosphate reductase I subunit beta
ribonucleoside-diphosphate reductase R2
ribonucleoside-diphosphate reductase R2/beta subunit
ribonucleoside-diphosphate reductase alpha
ribonucleoside-diphosphate reductase alpha (large) subunit
ribonucleosid

sulfotransferase-like protein
sulfurtransferase
super infection exclusion protein
super infection immunity protein
super-infection exclusion protein
super-infection exclustion protein
superantigen A
superantigen-encoding pathogenicity island protein
superfamily I DNA and RNA helicase
superfamily I DNA or RNA helicase
superfamily I DNA/RNA helicase
superfamily II DNA or RNA helicase
superfamily II DNA/RNA helicase
superfamily II DNA/RNA helicase [Enterobacteria phage 9g]
superfamily II DNA/RNA helicase, SNF2 family
superfamily II helicase
superfamily II helicase restriction enzyme
superfamily II helicase/restriction enzyme
superfamily member TIGR01575
superfamily protein
superinfection exclusion
superinfection exclusion lipoprotein
superinfection exclusion protein
superinfection exclusion protein A
superinfection exclusion protein B
superinfection exclusion protein Cor-like protein
superinfection exclusion protein[Escherichiaphage PMBT57]
superinfection immunity protein
superoxide dismu

topoisomerase IA
topoisomerase IB
topoisomerase II
topoisomerase II domain-containing protein
topoisomerase II large subunit
topoisomerase II large subunit N-terminal region
topoisomerase II medium subunit
topoisomerase II medium subunit [Escherichia phage e11/2]
topoisomerase II medium subunit [Escherichia phage wV7]
topoisomerase II small subunit
topoisomerase II subunit
topoisomerase II, large subunit
topoisomerase II, large subunit, N-terminal region
topoisomerase II, medium subunit
topoisomerase IIM protein
topoisomerase IV subunit
topoisomerase IV subunit A
topoisomerase IV subunit B
topoisomerase IV subunit B protein
topoisomerase medium subunit
topoisomerase primase
topoisomerase subunit
topoisomerase-primase
topoisomerase-primase domain containing protein
topoisomerase-primase domain protein
topoisomerase-primase domain-containing protein
toprim domain containing protein
toprim domain protein
toprim domain-containing protein
toprim protein
toprim-domain containing protein
topr

In [30]:
rbp_products, hypothetical_proteins = util.get_rbp_hypothetical_proteins(constants.RBP_REGEX)

Processed 1000 records
Processed 2000 records
Processed 3000 records
Processed 4000 records
Processed 5000 records
Processed 6000 records
Processed 7000 records
Processed 8000 records
Processed 9000 records
Processed 10000 records
Processed 11000 records
Processed 12000 records
Processed 13000 records
Processed 14000 records
Processed 15000 records
Processed 16000 records
Processed 17000 records
Processed 18000 records
Processed 19000 records
Processed 20000 records
Processed 21000 records
Processed 22000 records
Processed 23000 records
Processed 24000 records
Processed 25000 records
Processed 26000 records
Processed 27000 records
Processed 28000 records
Processed 29000 records
Processed 30000 records
Processed 31000 records
Processed 32000 records
Processed 33000 records
Processed 34000 records


In [31]:
with open(os.path.join(constants.TEMP_PREPROCESSING, constants.RBP_PRODUCTS),'wb') as rbp_products_file:
    pickle.dump(rbp_products, rbp_products_file)
    
with open(os.path.join(constants.TEMP_PREPROCESSING, constants.HYPOTHETICAL_PRODUCTS),'wb') as hypothetical_proteins_file:
    pickle.dump(hypothetical_proteins, hypothetical_proteins_file)

In [32]:
with open(f'{constants.TEMP_PREPROCESSING}/{constants.RBP_PRODUCTS}','rb') as rbp_products_file:
    rbp_products = pickle.load(rbp_products_file)
    
util.set_rbp_products(rbp_products)
rbp_products

{'adhesin tip for distal long tail fiber',
 'bacteriophage tail fiber protein',
 'central straight tail fiber',
 'central tail fiber',
 'central tail fiber j',
 'central tail fiber receptor binding protein',
 'central tail spike',
 'colanidase tailspike',
 'colanidase tailspike [enterobacteria phage ecgd1]',
 'conserved hypothetical tail fiber protein',
 'conserved tail fiber protein',
 'defective tail fiber protein',
 'depolymerase tail fiber protein',
 'distal long tail fiber adhesin',
 'distal long tail fiber subunit',
 'distal tail fiber protein',
 'distal/tail fiber protein',
 'duf1983 containing putative tail fiber protein',
 'endolysin tail fiber',
 'endolysin tail fiber hydrolase',
 'endosialidase tailspike',
 'endosialidase tailspike protein',
 'fhua receptor-binding tail protein',
 'fusion long tail fiber distal subunit',
 'gp12 short tail fiber protein',
 'gp12 short tail fibers',
 'gp12 short tail fibers protein',
 'gp21, tail fiber protein',
 'gp25, tail fiber',
 'gp36 put

In [33]:
with open(f'{constants.TEMP_PREPROCESSING}/{constants.HYPOTHETICAL_PRODUCTS}','rb') as hypothetical_proteins_file:
    hypothetical_proteins = pickle.load(hypothetical_proteins_file)
    
util.set_hypothetical_proteins(hypothetical_proteins)
for protein in hypothetical_proteins:
    print(protein)

gp30.2 conserved hypothetical protein
protein of unknown function duf1493
rb69orf033c hypothetical protein
arn.3 hypothetical protein
hypothetical-protein | belonging to t4-like gc: 798
gp30.3 conserved hypothetical protein
hypothetical-protein | belonging to t4-like gc: 313
hypothetical helix-turn-helix domain protein
hypothetical protein containing coiled-coil segments
protein of unknown function duf1523
hypothetical protein_gp147
hypothetical protein [escherichia phage olb145]
cd.3 hypothetical protein
hypothetical protein_gp019
hypothetical protein [escherichia phagevb_ecom_phb05
hypothetical protein rdjlphi1_gp14
hypothetical protein jb75_0167 [escherichia phage vb_ecom_jb75]
hypothetical protein vn4_19 [vibrio phage n4]
nrdc.2 hypothetical protein
hyphothetical protein
hypothetical protein a368
gp44, hypothetical protein
hypothetical protein_gp031
protein of unknown function duf2163
alt.1 conserved hypothetical protein
hypothetical protein, duf3310
hypothetical protein g53_00272 

rb69orf053c hypothetical protein
vs.4 hypothetical protein
uncharacterized phage protein
protein of unknown function duf1360
hypothetical-protein | belonging to t4-like gc: 751
gp30.7 hypothetical protein
gp61.4 conserved hypothetical protein
gp55.2 hypothetical protein
cog1683: uncharacterized conserved protein / fig143828: hypothetical protein ybga
protein of unknown function duf4060
gp33, hypothetical protein
gp30.5 hypothetical protein
hypothetical protein_gp024
hypothetical protein_gp044
hypothetical protein_gp137
gp39.1 conserved hypothetical protein
protein of unknown function duf2774
dda.1 hypothetical protein
fig00639352: hypothetical protein
protein of unknown function duf1408
pset.2 hypothetical protein
gp46.2 conserved hypothetical protein
rb32orf217c hypothetical protein
hypothetical bacteriophage protein
hypothetical protein mrh.1
rb69orf123c hypothetical protein
hypothetical protein_gp177
hypothetical-protein | belonging to t4-like gc: 749
hypothetical protein, ninb homo

### Get distribution of RBP lengths

This preliminary analysis of the distribution of the RBP lengths covers both the sequences with gene annotations and those where the CDS coordinate prediction and annotation are done via Prokka. We carry it now in order since we need the lower and upper bounds for RBP lengths to generate the FASTA files.

In [34]:
len_distribution = util.generate_rbp_len_distribution()
util.generate_rbp_len_distribution_prokka(len_distribution, constants.INPHARED_GENOME)

lengths = []
for length, freq in len_distribution.items():
    for _ in range(freq):
        lengths.append(length)

Processed 1000 records
Processed 2000 records
Processed 3000 records
Processed 4000 records
Processed 5000 records
Processed 6000 records
Processed 7000 records
Processed 8000 records
Processed 9000 records
Processed 10000 records
Processed 11000 records
Processed 12000 records
Processed 13000 records
Processed 14000 records
Processed 15000 records
Processed 16000 records
Processed 17000 records
Processed 18000 records
Processed 19000 records
Processed 20000 records
Processed 21000 records
Processed 22000 records
Processed 23000 records
Processed 24000 records
Processed 25000 records
Processed 26000 records
Processed 27000 records
Processed 28000 records
Processed 29000 records
Processed 30000 records
Processed 31000 records


In [35]:
with open(os.path.join(constants.TEMP_PREPROCESSING, constants.RBP_LENGTHS),'wb') as rbp_lengths_file:
    pickle.dump(lengths, rbp_lengths_file)

Statistically compute the lower and upper bounds for RBP lengths.

In [36]:
with open(f'{constants.TEMP_PREPROCESSING}/{constants.RBP_LENGTHS}','rb') as rbp_lengths_file:
    lengths = pickle.load(rbp_lengths_file)

Q1 = np.percentile(lengths, 25, interpolation = 'midpoint')
Q3 = np.percentile(lengths, 75, interpolation = 'midpoint')
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print("Lower bound:", lower_bound)
print("Upper bound:", upper_bound)

Lower bound: -533.0
Upper bound: 1587.0


### Generate FASTA for entries with gene annotations

Generate FASTA files containing protein sequences of hypothetical proteins and RBPs

In [37]:
RBP_NEW_DIR = f'{constants.INPHARED}/{constants.FASTA}/{constants.RBP}/{constants.GENBANK}'
HYPOTHETICAL_NEW_DIR = f'{constants.INPHARED}/{constants.FASTA}/{constants.HYPOTHETICAL}/{constants.GENBANK}'

if not os.path.exists(RBP_NEW_DIR):
    os.makedirs(RBP_NEW_DIR)
    
if not os.path.exists(HYPOTHETICAL_NEW_DIR):
    os.makedirs(HYPOTHETICAL_NEW_DIR)

In [38]:
util.generate_rbp_hypothetical_fasta(RBP_NEW_DIR, HYPOTHETICAL_NEW_DIR, constants.LOWER_BOUND_RBP_LENGTH, 
                                     constants.UPPER_BOUND_RBP_LENGTH)

Processed 1000 records
Processed 2000 records
Processed 3000 records
Processed 4000 records
Processed 5000 records
Processed 6000 records
Processed 7000 records
Processed 8000 records
Processed 9000 records
Processed 10000 records
Processed 11000 records
Processed 12000 records
Processed 13000 records
Processed 14000 records
Processed 15000 records
Processed 16000 records
Processed 17000 records
Processed 18000 records
Processed 19000 records
Processed 20000 records
Processed 21000 records
Processed 22000 records
Processed 23000 records
Processed 24000 records
Processed 25000 records
Processed 26000 records
Processed 27000 records
Processed 28000 records
Processed 29000 records
Processed 30000 records
Processed 31000 records


The FASTA files are stored in the `inphared/fasta` directory, specifically in the `genbank` subdirectories.

## *B. Process entries without gene annotations*

- Annotate genes using [Prokka](https://academic.oup.com/bioinformatics/article/30/14/2068/2390517), which calls [Prodigal](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119) to predict coordinates of ORFs.
- Configure Prokka to perform functional annotation using [PHROGs](https://academic.oup.com/nargab/article/3/3/lqab067/6342220).
- Process in the same manner as the entries with gene annotations.

In [39]:
RBP_NEW_DIR = f'{constants.INPHARED}/{constants.FASTA}/{constants.RBP}/{constants.PROKKA}'
HYPOTHETICAL_NEW_DIR = f'{constants.INPHARED}/{constants.FASTA}/{constants.HYPOTHETICAL}/{constants.PROKKA}'

if not os.path.exists(RBP_NEW_DIR):
    os.mkdir(RBP_NEW_DIR)
    
if not os.path.exists(HYPOTHETICAL_NEW_DIR):
    os.mkdir(HYPOTHETICAL_NEW_DIR)

Analyze the gene product annotations returned by PHROG.

In [40]:
annot_products_prokka = util.get_annot_products_prokka(constants.INPHARED_GENOME)

Processed 1000 records
Processed 2000 records
Processed 3000 records
Processed 4000 records
Processed 5000 records
Processed 6000 records
Processed 7000 records
Processed 8000 records
Processed 9000 records
Processed 10000 records
Processed 11000 records
Processed 12000 records
Processed 13000 records
Processed 14000 records
Processed 15000 records
Processed 16000 records
Processed 17000 records
Processed 18000 records
Processed 19000 records
Processed 20000 records
Processed 21000 records
Processed 22000 records
Processed 23000 records
Processed 24000 records
Processed 25000 records


In [41]:
x, y = util.get_rbp_hypothetical_proteins_prokka(constants.RBP_REGEX, annot_products_prokka)

Processed 1000 records
Processed 2000 records
Processed 3000 records
Processed 4000 records


Combine the keywords for RBPs and hypothetical proteins from GenBank and PHROG annotations.

In [42]:
util.set_rbp_products(rbp_products.union(x))
util.set_hypothetical_proteins(hypothetical_proteins.union(y))

In [43]:
rbp_products = rbp_products.union(x)
hypothetical_proteins = hypothetical_proteins.union(y)

In [44]:
with open(os.path.join(constants.TEMP_PREPROCESSING, constants.RBP_PRODUCTS),'wb') as rbp_products_file:
    pickle.dump(rbp_products, rbp_products_file)
    
with open(os.path.join(constants.TEMP_PREPROCESSING, constants.HYPOTHETICAL_PRODUCTS),'wb') as hypothetical_proteins_file:
    pickle.dump(hypothetical_proteins, hypothetical_proteins_file)

### Generate FASTA for entries without gene annotations

Generate FASTA files containing protein sequences of hypothetical proteins and RBPs

In [45]:
util.generate_rbp_hypothetical_fasta_prokka(constants.INPHARED_GENOME, RBP_NEW_DIR, HYPOTHETICAL_NEW_DIR, 
                                            lower_bound, upper_bound)

The FASTA files are stored in the `inphared/fasta` directory, specifically in the `prokka` subdirectories.

### C. *Generate FFN containing the genomes of RBPs and hypothetical proteins*

In [46]:
if not os.path.exists(f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.HYPOTHETICAL}/{constants.GENBANK}'):
    os.makedirs(f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.HYPOTHETICAL}/{constants.GENBANK}')
    
if not os.path.exists(f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.RBP}/{constants.GENBANK}'):
    os.makedirs(f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.RBP}/{constants.GENBANK}')

In [47]:
util.generate_rbp_hypothetical_nucleotide(f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.RBP}/{constants.GENBANK}',
                                          f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.HYPOTHETICAL}/{constants.GENBANK}', 
                                          lower_bound, upper_bound)

Processed 1000 records
Processed 2000 records
Processed 3000 records
Processed 4000 records
Processed 5000 records
Processed 6000 records
Processed 7000 records
Processed 8000 records
Processed 9000 records
Processed 10000 records
Processed 11000 records
Processed 12000 records
Processed 13000 records
Processed 14000 records
Processed 15000 records
Processed 16000 records
Processed 17000 records
Processed 18000 records
Processed 19000 records
Processed 20000 records
Processed 21000 records
Processed 22000 records
Processed 23000 records
Processed 24000 records
Processed 25000 records
Processed 26000 records
Processed 27000 records
Processed 28000 records
Processed 29000 records
Processed 30000 records
Processed 31000 records


The FFN files are stored in the `inphared/fasta/nucleotide` directory, specifically in the `genbank` subdirectories.

In [48]:
if not os.path.exists(f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.RBP}/{constants.PROKKA}'):
    os.makedirs(f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.RBP}/{constants.PROKKA}')
    
if not os.path.exists(f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.HYPOTHETICAL}/{constants.PROKKA}'):
    os.makedirs(f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.HYPOTHETICAL}/{constants.PROKKA}')

In [49]:
util.generate_rbp_hypothetical_nucleotide_prokka(constants.INPHARED_GENOME,
                                                 f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.RBP}/{constants.PROKKA}',
                                                 f'{constants.INPHARED}/{constants.FASTA}/{constants.NUCLEOTIDE}/{constants.HYPOTHETICAL}/{constants.PROKKA}', 
                                                 lower_bound, upper_bound)

The FFN files are stored in the `inphared/fasta/nucleotide` directory, specifically in the `prokka` subdirectories.