# 5. Reannotate rhodopsin neighborhood

We'll define rhodopsin's "neighborhood" as genes up to five positions downstrand (1...5 at furthest) and five positions upstream (-1...-5 at furthest) of rhodopsin.
I.e. so we're looking at a more limited search window of gene annotations. (Only in the vicinity of rhodopsin, rather than the entire contig).
This notebook has two steps:
* (5a). Integrating all the annotations (from Dram, HMMsearch, and Foldseek) into a single table (output_A)
* (5b). Transforming that into a column called 'flanking genes' that takes neighborhood info into account

### 5a:

In [None]:
import pandas as pd
from pathlib import Path
import re

#### Output 
PATH_output_A = './output/A_dram_reannot_with_hmmscan_and_foldseek.csv'
PATH_output_B = "./output/B_rhodopsin_neighborhood.csv"

#### Inputs
DIR_hmmscan = '../2_tabulate_hmmscan_hits/output/'
PATH_foldseek = '../4_reannotate_dramAAs_based_on_foldseek_annot/rankE_reannotations.csv'
PATH_dram = "../0_get_DRAM_AAs_for_SAGs_w_rhodopsins/dram_926_SAGs_combined.csv"

####### I. ##### Combine annots from DRAM, hmmscan, & Foldseek into 1 table  -> (DF3) ######

# Import the hmmscan annotation tables
# Get all files matching the pattern '_hmmsearch_minbit10.csv'
LIST_hmmscan_files = list(Path(DIR_hmmscan).glob('*_hmmsearch_minbit10.csv'))

# Read and combine all matching files
LIST_hmmscan_DFs = [pd.read_csv(file) for file in LIST_hmmscan_files]
DF_hmmscan = pd.concat(LIST_hmmscan_DFs, ignore_index=True)

# Select and rename columns
DF_hmmscan = DF_hmmscan.rename(columns={
    'query_name': 'cds_id',
    'hmmsearching_for': 'hmmsearch_gene',
    'score': 'hmmsearch_score',
    'target_accession': 'hmmsearch_acc'
})[['cds_id', 'hmmsearch_gene', 'hmmsearch_score', 'hmmsearch_acc']]


# Remove prefix from cds_id (everything up to and including first underscore)
DF_hmmscan['cds_id'] = DF_hmmscan['cds_id'].str.replace(r'^[^_]+_', '', regex=True)

# Import the foldseek annotation table
DF_foldseek = pd.read_csv(PATH_foldseek)[['gene_id', 'description']].rename(
    columns={'description': 'foldseek_hit'}
)

# Import dram annotation table and perform joins
DF_dram = pd.read_csv(PATH_dram)

# Left join with DF_hmmscan
DF2 = DF_dram.merge(DF_hmmscan, left_on='...1', right_on='cds_id', how='left')

# Create gene_id by removing prefix up to '_contigs_'
DF2['gene_id'] = DF2['...1'].str.replace(r'^.*_contigs_', '', regex=True)

# Left join with foldseek
DF3 = DF2.merge(DF_foldseek, on='gene_id', how='left')

# Create annot_multisrc column: use foldseek_hit if available, otherwise pfam_hits
DF3['annot_multisrc'] = DF3['foldseek_hit'].fillna(DF3['pfam_hits'])

# Drop gene_id column
DF3 = DF3.drop(columns=['gene_id'])

# Write to CSV
DF3.to_csv(PATH_output, index=False)



### 5b:

In [39]:
####### II. ##### Get the rhodopsin neighborhood (i.e. annotations for 5 genes upstrand and 5 genes downstrand of rhodopsin)



> python **5b_rhodopsin_neighborhood.py** --input_file "./output/A_reannot_with_hmmscan_and_foldseek.csv" --output="./output/B_rhodopsin_neighborhood"

# Understanding 'flanking_genes' format...

In [42]:
DF_neighborhood = pd.read_csv(PATH_output_B)
DF_neighborhood.head()

Unnamed: 0,rhodopsin_gene,rhodopsin_scaffold,sag,rhodopsin_count_in_sag,carotenoid_gene_cluster,flotillin,neither,flanking_genes
0,AG-538-J23_contigs_AG-538-J23_NODE_9_13,AG-538-J23_NODE_9,AG-538-J23_contigs,1,0,0,0,{'annot_multisrc of gene -5': ['Glycosyl trans...
1,AG-538-C05_contigs_AG-538-C05_NODE_1_19,AG-538-C05_NODE_1,AG-538-C05_contigs,1,0,1,0,{'annot_multisrc of gene -5': ['Periplasmic bi...
2,AG-538-N13_contigs_AG-538-N13_NODE_14_10,AG-538-N13_NODE_14,AG-538-N13_contigs,1,0,1,0,{'annot_multisrc of gene -5': ['Domain of unkn...
3,AH-302-C16_contigs_AH-302-C16_NODE_2_56,AH-302-C16_NODE_2,AH-302-C16_contigs,1,0,0,0,{'annot_multisrc of gene -5': ['Dolichyl-phosp...
4,AH-302-J23_contigs_AH-302-J23_NODE_61_2,AH-302-J23_NODE_61,AH-302-J23_contigs,1,0,0,0,{'annot_multisrc of gene -1': ['Domain of unkn...


#### Let's look at an example of this 'Flanking_genes' column...

In [44]:
DF_neighborhood['flanking_genes'][0]

"{'annot_multisrc of gene -5': ['Glycosyl transferase family 2 [PF00535.29]', 'None', 8230, 9054, 1], 'annot_multisrc of gene -4': ['Glucose / Sorbosone dehydrogenase [PF07995.14]', 'None', 9062, 10432, 1], 'annot_multisrc of gene -3': ['GDP-mannose 4,6 dehydratase [PF16363.8]; NAD dependent epimerase/dehydratase family [PF01370.24]; 3-beta hydroxysteroid dehydrogenase/isomerase family [PF01073.22]; RmlD substrate binding domain [PF04321.20]; Polysaccharide biosynthesis protein [PF02719.18]; Male sterility protein [PF07993.15]; NmrA-like family [PF05368.16]', 'None', 10366, 11406, -1], 'annot_multisrc of gene -2': ['Small Multidrug Resistance protein [PF00893.22]', 'None', 11410, 11742, -1], 'annot_multisrc of gene -1': ['AMP-binding enzyme [PF00501.31]; Acetyl-coenzyme A synthetase N-terminus [PF16177.8]; AMP-binding enzyme C-terminal domain [PF13193.9]', 'None', 11742, 13637, -1], 'annot_multisrc of rhodopsin gene': ['Bacteriorhodopsin-like protein [PF01036.21]', 'None', 13834, 14601

In [46]:
import ast
DICT_neighborhood_of_rhodA = ast.literal_eval(DF_neighborhood['flanking_genes'][0])
DICT_neighborhood_of_rhodA

{'annot_multisrc of gene -5': ['Glycosyl transferase family 2 [PF00535.29]',
  'None',
  8230,
  9054,
  1],
 'annot_multisrc of gene -4': ['Glucose / Sorbosone dehydrogenase [PF07995.14]',
  'None',
  9062,
  10432,
  1],
 'annot_multisrc of gene -3': ['GDP-mannose 4,6 dehydratase [PF16363.8]; NAD dependent epimerase/dehydratase family [PF01370.24]; 3-beta hydroxysteroid dehydrogenase/isomerase family [PF01073.22]; RmlD substrate binding domain [PF04321.20]; Polysaccharide biosynthesis protein [PF02719.18]; Male sterility protein [PF07993.15]; NmrA-like family [PF05368.16]',
  'None',
  10366,
  11406,
  -1],
 'annot_multisrc of gene -2': ['Small Multidrug Resistance protein [PF00893.22]',
  'None',
  11410,
  11742,
  -1],
 'annot_multisrc of gene -1': ['AMP-binding enzyme [PF00501.31]; Acetyl-coenzyme A synthetase N-terminus [PF16177.8]; AMP-binding enzyme C-terminal domain [PF13193.9]',
  'None',
  11742,
  13637,
  -1],
 'annot_multisrc of rhodopsin gene': ['Bacteriorhodopsin-like

#### So, for instance: This LIST describes the protein 2 steps downstream from rhodopsin1...

In [47]:
DICT_neighborhood_of_rhodA['annot_multisrc of gene +2']

['Pyridine nucleotide-disulphide oxidoreductase [PF07992.17]; Flavin-binding monooxygenase-like [PF00743.22]; HI0933-like protein [PF03486.17]; NAD(P)-binding Rossmann-like domain [PF13450.9]',
 'CrtI',
 15000,
 16022,
 1]

#### Where LIST[0] is annots from DRAM or Foldseek (if no DRAM annot was found):

In [48]:
DICT_neighborhood_of_rhodA['annot_multisrc of gene +2'][0]

'Pyridine nucleotide-disulphide oxidoreductase [PF07992.17]; Flavin-binding monooxygenase-like [PF00743.22]; HI0933-like protein [PF03486.17]; NAD(P)-binding Rossmann-like domain [PF13450.9]'

#### LIST[1] is the annots from HMMsearch:

In [49]:
DICT_neighborhood_of_rhodA['annot_multisrc of gene +2'][1]

'CrtI'

#### LIST[2,3] are the starting and ending nuc coords for that protein:

In [51]:
DICT_neighborhood_of_rhodA['annot_multisrc of gene +2'][2:4]

[15000, 16022]

#### And LIST[4] is the strandedness (1 for pos, -1 for neg):
DICT_neighborhood_of_rhodA['annot_multisrc of gene +2'][4]

In [53]:
DICT_neighborhood_of_rhodA['annot_multisrc of gene +2'][4]

1