# Creating the Gene Table
This notebook is copied from the [Pymodulon GitHub repository](https://github.com/SBRG/pymodulon/blob/master/docs/tutorials/creating_the_gene_table.ipynb)

## Get information from GFF files

In [3]:
from pymodulon.gene_util import *
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from os import path
from pathlib import Path  

First, download the FASTA and GFF files for your organism and its plasmids from NCBI.

Enter the location of all your GFF files here:

In [24]:
gff_files = [os.path.join('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/gene/genome.gff3')]
gene_file = path.join('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/gene/gene.csv') # Enter metadata filename here
Kegg_file = path.join('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/gene/Kegg_ID.csv') # Enter metadata filename here


The following cell will convert all the GFF files into a single Pandas DataFrame for easy manipulation. Pseudogenes have multiple rows in a GFF file (one for each fragment), but only the first fragment will be kept.

In [5]:
keep_cols = ['accession','start','end','strand','gene_name','old_locus_tag','gene_product','ncbi_protein']

DF_annot = gff2pandas(gff_files,index='locus_tag')
DF_annot = DF_annot[keep_cols]

DF_annot



Unnamed: 0_level_0,accession,start,end,strand,gene_name,old_locus_tag,gene_product,ncbi_protein
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
CYL77_RS00005,NZ_CP025533.1,1.0,1575.0,+,dnaA,CYL77_00005,chromosomal replication initiator protein DnaA,WP_011013309.1
CYL77_RS00015,NZ_CP025533.1,2292.0,3476.0,+,dnaN,CYL77_00015,DNA polymerase III subunit beta,WP_003855336.1
CYL77_RS00020,NZ_CP025533.1,3585.0,4769.0,+,recF,CYL77_00020,DNA replication/repair protein RecF,WP_011013310.1
CYL77_RS00025,NZ_CP025533.1,4766.0,5302.0,+,,CYL77_00025,DUF721 domain-containing protein,WP_003855338.1
CYL77_RS00030,NZ_CP025533.1,5435.0,7489.0,+,gyrB,CYL77_00030,DNA topoisomerase (ATP-hydrolyzing) subunit B,WP_011013311.1
...,...,...,...,...,...,...,...,...
CYL77_RS15665,NZ_CP025533.1,3313085.0,3313714.0,-,rsmG,CYL77_15665,16S rRNA (guanine(527)-N(7))-methyltransferase...,WP_011266076.1
CYL77_RS15670,NZ_CP025533.1,3313903.0,3314856.0,-,yidC,CYL77_15670,membrane protein insertase YidC,WP_003855313.1
CYL77_RS15675,NZ_CP025533.1,3314912.0,3315211.0,-,yidD,CYL77_15675,membrane protein insertion efficiency factor YidD,WP_011266077.1
CYL77_RS15680,NZ_CP025533.1,3315192.0,3315593.0,-,rnpA,CYL77_15680,ribonuclease P protein component,WP_003860977.1


Since the microarray data use a different kind of annotation, it might be necessary to provide add a column of the gene ID that they used to match this locus tag
1. in gene_ID, the start and end position also indicates the direction. Solution is to add the direction of the strand, but start position will now always be smalle than the end position

In [39]:
gene_ID = pd.read_csv(gene_file, index_col=0).fillna(-1)
Kegg_ID = pd.read_csv(Kegg_file, index_col=0,sep='\t')

In [40]:
gene_start = gene_ID["Start"].tolist()
gene_stop = gene_ID["Stop"].tolist()
strand = []
for i in range(len(gene_start)):
    if gene_start[i] < gene_stop[i]:
        strand.append("+")
    else:
        strand.append("-")
        temp = gene_start[i]
        gene_start[i] = gene_stop[i]
        gene_stop[i] = temp

In [41]:
gene_ID = gene_ID.drop(["Start","Stop"], axis = 1)
gene_ID['Start'] = gene_start
gene_ID['Stop'] = gene_stop
gene_ID['Strand'] = strand

In [17]:
from pathlib import Path  
filepath = Path('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/gene/gene.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
gene_ID.to_csv(filepath)  

In [59]:
gene_ID

Unnamed: 0_level_0,Annotation,Gene_name,Genome_ACC,ORF,Strand,Start,Stop
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
cg0001,chromosomal replication initiation protein,dnaA,BX927147,cg0001,+,1.0,1575.0
cg0002,hypothetical protein,0,BX927147,cg0002,+,1594.0,1920.0
cg0004,DNA polymerase III subunit ? (EC:2.7.7.7),dnaN,BX927147,cg0004,+,2292.0,3476.0
cg0005,recombination protein RecF,recF,BX927147,cg0005,+,3585.0,4769.0
cg0006,"hypothetical protein, conserved",0,BX927147,cg0006,+,4766.0,5302.0
...,...,...,...,...,...,...,...
cgtRNA_3583,Asp tRNA,0,BX927147,cgtRNA_3583,+,2698749.0,2698859.0
cgtRNA_3584,Glu tRNA,0,BX927147,cgtRNA_3584,+,2698861.0,2698954.0
cgtRNA_3585,Lys tRNA,0,BX927147,cgtRNA_3585,+,2700806.0,2700878.0
cgtRNA_3586,Thr tRNA,0,BX927147,cgtRNA_3586,+,2764068.0,2764146.0


In [58]:
for i, gene_name in gene_ID.Gene_name.iteritems():
    if isinstance(gene_name, str):
        gene_ID.Gene_name[i] = str.strip(gene_name)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [48]:
DF_annot

Unnamed: 0_level_0,accession,start,end,strand,gene_name,old_locus_tag,gene_product,ncbi_protein
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
CYL77_RS00005,NZ_CP025533.1,1.0,1575.0,+,dnaA,CYL77_00005,chromosomal replication initiator protein DnaA,WP_011013309.1
CYL77_RS00015,NZ_CP025533.1,2292.0,3476.0,+,dnaN,CYL77_00015,DNA polymerase III subunit beta,WP_003855336.1
CYL77_RS00020,NZ_CP025533.1,3585.0,4769.0,+,recF,CYL77_00020,DNA replication/repair protein RecF,WP_011013310.1
CYL77_RS00025,NZ_CP025533.1,4766.0,5302.0,+,,CYL77_00025,DUF721 domain-containing protein,WP_003855338.1
CYL77_RS00030,NZ_CP025533.1,5435.0,7489.0,+,gyrB,CYL77_00030,DNA topoisomerase (ATP-hydrolyzing) subunit B,WP_011013311.1
...,...,...,...,...,...,...,...,...
CYL77_RS15665,NZ_CP025533.1,3313085.0,3313714.0,-,rsmG,CYL77_15665,16S rRNA (guanine(527)-N(7))-methyltransferase...,WP_011266076.1
CYL77_RS15670,NZ_CP025533.1,3313903.0,3314856.0,-,yidC,CYL77_15670,membrane protein insertase YidC,WP_003855313.1
CYL77_RS15675,NZ_CP025533.1,3314912.0,3315211.0,-,yidD,CYL77_15675,membrane protein insertion efficiency factor YidD,WP_011266077.1
CYL77_RS15680,NZ_CP025533.1,3315192.0,3315593.0,-,rnpA,CYL77_15680,ribonuclease P protein component,WP_003860977.1


In [62]:
gene_ID.Gene_name['cg0001']

'dnaA'

In [63]:
DF_annot.gene_name['CYL77_RS00005']

'dnaA'

In [69]:
'dnaA' in gene_ID.Gene_name.values

True

In [70]:
count = 0
for i, row in DF_annot.iterrows():
    if row.gene_name in gene_ID.Gene_name.values:
        print(row.gene_name, 
              gene_ID.index[gene_ID.Gene_name == row.gene_name][0], 
              i)
        count += 1

dnaA cg0001 CYL77_RS00005
dnaN cg0004 CYL77_RS00015
recF cg0005 CYL77_RS00020
gyrB cg0007 CYL77_RS00030
ssuR cg0012 CYL77_RS00050
gyrA cg0015 CYL77_RS00065
crgA cg0055 CYL77_RS00235
pknB cg0057 CYL77_RS00240
bioB cg0095 CYL77_RS00405
ureC cg0115 CYL77_RS00500
ureE cg0116 CYL77_RS00505
ureG cg0118 CYL77_RS00515
amn cg0124 CYL77_RS00545
xylB cg0147 CYL77_RS00635
panC cg0148 CYL77_RS00640
panB cg0149 CYL77_RS00645
iolR cg0196 CYL77_RS00855
iolC cg0197 CYL77_RS00860
iolB cg0201 CYL77_RS00875
iolD cg0202 CYL77_RS00880
gltB cg0229 CYL77_RS00990
moaC cg0260 CYL77_RS01135
modA cg0263 CYL77_RS01150
hisC cg2304 CYL77_RS01170
tgt cg0285 CYL77_RS01260
leuA cg0303 CYL77_RS01355
brnF cg0314 CYL77_RS01400
brnE cg0315 CYL77_RS01405
nth cg0353 CYL77_RS01575
topA cg0373 CYL77_RS01655
murB cg0423 CYL77_RS01880
lpdA cg0790 CYL77_RS01925
ramB cg0444 CYL77_RS01940
purU cg0457 CYL77_RS02005
deoC cg0458 CYL77_RS02010
mshA cg0481 CYL77_RS02095
proC cg0490 CYL77_RS02140
hemC cg0498 CYL77_RS02170
qsuB cg0502 CYL

In [71]:
count

437

In [49]:
anno_name = DF_annot['gene_name'].tolist()
mine_name = gene_ID['Gene_name'].tolist()
counter = 0
for i in range(len(mine_name)):
    if mine_name[i] not in anno_name:
        counter = counter +1
print(counter)

2682


In [45]:
type(Kegg_ID_l[0])

int

To ensure that the gene index used is identical to the expression matrix, load in your data.

Check that the genes are the same in the expression dataset as in the annotation dataframe. Mismatched genes are listed below.

In [None]:
test = DF_annot.sort_index().index == DF_log_tpm.sort_index().index
DF_annot[~test]

In [None]:
from pathlib import Path  
filepath = Path('/Users/louxuwen/Desktop/Documents/GitHub/BENG212_S_aureus/5_characterize_iModulons/Data/gene_info.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
DF_annot.to_csv(filepath)  

## (Optional) KEGG and COGs

### Generate nucleotide fasta files for CDS

Enter the location of all your fasta files here:
https://www.genome.jp/brite/br08601+cgb
https://rest.kegg.jp/link/cgl/pathway

In [None]:
fasta_files = [os.path.join('saureus.fna')]

The following code generates CDS files using your FASTA and GFF3 files

In [None]:
from Bio import SeqIO

cds_list = []
for fasta in fasta_files:
    seq = SeqIO.read(fasta,'fasta')

    # Get gene information for genes in this fasta file
    df_genes = DF_annot[DF_annot.accession == seq.id]
    
    for i,row in df_genes.iterrows():
        cds = seq[int(row.start)-1:int(row.end)] #Added int() heredue to errors
        if row.strand == '-':
            cds = seq[int(row.start)-1:int(row.end)].reverse_complement()  #Added int() heredue to errors
        cds.id = row.name
        cds.description = row.gene_name if pd.notnull(row.gene_name) else row.name
        cds_list.append(cds)

In [None]:
cds_list[:5]

Save the CDS file

In [None]:
cds_file = os.path.join('CDS_files','CDS.fna')
SeqIO.write(cds_list, cds_file, 'fasta')

### Run EggNOG Mapper
1. Go to http://eggnog-mapper.embl.de/.
1. Upload the CDS.fna file from your organism directory (within the sequence_files folder)
1. Make sure to limit the taxonomy to the correct level
1. After the job is submitted, you must follow the link in your email to run the job.
1. Once the job completes (after ~4 hrs), download the annotations file.
1. Save the annotation file

### Get KEGG IDs

Once you have the EggNOG annotations, load the annotation file

In [None]:
eggnog_file = os.path.join('eggNOG','MM_pqq12tbj.emapper.annotations.txt')

In [None]:
DF_eggnog = pd.read_csv(eggnog_file,sep='\t',skiprows=4,header=None)
eggnog_cols = ['query_name','seed eggNOG ortholog','seed ortholog evalue','seed ortholog score',
               'Predicted taxonomic group','Predicted protein name','Gene Ontology terms',
               'EC number','KEGG_orth','KEGG_pathway','KEGG_module','KEGG_reaction',
               'KEGG_rclass','BRITE','KEGG_TC','CAZy','BiGG Reaction','tax_scope',
               'eggNOG OGs','bestOG_deprecated','COG']  #deleted 'eggNOG free text description' column at the end due to bug

DF_eggnog.columns = eggnog_cols

# Strip last three rows as they are comments
DF_eggnog = DF_eggnog.iloc[:-3]

# Set locus tag as index
DF_eggnog = DF_eggnog.set_index('query_name')
DF_eggnog.index.name = 'locus_tag'

DF_eggnog.head()

Now we will pull the KEGG information from the eggNOG file, including orthology, pathway, module, and reactions for each gene.

In [None]:
DF_kegg = DF_eggnog[['KEGG_orth','KEGG_pathway','KEGG_module','KEGG_reaction']]

# Melt dataframe
DF_kegg = DF_kegg.reset_index().melt(id_vars='locus_tag') 

# Remove null values
DF_kegg = DF_kegg[DF_kegg.value.notnull()]

# Split comma-separated values into their own rows
list2struct = []
for name,row in DF_kegg.iterrows():
    for val in row.value.split(','):
        list2struct.append([row.locus_tag,row.variable,val])

DF_kegg = pd.DataFrame(list2struct,columns=['gene_id','database','kegg_id'])

# Remove ko entries, as only map entries are searchable in KEGG pathway
DF_kegg = DF_kegg[~DF_kegg.kegg_id.str.startswith('ko')]

DF_kegg.head()

### Save KEGG information

In [None]:
DF_kegg.to_csv(os.path.join('KEGG','kegg_mapping.csv'))

### Save COGs to annotation dataframe

In [None]:
DF_annot['COG'] = DF_eggnog.COG

# Make sure COG only has one entry per gene
DF_annot['COG'] = [item[0] if isinstance(item,str) else item for item in DF_annot['COG']]

## Uniprot ID mapping

The ``uniprot_id_mapping`` function is a python wrapper for the [Uniprot ID mapping tool](https://www.uniprot.org/uploadlists/). Use ``input_id=P_REFSEQ_AC`` if the FASTA/GFF files are from RefSeq, and ``input_id=EMBL`` if the files are from Genbank.

In [None]:
#mapping_uniprot = uniprot_id_mapping(DF_annot.ncbi_protein.fillna(''),input_id='EMBL',output_id='ACC',
#                             input_name='ncbi_protein',output_name='uniprot')
#mapping_uniprot.head()


#mapping_uniprot = uniprot_id_mapping(DF_annot.ncbi_protein.fillna(''),input_id='P_REFSEQ_AC',output_id='ACC',
#                                     input_name='ncbi_protein',output_name='uniprot')
#mapping_uniprot.head()

import json
import requests
import time

URL = 'https://rest.uniprot.org/idmapping'
IDS = DF_annot.ncbi_protein.fillna('').values.tolist()


params = {
   'from': 'UniProtKB_AC-ID',
   'to': 'ChEMBL',
   'ids': ' '.join(IDS)
}

response = requests.post(f'{URL}/run', params)
job_id = response.json()['jobId']
job_status = requests.get(f'{URL}/status/{job_id}')
d = job_status.json()

# Make three attemps to get the results
for i in range(3):
    if d.get("job_status") == 'FINISHED' or d.get('results'):
        job_results = requests.get(f'{URL}/results/{job_id}')
        results = job_results.json()
        for obj in results['results']:
            print(f'{obj["from"]}\t{obj["to"]}')
        break
    time.sleep(1)
#print(IDS)
#print(response.json())
print(job_status.json())

In [None]:
# Merge with current annotation
DF_annot = pd.merge(DF_annot.reset_index(),mapping_uniprot,how='left',on='ncbi_protein')
DF_annot.set_index('locus_tag',inplace=True)
DF_annot.head()

## Add Biocyc Operon information

To obtain operon information from Biocyc, follow the steps below

1. Go to [Biocyc.org](https://biocyc.org/) (you may need to create an account and/or login)
2. Change the organism database to your organism/strain
3. Select **SmartTables** -> **Special SmartTables**
4. Select **"All genes of \<organism\>"**
5. Select the **"Gene Name"** column
6. Under **"ADD TRANSFORM COLUMN"** select **"Genes in same transcription unit"**
7. Select the **"Genes in same transcription unit"** column
8. Under **"ADD PROPERTY COLUMN"** select **"Accession-1"**
9. Under **OPERATIONS**, select **"Export"** -> **"to Spreadsheet File..."**
10. Select **"common names"** and click **"Export smarttable"**
11. Add file location below and run the code cell

In [None]:
biocyc_file = os.path.join('..','data','external','biocyc_annotations.txt')

DF_biocyc = pd.read_csv(biocyc_file,sep='\t')

# Remove genes with no accession
DF_biocyc = DF_biocyc[DF_biocyc['Accession-1'].notnull()]

# Set the accession (i.e. locus tag) as index
DF_biocyc = DF_biocyc.set_index('Accession-1').sort_values('Left-End-Position')

# Specific for B. subtilis: Fix locus tags
DF_biocyc.index = DF_biocyc.index.str.replace('BSU','BSU_')

# Only keep genes in the final annotation file
DF_biocyc = DF_biocyc.reindex(DF_annot.index)

# Reformat transcription units
DF_biocyc['operon_list'] = DF_biocyc['Accession-1.1'].apply(reformat_biocyc_tu)

# Fill None with locus tags
DF_biocyc['operon_list'].fillna(DF_biocyc.index.to_series(), inplace=True)

DF_biocyc.head()

### Assign unique IDs to operons

The following code assigns unique names to each operon

In [None]:
# Get all operons
operons = DF_biocyc['operon_list'].unique()

# Map each operon to a unique string
operon_dict = {operon: "Op"+str(i) for i, operon in enumerate(operons)}

# Add names to dataframe
DF_biocyc['operon'] = [operon_dict[op] for op in DF_biocyc["operon_list"]]

DF_biocyc.head()

Finally, merge the Biocyc information with the main annotation DataFrame

In [None]:
DF_annot['operon'] = DF_biocyc['operon']

## Clean up and save annotation

First, we will re-order the annotation columns

In [None]:
if 'old_locus_tag' in DF_annot.columns:
    order = ['gene_name','accession','old_locus_tag','start','end','strand','gene_product','COG','uniprot','operon']
else:
    order = ['gene_name','accession','start','end','strand','gene_product','COG','uniprot','operon']
    
DF_annot = DF_annot[order]

In [None]:
DF_annot.head()

## Final statistics

The following graphs show how much information is available for the organism.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('ticks')

In [None]:
fig,ax = plt.subplots()
DF_annot.count().plot(kind='bar',ax=ax)
ax.set_ylabel('# of Values',fontsize=18)
ax.tick_params(labelsize=16)

## Fill missing values

Some organisms are missing gene names, so these will be filled with locus tag gene names.

In [None]:
# Fill in missing gene names with locus tag names
DF_annot['tmp_name'] = DF_annot.copy().index.tolist()
DF_annot.gene_name.fillna(DF_annot.tmp_name,inplace=True)
DF_annot.drop('tmp_name',axis=1,inplace=True)

 COG letters will also be converted to the full name.

In [None]:
# Fill missing COGs with X
DF_annot['COG'].fillna('X',inplace=True)

# Change single letter COG annotation to full description
DF_annot['COG'] = DF_annot.COG.apply(cog2str)

counts = DF_annot.COG.value_counts()
plt.pie(counts.values,labels=counts.index);

Uncomment the following line to save the gene annotation dataset

In [None]:
DF_annot.to_csv(os.path.join('..','data','processed_data','gene_info.csv'))

## GO Annotations

To start, download the GO Annotations for your organism from AmiGO 2

1. Go to [AmiGO 2](http://amigo.geneontology.org/amigo/search/annotation)
1. Filter for your organism
1. Click ``CustomDL``
1. Drag ``GO class (direct)`` to the end of your Selected Fields
1. Enter the location of your GO annotation file below and run the following code block

In [None]:
go_file = os.path.join('..','data','external','GO_annotations.txt')

In [None]:
DF_GO = pd.read_csv(go_file,sep='\t',header=None,usecols=[2,17])
DF_GO.columns = ['gene_name','gene_ontology']
DF_GO.head()

Convert the gene names to gene locus tags, and drop gene names that cannot be converted

In [None]:
name2num = {v:k for k,v in DF_annot.gene_name.to_dict().items()}

In [None]:
DF_GO['gene_id'] = [name2num[x] if x in name2num.keys() else None for x in DF_GO.gene_name]

In [None]:
DF_GO.head()

Now we remove null entries

In [None]:
DF_GO = DF_GO[DF_GO.gene_id.notnull()]

In [None]:
DF_GO.head()

Uncomment the line below to save the annotations

In [None]:
DF_GO[['gene_id','gene_name','gene_ontology']].to_csv(os.path.join('..','data','external','GO_annotations_curated.csv'))