# Creating the Gene Table
This notebook is copied from the [Pymodulon GitHub repository](https://github.com/SBRG/pymodulon/blob/master/docs/tutorials/creating_the_gene_table.ipynb)

## Get information from GFF files

In [1]:
from pymodulon.gene_util import *
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from os import path
from pathlib import Path  

First, download the FASTA and GFF files for your organism and its plasmids from NCBI.

Enter the location of all your GFF files here:

In [2]:
#gff_files = [os.path.join('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/gene/genomic.gff')]
gff_files = [os.path.join('/Users/louxuwen/Desktop/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/gene/genomic.gff')]

#gene_file = path.join('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/gene/gene.csv') # Enter metadata filename here
#Kegg_file = path.join('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/gene/Kegg_ID.csv') # Enter metadata filename here


The following cell will convert all the GFF files into a single Pandas DataFrame for easy manipulation. Pseudogenes have multiple rows in a GFF file (one for each fragment), but only the first fragment will be kept.

In [3]:
keep_cols = ['accession','start','end','strand','gene_name','old_locus_tag','gene_product','ncbi_protein']

DF_annot = gff2pandas(gff_files,index='locus_tag')
DF_annot = DF_annot[keep_cols]

DF_annot

Unnamed: 0_level_0,accession,start,end,strand,gene_name,old_locus_tag,gene_product,ncbi_protein
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
cg0001,BX927147.1,1.0,1575.0,+,dnaA,,CHROMOSOMAL REPLICATION INITIATOR PROTEIN,CAF18566.1
cg0002,BX927147.1,1594.0,1920.0,-,,,hypothetical protein predicted by Glimmer,CAF18567.1
cg0004,BX927147.1,2292.0,3476.0,+,dnaN,,DNA POLYMERASE III%2C BETA SUBUNIT,CAF18568.1
cg0005,BX927147.1,3585.0,4769.0,+,recF,,DNA REPAIR AND GENETIC RECOMBINATION PROTEIN,CAF18569.1
cg0006,BX927147.1,4814.0,5302.0,+,,,CONSERVED HYPOTHETICAL PROTEIN,CAF18570.1
...,...,...,...,...,...,...,...,...
cg3430,BX927147.1,3280996.0,3281295.0,-,,,conserved hypothetical protein,CAF19037.1
cg3431,BX927147.1,3281276.0,3281677.0,-,rnpA,,RNase P protein component,CAF19038.1
cg3432,BX927147.1,3281717.0,3281860.0,-,rpmH,,50S RIBOSOMAL PROTEIN L34,CAF19039.1
cg3433,BX927147.1,3282127.0,3282348.0,-,,,hypothetical protein predicted by Glimmer/Critica,CAF19040.1


06/30 update: turns out there are two c. glutamicum in kegg and ncbi, used the cgb organism will get you the right locus tag and genes.

Since the microarray data use a different kind of annotation, it might be necessary to provide add a column of the gene ID that they used to match this locus tag
1. in gene_ID, the start and end position also indicates the direction. Solution is to add the direction of the strand, but start position will now always be smalle than the end position

In [4]:
#gene_ID = pd.read_csv(gene_file, index_col=0).fillna(-1)
#Kegg_ID = pd.read_csv(Kegg_file, index_col=0,sep='\t')

In [5]:
#Kegg_ID

To ensure that the gene index used is identical to the expression matrix, load in your data.

In [6]:
log_tpm_file = os.path.join('/Users/louxuwen/Desktop/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/log_tpm.csv')

#log_tpm_file = os.path.join('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/log_tpm.csv')
DF_log_tpm = pd.read_csv(log_tpm_file,index_col=0)
DF_log_tpm.head()

Unnamed: 0,GSM5197020_1,GSM5197020_2,GSM5197021_1,GSM5197021_2,GSM5197022_1,GSM5197022_2,GSM5197023_1,GSM5197023_2,GSM5197024_2,GSM5197025_2,...,GSM5197943_1,GSM5197943_2,GSM5197944_1,GSM5197944_2,GSM5197945_1,GSM5197945_2,GSM5197946_1,GSM5197946_2,GSM5197947_1,GSM5197947_2
cg0001,0.0,0.0,0.0,0.0,1540.986,2186.654,839.0,1872.0,1723.954,2059.085,...,1296.8247,826.9054,1150.1929,848.4191,362.9568,326.9718,366.9057,550.2436,416.097,444.4621
cg0002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,176.9562,227.258,238.8252,243.9302,0.0,0.0,0.0,0.0,0.0,0.0
cg0004,4711.377,6665.822,6738.429,6371.215,5917.581,6109.358,4897.064,8340.172,6469.821,6613.652,...,1994.0833,1810.2925,2091.9613,2049.7641,1391.4229,665.8254,1790.5143,1346.9563,1324.0,1327.0
cg0005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,457.0,714.0,...,363.3212,309.6552,402.8218,417.6308,0.0,0.0,0.0,0.0,178.0,192.0
cg0006,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,216.0,224.0,239.0,267.0,0.0,0.0,0.0,0.0,0.0,0.0


Check that the genes are the same in the expression dataset as in the annotation dataframe. Mismatched genes are listed below.

In [7]:
DF_gene = DF_annot.sort_index().index.tolist()
tpm_gene = DF_log_tpm.sort_index().index.tolist()

In [8]:
missing = []
for i in DF_gene:
    if i not in tpm_gene:
        missing.append(i)

In [9]:
#test = DF_annot.sort_index().index == DF_log_tpm.sort_index().index
#  some genes in the DF_annot does not exist in DF_log_tpm
DF_annot = DF_annot.drop(missing, axis=0)
DF_annot

Unnamed: 0_level_0,accession,start,end,strand,gene_name,old_locus_tag,gene_product,ncbi_protein
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
cg0001,BX927147.1,1.0,1575.0,+,dnaA,,CHROMOSOMAL REPLICATION INITIATOR PROTEIN,CAF18566.1
cg0002,BX927147.1,1594.0,1920.0,-,,,hypothetical protein predicted by Glimmer,CAF18567.1
cg0004,BX927147.1,2292.0,3476.0,+,dnaN,,DNA POLYMERASE III%2C BETA SUBUNIT,CAF18568.1
cg0005,BX927147.1,3585.0,4769.0,+,recF,,DNA REPAIR AND GENETIC RECOMBINATION PROTEIN,CAF18569.1
cg0006,BX927147.1,4814.0,5302.0,+,,,CONSERVED HYPOTHETICAL PROTEIN,CAF18570.1
...,...,...,...,...,...,...,...,...
cg3430,BX927147.1,3280996.0,3281295.0,-,,,conserved hypothetical protein,CAF19037.1
cg3431,BX927147.1,3281276.0,3281677.0,-,rnpA,,RNase P protein component,CAF19038.1
cg3432,BX927147.1,3281717.0,3281860.0,-,rpmH,,50S RIBOSOMAL PROTEIN L34,CAF19039.1
cg3433,BX927147.1,3282127.0,3282348.0,-,,,hypothetical protein predicted by Glimmer/Critica,CAF19040.1


In [11]:
from pathlib import Path  
filepath = Path('/Users/louxuwen/Desktop//Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/gene_info.csv')  

#filepath = Path('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/gene_info.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
DF_annot.to_csv(filepath)  

## (Optional) KEGG and COGs

### Generate nucleotide fasta files for CDS

Enter the location of all your fasta files here:
https://www.genome.jp/brite/br08601+cgb
https://rest.kegg.jp/link/cgl/pathway

In [None]:
fasta_files = [os.path.join('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/gene/genomic.fna')]
#fasta_files = Path('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/Gene/genomic.fna')

fasta_files


The following code generates CDS files using your FASTA and GFF3 files

In [None]:
from Bio import SeqIO

cds_list = []
for fasta in fasta_files:
    seq = SeqIO.read(fasta,'fasta')

    # Get gene information for genes in this fasta file
    df_genes = DF_annot[DF_annot.accession == seq.id]
    
    for i,row in df_genes.iterrows():
        cds = seq[int(row.start)-1:int(row.end)] #Added int() heredue to errors
        if row.strand == '-':
            cds = seq[int(row.start)-1:int(row.end)].reverse_complement()  #Added int() heredue to errors
        cds.id = row.name
        cds.description = row.gene_name if pd.notnull(row.gene_name) else row.name
        cds_list.append(cds)

In [None]:
cds_list[:5]

Save the CDS file

In [None]:
cds_file = os.path.join('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/gene','CDS.fna')
SeqIO.write(cds_list, cds_file, 'fasta')

### Run EggNOG Mapper
1. Go to http://eggnog-mapper.embl.de/.
1. Upload the CDS.fna file from your organism directory (within the sequence_files folder)
1. Make sure to limit the taxonomy to the correct level
1. After the job is submitted, you must follow the link in your email to run the job.
1. Once the job completes (after ~4 hrs), download the annotations file.
1. Save the annotation file

06/30 update: done and complete

### Get KEGG IDs

Once you have the EggNOG annotations, load the annotation file

In [12]:
eggnog_file = os.path.join('/Users/louxuwen/Desktop/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/annotations.tsv')
#eggnog_file = os.path.join('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/annotations.tsv')

In [13]:
DF_eggnog = pd.read_csv(eggnog_file,sep='\t',skiprows=5,header=None) #changed 4 to 5 due to version update
eggnog_cols = ['query_name','seed eggNOG ortholog','seed ortholog evalue','seed ortholog score',
               'eggNOG OGs','Max Annotation Level','COG',
               'Description', 'Gene Name', 'GOs','EC number',
               'KEGG_orth','KEGG_pathway','KEGG_module','KEGG_reaction',
               'KEGG_rclass','BRITE','KEGG_TC','CAZy','BiGG Reaction','PFAMs']  
# "predicted taxonomic group" -> eggNOG OGs (orthologous groups)
# "predicted protein name" -> max annotation level
# "Gene Ontology terms" -> COG categroy"
# "EC number" -> "Description"
# add "gene name"
# add "GOs"
# "tax_scope" -> PFAMs, the PFAMs database is now interpro. Not sure how useful this column is
# delete "eggNOG OGs" it's in previous column, "bestOF _deprecated"
# delete 'COG', it's added to other places


#deleted 'eggNOG free text description' column at the end due to bug

DF_eggnog.columns = eggnog_cols

# Strip last three rows as they are comments
DF_eggnog = DF_eggnog.iloc[:-3]

# Set locus tag as index
DF_eggnog = DF_eggnog.set_index('query_name')
DF_eggnog.index.name = 'locus_tag'

DF_eggnog.head()

Unnamed: 0_level_0,seed eggNOG ortholog,seed ortholog evalue,seed ortholog score,eggNOG OGs,Max Annotation Level,COG,Description,Gene Name,GOs,EC number,KEGG_orth,KEGG_pathway,KEGG_module,KEGG_reaction,KEGG_rclass,BRITE,KEGG_TC,CAZy,BiGG Reaction,PFAMs
locus_tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
cg0001,196627.cg0001,0.0,998.0,"COG0593@1|root,COG0593@2|Bacteria,2GJKI@201174...",201174|Actinobacteria,L,it binds specifically double-stranded DNA at a...,dnaA,"GO:0000166,GO:0003674,GO:0003676,GO:0003677,GO...",-,ko:K02313,"ko02020,ko04112,map02020,map04112",-,-,-,"ko00000,ko00001,ko03032,ko03036",-,-,-,"Bac_DnaA,Bac_DnaA_C"
cg0004,196627.cg0004,5.28e-282,771.0,"COG0592@1|root,COG0592@2|Bacteria,2GJK3@201174...",201174|Actinobacteria,L,Confers DNA tethering and processivity to DNA ...,dnaN,"GO:0005575,GO:0005576,GO:0005618,GO:0005623,GO...",2.7.7.7,ko:K02338,"ko00230,ko00240,ko01100,ko03030,ko03430,ko0344...",M00260,"R00375,R00376,R00377,R00378",RC02795,"ko00000,ko00001,ko00002,ko01000,ko03032,ko03400",-,-,-,"DNA_pol3_beta,DNA_pol3_beta_2,DNA_pol3_beta_3"
cg0005,196627.cg0005,2.06e-279,764.0,"COG1195@1|root,COG1195@2|Bacteria,2GJCS@201174...",201174|Actinobacteria,L,it is required for DNA replication and normal ...,recF,"GO:0000731,GO:0005575,GO:0005622,GO:0005623,GO...",-,ko:K03629,"ko03440,map03440",-,-,-,"ko00000,ko00001,ko03400",-,-,-,SMC_N
cg0006,196627.cg0006,1.9e-115,331.0,"COG5512@1|root,COG5512@2|Bacteria,2GNQ4@201174...",201174|Actinobacteria,S,"Zn-ribbon-containing, possibly RNA-binding pro...",-,-,-,-,-,-,-,-,-,-,-,-,DUF721
cg0007,196627.cg0007,0.0,1348.0,"COG0187@1|root,COG0187@2|Bacteria,2GKGP@201174...",201174|Actinobacteria,L,A type II topoisomerase that negatively superc...,gyrB,"GO:0003674,GO:0003824,GO:0003916,GO:0003918,GO...",5.99.1.3,ko:K02470,-,-,-,-,"ko00000,ko01000,ko03032,ko03400",-,-,-,"DNA_gyraseB,DNA_gyraseB_C,HATPase_c,Toprim"


Now we will pull the KEGG information from the eggNOG file, including orthology, pathway, module, and reactions for each gene.

In [14]:
DF_kegg = DF_eggnog[['KEGG_orth','KEGG_pathway','KEGG_module','KEGG_reaction']]

# Melt dataframe
DF_kegg = DF_kegg.reset_index().melt(id_vars='locus_tag') 

# Remove null values
DF_kegg = DF_kegg[DF_kegg.value.notnull()]

# Split comma-separated values into their own rows
list2struct = []
for name,row in DF_kegg.iterrows():
    for val in row.value.split(','):
        list2struct.append([row.locus_tag,row.variable,val])

DF_kegg = pd.DataFrame(list2struct,columns=['gene_id','database','kegg_id'])

# Remove ko entries, as only map entries are searchable in KEGG pathway
DF_kegg = DF_kegg[~DF_kegg.kegg_id.str.startswith('ko')]

DF_kegg.head()

Unnamed: 0,gene_id,database,kegg_id
3,cg0006,KEGG_orth,-
5,cg0008,KEGG_orth,-
7,cg0010,KEGG_orth,-
8,cg0012,KEGG_orth,-
9,cg0013,KEGG_orth,-


### Save KEGG information

In [15]:
from pathlib import Path
filepath = Path('/Users/louxuwen/Desktop/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/kegg_mapping.csv')  

#filepath = Path('/home/amy/Documents/GitHub/modulome-C_Glutamicum_Microarray_clean/7_characterizing_imodulons/Data/kegg_mapping.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
DF_kegg.to_csv(filepath)  

### Save COGs to annotation dataframe

In [16]:
DF_annot['COG'] = DF_eggnog.COG

# Make sure COG only has one entry per gene
DF_annot['COG'] = [item[0] if isinstance(item,str) else item for item in DF_annot['COG']]

## Uniprot ID mapping

The ``uniprot_id_mapping`` function is a python wrapper for the [Uniprot ID mapping tool](https://www.uniprot.org/uploadlists/). Use ``input_id=P_REFSEQ_AC`` if the FASTA/GFF files are from RefSeq, and ``input_id=EMBL`` if the files are from Genbank.

06/30 completed but no results found

07/03 cannot access using uniprot_id_mapping, used url access and obtained ~20 hits

In [None]:

mapping_uniprot = uniprot_id_mapping(DF_annot.ncbi_protein.fillna(''),input_id='P_REFSEQ_AC',output_id='ACC', input_name='ncbi_protein',output_name='uniprot')
mapping_uniprot.head()

In [34]:
#mapping_uniprot = uniprot_id_mapping(DF_annot.ncbi_protein.fillna(''),input_id='EMBL',output_id='ACC',
#                             input_name='ncbi_protein',output_name='uniprot')
#mapping_uniprot.head()


#mapping_uniprot = uniprot_id_mapping(DF_annot.ncbi_protein.fillna(''),input_id='P_REFSEQ_AC',output_id='ACC',
#                                     input_name='ncbi_protein',output_name='uniprot')
#mapping_uniprot.head()

import json
import requests
import time

URL = 'https://rest.uniprot.org/idmapping'
#IDS = DF_annot.index.fillna('').values.tolist()
IDS = DF_annot.ncbi_protein.fillna('').values.tolist()


params = {
   'from': 'EMBL-GenBank-DDBJ_CDS',
   'to': 'UniProtKB',
   'ids': ' '.join(IDS)
}

response = requests.post(f'{URL}/run', params)

job_id = response.json()['jobId']
job_status = requests.get(f'{URL}/status/{job_id}')
d = job_status.json()

# Make three attemps to get the results
for i in range(3):
    if d.get("job_status") == 'FINISHED' or d.get('results'):
        job_results = requests.get(f'{URL}/results/{job_id}')
        results = job_results.json()
        for obj in results['results']:
            print(f'{obj["from"]}\t{obj["to"]}')
        break
    time.sleep(1)
print(IDS)
print(response.json())
print(job_status.json())

CAF18566.1	Q8NUD8
CAF18569.1	Q6M8X7
CAF18570.1	Q8NUD4
CAF18580.1	Q6M8W7
CAF18608.1	Q8NU99
CAF18609.1	Q8NU98
CAF18610.1	Q8NU97
CAF18640.1	P46396
CAF18652.1	Q9RHM6
CAF18653.1	Q79VJ4
CAF18654.1	Q79VJ3
CAF18655.1	Q9RHM3
CAF18656.1	Q79VJ2
CAF18657.1	Q79VJ1
CAF18658.1	Q79VJ0
CAF18681.1	Q9X713
CAF18682.1	Q9X712
CAF18684.1	Q8NU33
CAF18702.1	Q9X4N0
CAF18731.1	Q8NTY7
CAF18758.1	Q8NTW4
CAF18782.1	Q8NTU1
CAF18789.1	Q8NTT4
CAF18807.1	Q8NTR6
CAF18814.1	Q8NTQ9
['CAF18566.1', 'CAF18567.1', 'CAF18568.1', 'CAF18569.1', 'CAF18570.1', 'CAF18571.1', 'CAF18572.1', 'CAF18573.1', 'CAF18574.1', 'CAF18575.1', 'CAF18576.1', 'CAF18577.1', 'CAF18578.1', 'CAF18579.1', 'CAF18580.1', 'CAF18581.1', 'CAF18582.1', 'CAF18583.1', 'CAF18584.1', 'CAF18585.1', 'CAF18586.1', 'CAF18587.1', 'CAF18588.1', 'CAF18589.1', 'CAF18590.1', 'CAF18591.1', 'CAF18592.1', 'CAF18593.1', 'CAF18594.1', 'CAF18595.1', 'CAF18596.1', 'CAF18597.1', 'CAF18598.1', 'CAF18599.1', 'CAF18600.1', 'CAF18601.1', 'CAF18602.1', 'CAF18603.1', 'CAF18605.1', 'CA

In [33]:
# Merge with current annotation
DF_annot = pd.merge(DF_annot.reset_index(),mapping_uniprot,how='left',on='ncbi_protein')
DF_annot.set_index('locus_tag',inplace=True)
DF_annot.head()

NameError: name 'mapping_uniprot' is not defined

## Add Biocyc Operon information

To obtain operon information from Biocyc, follow the steps below

1. Go to [Biocyc.org](https://biocyc.org/) (you may need to create an account and/or login)
2. Change the organism database to your organism/strain
3. Select **SmartTables** -> **Special SmartTables**
4. Select **"All genes of \<organism\>"**
5. Select the **"Gene Name"** column
6. Under **"ADD TRANSFORM COLUMN"** select **"Genes in same transcription unit"**
7. Select the **"Genes in same transcription unit"** column
8. Under **"ADD PROPERTY COLUMN"** select **"Accession-1"**
9. Under **OPERATIONS**, select **"Export"** -> **"to Spreadsheet File..."**
10. Select **"common names"** and click **"Export smarttable"**
11. Add file location below and run the code cell

Summary of Corynebacterium glutamicum DSM 20300 = ATCC 13032, version 27.0
Tier 3 Uncurated Database

In [None]:
biocyc_file = os.path.join('..','data','external','biocyc_annotations.txt')

DF_biocyc = pd.read_csv(biocyc_file,sep='\t')

# Remove genes with no accession
DF_biocyc = DF_biocyc[DF_biocyc['Accession-1'].notnull()]

# Set the accession (i.e. locus tag) as index
DF_biocyc = DF_biocyc.set_index('Accession-1').sort_values('Left-End-Position')

# Specific for B. subtilis: Fix locus tags
DF_biocyc.index = DF_biocyc.index.str.replace('BSU','BSU_')

# Only keep genes in the final annotation file
DF_biocyc = DF_biocyc.reindex(DF_annot.index)

# Reformat transcription units
DF_biocyc['operon_list'] = DF_biocyc['Accession-1.1'].apply(reformat_biocyc_tu)

# Fill None with locus tags
DF_biocyc['operon_list'].fillna(DF_biocyc.index.to_series(), inplace=True)

DF_biocyc.head()

### Assign unique IDs to operons

The following code assigns unique names to each operon

In [None]:
# Get all operons
operons = DF_biocyc['operon_list'].unique()

# Map each operon to a unique string
operon_dict = {operon: "Op"+str(i) for i, operon in enumerate(operons)}

# Add names to dataframe
DF_biocyc['operon'] = [operon_dict[op] for op in DF_biocyc["operon_list"]]

DF_biocyc.head()

Finally, merge the Biocyc information with the main annotation DataFrame

In [None]:
DF_annot['operon'] = DF_biocyc['operon']

## Clean up and save annotation

First, we will re-order the annotation columns

In [None]:
if 'old_locus_tag' in DF_annot.columns:
    order = ['gene_name','accession','old_locus_tag','start','end','strand','gene_product','COG','uniprot','operon']
else:
    order = ['gene_name','accession','start','end','strand','gene_product','COG','uniprot','operon']
    
DF_annot = DF_annot[order]

In [None]:
DF_annot.head()

## Final statistics

The following graphs show how much information is available for the organism.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('ticks')

In [None]:
fig,ax = plt.subplots()
DF_annot.count().plot(kind='bar',ax=ax)
ax.set_ylabel('# of Values',fontsize=18)
ax.tick_params(labelsize=16)

## Fill missing values

Some organisms are missing gene names, so these will be filled with locus tag gene names.

In [None]:
# Fill in missing gene names with locus tag names
DF_annot['tmp_name'] = DF_annot.copy().index.tolist()
DF_annot.gene_name.fillna(DF_annot.tmp_name,inplace=True)
DF_annot.drop('tmp_name',axis=1,inplace=True)

 COG letters will also be converted to the full name.

In [None]:
# Fill missing COGs with X
DF_annot['COG'].fillna('X',inplace=True)

# Change single letter COG annotation to full description
DF_annot['COG'] = DF_annot.COG.apply(cog2str)

counts = DF_annot.COG.value_counts()
plt.pie(counts.values,labels=counts.index);

Uncomment the following line to save the gene annotation dataset

In [None]:
DF_annot.to_csv(os.path.join('..','data','processed_data','gene_info.csv'))

## GO Annotations

To start, download the GO Annotations for your organism from AmiGO 2

1. Go to [AmiGO 2](http://amigo.geneontology.org/amigo/search/annotation)
1. Filter for your organism
1. Click ``CustomDL``
1. Drag ``GO class (direct)`` to the end of your Selected Fields
1. Enter the location of your GO annotation file below and run the following code block

In [None]:
go_file = os.path.join('..','data','external','GO_annotations.txt')

In [None]:
DF_GO = pd.read_csv(go_file,sep='\t',header=None,usecols=[2,17])
DF_GO.columns = ['gene_name','gene_ontology']
DF_GO.head()

Convert the gene names to gene locus tags, and drop gene names that cannot be converted

In [None]:
name2num = {v:k for k,v in DF_annot.gene_name.to_dict().items()}

In [None]:
DF_GO['gene_id'] = [name2num[x] if x in name2num.keys() else None for x in DF_GO.gene_name]

In [None]:
DF_GO.head()

Now we remove null entries

In [None]:
DF_GO = DF_GO[DF_GO.gene_id.notnull()]

In [None]:
DF_GO.head()

Uncomment the line below to save the annotations

In [None]:
DF_GO[['gene_id','gene_name','gene_ontology']].to_csv(os.path.join('..','data','external','GO_annotations_curated.csv'))