# Programmatic Gene Ontology Guide
This uses GOAtools. You will need to install that first, either through pip or conda. If conda:
```bash
conda install -c bioconda goatools
```

Ill admit this isnt perfect, still wrapping around how best to use it. But it will help if you need! 

## Libraries to import
Part one is just importing the different libraries you need and establishing the different databases. These are the ontology, the associations, and then loads them.

To get the taxa ID, you can search in the [NCBI taxa browser](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=83332&lvl=3&lin=f&keep=1&srchmode=1&unlock). I already included MTb H37Rv for starters. 

In [None]:
from __future__ import print_function
from goatools.base import download_go_basic_obo
from goatools.base import download_ncbi_associations
from goatools.obo_parser import GODag
from goatools.anno.genetogo_reader import Gene2GoReader

obo_fname = download_go_basic_obo()
fin_gene2go = download_ncbi_associations()
obodag = GODag("go-basic.obo")
objanno = Gene2GoReader(fin_gene2go, taxids=[83332])#You'll need to set this to the taxa id for the specific organism being tested. 
ns2assoc = objanno.get_ns2assc()

## Worst. Step. Ever.
Now this next step is frankly a royal pain. Took me far too long to figure it out with their documentation a mess to dive through. But youll need to download the protein coding gene names for the associations. To start with:

Go to [NCBI Gene](https://www.ncbi.nlm.nih.gov/gene/)
```
Text in 'Search':
genetype protein coding[Properties] AND "83332"[Taxonomy ID] AND alive[property] 
```
Download the output as a tab delimited txt file.

Then youll have to use the script included in GOAtools to convert that txt file to a python file with the appropriate structure. 
```bash
python scripts/ncbi_gene_results_to_python.py -i gene_result.txt -o gene_result.py
```
I put the path out of the repo, but youll need to change the path to whatever your actual path is to the python script. Might also need to download another dependency or two. 

Royal. Pain. In. The. Tookus. 

In [None]:
from gene_result import GENEID2NT as GeneID2nt

But now that you have it you can import it appropriately. 

## Structuring the GOEA test. 
Now this step is where you actually build the test, and you can re-use it as needed for an organism. The default cutoff for significance is 0.05 (p<0.05) and then the FDR correction. 

In [None]:
from goatools.goea.go_enrichment_ns import GOEnrichmentStudyNS
goeaobj = GOEnrichmentStudyNS(
        GeneID2nt.keys(), # List of protein-coding genes
        ns2assoc, # geneid/GO associations
        obodag, # Ontologies
        propagate_counts = False,
        alpha = 0.05, # default significance cut-off
        methods = ['fdr_bh']) # default multipletest correction method

## Running the GOEA test
Beautiful! You can now use it repeatedly as needed. You just need to run whatever gene list you have now for enrichment. Its just labelled as geneids_list below, but you can change that variable as you need. First it will run it, then it will pull out just the significant ones, and then write it out as a tab delimited file. 

In [None]:
goea_results_all = goeaobj.run_study(geneids_list, prt=None)
goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < 0.05]
goeaobj.wr_txt("output_GO.txt", goea_results_sig)

For making your lists, it can be something like the common genes between two runs, or all significant genes for a sample etc etc. 