Aidan Coyle, afcoyle@uw.edu

2021/01/20

Roberts lab at SAFS

### Turning UniProt accessions into GO terms

This script involves calling a Bash script that Sam wrote. That script is available [here](https://github.com/RobertsLab/code/blob/master/script-box/uniprot2go.sh). I copied it into my /scripts/ directory, where it's named 04_uniprot2go.sh

04_uniprot2go.sh takes a newline-separated file of UniProt accessions, and outputs a file in the following format: UniProt_accession tab GOID1;GOID2...GOIDn

This is one of two ways I established to obtain a two-column tab-separated file of UniProt accession IDs and GO terms. The other is shown in 03_uniprot_to_GO_method1.R

Upside of this method: 
- Does not require a database to be downloaded manually
- GO terms are always the most up-to-date

Upside of using 03_uniprot_to_GO_method1.R instead:
- Much faster. Gets GO terms in a matter of minutes, whereas this takes days for a 65,000 line file

I suggest using this script for smaller lists of accessions, and using the other method when trying to process large files

### If you use the R script, you will need to finish processing in this script. The place to begin final processing is marked below with "IF USING 03_UNIPROT_TO_GO_METHOD1.R, START HERE"


In [1]:
pwd

'/mnt/c/Users/acoyl/Documents/GitHub/hemat_bairdii_transcriptome/scripts'

### First, we'll get all GO terms for all genes. This will take a long time - on a laptop, takes 1-2 days to run each

In [4]:
# Get all GO terms for all genes in ambient vs. low treatment group
!./04_uniprot2go.sh ../output/accession_n_GOids/allgenes_IDs/Amb_vsLow_All_GeneIDs.txt > ../output/accession_n_GOids/allgenes_IDs/Amb_vsLow_All_GOIDs.txt

6 accessions were not processed.
Please see: failed_accessions.txt


In [None]:
# A file called failed_accessions.txt is auto-placed in the directory you run the script from. Move to the output folder
!mv failed_accessions.txt ../output/accession_n_GOids/allgenes_IDs/Amb_vsLow_AllGOIDs_failed_accessions.txt

In [12]:
# Get all GO terms for all genes in day 0 vs. day 17 ambient treatment group
!./04_uniprot2go.sh ../output/accession_n_GOids/allgenes_IDs/day0_day17_All_GeneIDs.txt > ../output/accession_n_GOids/allgenes_IDs/day0_day17_All_GOIDs.txt

3 accessions were not processed.
Please see: failed_accessions.txt


In [None]:
# A file called failed_accessions.txt is auto-placed in the directory you run the script from. Move to the output folder
!mv failed_accessions.txt ../output/accession_n_GOids/allgenes_IDs/day0_day17_AllGOIDs_failed_accessions.txt

In [1]:
# Get all GO terms for all genes in elevated vs. ambient treatment group
!./04_uniprot2go.sh ../output/accession_n_GOids/allgenes_IDs/Elev_vsAmb_All_GeneIDs.txt > ../output/accession_n_GOids/allgenes_IDs/Elev_vsAmb_All_GOIDs.txt

5 accessions were not processed.
Please see: failed_accessions.txt


In [None]:
# A file called failed_accessions.txt is auto-placed in the directory you run the script from. Move to the output folder
!mv failed_accessions.txt ../output/accession_n_GOids/allgenes_IDs/Elev_vsAmb_AllGOIDs_failed_accessions.txt

In [5]:
# Get all GO terms for all genes in elevated vs. low treatment group
!./04_uniprot2go.sh ../output/accession_n_GOids/allgenes_IDs/Elev_vsLow_All_GeneIDs.txt > ../output/accession_n_GOids/allgenes_IDs/Elev_vsLow_All_GOIDs.txt

4 accessions were not processed.
Please see: failed_accessions.txt


In [6]:
# A file called failed_accessions.txt is auto-placed in the directory you run the script from. Move to the output folder
!mv failed_accessions.txt ../output/accession_n_GOids/allgenes_IDs/Elev_vsLow_AllGOIDs_failed_accessions.txt

### Then, we get all GO terms for only differentially-expressed genes (padj <= 0.005)

In [1]:
# Get all GO terms for all genes in ambient vs. low treatment group
!./05_uniprot2go.sh ../output/accession_n_GOids/DEG_IDs/Amb_vsLow_DEG_IDs.txt > ../output/accession_n_GOids/DEG_IDs/Amb_vsLow_DEG_GOIDs.txt

In [2]:
# Get all GO terms for all genes in ambient vs. low treatment group
!./04_uniprot2go.sh ../output/accession_n_GOids/DEG_IDs/day0_day17_amb_DEG_IDs.txt > ../output/accession_n_GOids/DEG_IDs/day0_day17_amb_DEG_GOIDs.txt

In [3]:
# Get all GO terms for all genes in ambient vs. low treatment group
!./04_uniprot2go.sh ../output/accession_n_GOids/DEG_IDs/Elev_vsAmb_DEG_IDs.txt > ../output/accession_n_GOids/DEG_IDs/Elev_vsAmb_DEG_GOIDs.txt

In [4]:
# Get all GO terms for all genes in ambient vs. low treatment group
!./04_uniprot2go.sh ../output/accession_n_GOids/DEG_IDs/Elev_vsLow_DEG_IDs.txt > ../output/accession_n_GOids/DEG_IDs/Elev_vsLow_DEG_GOIDs.txt

# IF USING 03_UNIPROT_TO_GO_METHOD1.R, START HERE

### Completed getting all our GO terms! Now, time to get input for GO-MWU.
We need 2 tables:
- A 2-column table of genes and GO terms
- A 2-column table of genes and unadjusted p-value without repeated genes

The table of genes and unadjusted p-value is created using R in the script 06_GO-MWU_prep.R
In getting GO terms, we made a 2-col tab-separated table of genes and GO terms.
Now we need to eliminate all repeated genes. To do this, we'll use the nrify_GOtable.pl script. 
This can be found in the [GitHub repo for GO-MWU](https://github.com/z0on/GO_MWU)

We'll put these files in a subdirectory within the scripts directory, as this is what GO-MWU calls for. All GO-MWU files are also within that directory

In [1]:
# Eliminate gene repeats, concatenate GO terms for Elevated Day 2 vs. Ambient Day 0+2, indiv. libraries only
!06_running_GO-MWU/./nrify_GOtable.pl ../output/accession_n_GOids/allgenes_IDs/elev2_vs_amb02_indiv_only_All_GOIDs.txt > 06_running_GO-MWU/elev2_vs_amb02_indiv_only_GOIDs_norepeats.txt

In [2]:
# See length of file without repeats
!wc -l 06_running_GO-MWU/elev2_vs_amb02_indiv_only_GOIDs_norepeats.txt

41479 06_running_GO-MWU/elev2_vs_amb02_indiv_only_GOIDs_norepeats.txt


In [4]:
# Manually see how many repeats we have - looks indentical, script running properly
!sort ../output/accession_n_GOids/allgenes_IDs/elev2_vs_amb02_indiv_only_All_GOIDs.txt | uniq | wc -l

41479


In [5]:
# Continue with comparison of Day 2 Elevated vs. Ambient Day 0+2+17 + Elev. Day 0 + Lowered Day 0
!06_running_GO-MWU/./nrify_GOtable.pl ../output/accession_n_GOids/allgenes_IDs/amb0217_elev0_low0_vs_elev2_All_GOIDs.txt > 06_running_GO-MWU/amb0217_elev0_low0_vs_elev2_GOIDs_norepeats.txt

In [1]:
# Continue with comparison of Day 0 Elevated vs. Day 2 Elevated, individual libraries only
!06_running_GO-MWU/./nrify_GOtable.pl ../output/accession_n_GOids/allgenes_IDs/elev0_vs_elev2_indiv_All_GOIDs.txt > 06_running_GO-MWU/elev0_vs_elev2_indiv_GOIDs_norepeats.txt

In [1]:
# Continue with comparison of Day 0 Elevated vs. Day 2 Elevated, individual libraries only
!06_running_GO-MWU/./nrify_GOtable.pl ../output/accession_n_GOids/allgenes_IDs/amb0_vs_amb2_indiv_All_GOIDs.txt > 06_running_GO-MWU/amb0_vs_amb2_indiv_GOIDs_norepeats.txt

In [2]:
# Continue with comparison of Day 0 Elevated vs. Day 2 Elevated, individual libraries only
!06_running_GO-MWU/./nrify_GOtable.pl ../output/accession_n_GOids/allgenes_IDs/amb0_vs_amb17_indiv_All_GOIDs.txt > 06_running_GO-MWU/amb0_vs_amb17_indiv_GOIDs_norepeats.txt

In [3]:
# Continue with comparison of Day 0 Elevated vs. Day 2 Elevated, individual libraries only
!06_running_GO-MWU/./nrify_GOtable.pl ../output/accession_n_GOids/allgenes_IDs/amb2_vs_amb17_indiv_All_GOIDs.txt > 06_running_GO-MWU/amb2_vs_amb17_indiv_GOIDs_norepeats.txt

In [1]:
# Continue with comparison of Day 0 Elevated vs. Day 2 Elevated, individual libraries only
!06_running_GO-MWU/./nrify_GOtable.pl ../output/accession_n_GOids/allgenes_IDs/amb2_vs_elev2_indiv_All_GOIDs.txt > 06_running_GO-MWU/amb2_vs_elev2_indiv_GOIDs_norepeats.txt

Well done! Each comparison should now have one of our two inputs needed for GO-MWU. Continue to 05_GO-MWU_prep.R to create the other input.