Aidan Coyle, afcoyle@uw.edu

2021/01/20

Roberts lab at SAFS

### Turning UniProt accessions into GO terms

This script involves calling a Bash script that Sam wrote. That script is available [here](https://github.com/RobertsLab/code/blob/master/script-box/uniprot2go.sh). I copied it into my /scripts/ directory, where it's named 05_uniprot2go.sh

05_uniprot2go.sh takes a newline-separated file of UniProt accessions, and outputs a file in the following format: UniProt_accession tab GOID1;GOID2...GOIDn

In [1]:
pwd

'/mnt/c/Users/acoyl/Documents/GitHub/hemat_bairdii_transcriptome/scripts'

### First, we'll get all GO terms for all genes. This will take a long time - on a laptop, takes 1-2 days to run each

In [4]:
# Get all GO terms for all genes in ambient vs. low treatment group
!./05_uniprot2go.sh ../output/accession_n_GOids/allgenes_IDs/Amb_vsLow_All_GeneIDs.txt > ../output/accession_n_GOids/allgenes_IDs/Amb_vsLow_All_GOIDs.txt

6 accessions were not processed.
Please see: failed_accessions.txt


In [None]:
# A file called failed_accessions.txt is auto-placed in the directory you run the script from. Move to the output folder
!mv failed_accessions.txt ../output/accession_n_GOids/allgenes_IDs/Amb_vsLow_AllGOIDs_failed_accessions.txt

In [12]:
# Get all GO terms for all genes in day 0 vs. day 17 ambient treatment group
!./05_uniprot2go.sh ../output/accession_n_GOids/allgenes_IDs/day0_day17_All_GeneIDs.txt > ../output/accession_n_GOids/allgenes_IDs/day0_day17_All_GOIDs.txt

3 accessions were not processed.
Please see: failed_accessions.txt


In [None]:
# A file called failed_accessions.txt is auto-placed in the directory you run the script from. Move to the output folder
!mv failed_accessions.txt ../output/accession_n_GOids/allgenes_IDs/day0_day17_AllGOIDs_failed_accessions.txt

In [1]:
# Get all GO terms for all genes in elevated vs. ambient treatment group
!./05_uniprot2go.sh ../output/accession_n_GOids/allgenes_IDs/Elev_vsAmb_All_GeneIDs.txt > ../output/accession_n_GOids/allgenes_IDs/Elev_vsAmb_All_GOIDs.txt

5 accessions were not processed.
Please see: failed_accessions.txt


In [None]:
# A file called failed_accessions.txt is auto-placed in the directory you run the script from. Move to the output folder
!mv failed_accessions.txt ../output/accession_n_GOids/allgenes_IDs/Elev_vsAmb_AllGOIDs_failed_accessions.txt

In [5]:
# Get all GO terms for all genes in elevated vs. low treatment group
!./05_uniprot2go.sh ../output/accession_n_GOids/allgenes_IDs/Elev_vsLow_All_GeneIDs.txt > ../output/accession_n_GOids/allgenes_IDs/Elev_vsLow_All_GOIDs.txt

4 accessions were not processed.
Please see: failed_accessions.txt


In [6]:
# A file called failed_accessions.txt is auto-placed in the directory you run the script from. Move to the output folder
!mv failed_accessions.txt ../output/accession_n_GOids/allgenes_IDs/Elev_vsLow_AllGOIDs_failed_accessions.txt

### Then, we get all GO terms for only differentially-expressed genes (padj <= 0.005)

In [1]:
# Get all GO terms for all genes in ambient vs. low treatment group
!./05_uniprot2go.sh ../output/accession_n_GOids/DEG_IDs/Amb_vsLow_DEG_IDs.txt > ../output/accession_n_GOids/DEG_IDs/Amb_vsLow_DEG_GOIDs.txt

In [2]:
# Get all GO terms for all genes in ambient vs. low treatment group
!./05_uniprot2go.sh ../output/accession_n_GOids/DEG_IDs/day0_day17_amb_DEG_IDs.txt > ../output/accession_n_GOids/DEG_IDs/day0_day17_amb_DEG_GOIDs.txt

In [3]:
# Get all GO terms for all genes in ambient vs. low treatment group
!./05_uniprot2go.sh ../output/accession_n_GOids/DEG_IDs/Elev_vsAmb_DEG_IDs.txt > ../output/accession_n_GOids/DEG_IDs/Elev_vsAmb_DEG_GOIDs.txt

In [4]:
# Get all GO terms for all genes in ambient vs. low treatment group
!./05_uniprot2go.sh ../output/accession_n_GOids/DEG_IDs/Elev_vsLow_DEG_IDs.txt > ../output/accession_n_GOids/DEG_IDs/Elev_vsLow_DEG_GOIDs.txt

### Completed getting all our GO terms! Now, time to get input for GO-MWU.
We need 2 tables:
- A 2-column table of genes and GO terms
- A 2-column table of genes and unadjusted p-value without repeated genes

The table of genes and unadjusted p-value is created using R in the script 06_GO-MWU_prep.R
In getting GO terms, we made a 2-col tab-separated table of genes and GO terms.
Now we need to eliminate all repeated genes. To do this, we'll use the nrify_GOtable.pl script. 
This can be found in the [GitHub repo for GO-MWU](https://github.com/z0on/GO_MWU)

We'll put these files in a subdirectory within the scripts directory, as this is what GO-MWU calls for. All GO-MWU files are also within that directory

In [1]:
# Eliminate gene repeats, concatenate GO terms for Amb/Low comparison
!07_running_GO-MWU/./nrify_GOtable.pl ../output/accession_n_GOids/DEG_IDs/Amb_vsLow_DEG_GOIDs.txt > 07_running_GO-MWU/Amb_vsLow_DEG_GOIDs_norepeats.txt

In [3]:
# See length of file without repeats
!wc -l 07_running_GO-MWU/Amb_vsLow_DEG_GOIDs_norepeats.txt

336 07_running_GO-MWU/Amb_vsLow_DEG_GOIDs_norepeats.txt


In [8]:
# Manually see how many repeats we have - looks indentical, script running properly
!sort ../output/accession_n_GOids/DEG_IDs/Amb_vsLow_DEG_GOIDs.txt | uniq | wc -l

336


In [9]:
# Continue with day 0/day 17 comparison
!07_running_GO-MWU/./nrify_GOtable.pl ../output/accession_n_GOids/DEG_IDs/day0_day17_amb_DEG_GOIDs.txt > 07_running_GO-MWU/day0_day17_amb_DEG_GOIDs_norepeats.txt

In [10]:
# Continue with Elev/Amb comparison
!07_running_GO-MWU/./nrify_GOtable.pl ../output/accession_n_GOids/DEG_IDs/Elev_vsAmb_DEG_GOIDs.txt > 07_running_GO-MWU/Elev_vsAmb_DEG_GOIDs_norepeats.txt

In [11]:
# Continue with Elev/Low comparison
!07_running_GO-MWU/./nrify_GOtable.pl ../output/accession_n_GOids/DEG_IDs/Elev_vsLow_DEG_GOIDs.txt > 07_running_GO-MWU/Elev_vsLow_DEG_GOIDs_norepeats.txt