# Assigning GO Slim Terms

In this notebook, I'll assign GO Slim terms to differentially expressed genes for *Zostera marina* and *Labyrinthula zosterae*. I will only do this for files with GO-MWU Biological Process output. This will help with downstream interpretation of biological processes impacted by infection.

## 0. Set working directory

In [1]:
pwd

'/Users/yaaminivenkataraman/Documents/project-EWD-transcriptomics/scripts'

In [2]:
cd ../analyses/GO-MWU/DE-GO-MWU/

/Users/yaaminivenkataraman/Documents/project-EWD-transcriptomics/analyses/GO-MWU/DE-GO-MWU


## 1. Format differentially expressed gene lists

### *Z. marina*

In [117]:
#Check current file formats for BP GOterms
!head 2019-07-15-Zostera-DEG-GOterms.tab
!wc -l 2019-07-15-Zostera-DEG-GOterms.tab

seq	term
TRINITY_DN299263_c0_g1	GO:0002758;GO:0002757;GO:0002218;GO:0002253;GO:0002764;GO:0045089;GO:0050778;GO:0050776;GO:0031349;GO:0045088;GO:0002684;GO:0002682;GO:0031347;GO:0080134
TRINITY_DN299263_c0_g2	GO:0002758;GO:0002757;GO:0002218;GO:0002253;GO:0002764;GO:0045089;GO:0050778;GO:0050776;GO:0031349;GO:0045088;GO:0002684;GO:0002682;GO:0031347;GO:0080134
TRINITY_DN312745_c1_g2	GO:0002758;GO:0002757;GO:0002218;GO:0002253;GO:0002764;GO:0045089;GO:0050778;GO:0050776;GO:0031349;GO:0045088;GO:0002684;GO:0002682;GO:0031347;GO:0080134
TRINITY_DN295269_c0_g3	GO:0000187;GO:0032147;GO:0043406;GO:0045860;GO:0001934;GO:0033674;GO:0031401;GO:0042327;GO:0051347;GO:0032270;GO:0045937;GO:0043085;GO:0031325;GO:0051247;GO:0010562;GO:0044093;GO:0009893;GO:0010604;GO:0051173;GO:0043405;GO:0043410;GO:0071902;GO:0043408;GO:1902533;GO:1902531;GO:0009967;GO:0010647;GO:0023056
TRINITY_DN294767_c5_g1	GO:0002181;GO:0006412;GO:0043043;GO:0006518;GO:0043604
TRINITY_DN294767_c5_g2	GO:0002181;GO:0006412;GO:004

In [118]:
#Remove header line
# Sort file and save as new file
!tail -n +2 2019-07-15-Zostera-DEG-GOterms.tab \
| sort > 2019-07-15-Zostera-DEG-GOterms-noHead.tab

In [119]:
!head 2019-07-15-Zostera-DEG-GOterms-noHead.tab

TRINITY_DN111543_c0_g1	GO:0001676
TRINITY_DN111543_c0_g1	GO:0006631;GO:0032787
TRINITY_DN111543_c0_g1	GO:0019752;GO:0043436;GO:0006082
TRINITY_DN111543_c0_g1	GO:0044255;GO:0006629
TRINITY_DN111543_c0_g1	GO:0044281
TRINITY_DN139711_c0_g1	GO:0000377;GO:0000375;GO:0008380;GO:0000398;GO:0006397
TRINITY_DN139711_c0_g1	GO:0016071
TRINITY_DN172833_c0_g1	GO:0006721;GO:0006720
TRINITY_DN172833_c0_g1	GO:0019752;GO:0043436;GO:0006082
TRINITY_DN172833_c0_g1	GO:0044255;GO:0006629


In [120]:
#Add space after each ; delimiter and save as a few file
!sed 's/;/; /g' 2019-07-15-Zostera-DEG-GOterms-noHead.tab > Zostera_blast-annot.tab

In [121]:
#Check output
!head -20 Zostera_blast-annot.tab
!wc Zostera_blast-annot.tab

TRINITY_DN111543_c0_g1	GO:0001676
TRINITY_DN111543_c0_g1	GO:0006631; GO:0032787
TRINITY_DN111543_c0_g1	GO:0019752; GO:0043436; GO:0006082
TRINITY_DN111543_c0_g1	GO:0044255; GO:0006629
TRINITY_DN111543_c0_g1	GO:0044281
TRINITY_DN139711_c0_g1	GO:0000377; GO:0000375; GO:0008380; GO:0000398; GO:0006397
TRINITY_DN139711_c0_g1	GO:0016071
TRINITY_DN172833_c0_g1	GO:0006721; GO:0006720
TRINITY_DN172833_c0_g1	GO:0019752; GO:0043436; GO:0006082
TRINITY_DN172833_c0_g1	GO:0044255; GO:0006629
TRINITY_DN172833_c0_g1	GO:0044281
TRINITY_DN172833_c0_g1	GO:0065008
TRINITY_DN226686_c0_g1	GO:0002376
TRINITY_DN226686_c0_g1	GO:0006950; GO:0050896
TRINITY_DN228040_c0_g1	GO:0000079; GO:1904029
TRINITY_DN228040_c0_g1	GO:0051726
TRINITY_DN228040_c0_g1	GO:0060255; GO:0019222; GO:0031323; GO:0051171; GO:0080090
TRINITY_DN228040_c0_g1	GO:0071900; GO:0045859; GO:0001932; GO:0043549; GO:0031399; GO:0042325; GO:0051338; GO:0032268; GO:0019220; GO:0050790; GO:0051246; GO:0051174; GO:0065009
TRINITY_DN228042_c0_g1	GO:00

### *L. zosterae*

In [122]:
#Check current file formats for BP GOterms
!head 2019-07-15-nonZostera-DEG-GOterms.tab
!wc 2019-07-15-nonZostera-DEG-GOterms.tab

seq	term
TRINITY_DN312737_c2_g2	GO:0009653
TRINITY_DN271666_c1_g1	GO:0009056;GO:1901575
TRINITY_DN271666_c1_g1	GO:0006464;GO:0036211;GO:0044267;GO:0043412;GO:0019538
TRINITY_DN296708_c0_g1	GO:0006464;GO:0036211;GO:0044267;GO:0043412;GO:0019538
TRINITY_DN233011_c0_g1	GO:0000281;GO:0061640;GO:0000910
TRINITY_DN277328_c0_g1	GO:0000281;GO:0061640;GO:0000910
TRINITY_DN299789_c0_g2	GO:0000281;GO:0061640;GO:0000910
TRINITY_DN311731_c0_g1	GO:0007010
TRINITY_DN311493_c0_g1	GO:0042592
      55     110    2681 2019-07-15-nonZostera-DEG-GOterms.tab


In [123]:
#Remove header line and save as new file
!tail -n +2 2019-07-15-nonZostera-DEG-GOterms.tab > 2019-07-15-nonZostera-DEG-GOterms-noHead.tab

In [124]:
!head 2019-07-15-nonZostera-DEG-GOterms-noHead.tab

TRINITY_DN312737_c2_g2	GO:0009653
TRINITY_DN271666_c1_g1	GO:0009056;GO:1901575
TRINITY_DN271666_c1_g1	GO:0006464;GO:0036211;GO:0044267;GO:0043412;GO:0019538
TRINITY_DN296708_c0_g1	GO:0006464;GO:0036211;GO:0044267;GO:0043412;GO:0019538
TRINITY_DN233011_c0_g1	GO:0000281;GO:0061640;GO:0000910
TRINITY_DN277328_c0_g1	GO:0000281;GO:0061640;GO:0000910
TRINITY_DN299789_c0_g2	GO:0000281;GO:0061640;GO:0000910
TRINITY_DN311731_c0_g1	GO:0007010
TRINITY_DN311493_c0_g1	GO:0042592
TRINITY_DN271666_c1_g1	GO:0009057


In [125]:
#Add space after each ; delimiter and save as a few file
!sed 's/;/; /g' 2019-07-15-nonZostera-DEG-GOterms-noHead.tab > nonZostera_blast-annot.tab

In [126]:
#Check output
!head -20 nonZostera_blast-annot.tab
!wc nonZostera_blast-annot.tab

TRINITY_DN312737_c2_g2	GO:0009653
TRINITY_DN271666_c1_g1	GO:0009056; GO:1901575
TRINITY_DN271666_c1_g1	GO:0006464; GO:0036211; GO:0044267; GO:0043412; GO:0019538
TRINITY_DN296708_c0_g1	GO:0006464; GO:0036211; GO:0044267; GO:0043412; GO:0019538
TRINITY_DN233011_c0_g1	GO:0000281; GO:0061640; GO:0000910
TRINITY_DN277328_c0_g1	GO:0000281; GO:0061640; GO:0000910
TRINITY_DN299789_c0_g2	GO:0000281; GO:0061640; GO:0000910
TRINITY_DN311731_c0_g1	GO:0007010
TRINITY_DN311493_c0_g1	GO:0042592
TRINITY_DN271666_c1_g1	GO:0009057
TRINITY_DN311731_c0_g1	GO:0007017
TRINITY_DN233011_c0_g1	GO:1903047
TRINITY_DN277328_c0_g1	GO:1903047
TRINITY_DN299789_c0_g2	GO:1903047
TRINITY_DN256449_c0_g1	GO:0016071
TRINITY_DN296708_c0_g1	GO:0034470; GO:0034660
TRINITY_DN255273_c0_g1	GO:0010605; GO:0009892; GO:0048519
TRINITY_DN311716_c0_g2	GO:0010605; GO:0009892; GO:0048519
TRINITY_DN255273_c0_g1	GO:0031324; GO:0048523; GO:0051172
TRINITY_DN311716_c0_g2	GO:0031324; GO:0048523; GO:0051172
      54     184    2748 nonZost

## 2. Unfold GOterms

### *Z. marina*

In [127]:
%%bash 

# This script was originally written to address a specific problem that Rhonda was having

# input_file is the initial, "problem" file
# file is an intermediate file that most of the program works upon
# output_file is the final file produced by the script
input_file="Zostera_blast-annot.tab"
file="Zostera_intermediate.file"
output_file="Zostera_blast-GO-unfolded.tab"

# sed command substitutes the "; " sequence to a tab and writes the new format to a new file.
# This character sequence is how the GO terms are delimited in their field.
sed $'s/; /\t/g' "$input_file" > "$file"

# Identify first field containing a GO term.
# Search file with grep for "GO:" and pipe to awk.
# Awk sets tab as field delimiter (-F'\t'), runs a for loop that looks for "GO:" (~/GO:/), and then prints the field number).
# Awk results are piped to sort, which sorts unique by number (-ug).
# Sort results are piped to head to retrieve the lowest value (i.e. the top of the list; "-n1").
begin_goterms=$(grep "GO:" "$file" | awk -F'\t' '{for (i=1;i<=NF;i++) if($i ~/GO:/) print i}' | sort -ug | head -n1)

# While loop to process each line of the input file.
while read -r line
	do
	
	# Send contents of the current line to awk.
	# Set the field separator as a tab (-F'\t') and print the number of fields in that line.
	# Save the results of the echo/awk pipe (i.e. number of fields) to the variable "max_field".
	max_field=$(echo "$line" | awk -F'\t' '{print NF}')

	# Send contents of current line to cut.
	# Cut fields (i.e. retain those fields) 1-12.
	# Save the results of the echo/cut pipe (i.e. fields 1-12) to the variable "fixed_fields"
	fixed_fields=$(echo "$line" | cut -f1-2)

	# Since not all the lines contain the same number of fields (e.g. may not have GO terms),
	# evaluate the number of fields in each line to determine how to handle current line.

	# If the value in max_field is less than the field number where the GO terms begin,
	# then just print the current line (%s) followed by a newline (\n).
	if (( "$max_field" < "$begin_goterms" ))
		then printf "%s\n" "$line"
			else

			# Send contents of current line (which contains GO terms) to cut.
			# Cut fields (i.e. retain those fields) 13 to whatever the last field is in the curent line.
			# Save the results of the echo/cut pipe (i.e. all the GO terms fields) to the variable "goterms".
			goterms=$(echo "$line" | cut -f"$begin_goterms"-"$max_field")
			
			# Assign values in the variable "goterms" to a new indexed array (called "array"), 
			# with tab delimiter (IFS=$'\t')
			IFS=$'\t' read -r -a array <<<"$goterms"
			
			# Iterate through each element of the array.
			# Print the first 12 fields (i.e. the fields stored in "fixed_fields") followed by a tab (%s\t).
			# Print the current element in the array (i.e. the current GO term) followed by a new line (%s\n).
			for element in "${!array[@]}"	
				do printf "%s\t%s\n" "$fixed_fields" "${array[$element]}"
			done
	fi

# Send the input file into the while loop and send the output to a file named "rhonda_fixed.txt".
done < "$file" > "$output_file"

In [128]:
#It was unfolded correctly but the second column is not correct
!head -20 Zostera_blast-GO-unfolded.tab

TRINITY_DN111543_c0_g1	GO:0001676	GO:0001676
TRINITY_DN111543_c0_g1	GO:0006631	GO:0006631
TRINITY_DN111543_c0_g1	GO:0006631	GO:0032787
TRINITY_DN111543_c0_g1	GO:0019752	GO:0019752
TRINITY_DN111543_c0_g1	GO:0019752	GO:0043436
TRINITY_DN111543_c0_g1	GO:0019752	GO:0006082
TRINITY_DN111543_c0_g1	GO:0044255	GO:0044255
TRINITY_DN111543_c0_g1	GO:0044255	GO:0006629
TRINITY_DN111543_c0_g1	GO:0044281	GO:0044281
TRINITY_DN139711_c0_g1	GO:0000377	GO:0000377
TRINITY_DN139711_c0_g1	GO:0000377	GO:0000375
TRINITY_DN139711_c0_g1	GO:0000377	GO:0008380
TRINITY_DN139711_c0_g1	GO:0000377	GO:0000398
TRINITY_DN139711_c0_g1	GO:0000377	GO:0006397
TRINITY_DN139711_c0_g1	GO:0016071	GO:0016071
TRINITY_DN172833_c0_g1	GO:0006721	GO:0006721
TRINITY_DN172833_c0_g1	GO:0006721	GO:0006720
TRINITY_DN172833_c0_g1	GO:0019752	GO:0019752
TRINITY_DN172833_c0_g1	GO:0019752	GO:0043436
TRINITY_DN172833_c0_g1	GO:0019752	GO:0006082


In [129]:
#Only retain the third and first columns
#Sort
#Save as a new file
!awk '{print $3"\t"$1}' Zostera_blast-GO-unfolded.tab \
| sort \
> Zostera_blast-GO-unfolded-sorted.tab

In [130]:
#Confirm output
!head Zostera_blast-GO-unfolded-sorted.tab
!wc -l Zostera_blast-GO-unfolded-sorted.tab

GO:0000027	TRINITY_DN293394_c0_g1
GO:0000027	TRINITY_DN298832_c1_g1
GO:0000027	TRINITY_DN298848_c3_g2
GO:0000027	TRINITY_DN314236_c0_g1
GO:0000028	TRINITY_DN292998_c5_g6
GO:0000028	TRINITY_DN316936_c5_g1
GO:0000045	TRINITY_DN298449_c4_g2
GO:0000075	TRINITY_DN312745_c1_g1
GO:0000075	TRINITY_DN312745_c1_g3
GO:0000075	TRINITY_DN316221_c2_g2
     849 Zostera_blast-GO-unfolded-sorted.tab


### *L. zosterae*

In [131]:
%%bash 

# This script was originally written to address a specific problem that Rhonda was having

# input_file is the initial, "problem" file
# file is an intermediate file that most of the program works upon
# output_file is the final file produced by the script
input_file="nonZostera_blast-annot.tab"
file="nonZostera_intermediate.file"
output_file="nonZostera_blast-GO-unfolded.tab"

# sed command substitutes the "; " sequence to a tab and writes the new format to a new file.
# This character sequence is how the GO terms are delimited in their field.
sed $'s/; /\t/g' "$input_file" > "$file"

# Identify first field containing a GO term.
# Search file with grep for "GO:" and pipe to awk.
# Awk sets tab as field delimiter (-F'\t'), runs a for loop that looks for "GO:" (~/GO:/), and then prints the field number).
# Awk results are piped to sort, which sorts unique by number (-ug).
# Sort results are piped to head to retrieve the lowest value (i.e. the top of the list; "-n1").
begin_goterms=$(grep "GO:" "$file" | awk -F'\t' '{for (i=1;i<=NF;i++) if($i ~/GO:/) print i}' | sort -ug | head -n1)

# While loop to process each line of the input file.
while read -r line
	do
	
	# Send contents of the current line to awk.
	# Set the field separator as a tab (-F'\t') and print the number of fields in that line.
	# Save the results of the echo/awk pipe (i.e. number of fields) to the variable "max_field".
	max_field=$(echo "$line" | awk -F'\t' '{print NF}')

	# Send contents of current line to cut.
	# Cut fields (i.e. retain those fields) 1-12.
	# Save the results of the echo/cut pipe (i.e. fields 1-12) to the variable "fixed_fields"
	fixed_fields=$(echo "$line" | cut -f1-2)

	# Since not all the lines contain the same number of fields (e.g. may not have GO terms),
	# evaluate the number of fields in each line to determine how to handle current line.

	# If the value in max_field is less than the field number where the GO terms begin,
	# then just print the current line (%s) followed by a newline (\n).
	if (( "$max_field" < "$begin_goterms" ))
		then printf "%s\n" "$line"
			else

			# Send contents of current line (which contains GO terms) to cut.
			# Cut fields (i.e. retain those fields) 13 to whatever the last field is in the curent line.
			# Save the results of the echo/cut pipe (i.e. all the GO terms fields) to the variable "goterms".
			goterms=$(echo "$line" | cut -f"$begin_goterms"-"$max_field")
			
			# Assign values in the variable "goterms" to a new indexed array (called "array"), 
			# with tab delimiter (IFS=$'\t')
			IFS=$'\t' read -r -a array <<<"$goterms"
			
			# Iterate through each element of the array.
			# Print the first 12 fields (i.e. the fields stored in "fixed_fields") followed by a tab (%s\t).
			# Print the current element in the array (i.e. the current GO term) followed by a new line (%s\n).
			for element in "${!array[@]}"	
				do printf "%s\t%s\n" "$fixed_fields" "${array[$element]}"
			done
	fi

# Send the input file into the while loop and send the output to a file named "rhonda_fixed.txt".
done < "$file" > "$output_file"

In [132]:
#It was unfolded correctly but the second column is not correct
!head -20 nonZostera_blast-GO-unfolded.tab

TRINITY_DN312737_c2_g2	GO:0009653	GO:0009653
TRINITY_DN271666_c1_g1	GO:0009056	GO:0009056
TRINITY_DN271666_c1_g1	GO:0009056	GO:1901575
TRINITY_DN271666_c1_g1	GO:0006464	GO:0006464
TRINITY_DN271666_c1_g1	GO:0006464	GO:0036211
TRINITY_DN271666_c1_g1	GO:0006464	GO:0044267
TRINITY_DN271666_c1_g1	GO:0006464	GO:0043412
TRINITY_DN271666_c1_g1	GO:0006464	GO:0019538
TRINITY_DN296708_c0_g1	GO:0006464	GO:0006464
TRINITY_DN296708_c0_g1	GO:0006464	GO:0036211
TRINITY_DN296708_c0_g1	GO:0006464	GO:0044267
TRINITY_DN296708_c0_g1	GO:0006464	GO:0043412
TRINITY_DN296708_c0_g1	GO:0006464	GO:0019538
TRINITY_DN233011_c0_g1	GO:0000281	GO:0000281
TRINITY_DN233011_c0_g1	GO:0000281	GO:0061640
TRINITY_DN233011_c0_g1	GO:0000281	GO:0000910
TRINITY_DN277328_c0_g1	GO:0000281	GO:0000281
TRINITY_DN277328_c0_g1	GO:0000281	GO:0061640
TRINITY_DN277328_c0_g1	GO:0000281	GO:0000910
TRINITY_DN299789_c0_g2	GO:0000281	GO:0000281


In [133]:
#Only retain the first and third columns
#Sort
#Save as a new file
!awk '{print $3"\t"$1}' nonZostera_blast-GO-unfolded.tab \
| sort \
> nonZostera_blast-GO-unfolded-sorted.tab

In [145]:
#Confirm output
!head nonZostera_blast-GO-unfolded-sorted.tab
!wc -l nonZostera_blast-GO-unfolded-sorted.tab

GO:0000281	TRINITY_DN233011_c0_g1
GO:0000281	TRINITY_DN277328_c0_g1
GO:0000281	TRINITY_DN299789_c0_g2
GO:0000302	TRINITY_DN295214_c0_g1
GO:0000302	TRINITY_DN295214_c0_g2
GO:0000302	TRINITY_DN312005_c1_g1
GO:0000910	TRINITY_DN233011_c0_g1
GO:0000910	TRINITY_DN277328_c0_g1
GO:0000910	TRINITY_DN299789_c0_g2
GO:0001932	TRINITY_DN255903_c0_g1
     130 nonZostera_blast-GO-unfolded-sorted.tab


## 3. Match BP GOterms to GO Slim terms

In [135]:
#Download list of GO Slim and matching GOterms
!curl -O http://owl.fish.washington.edu/halfshell/bu-alanine-wd/17-07-20/GO-GOslim.sorted

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2314k  100 2314k    0     0  1115k      0  0:00:02  0:00:02 --:--:-- 1115k


In [136]:
!head GO-GOslim.sorted

GO:0000001	mitochondrion inheritance	cell organization and biogenesis	P
GO:0000002	mitochondrial genome maintenance	cell organization and biogenesis	P
GO:0000003	reproduction	other biological processes	P
GO:0000006	high affinity zinc uptake transmembrane transporter activity	transporter activity	F
GO:0000007	low-affinity zinc ion transmembrane transporter activity	transporter activity	F
GO:0000009	"alpha-1,6-mannosyltransferase activity"	other molecular function	F
GO:0000010	trans-hexaprenyltranstransferase activity	other molecular function	F
GO:0000011	vacuole inheritance	cell organization and biogenesis	P
GO:0000012	single strand break repair	DNA metabolism	P
GO:0000012	single strand break repair	stress response	P


### *Z. marina*

In [148]:
#Join files to get GOslim for each query
#Remove duplicate GOslim and queries removed
#Sort files
#Only print gene ID, GO Slim, and original GOterm
!join -1 1 -2 1 -t $'\t' \
Zostera_blast-GO-unfolded-sorted.tab \
GO-GOslim.sorted \
| uniq | sort | awk -F'\t' -v OFS='\t' '{print $2, $4, $3}' \
> Zostera_blastquery-GOslim.tab

In [149]:
#Check output
!head Zostera_blastquery-GOslim.tab
!wc -l Zostera_blastquery-GOslim.tab

TRINITY_DN293394_c0_g1	cell organization and biogenesis	ribosomal large subunit assembly
TRINITY_DN298832_c1_g1	cell organization and biogenesis	ribosomal large subunit assembly
TRINITY_DN298848_c3_g2	cell organization and biogenesis	ribosomal large subunit assembly
TRINITY_DN314236_c0_g1	cell organization and biogenesis	ribosomal large subunit assembly
TRINITY_DN292998_c5_g6	cell organization and biogenesis	ribosomal small subunit assembly
TRINITY_DN316936_c5_g1	cell organization and biogenesis	ribosomal small subunit assembly
TRINITY_DN298449_c4_g2	cell organization and biogenesis	autophagic vacuole formation
TRINITY_DN298449_c4_g2	other metabolic processes	autophagic vacuole formation
TRINITY_DN298449_c4_g2	stress response	autophagic vacuole formation
TRINITY_DN312745_c1_g1	cell cycle and proliferation	cell cycle checkpoint
     765 Zostera_blastquery-GOslim.tab


In [150]:
#Count the number of unique IDs with GOSlim terms
!uniq -f1 Zostera_blastquery-GOslim.tab | wc -l

     258


In [151]:
#Remove all "other biological processes"
#Confirm removal
#Count the number of entries
!grep --invert-match "other biological processes" Zostera_blastquery-GOslim.tab \
> Zostera_blastquery-GOslim-BP.sorted.unique.noOther
!head Zostera_blastquery-GOslim-BP.sorted.unique.noOther
!wc -l Zostera_blastquery-GOslim-BP.sorted.unique.noOther

TRINITY_DN293394_c0_g1	cell organization and biogenesis	ribosomal large subunit assembly
TRINITY_DN298832_c1_g1	cell organization and biogenesis	ribosomal large subunit assembly
TRINITY_DN298848_c3_g2	cell organization and biogenesis	ribosomal large subunit assembly
TRINITY_DN314236_c0_g1	cell organization and biogenesis	ribosomal large subunit assembly
TRINITY_DN292998_c5_g6	cell organization and biogenesis	ribosomal small subunit assembly
TRINITY_DN316936_c5_g1	cell organization and biogenesis	ribosomal small subunit assembly
TRINITY_DN298449_c4_g2	cell organization and biogenesis	autophagic vacuole formation
TRINITY_DN298449_c4_g2	other metabolic processes	autophagic vacuole formation
TRINITY_DN298449_c4_g2	stress response	autophagic vacuole formation
TRINITY_DN312745_c1_g1	cell cycle and proliferation	cell cycle checkpoint
     601 Zostera_blastquery-GOslim-BP.sorted.unique.noOther


In [152]:
#Count the number of unique CGI IDs with defined GOSlim terms
!uniq -f1 Zostera_blastquery-GOslim-BP.sorted.unique.noOther | wc -l

     208


### *L. zostera*

In [153]:
#Join files to get GOslim for each query
#Remove duplicate GOslim and queries removed
#Sort files
#Only print gene ID, GO Slim, and original GOterm
!join -1 1 -2 1 -t $'\t' \
nonZostera_blast-GO-unfolded-sorted.tab \
GO-GOslim.sorted \
| uniq | sort | awk -F'\t' -v OFS='\t' '{print $2, $4, $3}' \
> nonZostera_blastquery-GOslim.tab

In [154]:
#Check output
!head nonZostera_blastquery-GOslim.tab
!wc -l nonZostera_blastquery-GOslim.tab

TRINITY_DN233011_c0_g1	cell cycle and proliferation	cytokinesis after mitosis
TRINITY_DN277328_c0_g1	cell cycle and proliferation	cytokinesis after mitosis
TRINITY_DN299789_c0_g2	cell cycle and proliferation	cytokinesis after mitosis
TRINITY_DN295214_c0_g1	stress response	response to reactive oxygen species
TRINITY_DN295214_c0_g2	stress response	response to reactive oxygen species
TRINITY_DN312005_c1_g1	stress response	response to reactive oxygen species
TRINITY_DN233011_c0_g1	other biological processes	cytokinesis
TRINITY_DN277328_c0_g1	other biological processes	cytokinesis
TRINITY_DN299789_c0_g2	other biological processes	cytokinesis
TRINITY_DN255903_c0_g1	protein metabolism	regulation of protein amino acid phosphorylation
     110 nonZostera_blastquery-GOslim.tab


In [155]:
#Count the number of unique IDs with GOSlim terms
!uniq -f1 nonZostera_blastquery-GOslim.tab | wc -l

      53


In [156]:
#Remove all "other biological processes"
#Confirm removal
#Count the number of entries
!grep --invert-match "other biological processes" nonZostera_blastquery-GOslim.tab \
> nonZostera_blastquery-GOslim-BP.sorted.unique.noOther
!head nonZostera_blastquery-GOslim-BP.sorted.unique.noOther
!wc -l nonZostera_blastquery-GOslim-BP.sorted.unique.noOther

TRINITY_DN233011_c0_g1	cell cycle and proliferation	cytokinesis after mitosis
TRINITY_DN277328_c0_g1	cell cycle and proliferation	cytokinesis after mitosis
TRINITY_DN299789_c0_g2	cell cycle and proliferation	cytokinesis after mitosis
TRINITY_DN295214_c0_g1	stress response	response to reactive oxygen species
TRINITY_DN295214_c0_g2	stress response	response to reactive oxygen species
TRINITY_DN312005_c1_g1	stress response	response to reactive oxygen species
TRINITY_DN255903_c0_g1	protein metabolism	regulation of protein amino acid phosphorylation
TRINITY_DN299755_c1_g3	protein metabolism	regulation of protein amino acid phosphorylation
TRINITY_DN256449_c0_g1	other metabolic processes	"nucleobase, nucleoside, nucleotide and nucleic acid metabolic process"
TRINITY_DN296708_c0_g1	other metabolic processes	"nucleobase, nucleoside, nucleotide and nucleic acid metabolic process"
      92 nonZostera_blastquery-GOslim-BP.sorted.unique.noOther


In [157]:
#Count the number of unique CGI IDs with defined GOSlim terms
!uniq -f1 nonZostera_blastquery-GOslim-BP.sorted.unique.noOther | wc -l

      43


## 4. Match GO Slim terms to original annotations

### *Z. marina*

In [19]:
#Check format of gene list with protein annotations
#Columns: seq ID, Uniprot, seq ID ith isoform information, e-value, annotation, GOerms, reviewed, organism
!head -n2 ../../../data/Zostera-blast-annot-withGeneID-noIsoforms.tab

TRINITY_DN100001_c0_g1	Q54EW8	TRINITY_DN100001_c0_g1_i1	1.2e-19	Dihydrolipoyl dehydrogenase, mitochondrial (EC 1.8.1.4) (Dihydrolipoamide dehydrogenase) (Glycine cleavage system L protein)	cell redox homeostasis [GO:0045454]; glycine catabolic process [GO:0006546]; isoleucine catabolic process [GO:0006550]; leucine catabolic process [GO:0006552]; L-serine biosynthetic process [GO:0006564]; valine catabolic process [GO:0006574]	extracellular matrix [GO:0031012]; mitochondrial matrix [GO:0005759]; mitochondrial pyruvate dehydrogenase complex [GO:0005967]; phagocytic vesicle [GO:0045335]	dihydrolipoyl dehydrogenase activity [GO:0004148]; electron transfer activity [GO:0009055]; flavin adenine dinucleotide binding [GO:0050660]	GO:0004148; GO:0005759; GO:0005967; GO:0006546; GO:0006550; GO:0006552; GO:0006564; GO:0006574; GO:0009055; GO:0031012; GO:0045335; GO:0045454; GO:0050660	reviewed	Dictyostelium discoideum (Slime mold)
TRINITY_DN100015_c0_g1	P16894	TRINITY_DN100015_c0_g1_i1	1.2e-21	

In [27]:
#Join protein annotation with GO Slim terms (unique, no other)
#Use a left join so unpaired lines from GOSlim terms are still printed
!join -1 1 -2 1 -t $'\t' \
Zostera_blastquery-GOslim-BP.sorted.unique.noOther \
../../../data/Zostera-blast-annot-withGeneID-noIsoforms.tab \
| awk -F'\t' -v OFS='\t' '{print $1, $4, $2, $3, $7, $8}' \
> Zostera_blastquery-GOslim-BP.sorted.unique.noOther.annot

In [30]:
#Check output
#Count lines
!head Zostera_blastquery-GOslim-BP.sorted.unique.noOther.annot
!wc -l Zostera_blastquery-GOslim-BP.sorted.unique.noOther.annot

TRINITY_DN293394_c0_g1	P47991	cell organization and biogenesis	ribosomal large subunit assembly	60S ribosomal protein L6	cytoplasmic translation [GO:0002181]; determination of adult lifespan [GO:0008340]; ribosomal large subunit assembly [GO:0000027]
TRINITY_DN298832_c1_g1	A4FV84	cell organization and biogenesis	ribosomal large subunit assembly	mRNA turnover protein 4 homolog (Ribosome assembly factor MRTO4)	nuclear-transcribed mRNA catabolic process [GO:0000956]; ribosomal large subunit assembly [GO:0000027]; ribosomal large subunit biogenesis [GO:0042273]; rRNA processing [GO:0006364]
TRINITY_DN298848_c3_g2	O04204	cell organization and biogenesis	ribosomal large subunit assembly	60S acidic ribosomal protein P0-1	cytoplasmic translation [GO:0002181]; ribosomal large subunit assembly [GO:0000027]
TRINITY_DN314236_c0_g1	Q12019	cell organization and biogenesis	ribosomal large subunit assembly	Midasin (Dynein-related AAA-ATPase REA1) (MIDAS-containing protein) (Ribosome export/assembly pr

### *L. zosterae*

In [47]:
#Check format of gene list with protein annotations
!head -n5 2019-07-15-nonZostera-DEG-ProteinN.tab

seq	ProteinN
TRINITY_DN312737_c2_g2	Jouberin (Abelson helper integration site 1 protein) (AHI-1)
TRINITY_DN271666_c1_g1	Acyl-protein thioesterase 1 (EC 3.1.2.-)
TRINITY_DN271666_c1_g1	Acyl-protein thioesterase 1 (EC 3.1.2.-)
TRINITY_DN296708_c0_g1	DPH4 homolog (DnaJ homolog subfamily C member 24)


In [48]:
#Remove header line
# Sort file and only keep unique entries
#Save output
!tail -n +2 2019-07-15-nonZostera-DEG-ProteinN.tab \
| sort | uniq \
> 2019-07-15-nonZostera-DEG-ProteinN.noHead.tab

In [49]:
#Check header was removed
!head -n5 2019-07-15-nonZostera-DEG-ProteinN.noHead.tab

TRINITY_DN233011_c0_g1	Probable serine/threonine-protein kinase pats1 (EC 2.7.11.1) (Protein associated with the transduction of signal 1)
TRINITY_DN255273_c0_g1	NAD-dependent protein deacetylase sirtuin-2 (EC 3.5.1.-) (Regulatory protein SIR2 homolog 2) (SIR2-like protein 2)
TRINITY_DN255903_c0_g1	Protein BCCIP homolog
TRINITY_DN256449_c0_g1	Splicing factor U2AF 35 kDa subunit (U2 auxiliary factor 35 kDa subunit) (U2 snRNP auxiliary factor small subunit) (Fragment)
TRINITY_DN271666_c1_g1	Acyl-protein thioesterase 1 (EC 3.1.2.-)


In [50]:
#Join protein annotation with GO Slim terms (unique, no other)
#Use a left join so unpaired lines from GOSlim terms are still printed
!join -1 1 -2 1 -t $'\t' -a1 \
nonZostera_blastquery-GOslim-BP.sorted.unique.noOther \
2019-07-15-nonZostera-DEG-ProteinN.noHead.tab \
> nonZostera_blastquery-GOslim-BP.sorted.unique.noOther.ProteinN

In [51]:
!head nonZostera_blastquery-GOslim-BP.sorted.unique.noOther.ProteinN
!wc -l nonZostera_blastquery-GOslim-BP.sorted.unique.noOther.ProteinN

TRINITY_DN233011_c0_g1	cell cycle and proliferation	cytokinesis after mitosis	Probable serine/threonine-protein kinase pats1 (EC 2.7.11.1) (Protein associated with the transduction of signal 1)
TRINITY_DN277328_c0_g1	cell cycle and proliferation	cytokinesis after mitosis	Probable serine/threonine-protein kinase pats1 (EC 2.7.11.1) (Protein associated with the transduction of signal 1)
TRINITY_DN299789_c0_g2	cell cycle and proliferation	cytokinesis after mitosis	Probable serine/threonine-protein kinase pats1 (EC 2.7.11.1) (Protein associated with the transduction of signal 1)
TRINITY_DN295214_c0_g1	stress response	response to reactive oxygen species
TRINITY_DN295214_c0_g2	stress response	response to reactive oxygen species
TRINITY_DN312005_c1_g1	stress response	response to reactive oxygen species	Cytochrome c peroxidase, mitochondrial (CCP) (EC 1.11.1.5)
TRINITY_DN255903_c0_g1	protein metabolism	regulation of protein amino acid phosphorylation
TRINITY_DN299755_c1_g3	protein metabolism	r