# Assigning GO Slim Terms

In this notebook, I'll assign GO Slim terms to differentially expressed genes for *Zostera marina* and *Labyrinthula zosterae*. I will only do this for files with GO-MWU Biological Process output. This will help with downstream interpretation of biological processes impacted by infection.

## 0. Set working directory

In [1]:
pwd

'/Users/yaaminivenkataraman/Documents/project-EWD-transcriptomics/scripts'

In [2]:
cd ../analyses/

/Users/yaaminivenkataraman/Documents/project-EWD-transcriptomics/analyses


In [3]:
#!mkdir Gene-Enrichment

In [4]:
cd Gene-Enrichment/

/Users/yaaminivenkataraman/Documents/project-EWD-transcriptomics/analyses/Gene-Enrichment


## 1. Format differentially expressed gene lists

### *Z. marina*

In [21]:
#Check DEG file
!head -n2 ../EdgeR/DE.EXP.CON.FDR.Z.Annot.txt
!wc -l ../EdgeR/DE.EXP.CON.FDR.Z.Annot.txt

	GeneID	logFC	logCPM	F	PValue	FDR	Accession	Isoform	E-value	ProteinN	GO_BP	GO_CC	GO_MF	GO	Status	Organism	S_10B	S_9A	S_13A	S_42A	S_46B	S_47B	S_48B	S_2A	S_2B	S_7B	S_8B	S_33A	S_36B	S_38A	S_40A
1	TRINITY_DN102431_c0_g1	7.87968942125049	1.5948412518851	9.62074706954095	0.00646939513530696	0.039492658666223	Q7ZX51	TRINITY_DN102431_c0_g1_i1	1.4e-40	Tyrosine--tRNA ligase, cytoplasmic (EC 6.1.1.1) (Tyrosyl-tRNA synthetase) (TyrRS)	tyrosyl-tRNA aminoacylation [GO:0006437]	cytoplasm [GO:0005737]	ATP binding [GO:0005524]; tRNA binding [GO:0000049]; tyrosine-tRNA ligase activity [GO:0004831]	GO:0000049; GO:0004831; GO:0005524; GO:0005737; GO:0006437	reviewed	Xenopus laevis (African clawed frog)	0	0	2	0	2	4	0	0	0	0	0	0	0	0	0
     541 ../EdgeR/DE.EXP.CON.FDR.Z.Annot.txt


In [43]:
#Remove header
# Sort
#Only keep gene ID and GOterms
#Save as new file
!tail -n +2 ../EdgeR/DE.EXP.CON.FDR.Z.Annot.txt \
| sort \
| awk -F'\t' -v OFS='\t' '{print $8, $2, $15}' \
> Zostera_blast-annot.tab

In [45]:
#Check output
#Count lines (original - 1)
!head Zostera_blast-annot.tab
!wc -l Zostera_blast-annot.tab

Q7ZX51	TRINITY_DN102431_c0_g1	GO:0000049; GO:0004831; GO:0005524; GO:0005737; GO:0006437
O49561	TRINITY_DN172833_c0_g1	GO:0009685; GO:0009686; GO:0046872; GO:0051213; GO:0052635
Q47UW0	TRINITY_DN276081_c0_g2	GO:0003677; GO:0003899; GO:0006351
Q5XHZ0	TRINITY_DN276264_c0_g1	GO:0003723; GO:0005524; GO:0005654; GO:0005739; GO:0005743; GO:0005758; GO:0005759; GO:0006457; GO:0009386; GO:0019901; GO:0051082; GO:1901856; GO:1903751
P0CH36	TRINITY_DN276264_c0_g3	GO:0008106; GO:0008270
P09444	TRINITY_DN276293_c0_g1	
Q9LJ97	TRINITY_DN276293_c0_g2	GO:0005634; GO:0005730; GO:0005829; GO:0006873; GO:0009845; GO:0010226
P00125	TRINITY_DN276409_c0_g1	GO:0005739; GO:0005743; GO:0005750; GO:0006122; GO:0016021; GO:0020037; GO:0042776; GO:0045153; GO:0046872
Q3IJK2	TRINITY_DN276418_c0_g1	GO:0003735; GO:0006412; GO:0015935; GO:0019843
Q9M8M3	TRINITY_DN276445_c0_g1	GO:0005739
     540 Zostera_blast-annot.tab


### *L. zosterae*

In [52]:
#Check DEG file
!head -n2 ../EdgeR/DE.EXP.CON.FDR.nZ.Annot.txt
!wc -l ../EdgeR/DE.EXP.CON.FDR.nZ.Annot.txt

	GeneID	logFC	logCPM	F	PValue	FDR	Accession	Isoform	E-value	ProteinN	GO_BP	GO_CC	GO_MF	GO	Status	Organism	S_10B	S_9A	S_13A	S_42A	S_46B	S_47B	S_48B	S_2A	S_2B	S_7B	S_8B	S_33A	S_36B	S_38A	S_40A
1	TRINITY_DN173970_c0_g1	19.7709273	14.04275312	103.1232144	7.15E-12	7.09E-10	Q54IP4	TRINITY_DN173970_c0_g1_i1	1.90E-06	Dual specificity protein kinase shkB (EC 2.7.11.1) (SH2 domain-containing protein 2) (SH2 domain-containing protein B)		membrane [GO:0016020]	ATP binding [GO:0005524]; protein serine/threonine kinase activity [GO:0004674]; protein tyrosine kinase activity [GO:0004713]	GO:0004674; GO:0004713; GO:0005524; GO:0016020	reviewed	Dictyostelium discoideum (Slime mold)	1149	2612	930	742	1004	757	488	0	0	0	2	1	0	2	1
     100 ../EdgeR/DE.EXP.CON.FDR.nZ.Annot.txt


In [53]:
#Remove header
# Sort
#Only keep gene ID and GOterms
#Save as new file
!tail -n +2 ../EdgeR/DE.EXP.CON.FDR.nZ.Annot.txt \
| sort \
| awk -F'\t' -v OFS='\t' '{print $8, $2, $15}' \
> nonZostera_blast-annot.tab

In [55]:
#Check output
!head nonZostera_blast-annot.tab
!wc -l nonZostera_blast-annot.tab

Q54IP4	TRINITY_DN173970_c0_g1	GO:0004674; GO:0004713; GO:0005524; GO:0016020
Q8Y457	TRINITY_DN251191_c0_g2	GO:0003723; GO:0009982; GO:0031119; GO:0106029
Q0V6M5	TRINITY_DN66609_c0_g1	GO:0004310; GO:0006696; GO:0016021; GO:0016117; GO:0016767; GO:0016872; GO:0051996
Q8IXJ6	TRINITY_DN255273_c0_g1	GO:0000122; GO:0000183; GO:0000781; GO:0003682; GO:0004407; GO:0005634; GO:0005677; GO:0005694; GO:0005720; GO:0005730; GO:0005737; GO:0005739; GO:0005813; GO:0005814; GO:0005819; GO:0005829; GO:0005874; GO:0005886; GO:0006342; GO:0006348; GO:0006471; GO:0006476; GO:0006914; GO:0007096; GO:0008134; GO:0008270; GO:0008285; GO:0010507; GO:0010801; GO:0014065; GO:0016458; GO:0016575; GO:0017136; GO:0021762; GO:0022011; GO:0030426; GO:0030496; GO:0031641; GO:0032436; GO:0033010; GO:0033270; GO:0033558; GO:0034599; GO:0034979; GO:0034983; GO:0035035; GO:0035729; GO:0042177; GO:0042325; GO:0042826; GO:0042903; GO:0043130; GO:0043161; GO:0043204; GO:0043209; GO:0043219; GO:0043220; GO:0043388; GO:00434

## 2. Unfold GOterms

### *Z. marina*

In [46]:
%%bash 

# This script was originally written to address a specific problem that Rhonda was having

# input_file is the initial, "problem" file
# file is an intermediate file that most of the program works upon
# output_file is the final file produced by the script
input_file="Zostera_blast-annot.tab"
file="Zostera_intermediate.file"
output_file="Zostera_blast-GO-unfolded.tab"

# sed command substitutes the "; " sequence to a tab and writes the new format to a new file.
# This character sequence is how the GO terms are delimited in their field.
sed $'s/; /\t/g' "$input_file" > "$file"

# Identify first field containing a GO term.
# Search file with grep for "GO:" and pipe to awk.
# Awk sets tab as field delimiter (-F'\t'), runs a for loop that looks for "GO:" (~/GO:/), and then prints the field number).
# Awk results are piped to sort, which sorts unique by number (-ug).
# Sort results are piped to head to retrieve the lowest value (i.e. the top of the list; "-n1").
begin_goterms=$(grep "GO:" "$file" | awk -F'\t' '{for (i=1;i<=NF;i++) if($i ~/GO:/) print i}' | sort -ug | head -n1)

# While loop to process each line of the input file.
while read -r line
	do
	
	# Send contents of the current line to awk.
	# Set the field separator as a tab (-F'\t') and print the number of fields in that line.
	# Save the results of the echo/awk pipe (i.e. number of fields) to the variable "max_field".
	max_field=$(echo "$line" | awk -F'\t' '{print NF}')

	# Send contents of current line to cut.
	# Cut fields (i.e. retain those fields) 1-12.
	# Save the results of the echo/cut pipe (i.e. fields 1-12) to the variable "fixed_fields"
	fixed_fields=$(echo "$line" | cut -f1-2)

	# Since not all the lines contain the same number of fields (e.g. may not have GO terms),
	# evaluate the number of fields in each line to determine how to handle current line.

	# If the value in max_field is less than the field number where the GO terms begin,
	# then just print the current line (%s) followed by a newline (\n).
	if (( "$max_field" < "$begin_goterms" ))
		then printf "%s\n" "$line"
			else

			# Send contents of current line (which contains GO terms) to cut.
			# Cut fields (i.e. retain those fields) 13 to whatever the last field is in the curent line.
			# Save the results of the echo/cut pipe (i.e. all the GO terms fields) to the variable "goterms".
			goterms=$(echo "$line" | cut -f"$begin_goterms"-"$max_field")
			
			# Assign values in the variable "goterms" to a new indexed array (called "array"), 
			# with tab delimiter (IFS=$'\t')
			IFS=$'\t' read -r -a array <<<"$goterms"
			
			# Iterate through each element of the array.
			# Print the first 12 fields (i.e. the fields stored in "fixed_fields") followed by a tab (%s\t).
			# Print the current element in the array (i.e. the current GO term) followed by a new line (%s\n).
			for element in "${!array[@]}"	
				do printf "%s\t%s\n" "$fixed_fields" "${array[$element]}"
			done
	fi

# Send the input file into the while loop and send the output to a file named "rhonda_fixed.txt".
done < "$file" > "$output_file"

In [48]:
#It was unfolded correctly
!head Zostera_blast-GO-unfolded.tab

Q7ZX51	TRINITY_DN102431_c0_g1	GO:0000049
Q7ZX51	TRINITY_DN102431_c0_g1	GO:0004831
Q7ZX51	TRINITY_DN102431_c0_g1	GO:0005524
Q7ZX51	TRINITY_DN102431_c0_g1	GO:0005737
Q7ZX51	TRINITY_DN102431_c0_g1	GO:0006437
O49561	TRINITY_DN172833_c0_g1	GO:0009685
O49561	TRINITY_DN172833_c0_g1	GO:0009686
O49561	TRINITY_DN172833_c0_g1	GO:0046872
O49561	TRINITY_DN172833_c0_g1	GO:0051213
O49561	TRINITY_DN172833_c0_g1	GO:0052635


In [58]:
!awk '{print $3"\t"$2}' Zostera_blast-GO-unfolded.tab | gsort -V > Zostera_blast-GO-unfolded.sorted

In [59]:
#Extra space was removed and columns reorganized
!head Zostera_blast-GO-unfolded.sorted

GO:0000027	TRINITY_DN293394_c0_g1
GO:0000027	TRINITY_DN298832_c1_g1
GO:0000027	TRINITY_DN298848_c3_g2
GO:0000027	TRINITY_DN314236_c0_g1
GO:0000028	TRINITY_DN292998_c5_g6
GO:0000028	TRINITY_DN316936_c5_g1
GO:0000035	TRINITY_DN311987_c0_g1
GO:0000036	TRINITY_DN311987_c0_g1
GO:0000038	TRINITY_DN314641_c0_g1
GO:0000045	TRINITY_DN298449_c4_g2


### *L. zosterae*

In [56]:
%%bash 

# This script was originally written to address a specific problem that Rhonda was having

# input_file is the initial, "problem" file
# file is an intermediate file that most of the program works upon
# output_file is the final file produced by the script
input_file="nonZostera_blast-annot.tab"
file="nonZostera_intermediate.file"
output_file="nonZostera_blast-GO-unfolded.tab"

# sed command substitutes the "; " sequence to a tab and writes the new format to a new file.
# This character sequence is how the GO terms are delimited in their field.
sed $'s/; /\t/g' "$input_file" > "$file"

# Identify first field containing a GO term.
# Search file with grep for "GO:" and pipe to awk.
# Awk sets tab as field delimiter (-F'\t'), runs a for loop that looks for "GO:" (~/GO:/), and then prints the field number).
# Awk results are piped to sort, which sorts unique by number (-ug).
# Sort results are piped to head to retrieve the lowest value (i.e. the top of the list; "-n1").
begin_goterms=$(grep "GO:" "$file" | awk -F'\t' '{for (i=1;i<=NF;i++) if($i ~/GO:/) print i}' | sort -ug | head -n1)

# While loop to process each line of the input file.
while read -r line
	do
	
	# Send contents of the current line to awk.
	# Set the field separator as a tab (-F'\t') and print the number of fields in that line.
	# Save the results of the echo/awk pipe (i.e. number of fields) to the variable "max_field".
	max_field=$(echo "$line" | awk -F'\t' '{print NF}')

	# Send contents of current line to cut.
	# Cut fields (i.e. retain those fields) 1-12.
	# Save the results of the echo/cut pipe (i.e. fields 1-12) to the variable "fixed_fields"
	fixed_fields=$(echo "$line" | cut -f1-2)

	# Since not all the lines contain the same number of fields (e.g. may not have GO terms),
	# evaluate the number of fields in each line to determine how to handle current line.

	# If the value in max_field is less than the field number where the GO terms begin,
	# then just print the current line (%s) followed by a newline (\n).
	if (( "$max_field" < "$begin_goterms" ))
		then printf "%s\n" "$line"
			else

			# Send contents of current line (which contains GO terms) to cut.
			# Cut fields (i.e. retain those fields) 13 to whatever the last field is in the curent line.
			# Save the results of the echo/cut pipe (i.e. all the GO terms fields) to the variable "goterms".
			goterms=$(echo "$line" | cut -f"$begin_goterms"-"$max_field")
			
			# Assign values in the variable "goterms" to a new indexed array (called "array"), 
			# with tab delimiter (IFS=$'\t')
			IFS=$'\t' read -r -a array <<<"$goterms"
			
			# Iterate through each element of the array.
			# Print the first 12 fields (i.e. the fields stored in "fixed_fields") followed by a tab (%s\t).
			# Print the current element in the array (i.e. the current GO term) followed by a new line (%s\n).
			for element in "${!array[@]}"	
				do printf "%s\t%s\n" "$fixed_fields" "${array[$element]}"
			done
	fi

# Send the input file into the while loop and send the output to a file named "rhonda_fixed.txt".
done < "$file" > "$output_file"

In [57]:
#It was unfolded correctly
!head nonZostera_blast-GO-unfolded.tab

Q54IP4	TRINITY_DN173970_c0_g1	GO:0004674
Q54IP4	TRINITY_DN173970_c0_g1	GO:0004713
Q54IP4	TRINITY_DN173970_c0_g1	GO:0005524
Q54IP4	TRINITY_DN173970_c0_g1	GO:0016020
Q8Y457	TRINITY_DN251191_c0_g2	GO:0003723
Q8Y457	TRINITY_DN251191_c0_g2	GO:0009982
Q8Y457	TRINITY_DN251191_c0_g2	GO:0031119
Q8Y457	TRINITY_DN251191_c0_g2	GO:0106029
Q0V6M5	TRINITY_DN66609_c0_g1	GO:0004310
Q0V6M5	TRINITY_DN66609_c0_g1	GO:0006696


In [60]:
!awk '{print $3"\t"$2}' nonZostera_blast-GO-unfolded.tab | gsort -V > nonZostera_blast-GO-unfolded.sorted

In [61]:
#Extra space was removed and columns reorganized
!head nonZostera_blast-GO-unfolded.sorted

GO:0000079	TRINITY_DN255903_c0_g1
GO:0000079	TRINITY_DN299755_c1_g3
GO:0000122	TRINITY_DN255273_c0_g1
GO:0000122	TRINITY_DN311716_c0_g2
GO:0000132	TRINITY_DN255903_c0_g1
GO:0000139	TRINITY_DN294657_c0_g1
GO:0000183	TRINITY_DN255273_c0_g1
GO:0000188	TRINITY_DN299755_c1_g3
GO:0000212	TRINITY_DN311731_c0_g1
GO:0000226	TRINITY_DN255903_c0_g1


## 3. Match GOterms to GO Slim terms

In [135]:
#Download list of GO Slim and matching GOterms
!curl -O http://owl.fish.washington.edu/halfshell/bu-alanine-wd/17-07-20/GO-GOslim.sorted

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2314k  100 2314k    0     0  1115k      0  0:00:02  0:00:02 --:--:-- 1115k


In [35]:
!head GO-GOslim.sorted

GO:0000001	mitochondrion inheritance	cell organization and biogenesis	P
GO:0000002	mitochondrial genome maintenance	cell organization and biogenesis	P
GO:0000003	reproduction	other biological processes	P
GO:0000006	high affinity zinc uptake transmembrane transporter activity	transporter activity	F
GO:0000007	low-affinity zinc ion transmembrane transporter activity	transporter activity	F
GO:0000009	"alpha-1,6-mannosyltransferase activity"	other molecular function	F
GO:0000010	trans-hexaprenyltranstransferase activity	other molecular function	F
GO:0000011	vacuole inheritance	cell organization and biogenesis	P
GO:0000012	single strand break repair	DNA metabolism	P
GO:0000012	single strand break repair	stress response	P


### *Z. marina*

In [65]:
#Join files to get GOslim for each query
#Remove duplicate GOslim and queries removed
#Sort files
#Only print gene ID, GO Slim, and original GOterm
!join -1 1 -2 1 -t $'\t' \
Zostera_blast-GO-unfolded.sorted \
GO-GOslim.sorted \
| uniq | awk -F'\t' -v OFS='\t' '{print $2, $4, $5}' \
> Zostera_Blastquery-GOslim.tab

In [66]:
#Check output
!head Zostera_Blastquery-GOslim.tab
!wc -l Zostera_Blastquery-GOslim.tab

TRINITY_DN293394_c0_g1	cell organization and biogenesis	P
TRINITY_DN298832_c1_g1	cell organization and biogenesis	P
TRINITY_DN298848_c3_g2	cell organization and biogenesis	P
TRINITY_DN314236_c0_g1	cell organization and biogenesis	P
TRINITY_DN292998_c5_g6	cell organization and biogenesis	P
TRINITY_DN316936_c5_g1	cell organization and biogenesis	P
TRINITY_DN311987_c0_g1	other molecular function	F
TRINITY_DN311987_c0_g1	transporter activity	F
TRINITY_DN314641_c0_g1	other metabolic processes	P
TRINITY_DN298449_c4_g2	cell organization and biogenesis	P
    4437 Zostera_Blastquery-GOslim.tab


In [67]:
#Count the number of unique IDs with GOSlim terms
!uniq -f1 Zostera_Blastquery-GOslim.tab | wc -l

    1549


### *L. zostera*

In [68]:
#Join files to get GOslim for each query
#Remove duplicate GOslim and queries removed
#Sort files
#Only print gene ID, GO Slim, and original GOterm
!join -1 1 -2 1 -t $'\t' \
nonZostera_blast-GO-unfolded.sorted \
GO-GOslim.sorted \
| uniq | awk -F'\t' -v OFS='\t' '{print $2, $4, $5}' \
> nonZostera_Blastquery-GOslim.tab

In [69]:
#Check output
!head nonZostera_Blastquery-GOslim.tab
!wc -l nonZostera_Blastquery-GOslim.tab

TRINITY_DN255903_c0_g1	cell cycle and proliferation	P
TRINITY_DN255903_c0_g1	other metabolic processes	P
TRINITY_DN299755_c1_g3	cell cycle and proliferation	P
TRINITY_DN299755_c1_g3	other metabolic processes	P
TRINITY_DN255273_c0_g1	RNA metabolism	P
TRINITY_DN311716_c0_g2	RNA metabolism	P
TRINITY_DN255903_c0_g1	cell cycle and proliferation	P
TRINITY_DN255903_c0_g1	cell organization and biogenesis	P
TRINITY_DN294657_c0_g1	ER/Golgi	C
TRINITY_DN294657_c0_g1	other membranes	C
     869 nonZostera_Blastquery-GOslim.tab


In [71]:
#Count the number of unique IDs with GOSlim terms
!uniq -f1 nonZostera_Blastquery-GOslim.tab | wc -l

     386


## 4. Obtain Biological Process GOterms

### *Z. marina*

In [151]:
#Remove all "other biological processes"
#Confirm removal
#Count the number of entries
!grep --invert-match "other biological processes" Zostera_blastquery-GOslim.tab \
> Zostera_blastquery-GOslim-BP.sorted.unique.noOther
!head Zostera_blastquery-GOslim-BP.sorted.unique.noOther
!wc -l Zostera_blastquery-GOslim-BP.sorted.unique.noOther

TRINITY_DN293394_c0_g1	cell organization and biogenesis	ribosomal large subunit assembly
TRINITY_DN298832_c1_g1	cell organization and biogenesis	ribosomal large subunit assembly
TRINITY_DN298848_c3_g2	cell organization and biogenesis	ribosomal large subunit assembly
TRINITY_DN314236_c0_g1	cell organization and biogenesis	ribosomal large subunit assembly
TRINITY_DN292998_c5_g6	cell organization and biogenesis	ribosomal small subunit assembly
TRINITY_DN316936_c5_g1	cell organization and biogenesis	ribosomal small subunit assembly
TRINITY_DN298449_c4_g2	cell organization and biogenesis	autophagic vacuole formation
TRINITY_DN298449_c4_g2	other metabolic processes	autophagic vacuole formation
TRINITY_DN298449_c4_g2	stress response	autophagic vacuole formation
TRINITY_DN312745_c1_g1	cell cycle and proliferation	cell cycle checkpoint
     601 Zostera_blastquery-GOslim-BP.sorted.unique.noOther


In [152]:
#Count the number of unique CGI IDs with defined GOSlim terms
!uniq -f1 Zostera_blastquery-GOslim-BP.sorted.unique.noOther | wc -l

     208


### *L. zosterae*

In [156]:
#Remove all "other biological processes"
#Confirm removal
#Count the number of entries
!grep --invert-match "other biological processes" nonZostera_blastquery-GOslim.tab \
> nonZostera_blastquery-GOslim-BP.sorted.unique.noOther
!head nonZostera_blastquery-GOslim-BP.sorted.unique.noOther
!wc -l nonZostera_blastquery-GOslim-BP.sorted.unique.noOther

TRINITY_DN233011_c0_g1	cell cycle and proliferation	cytokinesis after mitosis
TRINITY_DN277328_c0_g1	cell cycle and proliferation	cytokinesis after mitosis
TRINITY_DN299789_c0_g2	cell cycle and proliferation	cytokinesis after mitosis
TRINITY_DN295214_c0_g1	stress response	response to reactive oxygen species
TRINITY_DN295214_c0_g2	stress response	response to reactive oxygen species
TRINITY_DN312005_c1_g1	stress response	response to reactive oxygen species
TRINITY_DN255903_c0_g1	protein metabolism	regulation of protein amino acid phosphorylation
TRINITY_DN299755_c1_g3	protein metabolism	regulation of protein amino acid phosphorylation
TRINITY_DN256449_c0_g1	other metabolic processes	"nucleobase, nucleoside, nucleotide and nucleic acid metabolic process"
TRINITY_DN296708_c0_g1	other metabolic processes	"nucleobase, nucleoside, nucleotide and nucleic acid metabolic process"
      92 nonZostera_blastquery-GOslim-BP.sorted.unique.noOther


In [157]:
#Count the number of unique CGI IDs with defined GOSlim terms
!uniq -f1 nonZostera_blastquery-GOslim-BP.sorted.unique.noOther | wc -l

      43


## 5. Match GO Slim terms to original annotations

### *Z. marina*

In [19]:
#Check format of gene list with protein annotations
#Columns: seq ID, Uniprot, seq ID ith isoform information, e-value, annotation, GOerms, reviewed, organism
!head -n2 ../../../data/Zostera-blast-annot-withGeneID-noIsoforms.tab

TRINITY_DN100001_c0_g1	Q54EW8	TRINITY_DN100001_c0_g1_i1	1.2e-19	Dihydrolipoyl dehydrogenase, mitochondrial (EC 1.8.1.4) (Dihydrolipoamide dehydrogenase) (Glycine cleavage system L protein)	cell redox homeostasis [GO:0045454]; glycine catabolic process [GO:0006546]; isoleucine catabolic process [GO:0006550]; leucine catabolic process [GO:0006552]; L-serine biosynthetic process [GO:0006564]; valine catabolic process [GO:0006574]	extracellular matrix [GO:0031012]; mitochondrial matrix [GO:0005759]; mitochondrial pyruvate dehydrogenase complex [GO:0005967]; phagocytic vesicle [GO:0045335]	dihydrolipoyl dehydrogenase activity [GO:0004148]; electron transfer activity [GO:0009055]; flavin adenine dinucleotide binding [GO:0050660]	GO:0004148; GO:0005759; GO:0005967; GO:0006546; GO:0006550; GO:0006552; GO:0006564; GO:0006574; GO:0009055; GO:0031012; GO:0045335; GO:0045454; GO:0050660	reviewed	Dictyostelium discoideum (Slime mold)
TRINITY_DN100015_c0_g1	P16894	TRINITY_DN100015_c0_g1_i1	1.2e-21	

In [35]:
#Join protein annotation with GO Slim terms (unique, no other)
#Use a left join so unpaired lines from GOSlim terms are still printed
!join -1 1 -2 1 -t $'\t' \
Zostera_blastquery-GOslim-BP.sorted.unique.noOther \
../../../data/Zostera-blast-annot-withGeneID-noIsoforms.tab \
| awk -F'\t' -v OFS='\t' '{print $1, $4, $2, $3, $7, $11}' \
> Zostera_blastquery-GOslim-BP.sorted.unique.noOther.annot

In [36]:
#Check output
#Count lines
!head Zostera_blastquery-GOslim-BP.sorted.unique.noOther.annot
!wc -l Zostera_blastquery-GOslim-BP.sorted.unique.noOther.annot

TRINITY_DN293394_c0_g1	P47991	cell organization and biogenesis	ribosomal large subunit assembly	60S ribosomal protein L6	GO:0000027; GO:0002181; GO:0003723; GO:0003735; GO:0005791; GO:0008340; GO:0022625
TRINITY_DN298832_c1_g1	A4FV84	cell organization and biogenesis	ribosomal large subunit assembly	mRNA turnover protein 4 homolog (Ribosome assembly factor MRTO4)	GO:0000027; GO:0000956; GO:0005730; GO:0005737; GO:0006364; GO:0030687; GO:0042273
TRINITY_DN298848_c3_g2	O04204	cell organization and biogenesis	ribosomal large subunit assembly	60S acidic ribosomal protein P0-1	GO:0000027; GO:0002181; GO:0003735; GO:0022625; GO:0022626; GO:0070180
TRINITY_DN314236_c0_g1	Q12019	cell organization and biogenesis	ribosomal large subunit assembly	Midasin (Dynein-related AAA-ATPase REA1) (MIDAS-containing protein) (Ribosome export/assembly protein 1)	GO:0000027; GO:0005524; GO:0005634; GO:0005654; GO:0005730; GO:0005739; GO:0006364; GO:0016887; GO:0110136; GO:2000200
TRINITY_DN316936_c5_g1	Q9Y3A4	c

### *L. zosterae*

In [47]:
#Check format of gene list with protein annotations
!head -n5 2019-07-15-nonZostera-DEG-ProteinN.tab

seq	ProteinN
TRINITY_DN312737_c2_g2	Jouberin (Abelson helper integration site 1 protein) (AHI-1)
TRINITY_DN271666_c1_g1	Acyl-protein thioesterase 1 (EC 3.1.2.-)
TRINITY_DN271666_c1_g1	Acyl-protein thioesterase 1 (EC 3.1.2.-)
TRINITY_DN296708_c0_g1	DPH4 homolog (DnaJ homolog subfamily C member 24)


In [48]:
#Remove header line
# Sort file and only keep unique entries
#Save output
!tail -n +2 2019-07-15-nonZostera-DEG-ProteinN.tab \
| sort | uniq \
> 2019-07-15-nonZostera-DEG-ProteinN.noHead.tab

In [49]:
#Check header was removed
!head -n5 2019-07-15-nonZostera-DEG-ProteinN.noHead.tab

TRINITY_DN233011_c0_g1	Probable serine/threonine-protein kinase pats1 (EC 2.7.11.1) (Protein associated with the transduction of signal 1)
TRINITY_DN255273_c0_g1	NAD-dependent protein deacetylase sirtuin-2 (EC 3.5.1.-) (Regulatory protein SIR2 homolog 2) (SIR2-like protein 2)
TRINITY_DN255903_c0_g1	Protein BCCIP homolog
TRINITY_DN256449_c0_g1	Splicing factor U2AF 35 kDa subunit (U2 auxiliary factor 35 kDa subunit) (U2 snRNP auxiliary factor small subunit) (Fragment)
TRINITY_DN271666_c1_g1	Acyl-protein thioesterase 1 (EC 3.1.2.-)


In [50]:
#Join protein annotation with GO Slim terms (unique, no other)
#Use a left join so unpaired lines from GOSlim terms are still printed
!join -1 1 -2 1 -t $'\t' -a1 \
nonZostera_blastquery-GOslim-BP.sorted.unique.noOther \
2019-07-15-nonZostera-DEG-ProteinN.noHead.tab \
> nonZostera_blastquery-GOslim-BP.sorted.unique.noOther.ProteinN

In [51]:
!head nonZostera_blastquery-GOslim-BP.sorted.unique.noOther.ProteinN
!wc -l nonZostera_blastquery-GOslim-BP.sorted.unique.noOther.ProteinN

TRINITY_DN233011_c0_g1	cell cycle and proliferation	cytokinesis after mitosis	Probable serine/threonine-protein kinase pats1 (EC 2.7.11.1) (Protein associated with the transduction of signal 1)
TRINITY_DN277328_c0_g1	cell cycle and proliferation	cytokinesis after mitosis	Probable serine/threonine-protein kinase pats1 (EC 2.7.11.1) (Protein associated with the transduction of signal 1)
TRINITY_DN299789_c0_g2	cell cycle and proliferation	cytokinesis after mitosis	Probable serine/threonine-protein kinase pats1 (EC 2.7.11.1) (Protein associated with the transduction of signal 1)
TRINITY_DN295214_c0_g1	stress response	response to reactive oxygen species
TRINITY_DN295214_c0_g2	stress response	response to reactive oxygen species
TRINITY_DN312005_c1_g1	stress response	response to reactive oxygen species	Cytochrome c peroxidase, mitochondrial (CCP) (EC 1.11.1.5)
TRINITY_DN255903_c0_g1	protein metabolism	regulation of protein amino acid phosphorylation
TRINITY_DN299755_c1_g3	protein metabolism	r