# Assigning GO Slim Terms

In this notebook, I'll assign GO Slim terms to differentially expressed genes for *Zostera marina* and *Labyrinthula zosterae*. I will only do this for files with GO-MWU Biological Process output. This will help with downstream interpretation of biological processes impacted by infection.

## 0. Set working directory

In [1]:
pwd

'/Users/yaamini/Documents/project-EWD-transcriptomics/scripts'

In [2]:
cd ../analyses/

/Users/yaamini/Documents/project-EWD-transcriptomics/analyses


In [3]:
#!mkdir Gene-Enrichment

In [4]:
cd Gene-Enrichment/

/Users/yaamini/Documents/project-EWD-transcriptomics/analyses/Gene-Enrichment


## *Z. marina*

### 1. Format differentially expressed gene lists

In [21]:
#Check DEG file
!head -n2 ../EdgeR/DE.EXP.CON.FDR.Z.Annot.txt
!wc -l ../EdgeR/DE.EXP.CON.FDR.Z.Annot.txt

	GeneID	logFC	logCPM	F	PValue	FDR	Accession	Isoform	E-value	ProteinN	GO_BP	GO_CC	GO_MF	GO	Status	Organism	S_10B	S_9A	S_13A	S_42A	S_46B	S_47B	S_48B	S_2A	S_2B	S_7B	S_8B	S_33A	S_36B	S_38A	S_40A
1	TRINITY_DN102431_c0_g1	7.87968942125049	1.5948412518851	9.62074706954095	0.00646939513530696	0.039492658666223	Q7ZX51	TRINITY_DN102431_c0_g1_i1	1.4e-40	Tyrosine--tRNA ligase, cytoplasmic (EC 6.1.1.1) (Tyrosyl-tRNA synthetase) (TyrRS)	tyrosyl-tRNA aminoacylation [GO:0006437]	cytoplasm [GO:0005737]	ATP binding [GO:0005524]; tRNA binding [GO:0000049]; tyrosine-tRNA ligase activity [GO:0004831]	GO:0000049; GO:0004831; GO:0005524; GO:0005737; GO:0006437	reviewed	Xenopus laevis (African clawed frog)	0	0	2	0	2	4	0	0	0	0	0	0	0	0	0
     541 ../EdgeR/DE.EXP.CON.FDR.Z.Annot.txt


In [43]:
#Remove header
# Sort
#Only keep gene ID and GOterms
#Save as new file
!tail -n +2 ../EdgeR/DE.EXP.CON.FDR.Z.Annot.txt \
| sort \
| awk -F'\t' -v OFS='\t' '{print $8, $2, $15}' \
> Zostera_blast-annot.tab

In [45]:
#Check output
#Count lines (original - 1)
!head Zostera_blast-annot.tab
!wc -l Zostera_blast-annot.tab

Q7ZX51	TRINITY_DN102431_c0_g1	GO:0000049; GO:0004831; GO:0005524; GO:0005737; GO:0006437
O49561	TRINITY_DN172833_c0_g1	GO:0009685; GO:0009686; GO:0046872; GO:0051213; GO:0052635
Q47UW0	TRINITY_DN276081_c0_g2	GO:0003677; GO:0003899; GO:0006351
Q5XHZ0	TRINITY_DN276264_c0_g1	GO:0003723; GO:0005524; GO:0005654; GO:0005739; GO:0005743; GO:0005758; GO:0005759; GO:0006457; GO:0009386; GO:0019901; GO:0051082; GO:1901856; GO:1903751
P0CH36	TRINITY_DN276264_c0_g3	GO:0008106; GO:0008270
P09444	TRINITY_DN276293_c0_g1	
Q9LJ97	TRINITY_DN276293_c0_g2	GO:0005634; GO:0005730; GO:0005829; GO:0006873; GO:0009845; GO:0010226
P00125	TRINITY_DN276409_c0_g1	GO:0005739; GO:0005743; GO:0005750; GO:0006122; GO:0016021; GO:0020037; GO:0042776; GO:0045153; GO:0046872
Q3IJK2	TRINITY_DN276418_c0_g1	GO:0003735; GO:0006412; GO:0015935; GO:0019843
Q9M8M3	TRINITY_DN276445_c0_g1	GO:0005739
     540 Zostera_blast-annot.tab


### 2. Unfold GOterms

In [46]:
%%bash 

# This script was originally written to address a specific problem that Rhonda was having

# input_file is the initial, "problem" file
# file is an intermediate file that most of the program works upon
# output_file is the final file produced by the script
input_file="Zostera_blast-annot.tab"
file="Zostera_intermediate.file"
output_file="Zostera_blast-GO-unfolded.tab"

# sed command substitutes the "; " sequence to a tab and writes the new format to a new file.
# This character sequence is how the GO terms are delimited in their field.
sed $'s/; /\t/g' "$input_file" > "$file"

# Identify first field containing a GO term.
# Search file with grep for "GO:" and pipe to awk.
# Awk sets tab as field delimiter (-F'\t'), runs a for loop that looks for "GO:" (~/GO:/), and then prints the field number).
# Awk results are piped to sort, which sorts unique by number (-ug).
# Sort results are piped to head to retrieve the lowest value (i.e. the top of the list; "-n1").
begin_goterms=$(grep "GO:" "$file" | awk -F'\t' '{for (i=1;i<=NF;i++) if($i ~/GO:/) print i}' | sort -ug | head -n1)

# While loop to process each line of the input file.
while read -r line
	do
	
	# Send contents of the current line to awk.
	# Set the field separator as a tab (-F'\t') and print the number of fields in that line.
	# Save the results of the echo/awk pipe (i.e. number of fields) to the variable "max_field".
	max_field=$(echo "$line" | awk -F'\t' '{print NF}')

	# Send contents of current line to cut.
	# Cut fields (i.e. retain those fields) 1-12.
	# Save the results of the echo/cut pipe (i.e. fields 1-12) to the variable "fixed_fields"
	fixed_fields=$(echo "$line" | cut -f1-2)

	# Since not all the lines contain the same number of fields (e.g. may not have GO terms),
	# evaluate the number of fields in each line to determine how to handle current line.

	# If the value in max_field is less than the field number where the GO terms begin,
	# then just print the current line (%s) followed by a newline (\n).
	if (( "$max_field" < "$begin_goterms" ))
		then printf "%s\n" "$line"
			else

			# Send contents of current line (which contains GO terms) to cut.
			# Cut fields (i.e. retain those fields) 13 to whatever the last field is in the curent line.
			# Save the results of the echo/cut pipe (i.e. all the GO terms fields) to the variable "goterms".
			goterms=$(echo "$line" | cut -f"$begin_goterms"-"$max_field")
			
			# Assign values in the variable "goterms" to a new indexed array (called "array"), 
			# with tab delimiter (IFS=$'\t')
			IFS=$'\t' read -r -a array <<<"$goterms"
			
			# Iterate through each element of the array.
			# Print the first 12 fields (i.e. the fields stored in "fixed_fields") followed by a tab (%s\t).
			# Print the current element in the array (i.e. the current GO term) followed by a new line (%s\n).
			for element in "${!array[@]}"	
				do printf "%s\t%s\n" "$fixed_fields" "${array[$element]}"
			done
	fi

# Send the input file into the while loop and send the output to a file named "rhonda_fixed.txt".
done < "$file" > "$output_file"

In [48]:
#It was unfolded correctly
!head Zostera_blast-GO-unfolded.tab

Q7ZX51	TRINITY_DN102431_c0_g1	GO:0000049
Q7ZX51	TRINITY_DN102431_c0_g1	GO:0004831
Q7ZX51	TRINITY_DN102431_c0_g1	GO:0005524
Q7ZX51	TRINITY_DN102431_c0_g1	GO:0005737
Q7ZX51	TRINITY_DN102431_c0_g1	GO:0006437
O49561	TRINITY_DN172833_c0_g1	GO:0009685
O49561	TRINITY_DN172833_c0_g1	GO:0009686
O49561	TRINITY_DN172833_c0_g1	GO:0046872
O49561	TRINITY_DN172833_c0_g1	GO:0051213
O49561	TRINITY_DN172833_c0_g1	GO:0052635


In [58]:
!awk '{print $3"\t"$2}' Zostera_blast-GO-unfolded.tab | gsort -V > Zostera_blast-GO-unfolded.sorted

In [59]:
#Extra space was removed and columns reorganized
!head Zostera_blast-GO-unfolded.sorted

GO:0000027	TRINITY_DN293394_c0_g1
GO:0000027	TRINITY_DN298832_c1_g1
GO:0000027	TRINITY_DN298848_c3_g2
GO:0000027	TRINITY_DN314236_c0_g1
GO:0000028	TRINITY_DN292998_c5_g6
GO:0000028	TRINITY_DN316936_c5_g1
GO:0000035	TRINITY_DN311987_c0_g1
GO:0000036	TRINITY_DN311987_c0_g1
GO:0000038	TRINITY_DN314641_c0_g1
GO:0000045	TRINITY_DN298449_c4_g2


### 3. Match GOterms to GO Slim terms

In [135]:
#Download list of GO Slim and matching GOterms
!curl -O http://owl.fish.washington.edu/halfshell/bu-alanine-wd/17-07-20/GO-GOslim.sorted

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2314k  100 2314k    0     0  1115k      0  0:00:02  0:00:02 --:--:-- 1115k


In [35]:
!head GO-GOslim.sorted

GO:0000001	mitochondrion inheritance	cell organization and biogenesis	P
GO:0000002	mitochondrial genome maintenance	cell organization and biogenesis	P
GO:0000003	reproduction	other biological processes	P
GO:0000006	high affinity zinc uptake transmembrane transporter activity	transporter activity	F
GO:0000007	low-affinity zinc ion transmembrane transporter activity	transporter activity	F
GO:0000009	"alpha-1,6-mannosyltransferase activity"	other molecular function	F
GO:0000010	trans-hexaprenyltranstransferase activity	other molecular function	F
GO:0000011	vacuole inheritance	cell organization and biogenesis	P
GO:0000012	single strand break repair	DNA metabolism	P
GO:0000012	single strand break repair	stress response	P


In [65]:
#Join files to get GOslim for each query
#Remove duplicate GOslim and queries removed
#Sort files
#Only print gene ID, GO Slim, and original GOterm
!join -1 1 -2 1 -t $'\t' \
Zostera_blast-GO-unfolded.sorted \
GO-GOslim.sorted \
| uniq | awk -F'\t' -v OFS='\t' '{print $2, $4, $5}' \
> Zostera_Blastquery-GOslim.tab

In [66]:
#Check output
!head Zostera_Blastquery-GOslim.tab
!wc -l Zostera_Blastquery-GOslim.tab

TRINITY_DN293394_c0_g1	cell organization and biogenesis	P
TRINITY_DN298832_c1_g1	cell organization and biogenesis	P
TRINITY_DN298848_c3_g2	cell organization and biogenesis	P
TRINITY_DN314236_c0_g1	cell organization and biogenesis	P
TRINITY_DN292998_c5_g6	cell organization and biogenesis	P
TRINITY_DN316936_c5_g1	cell organization and biogenesis	P
TRINITY_DN311987_c0_g1	other molecular function	F
TRINITY_DN311987_c0_g1	transporter activity	F
TRINITY_DN314641_c0_g1	other metabolic processes	P
TRINITY_DN298449_c4_g2	cell organization and biogenesis	P
    4437 Zostera_Blastquery-GOslim.tab


In [67]:
#Count the number of unique IDs with GOSlim terms
!uniq -f1 Zostera_Blastquery-GOslim.tab | wc -l

    1549


### 4. Obtain Biological Process GO Slim terms

In [72]:
#Get BP entries for each gene ID. Multiple GOslim terms may be matched with one ID.
#Confirm GOslim terms were obtained
#Count the number of entries
!awk -F"\t" '$3 == "P" { print $1"\t"$2 }' Zostera_Blastquery-GOslim.tab | sort > Zostera_Blastquery-GOslim-BP.sorted
!head Zostera_Blastquery-GOslim-BP.sorted
!wc -l Zostera_Blastquery-GOslim-BP.sorted

TRINITY_DN102431_c0_g1	RNA metabolism
TRINITY_DN102431_c0_g1	protein metabolism
TRINITY_DN111410_c0_g1	other biological processes
TRINITY_DN111543_c0_g1	other biological processes
TRINITY_DN111543_c0_g1	other biological processes
TRINITY_DN111543_c0_g1	other biological processes
TRINITY_DN111543_c0_g1	other metabolic processes
TRINITY_DN111543_c0_g1	other metabolic processes
TRINITY_DN111543_c0_g1	other metabolic processes
TRINITY_DN111543_c0_g1	other metabolic processes
    1611 Zostera_Blastquery-GOslim-BP.sorted


In [73]:
#Remove duplicate entries
#Count the number of unique entries
!uniq Zostera_Blastquery-GOslim-BP.sorted > Zostera_Blastquery-GOslim-BP.sorted.unique
!head Zostera_Blastquery-GOslim-BP.sorted.unique
!wc -l Zostera_Blastquery-GOslim-BP.sorted.unique

TRINITY_DN102431_c0_g1	RNA metabolism
TRINITY_DN102431_c0_g1	protein metabolism
TRINITY_DN111410_c0_g1	other biological processes
TRINITY_DN111543_c0_g1	other biological processes
TRINITY_DN111543_c0_g1	other metabolic processes
TRINITY_DN111543_c0_g1	stress response
TRINITY_DN111591_c0_g1	other metabolic processes
TRINITY_DN139711_c0_g1	RNA metabolism
TRINITY_DN139711_c0_g1	cell cycle and proliferation
TRINITY_DN139711_c0_g1	cell organization and biogenesis
     975 Zostera_Blastquery-GOslim-BP.sorted.unique


In [74]:
#Count the number of unique IDs with GOSlim terms
!uniq -f1 Zostera_Blastquery-GOslim-BP.sorted.unique | wc -l

     920


In [75]:
#Remove all "other biological processes"
#Confirm removal
#Count the number of entries
!grep --invert-match "other biological processes" Zostera_Blastquery-GOslim-BP.sorted.unique \
> Zostera_Blastquery-GOslim-BP.sorted.unique.noOther
!head Zostera_Blastquery-GOslim-BP.sorted.unique.noOther
!wc -l Zostera_Blastquery-GOslim-BP.sorted.unique.noOther

TRINITY_DN102431_c0_g1	RNA metabolism
TRINITY_DN102431_c0_g1	protein metabolism
TRINITY_DN111543_c0_g1	other metabolic processes
TRINITY_DN111543_c0_g1	stress response
TRINITY_DN111591_c0_g1	other metabolic processes
TRINITY_DN139711_c0_g1	RNA metabolism
TRINITY_DN139711_c0_g1	cell cycle and proliferation
TRINITY_DN139711_c0_g1	cell organization and biogenesis
TRINITY_DN139711_c0_g1	cell-cell signaling
TRINITY_DN139711_c0_g1	developmental processes
     848 Zostera_Blastquery-GOslim-BP.sorted.unique.noOther


In [76]:
#Count the number of unique CGI IDs with defined GOSlim terms
!uniq -f1 Zostera_blastquery-GOslim-BP.sorted.unique.noOther | wc -l

     781


### 5. Match GO Slim terms to original annotations

In [84]:
#Check format of gene list with protein annotations
#Columns: seq ID, Uniprot, seq ID ith isoform information, e-value, annotation, GOerms, reviewed, organism
!head -n2 ../../data/Zostera-blast-annot-withGeneID-noIsoforms.tab

TRINITY_DN100001_c0_g1	Q54EW8	TRINITY_DN100001_c0_g1_i1	1.2e-19	Dihydrolipoyl dehydrogenase, mitochondrial (EC 1.8.1.4) (Dihydrolipoamide dehydrogenase) (Glycine cleavage system L protein)	cell redox homeostasis [GO:0045454]; glycine catabolic process [GO:0006546]; isoleucine catabolic process [GO:0006550]; leucine catabolic process [GO:0006552]; L-serine biosynthetic process [GO:0006564]; valine catabolic process [GO:0006574]	extracellular matrix [GO:0031012]; mitochondrial matrix [GO:0005759]; mitochondrial pyruvate dehydrogenase complex [GO:0005967]; phagocytic vesicle [GO:0045335]	dihydrolipoyl dehydrogenase activity [GO:0004148]; electron transfer activity [GO:0009055]; flavin adenine dinucleotide binding [GO:0050660]	GO:0004148; GO:0005759; GO:0005967; GO:0006546; GO:0006550; GO:0006552; GO:0006564; GO:0006574; GO:0009055; GO:0031012; GO:0045335; GO:0045454; GO:0050660	reviewed	Dictyostelium discoideum (Slime mold)
TRINITY_DN100015_c0_g1	P16894	TRINITY_DN100015_c0_g1_i1	1.2e-21	

In [103]:
#Join protein annotation with GO Slim terms (unique, no other)
#Use a left join so unpaired lines from GOSlim terms are still printed
!join -1 1 -2 1 -t $'\t' \
Zostera_blastquery-GOslim-BP.sorted.unique.noOther \
../../data/Zostera-blast-annot-withGeneID-noIsoforms.tab \
| awk -F'\t' -v OFS='\t' '{print $1, $2, $6}' \
> Zostera_blastquery-GOslim-BP.sorted.unique.noOther.annot

In [104]:
#Check output
#Count lines (same as original file)
!head Zostera_blastquery-GOslim-BP.sorted.unique.noOther.annot
!wc -l Zostera_blastquery-GOslim-BP.sorted.unique.noOther.annot

TRINITY_DN102431_c0_g1	RNA metabolism	Tyrosine--tRNA ligase, cytoplasmic (EC 6.1.1.1) (Tyrosyl-tRNA synthetase) (TyrRS)
TRINITY_DN102431_c0_g1	protein metabolism	Tyrosine--tRNA ligase, cytoplasmic (EC 6.1.1.1) (Tyrosyl-tRNA synthetase) (TyrRS)
TRINITY_DN111543_c0_g1	other metabolic processes	Peroxisomal acyl-coenzyme A oxidase 1 (AOX 1) (EC 1.3.3.6) (Long-chain acyl-CoA oxidase) (AtCX1)
TRINITY_DN111543_c0_g1	stress response	Peroxisomal acyl-coenzyme A oxidase 1 (AOX 1) (EC 1.3.3.6) (Long-chain acyl-CoA oxidase) (AtCX1)
TRINITY_DN111591_c0_g1	other metabolic processes	Delta-aminolevulinic acid dehydratase (ALAD) (ALADH) (EC 4.2.1.24) (Porphobilinogen synthase)
TRINITY_DN139711_c0_g1	RNA metabolism	Polyadenylate-binding protein (PABP) (Poly(A)-binding protein)
TRINITY_DN139711_c0_g1	cell cycle and proliferation	Polyadenylate-binding protein (PABP) (Poly(A)-binding protein)
TRINITY_DN139711_c0_g1	cell organization and biogenesis	Polyadenylate-binding protein (PABP) (Poly(A)-binding prote

### 6. Match GOSlim terms with full `blastx` output

In addition to having GO Slim terms assigned to DEG, having that information for all detected genes (gene background) is helpful.

#### 6a. Isolate necessary columns

In [18]:
#Use file with gene counts so only detected genes (gene background) are used
#Only need Uniprot, gene ID, and GOterms
!awk -F'\t' -v OFS='\t' '{print $2, $1, $9}' ../../data/Zostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab \
> Zostera_full_blast-annot.tab

In [19]:
!head Zostera_full_blast-annot.tab
!wc -l Zostera_full_blast-annot.tab

Q1DZQ0	TRINITY_DN102005_c0_g1	GO:0000139; GO:0005198; GO:0005643; GO:0005789; GO:0015031; GO:0030127; GO:0051028; GO:0090114; GO:1904263
Q54NS9	TRINITY_DN102006_c0_g1	GO:0004174; GO:0005737; GO:0005811; GO:0050660
Q56872	TRINITY_DN102012_c0_g1	GO:0008446; GO:0009243; GO:0019673; GO:0042351; GO:0070401
F4I460	TRINITY_DN102016_c0_g1	GO:0003774; GO:0005516; GO:0005524; GO:0005737; GO:0007015; GO:0016459; GO:0030048; GO:0048767; GO:0051015
Q9LK31	TRINITY_DN102024_c0_g1	GO:0005768; GO:0005794; GO:0005802; GO:0009061; GO:0016021
Q6DIF4	TRINITY_DN102050_c0_g1	GO:0003779; GO:0005634; GO:0005737; GO:0005884; GO:0005886; GO:0030042; GO:0030043; GO:0030836; GO:0030864; GO:0040011; GO:0042643; GO:0045214; GO:0051015
Q9H172	TRINITY_DN102051_c0_g1	GO:0005524; GO:0005886; GO:0016021; GO:0016887; GO:0033344; GO:0042626; GO:0042803; GO:0046982; GO:0055085; GO:1990830
Q8BJD1	TRINITY_DN102060_c0_g1	GO:0004867; GO:0030212; GO:0062023
P34673	TRINITY_DN102062_c0_g1	GO:0005739; GO:0016787; GO:0018773; GO:004

#### 6b. Unfold GOterms

In [20]:
%%bash 

# This script was originally written to address a specific problem that Rhonda was having

# input_file is the initial, "problem" file
# file is an intermediate file that most of the program works upon
# output_file is the final file produced by the script
input_file="Zostera_full_blast-annot.tab"
file="Zostera_full_intermediate.file"
output_file="Zostera_full_blast-GO-unfolded.tab"

# sed command substitutes the "; " sequence to a tab and writes the new format to a new file.
# This character sequence is how the GO terms are delimited in their field.
sed $'s/; /\t/g' "$input_file" > "$file"

# Identify first field containing a GO term.
# Search file with grep for "GO:" and pipe to awk.
# Awk sets tab as field delimiter (-F'\t'), runs a for loop that looks for "GO:" (~/GO:/), and then prints the field number).
# Awk results are piped to sort, which sorts unique by number (-ug).
# Sort results are piped to head to retrieve the lowest value (i.e. the top of the list; "-n1").
begin_goterms=$(grep "GO:" "$file" | awk -F'\t' '{for (i=1;i<=NF;i++) if($i ~/GO:/) print i}' | sort -ug | head -n1)

# While loop to process each line of the input file.
while read -r line
	do
	
	# Send contents of the current line to awk.
	# Set the field separator as a tab (-F'\t') and print the number of fields in that line.
	# Save the results of the echo/awk pipe (i.e. number of fields) to the variable "max_field".
	max_field=$(echo "$line" | awk -F'\t' '{print NF}')

	# Send contents of current line to cut.
	# Cut fields (i.e. retain those fields) 1-12.
	# Save the results of the echo/cut pipe (i.e. fields 1-12) to the variable "fixed_fields"
	fixed_fields=$(echo "$line" | cut -f1-2)

	# Since not all the lines contain the same number of fields (e.g. may not have GO terms),
	# evaluate the number of fields in each line to determine how to handle current line.

	# If the value in max_field is less than the field number where the GO terms begin,
	# then just print the current line (%s) followed by a newline (\n).
	if (( "$max_field" < "$begin_goterms" ))
		then printf "%s\n" "$line"
			else

			# Send contents of current line (which contains GO terms) to cut.
			# Cut fields (i.e. retain those fields) 13 to whatever the last field is in the curent line.
			# Save the results of the echo/cut pipe (i.e. all the GO terms fields) to the variable "goterms".
			goterms=$(echo "$line" | cut -f"$begin_goterms"-"$max_field")
			
			# Assign values in the variable "goterms" to a new indexed array (called "array"), 
			# with tab delimiter (IFS=$'\t')
			IFS=$'\t' read -r -a array <<<"$goterms"
			
			# Iterate through each element of the array.
			# Print the first 12 fields (i.e. the fields stored in "fixed_fields") followed by a tab (%s\t).
			# Print the current element in the array (i.e. the current GO term) followed by a new line (%s\n).
			for element in "${!array[@]}"	
				do printf "%s\t%s\n" "$fixed_fields" "${array[$element]}"
			done
	fi

# Send the input file into the while loop and send the output to a file named "rhonda_fixed.txt".
done < "$file" > "$output_file"

In [21]:
#See if it was unfolded correctly
!head Zostera_full_blast-GO-unfolded.tab

Q1DZQ0	TRINITY_DN102005_c0_g1	GO:0000139
Q1DZQ0	TRINITY_DN102005_c0_g1	GO:0005198
Q1DZQ0	TRINITY_DN102005_c0_g1	GO:0005643
Q1DZQ0	TRINITY_DN102005_c0_g1	GO:0005789
Q1DZQ0	TRINITY_DN102005_c0_g1	GO:0015031
Q1DZQ0	TRINITY_DN102005_c0_g1	GO:0030127
Q1DZQ0	TRINITY_DN102005_c0_g1	GO:0051028
Q1DZQ0	TRINITY_DN102005_c0_g1	GO:0090114
Q1DZQ0	TRINITY_DN102005_c0_g1	GO:1904263
Q54NS9	TRINITY_DN102006_c0_g1	GO:0004174


In [22]:
!awk '{print $3"\t"$2}' Zostera_full_blast-GO-unfolded.tab | gsort -V > Zostera_full_blast-GO-unfolded.sorted

In [23]:
#Extra space was removed and columns reorganized
!head Zostera_full_blast-GO-unfolded.sorted

GO:0000002	TRINITY_DN106385_c0_g1
GO:0000003	TRINITY_DN3354_c0_g1
GO:0000011	TRINITY_DN235307_c0_g1
GO:0000012	TRINITY_DN18_c1_g1
GO:0000012	TRINITY_DN11157_c0_g1
GO:0000012	TRINITY_DN294144_c2_g1
GO:0000014	TRINITY_DN40815_c0_g1
GO:0000018	TRINITY_DN311034_c1_g1
GO:0000019	TRINITY_DN274898_c0_g1
GO:0000022	TRINITY_DN293890_c1_g1


#### 6c. Match GOterms to GO Slim terms

In [24]:
#Join files to get GOslim for each query
#Remove duplicate GOslim and queries removed
#Sort files
#Only print gene ID, GO Slim, and original GOterm
!join -1 1 -2 1 -t $'\t' \
Zostera_full_blast-GO-unfolded.sorted \
GO-GOslim.sorted \
| uniq | awk -F'\t' -v OFS='\t' '{print $2, $4, $5}' \
> Zostera_full_Blastquery-GOslim.tab

In [25]:
#Check output
!head Zostera_full_Blastquery-GOslim.tab
!wc -l Zostera_full_Blastquery-GOslim.tab

TRINITY_DN106385_c0_g1	cell organization and biogenesis	P
TRINITY_DN3354_c0_g1	other biological processes	P
TRINITY_DN235307_c0_g1	cell organization and biogenesis	P
TRINITY_DN18_c1_g1	DNA metabolism	P
TRINITY_DN18_c1_g1	stress response	P
TRINITY_DN11157_c0_g1	DNA metabolism	P
TRINITY_DN11157_c0_g1	stress response	P
TRINITY_DN294144_c2_g1	DNA metabolism	P
TRINITY_DN294144_c2_g1	stress response	P
TRINITY_DN40815_c0_g1	other molecular function	F
   45025 Zostera_full_Blastquery-GOslim.tab


#### 6d. Obtain Biological Process GO Slim terms

In [26]:
#Get BP entries for each gene ID. Multiple GOslim terms may be matched with one ID.
#Confirm GOslim terms were obtained
#Count the number of entries
!awk -F"\t" '$3 == "P" { print $1"\t"$2 }' Zostera_full_Blastquery-GOslim.tab | sort > Zostera_full_Blastquery-GOslim-BP.sorted
!head Zostera_full_Blastquery-GOslim-BP.sorted
!wc -l Zostera_full_Blastquery-GOslim-BP.sorted

TRINITY_DN102005_c0_g1	transport
TRINITY_DN102005_c0_g1	transport
TRINITY_DN102012_c0_g1	other metabolic processes
TRINITY_DN102012_c0_g1	other metabolic processes
TRINITY_DN102012_c0_g1	other metabolic processes
TRINITY_DN102016_c0_g1	cell organization and biogenesis
TRINITY_DN102016_c0_g1	developmental processes
TRINITY_DN102016_c0_g1	transport
TRINITY_DN102024_c0_g1	other metabolic processes
TRINITY_DN102050_c0_g1	cell organization and biogenesis
   16552 Zostera_full_Blastquery-GOslim-BP.sorted


In [27]:
#Remove duplicate entries
#Count the number of unique entries
!uniq Zostera_full_Blastquery-GOslim-BP.sorted > Zostera_full_Blastquery-GOslim-BP.sorted.unique
!head Zostera_full_Blastquery-GOslim-BP.sorted.unique
!wc -l Zostera_full_Blastquery-GOslim-BP.sorted.unique

TRINITY_DN102005_c0_g1	transport
TRINITY_DN102012_c0_g1	other metabolic processes
TRINITY_DN102016_c0_g1	cell organization and biogenesis
TRINITY_DN102016_c0_g1	developmental processes
TRINITY_DN102016_c0_g1	transport
TRINITY_DN102024_c0_g1	other metabolic processes
TRINITY_DN102050_c0_g1	cell organization and biogenesis
TRINITY_DN102050_c0_g1	developmental processes
TRINITY_DN102050_c0_g1	other biological processes
TRINITY_DN102050_c0_g1	protein metabolism
    9337 Zostera_full_Blastquery-GOslim-BP.sorted.unique


In [28]:
#Count the number of unique IDs with GOSlim terms
!uniq -f1 Zostera_full_Blastquery-GOslim-BP.sorted.unique | wc -l

    8563


In [29]:
#Remove all "other biological processes"
#Confirm removal
#Count the number of entries
!grep --invert-match "other biological processes" Zostera_full_Blastquery-GOslim-BP.sorted.unique \
> Zostera_full_Blastquery-GOslim-BP.sorted.unique.noOther
!head Zostera_full_Blastquery-GOslim-BP.sorted.unique.noOther
!wc -l Zostera_full_Blastquery-GOslim-BP.sorted.unique.noOther

TRINITY_DN102005_c0_g1	transport
TRINITY_DN102012_c0_g1	other metabolic processes
TRINITY_DN102016_c0_g1	cell organization and biogenesis
TRINITY_DN102016_c0_g1	developmental processes
TRINITY_DN102016_c0_g1	transport
TRINITY_DN102024_c0_g1	other metabolic processes
TRINITY_DN102050_c0_g1	cell organization and biogenesis
TRINITY_DN102050_c0_g1	developmental processes
TRINITY_DN102050_c0_g1	protein metabolism
TRINITY_DN102051_c0_g1	transport
    8153 Zostera_full_Blastquery-GOslim-BP.sorted.unique.noOther


In [30]:
#Count the number of unique CGI IDs with defined GOSlim terms
!uniq -f1 Zostera_full_blastquery-GOslim-BP.sorted.unique.noOther | wc -l

    7312


#### 6e. Match GO Slim terms to original annotations

In [31]:
#Join protein annotation with GO Slim terms (unique, no other)
#Use a left join so unpaired lines from GOSlim terms are still printed
!join -1 1 -2 1 -t $'\t' \
Zostera_full_blastquery-GOslim-BP.sorted.unique.noOther \
../../data/Zostera-blast-annot-withGeneID-noIsoforms.tab \
| awk -F'\t' -v OFS='\t' '{print $1, $2, $6}' \
> Zostera_full_blastquery-GOslim-BP.sorted.unique.noOther.annot

In [32]:
#Check output
#Count lines (same as original file)
!head Zostera_full_blastquery-GOslim-BP.sorted.unique.noOther.annot
!wc -l Zostera_full_blastquery-GOslim-BP.sorted.unique.noOther.annot

TRINITY_DN102005_c0_g1	transport	Protein transport protein SEC13
TRINITY_DN102012_c0_g1	other metabolic processes	GDP-mannose 4,6-dehydratase (EC 4.2.1.47) (GDP-D-mannose dehydratase)
TRINITY_DN102016_c0_g1	cell organization and biogenesis	Myosin-8 (Myosin XI B) (AtXIB)
TRINITY_DN102016_c0_g1	developmental processes	Myosin-8 (Myosin XI B) (AtXIB)
TRINITY_DN102016_c0_g1	transport	Myosin-8 (Myosin XI B) (AtXIB)
TRINITY_DN102024_c0_g1	other metabolic processes	Kelch repeat-containing protein At3g27220
TRINITY_DN102050_c0_g1	cell organization and biogenesis	WD repeat-containing protein 1 (Actin-interacting protein 1) (AIP1)
TRINITY_DN102050_c0_g1	developmental processes	WD repeat-containing protein 1 (Actin-interacting protein 1) (AIP1)
TRINITY_DN102050_c0_g1	protein metabolism	WD repeat-containing protein 1 (Actin-interacting protein 1) (AIP1)
TRINITY_DN102051_c0_g1	transport	ATP-binding cassette sub-family G member 4
    8153 Zostera_full_blastquery-GOslim-BP.sorted.unique.noOther.annot


### 7. Obtain Molecular Function GO Slim terms

In [7]:
#Get MF entries for each gene ID. Multiple GOslim terms may be matched with one ID.
#Confirm GOslim terms were obtained
#Count the number of entries
!awk -F"\t" '$3 == "F" { print $1"\t"$2 }' Zostera_Blastquery-GOslim.tab | sort > Zostera_Blastquery-GOslim-MF.sorted
!head Zostera_Blastquery-GOslim-MF.sorted
!wc -l Zostera_Blastquery-GOslim-MF.sorted

TRINITY_DN102431_c0_g1	nucleic acid binding activity
TRINITY_DN102431_c0_g1	other molecular function
TRINITY_DN102431_c0_g1	other molecular function
TRINITY_DN104822_c0_g1	nucleic acid binding activity
TRINITY_DN104822_c0_g1	other molecular function
TRINITY_DN111410_c0_g1	nucleic acid binding activity
TRINITY_DN111543_c0_g1	other molecular function
TRINITY_DN111543_c0_g1	other molecular function
TRINITY_DN111543_c0_g1	other molecular function
TRINITY_DN111591_c0_g1	other molecular function
    1212 Zostera_Blastquery-GOslim-MF.sorted


In [8]:
#Remove duplicate entries
#Count the number of unique entries
!uniq Zostera_Blastquery-GOslim-MF.sorted > Zostera_Blastquery-GOslim-MF.sorted.unique
!head Zostera_Blastquery-GOslim-MF.sorted.unique
!wc -l Zostera_Blastquery-GOslim-MF.sorted.unique

TRINITY_DN102431_c0_g1	nucleic acid binding activity
TRINITY_DN102431_c0_g1	other molecular function
TRINITY_DN104822_c0_g1	nucleic acid binding activity
TRINITY_DN104822_c0_g1	other molecular function
TRINITY_DN111410_c0_g1	nucleic acid binding activity
TRINITY_DN111543_c0_g1	other molecular function
TRINITY_DN111591_c0_g1	other molecular function
TRINITY_DN139711_c0_g1	nucleic acid binding activity
TRINITY_DN151645_c0_g1	other molecular function
TRINITY_DN151645_c0_g1	transporter activity
     666 Zostera_Blastquery-GOslim-MF.sorted.unique


In [9]:
#Count the number of unique IDs with GOSlim terms
!uniq -f1 Zostera_Blastquery-GOslim-MF.sorted.unique | wc -l

     446


In [10]:
#Remove all "other biological processes"
#Confirm removal
#Count the number of entries
!grep --invert-match "other molecular functions" Zostera_Blastquery-GOslim-MF.sorted.unique \
> Zostera_Blastquery-GOslim-MF.sorted.unique.noOther
!head Zostera_Blastquery-GOslim-MF.sorted.unique.noOther
!wc -l Zostera_Blastquery-GOslim-MF.sorted.unique.noOther

TRINITY_DN102431_c0_g1	nucleic acid binding activity
TRINITY_DN102431_c0_g1	other molecular function
TRINITY_DN104822_c0_g1	nucleic acid binding activity
TRINITY_DN104822_c0_g1	other molecular function
TRINITY_DN111410_c0_g1	nucleic acid binding activity
TRINITY_DN111543_c0_g1	other molecular function
TRINITY_DN111591_c0_g1	other molecular function
TRINITY_DN139711_c0_g1	nucleic acid binding activity
TRINITY_DN151645_c0_g1	other molecular function
TRINITY_DN151645_c0_g1	transporter activity
     666 Zostera_Blastquery-GOslim-MF.sorted.unique.noOther


In [11]:
#Count the number of unique CGI IDs with defined GOSlim terms
!uniq -f1 Zostera_blastquery-GOslim-MF.sorted.unique.noOther | wc -l

     446


### 8. Match GO Slim terms to original annotations

In [12]:
#Join protein annotation with GO Slim terms (unique, no other)
#Use a left join so unpaired lines from GOSlim terms are still printed
!join -1 1 -2 1 -t $'\t' \
Zostera_blastquery-GOslim-MF.sorted.unique.noOther \
../../data/Zostera-blast-annot-withGeneID-noIsoforms.tab \
| awk -F'\t' -v OFS='\t' '{print $1, $2, $6}' \
> Zostera_blastquery-GOslim-MF.sorted.unique.noOther.annot

In [13]:
#Check output
#Count lines (same as original file)
!head Zostera_blastquery-GOslim-MF.sorted.unique.noOther.annot
!wc -l Zostera_blastquery-GOslim-MF.sorted.unique.noOther.annot

TRINITY_DN102431_c0_g1	nucleic acid binding activity	Tyrosine--tRNA ligase, cytoplasmic (EC 6.1.1.1) (Tyrosyl-tRNA synthetase) (TyrRS)
TRINITY_DN102431_c0_g1	other molecular function	Tyrosine--tRNA ligase, cytoplasmic (EC 6.1.1.1) (Tyrosyl-tRNA synthetase) (TyrRS)
TRINITY_DN104822_c0_g1	nucleic acid binding activity	Zinc finger A20 and AN1 domain-containing stress-associated protein 10 (OsSAP10)
TRINITY_DN104822_c0_g1	other molecular function	Zinc finger A20 and AN1 domain-containing stress-associated protein 10 (OsSAP10)
TRINITY_DN111410_c0_g1	nucleic acid binding activity	Nucleolar MIF4G domain-containing protein 1 homolog
TRINITY_DN111543_c0_g1	other molecular function	Peroxisomal acyl-coenzyme A oxidase 1 (AOX 1) (EC 1.3.3.6) (Long-chain acyl-CoA oxidase) (AtCX1)
TRINITY_DN111591_c0_g1	other molecular function	Delta-aminolevulinic acid dehydratase (ALAD) (ALADH) (EC 4.2.1.24) (Porphobilinogen synthase)
TRINITY_DN139711_c0_g1	nucleic acid binding activity	Polyadenylate-binding prote

### 9. Match GOSlim terms with full `blastx` output

In addition to having GO Slim terms assigned to DEG, having that information for all detected genes (gene background) is helpful.

#### 9a. Obtain Molecular Function GO Slim terms

In [14]:
#Get MF entries for each gene ID. Multiple GOslim terms may be matched with one ID.
#Confirm GOslim terms were obtained
#Count the number of entries
!awk -F"\t" '$3 == "F" { print $1"\t"$2 }' Zostera_full_Blastquery-GOslim.tab | sort > Zostera_full_Blastquery-GOslim-MF.sorted
!head Zostera_full_Blastquery-GOslim-MF.sorted
!wc -l Zostera_full_Blastquery-GOslim-MF.sorted

TRINITY_DN102005_c0_g1	other molecular function
TRINITY_DN102006_c0_g1	other molecular function
TRINITY_DN102006_c0_g1	other molecular function
TRINITY_DN102012_c0_g1	other molecular function
TRINITY_DN102012_c0_g1	other molecular function
TRINITY_DN102016_c0_g1	cytoskeletal activity
TRINITY_DN102016_c0_g1	cytoskeletal activity
TRINITY_DN102016_c0_g1	other molecular function
TRINITY_DN102016_c0_g1	other molecular function
TRINITY_DN102050_c0_g1	cytoskeletal activity
   12169 Zostera_full_Blastquery-GOslim-MF.sorted


In [15]:
#Remove duplicate entries
#Count the number of unique entries
!uniq Zostera_full_Blastquery-GOslim-MF.sorted > Zostera_full_Blastquery-GOslim-MF.sorted.unique
!head Zostera_full_Blastquery-GOslim-MF.sorted.unique
!wc -l Zostera_full_Blastquery-GOslim-MF.sorted.unique

TRINITY_DN102005_c0_g1	other molecular function
TRINITY_DN102006_c0_g1	other molecular function
TRINITY_DN102012_c0_g1	other molecular function
TRINITY_DN102016_c0_g1	cytoskeletal activity
TRINITY_DN102016_c0_g1	other molecular function
TRINITY_DN102050_c0_g1	cytoskeletal activity
TRINITY_DN102051_c0_g1	other molecular function
TRINITY_DN102051_c0_g1	transporter activity
TRINITY_DN102060_c0_g1	enzyme regulator activity
TRINITY_DN102062_c0_g1	other molecular function
    6511 Zostera_full_Blastquery-GOslim-MF.sorted.unique


In [16]:
#Count the number of unique IDs with GOSlim terms
!uniq -f1 Zostera_full_Blastquery-GOslim-MF.sorted.unique | wc -l

    4070


In [17]:
#Remove all "other molecular functions"
#Confirm removal
#Count the number of entries
!grep --invert-match "other molecular functions" Zostera_full_Blastquery-GOslim-MF.sorted.unique \
> Zostera_full_Blastquery-GOslim-MF.sorted.unique.noOther
!head Zostera_full_Blastquery-GOslim-MF.sorted.unique.noOther
!wc -l Zostera_full_Blastquery-GOslim-MF.sorted.unique.noOther

TRINITY_DN102005_c0_g1	other molecular function
TRINITY_DN102006_c0_g1	other molecular function
TRINITY_DN102012_c0_g1	other molecular function
TRINITY_DN102016_c0_g1	cytoskeletal activity
TRINITY_DN102016_c0_g1	other molecular function
TRINITY_DN102050_c0_g1	cytoskeletal activity
TRINITY_DN102051_c0_g1	other molecular function
TRINITY_DN102051_c0_g1	transporter activity
TRINITY_DN102060_c0_g1	enzyme regulator activity
TRINITY_DN102062_c0_g1	other molecular function
    6511 Zostera_full_Blastquery-GOslim-MF.sorted.unique.noOther


In [18]:
#Count the number of unique CGI IDs with defined GOSlim terms
!uniq -f1 Zostera_full_blastquery-GOslim-MF.sorted.unique.noOther | wc -l

    4070


#### 6e. Match GO Slim terms to original annotations

In [19]:
#Join protein annotation with GO Slim terms (unique, no other)
#Use a left join so unpaired lines from GOSlim terms are still printed
!join -1 1 -2 1 -t $'\t' \
Zostera_full_blastquery-GOslim-MF.sorted.unique.noOther \
../../data/Zostera-blast-annot-withGeneID-noIsoforms.tab \
| awk -F'\t' -v OFS='\t' '{print $1, $2, $6}' \
> Zostera_full_blastquery-GOslim-MF.sorted.unique.noOther.annot

In [20]:
#Check output
#Count lines (same as original file)
!head Zostera_full_blastquery-GOslim-MF.sorted.unique.noOther.annot
!wc -l Zostera_full_blastquery-GOslim-MF.sorted.unique.noOther.annot

TRINITY_DN102005_c0_g1	other molecular function	Protein transport protein SEC13
TRINITY_DN102006_c0_g1	other molecular function	Apoptosis-inducing factor homolog A (EC 1.-.-.-)
TRINITY_DN102012_c0_g1	other molecular function	GDP-mannose 4,6-dehydratase (EC 4.2.1.47) (GDP-D-mannose dehydratase)
TRINITY_DN102016_c0_g1	cytoskeletal activity	Myosin-8 (Myosin XI B) (AtXIB)
TRINITY_DN102016_c0_g1	other molecular function	Myosin-8 (Myosin XI B) (AtXIB)
TRINITY_DN102050_c0_g1	cytoskeletal activity	WD repeat-containing protein 1 (Actin-interacting protein 1) (AIP1)
TRINITY_DN102051_c0_g1	other molecular function	ATP-binding cassette sub-family G member 4
TRINITY_DN102051_c0_g1	transporter activity	ATP-binding cassette sub-family G member 4
TRINITY_DN102060_c0_g1	enzyme regulator activity	Inter-alpha-trypsin inhibitor heavy chain H5 (ITI heavy chain H5) (ITI-HC5) (Inter-alpha-inhibitor heavy chain 5)
TRINITY_DN102062_c0_g1	other molecular function	Fumarylacetoacetate hydrolase domain-containing 

## *L. zosterae*

### 1. Format differentially expressed gene lists

In [52]:
#Check DEG file
!head -n2 ../EdgeR/DE.EXP.CON.FDR.nZ.Annot.txt
!wc -l ../EdgeR/DE.EXP.CON.FDR.nZ.Annot.txt

	GeneID	logFC	logCPM	F	PValue	FDR	Accession	Isoform	E-value	ProteinN	GO_BP	GO_CC	GO_MF	GO	Status	Organism	S_10B	S_9A	S_13A	S_42A	S_46B	S_47B	S_48B	S_2A	S_2B	S_7B	S_8B	S_33A	S_36B	S_38A	S_40A
1	TRINITY_DN173970_c0_g1	19.7709273	14.04275312	103.1232144	7.15E-12	7.09E-10	Q54IP4	TRINITY_DN173970_c0_g1_i1	1.90E-06	Dual specificity protein kinase shkB (EC 2.7.11.1) (SH2 domain-containing protein 2) (SH2 domain-containing protein B)		membrane [GO:0016020]	ATP binding [GO:0005524]; protein serine/threonine kinase activity [GO:0004674]; protein tyrosine kinase activity [GO:0004713]	GO:0004674; GO:0004713; GO:0005524; GO:0016020	reviewed	Dictyostelium discoideum (Slime mold)	1149	2612	930	742	1004	757	488	0	0	0	2	1	0	2	1
     100 ../EdgeR/DE.EXP.CON.FDR.nZ.Annot.txt


In [53]:
#Remove header
# Sort
#Only keep gene ID and GOterms
#Save as new file
!tail -n +2 ../EdgeR/DE.EXP.CON.FDR.nZ.Annot.txt \
| sort \
| awk -F'\t' -v OFS='\t' '{print $8, $2, $15}' \
> nonZostera_blast-annot.tab

In [55]:
#Check output
!head nonZostera_blast-annot.tab
!wc -l nonZostera_blast-annot.tab

Q54IP4	TRINITY_DN173970_c0_g1	GO:0004674; GO:0004713; GO:0005524; GO:0016020
Q8Y457	TRINITY_DN251191_c0_g2	GO:0003723; GO:0009982; GO:0031119; GO:0106029
Q0V6M5	TRINITY_DN66609_c0_g1	GO:0004310; GO:0006696; GO:0016021; GO:0016117; GO:0016767; GO:0016872; GO:0051996
Q8IXJ6	TRINITY_DN255273_c0_g1	GO:0000122; GO:0000183; GO:0000781; GO:0003682; GO:0004407; GO:0005634; GO:0005677; GO:0005694; GO:0005720; GO:0005730; GO:0005737; GO:0005739; GO:0005813; GO:0005814; GO:0005819; GO:0005829; GO:0005874; GO:0005886; GO:0006342; GO:0006348; GO:0006471; GO:0006476; GO:0006914; GO:0007096; GO:0008134; GO:0008270; GO:0008285; GO:0010507; GO:0010801; GO:0014065; GO:0016458; GO:0016575; GO:0017136; GO:0021762; GO:0022011; GO:0030426; GO:0030496; GO:0031641; GO:0032436; GO:0033010; GO:0033270; GO:0033558; GO:0034599; GO:0034979; GO:0034983; GO:0035035; GO:0035729; GO:0042177; GO:0042325; GO:0042826; GO:0042903; GO:0043130; GO:0043161; GO:0043204; GO:0043209; GO:0043219; GO:0043220; GO:0043388; GO:00434

### 2. Unfold GOterms

In [56]:
%%bash 

# This script was originally written to address a specific problem that Rhonda was having

# input_file is the initial, "problem" file
# file is an intermediate file that most of the program works upon
# output_file is the final file produced by the script
input_file="nonZostera_blast-annot.tab"
file="nonZostera_intermediate.file"
output_file="nonZostera_blast-GO-unfolded.tab"

# sed command substitutes the "; " sequence to a tab and writes the new format to a new file.
# This character sequence is how the GO terms are delimited in their field.
sed $'s/; /\t/g' "$input_file" > "$file"

# Identify first field containing a GO term.
# Search file with grep for "GO:" and pipe to awk.
# Awk sets tab as field delimiter (-F'\t'), runs a for loop that looks for "GO:" (~/GO:/), and then prints the field number).
# Awk results are piped to sort, which sorts unique by number (-ug).
# Sort results are piped to head to retrieve the lowest value (i.e. the top of the list; "-n1").
begin_goterms=$(grep "GO:" "$file" | awk -F'\t' '{for (i=1;i<=NF;i++) if($i ~/GO:/) print i}' | sort -ug | head -n1)

# While loop to process each line of the input file.
while read -r line
	do
	
	# Send contents of the current line to awk.
	# Set the field separator as a tab (-F'\t') and print the number of fields in that line.
	# Save the results of the echo/awk pipe (i.e. number of fields) to the variable "max_field".
	max_field=$(echo "$line" | awk -F'\t' '{print NF}')

	# Send contents of current line to cut.
	# Cut fields (i.e. retain those fields) 1-12.
	# Save the results of the echo/cut pipe (i.e. fields 1-12) to the variable "fixed_fields"
	fixed_fields=$(echo "$line" | cut -f1-2)

	# Since not all the lines contain the same number of fields (e.g. may not have GO terms),
	# evaluate the number of fields in each line to determine how to handle current line.

	# If the value in max_field is less than the field number where the GO terms begin,
	# then just print the current line (%s) followed by a newline (\n).
	if (( "$max_field" < "$begin_goterms" ))
		then printf "%s\n" "$line"
			else

			# Send contents of current line (which contains GO terms) to cut.
			# Cut fields (i.e. retain those fields) 13 to whatever the last field is in the curent line.
			# Save the results of the echo/cut pipe (i.e. all the GO terms fields) to the variable "goterms".
			goterms=$(echo "$line" | cut -f"$begin_goterms"-"$max_field")
			
			# Assign values in the variable "goterms" to a new indexed array (called "array"), 
			# with tab delimiter (IFS=$'\t')
			IFS=$'\t' read -r -a array <<<"$goterms"
			
			# Iterate through each element of the array.
			# Print the first 12 fields (i.e. the fields stored in "fixed_fields") followed by a tab (%s\t).
			# Print the current element in the array (i.e. the current GO term) followed by a new line (%s\n).
			for element in "${!array[@]}"	
				do printf "%s\t%s\n" "$fixed_fields" "${array[$element]}"
			done
	fi

# Send the input file into the while loop and send the output to a file named "rhonda_fixed.txt".
done < "$file" > "$output_file"

In [57]:
#It was unfolded correctly
!head nonZostera_blast-GO-unfolded.tab

Q54IP4	TRINITY_DN173970_c0_g1	GO:0004674
Q54IP4	TRINITY_DN173970_c0_g1	GO:0004713
Q54IP4	TRINITY_DN173970_c0_g1	GO:0005524
Q54IP4	TRINITY_DN173970_c0_g1	GO:0016020
Q8Y457	TRINITY_DN251191_c0_g2	GO:0003723
Q8Y457	TRINITY_DN251191_c0_g2	GO:0009982
Q8Y457	TRINITY_DN251191_c0_g2	GO:0031119
Q8Y457	TRINITY_DN251191_c0_g2	GO:0106029
Q0V6M5	TRINITY_DN66609_c0_g1	GO:0004310
Q0V6M5	TRINITY_DN66609_c0_g1	GO:0006696


In [60]:
!awk '{print $3"\t"$2}' nonZostera_blast-GO-unfolded.tab | gsort -V > nonZostera_blast-GO-unfolded.sorted

In [61]:
#Extra space was removed and columns reorganized
!head nonZostera_blast-GO-unfolded.sorted

GO:0000079	TRINITY_DN255903_c0_g1
GO:0000079	TRINITY_DN299755_c1_g3
GO:0000122	TRINITY_DN255273_c0_g1
GO:0000122	TRINITY_DN311716_c0_g2
GO:0000132	TRINITY_DN255903_c0_g1
GO:0000139	TRINITY_DN294657_c0_g1
GO:0000183	TRINITY_DN255273_c0_g1
GO:0000188	TRINITY_DN299755_c1_g3
GO:0000212	TRINITY_DN311731_c0_g1
GO:0000226	TRINITY_DN255903_c0_g1


### 3. Match GOterms to GO Slim terms

In [68]:
#Join files to get GOslim for each query
#Remove duplicate GOslim and queries removed
#Sort files
#Only print gene ID, GO Slim, and original GOterm
!join -1 1 -2 1 -t $'\t' \
nonZostera_blast-GO-unfolded.sorted \
GO-GOslim.sorted \
| uniq | awk -F'\t' -v OFS='\t' '{print $2, $4, $5}' \
> nonZostera_Blastquery-GOslim.tab

In [69]:
#Check output
!head nonZostera_Blastquery-GOslim.tab
!wc -l nonZostera_Blastquery-GOslim.tab

TRINITY_DN255903_c0_g1	cell cycle and proliferation	P
TRINITY_DN255903_c0_g1	other metabolic processes	P
TRINITY_DN299755_c1_g3	cell cycle and proliferation	P
TRINITY_DN299755_c1_g3	other metabolic processes	P
TRINITY_DN255273_c0_g1	RNA metabolism	P
TRINITY_DN311716_c0_g2	RNA metabolism	P
TRINITY_DN255903_c0_g1	cell cycle and proliferation	P
TRINITY_DN255903_c0_g1	cell organization and biogenesis	P
TRINITY_DN294657_c0_g1	ER/Golgi	C
TRINITY_DN294657_c0_g1	other membranes	C
     869 nonZostera_Blastquery-GOslim.tab


In [71]:
#Count the number of unique IDs with GOSlim terms
!uniq -f1 nonZostera_Blastquery-GOslim.tab | wc -l

     386


### 4. Obtain Biological Process GO Slim terms

In [77]:
#Get BP entries for each gene ID. Multiple GOslim terms may be matched with one ID.
#Confirm GOslim terms were obtained
#Count the number of entries
!awk -F"\t" '$3 == "P" { print $1"\t"$2 }' nonZostera_Blastquery-GOslim.tab | sort > nonZostera_Blastquery-GOslim-BP.sorted
!head nonZostera_Blastquery-GOslim-BP.sorted
!wc -l nonZostera_Blastquery-GOslim-BP.sorted

TRINITY_DN2119_c0_g1	cell organization and biogenesis
TRINITY_DN2119_c0_g1	other metabolic processes
TRINITY_DN226644_c0_g1	other biological processes
TRINITY_DN226644_c0_g1	protein metabolism
TRINITY_DN226644_c0_g1	stress response
TRINITY_DN227396_c0_g1	DNA metabolism
TRINITY_DN227396_c0_g1	DNA metabolism
TRINITY_DN227396_c0_g1	other biological processes
TRINITY_DN233011_c0_g1	cell cycle and proliferation
TRINITY_DN233011_c0_g1	protein metabolism
     309 nonZostera_Blastquery-GOslim-BP.sorted


In [78]:
#Remove duplicate entries
#Count the number of unique entries
!uniq nonZostera_Blastquery-GOslim-BP.sorted > nonZostera_Blastquery-GOslim-BP.sorted.unique
!head nonZostera_Blastquery-GOslim-BP.sorted.unique
!wc -l nonZostera_Blastquery-GOslim-BP.sorted.unique

TRINITY_DN2119_c0_g1	cell organization and biogenesis
TRINITY_DN2119_c0_g1	other metabolic processes
TRINITY_DN226644_c0_g1	other biological processes
TRINITY_DN226644_c0_g1	protein metabolism
TRINITY_DN226644_c0_g1	stress response
TRINITY_DN227396_c0_g1	DNA metabolism
TRINITY_DN227396_c0_g1	other biological processes
TRINITY_DN233011_c0_g1	cell cycle and proliferation
TRINITY_DN233011_c0_g1	protein metabolism
TRINITY_DN251134_c0_g1	cell organization and biogenesis
     165 nonZostera_Blastquery-GOslim-BP.sorted.unique


In [79]:
#Count the number of unique IDs with GOSlim terms
!uniq -f1 nonZostera_Blastquery-GOslim-BP.sorted.unique | wc -l

     152


In [80]:
#Remove all "other biological processes"
#Confirm removal
#Count the number of entries
!grep --invert-match "other biological processes" nonZostera_Blastquery-GOslim-BP.sorted.unique \
> nonZostera_Blastquery-GOslim-BP.sorted.unique.noOther
!head nonZostera_Blastquery-GOslim-BP.sorted.unique.noOther
!wc -l nonZostera_Blastquery-GOslim-BP.sorted.unique.noOther

TRINITY_DN2119_c0_g1	cell organization and biogenesis
TRINITY_DN2119_c0_g1	other metabolic processes
TRINITY_DN226644_c0_g1	protein metabolism
TRINITY_DN226644_c0_g1	stress response
TRINITY_DN227396_c0_g1	DNA metabolism
TRINITY_DN233011_c0_g1	cell cycle and proliferation
TRINITY_DN233011_c0_g1	protein metabolism
TRINITY_DN251134_c0_g1	cell organization and biogenesis
TRINITY_DN251134_c0_g1	protein metabolism
TRINITY_DN251134_c0_g1	transport
     143 nonZostera_Blastquery-GOslim-BP.sorted.unique.noOther


In [81]:
#Count the number of unique CGI IDs with defined GOSlim terms
!uniq -f1 nonZostera_blastquery-GOslim-BP.sorted.unique.noOther | wc -l

     129


### 5. Match GO Slim terms to original annotations

In [87]:
#Check format of gene list with protein annotations
#Columns: seq ID, Uniprot, seq ID ith isoform information, e-value, annotation, GOerms, reviewed, organism
!head -n2 ../../data/nonZostera-blast-annot-withGeneID-noIsoforms.tab

TRINITY_DN100016_c0_g1	Q8GYA6	TRINITY_DN100016_c0_g1_i1	1.2e-08	26S proteasome non-ATPase regulatory subunit 13 homolog B (26S proteasome regulatory subunit RPN9b) (AtRNP9b) (26S proteasome regulatory subunit S11 homolog B)	proteasome assembly [GO:0043248]; protein catabolic process [GO:0030163]; ubiquitin-dependent protein catabolic process [GO:0006511]	cytosol [GO:0005829]; nucleus [GO:0005634]; proteasome complex [GO:0000502]; proteasome regulatory particle, lid subcomplex [GO:0008541]	structural molecule activity [GO:0005198]	GO:0000502; GO:0005198; GO:0005634; GO:0005829; GO:0006511; GO:0008541; GO:0030163; GO:0043248	reviewed	Arabidopsis thaliana (Mouse-ear cress)
TRINITY_DN100076_c0_g1	Q59118	TRINITY_DN100076_c0_g1_i1	2.3e-09	Histamine oxidase (EC 1.4.3.22) (Copper amine oxidase)	amine metabolic process [GO:0009308]; cellular response to azide [GO:0097185]; oxidation-reduction process [GO:0055114]	cytoplasm [GO:0005737]	copper ion binding [GO:0005507]; diamine oxidase activity 

In [105]:
#Join protein annotation with GO Slim terms (unique, no other)
#Use a left join so unpaired lines from GOSlim terms are still printed
!join -1 1 -2 1 -t $'\t' \
nonZostera_blastquery-GOslim-BP.sorted.unique.noOther \
../../data/nonZostera-blast-annot-withGeneID-noIsoforms.tab \
| awk -F'\t' -v OFS='\t' '{print $1, $2, $6}' \
> nonZostera_blastquery-GOslim-BP.sorted.unique.noOther.annot

In [106]:
#Check output
#Count lines (same as original file)
!head nonZostera_blastquery-GOslim-BP.sorted.unique.noOther.annot
!wc -l nonZostera_blastquery-GOslim-BP.sorted.unique.noOther.annot

TRINITY_DN2119_c0_g1	cell organization and biogenesis	Cytochrome c oxidase assembly protein COX11, mitochondrial
TRINITY_DN2119_c0_g1	other metabolic processes	Cytochrome c oxidase assembly protein COX11, mitochondrial
TRINITY_DN226644_c0_g1	protein metabolism	Protein disulfide-isomerase A4 (EC 5.3.4.1)
TRINITY_DN226644_c0_g1	stress response	Protein disulfide-isomerase A4 (EC 5.3.4.1)
TRINITY_DN227396_c0_g1	DNA metabolism	Transposon Ty3-I Gag-Pol polyprotein (Gag3-Pol3) (Transposon Ty3-2 TYA-TYB polyprotein) [Cleaved into: Capsid protein (CA) (p24); Spacer peptide p3; Nucleocapsid protein p11 (NC); Ty3 protease (PR) (EC 3.4.23.-) (p16); Spacer peptide J; Reverse transcriptase/ribonuclease H (RT) (RT-RH) (EC 2.7.7.49) (EC 2.7.7.7) (EC 3.1.26.4) (p55); Integrase p52 (IN); Integrase p49 (IN)]
TRINITY_DN233011_c0_g1	cell cycle and proliferation	Probable serine/threonine-protein kinase pats1 (EC 2.7.11.1) (Protein associated with the transduction of signal 1)
TRINITY_DN233011_c0_g1	protein 

### 6. Match GOSlim terms with full `blastx` output

#### 6a. Isolate necessary columns

In [33]:
#Use file with gene counts so only detected genes (gene background) are used
#Only need Uniprot, gene ID, and GOterms
!awk -F'\t' -v OFS='\t' '{print $2, $1, $9}' ../../data/nonZostera-blast-annot-withGeneID-noIsoforms-geneCounts.tab \
> nonZostera_full_blast-annot.tab

In [34]:
!head nonZostera_full_blast-annot.tab
!wc -l nonZostera_full_blast-annot.tab

Q54BM8	TRINITY_DN102004_c0_g1	GO:0008270
Q84M24	TRINITY_DN102011_c0_g1	GO:0005319; GO:0005524; GO:0005774; GO:0006869; GO:0016021; GO:0016887; GO:0042626; GO:0043231
A5DB51	TRINITY_DN102014_c0_g1	GO:0004746; GO:0009231
F4I460	TRINITY_DN102016_c0_g1	GO:0003774; GO:0005516; GO:0005524; GO:0005737; GO:0007015; GO:0016459; GO:0030048; GO:0048767; GO:0051015
Q19020	TRINITY_DN10203_c0_g1	GO:0000278; GO:0000776; GO:0000922; GO:0005524; GO:0005635; GO:0005737; GO:0005819; GO:0005868; GO:0005881; GO:0005938; GO:0007018; GO:0007097; GO:0008090; GO:0008569; GO:0016322; GO:0030286; GO:0031122; GO:0045503; GO:0045505; GO:0048489; GO:0048814; GO:0051293; GO:0051296; GO:0051661; GO:0051932; GO:0051959; GO:0072382; GO:1902473; GO:1904115; GO:1904811; GO:1990048
Q9H172	TRINITY_DN102051_c0_g1	GO:0005524; GO:0005886; GO:0016021; GO:0016887; GO:0033344; GO:0042626; GO:0042803; GO:0046982; GO:0055085; GO:1990830
P30349	TRINITY_DN102053_c0_g1	GO:0004177; GO:0004301; GO:0004463; GO:0005634; GO:0005737; GO:00

#### 6b. Unfold GOterms

In [35]:
%%bash 

# This script was originally written to address a specific problem that Rhonda was having

# input_file is the initial, "problem" file
# file is an intermediate file that most of the program works upon
# output_file is the final file produced by the script
input_file="nonZostera_full_blast-annot.tab"
file="nonZostera_full_intermediate.file"
output_file="nonZostera_full_blast-GO-unfolded.tab"

# sed command substitutes the "; " sequence to a tab and writes the new format to a new file.
# This character sequence is how the GO terms are delimited in their field.
sed $'s/; /\t/g' "$input_file" > "$file"

# Identify first field containing a GO term.
# Search file with grep for "GO:" and pipe to awk.
# Awk sets tab as field delimiter (-F'\t'), runs a for loop that looks for "GO:" (~/GO:/), and then prints the field number).
# Awk results are piped to sort, which sorts unique by number (-ug).
# Sort results are piped to head to retrieve the lowest value (i.e. the top of the list; "-n1").
begin_goterms=$(grep "GO:" "$file" | awk -F'\t' '{for (i=1;i<=NF;i++) if($i ~/GO:/) print i}' | sort -ug | head -n1)

# While loop to process each line of the input file.
while read -r line
	do
	
	# Send contents of the current line to awk.
	# Set the field separator as a tab (-F'\t') and print the number of fields in that line.
	# Save the results of the echo/awk pipe (i.e. number of fields) to the variable "max_field".
	max_field=$(echo "$line" | awk -F'\t' '{print NF}')

	# Send contents of current line to cut.
	# Cut fields (i.e. retain those fields) 1-12.
	# Save the results of the echo/cut pipe (i.e. fields 1-12) to the variable "fixed_fields"
	fixed_fields=$(echo "$line" | cut -f1-2)

	# Since not all the lines contain the same number of fields (e.g. may not have GO terms),
	# evaluate the number of fields in each line to determine how to handle current line.

	# If the value in max_field is less than the field number where the GO terms begin,
	# then just print the current line (%s) followed by a newline (\n).
	if (( "$max_field" < "$begin_goterms" ))
		then printf "%s\n" "$line"
			else

			# Send contents of current line (which contains GO terms) to cut.
			# Cut fields (i.e. retain those fields) 13 to whatever the last field is in the curent line.
			# Save the results of the echo/cut pipe (i.e. all the GO terms fields) to the variable "goterms".
			goterms=$(echo "$line" | cut -f"$begin_goterms"-"$max_field")
			
			# Assign values in the variable "goterms" to a new indexed array (called "array"), 
			# with tab delimiter (IFS=$'\t')
			IFS=$'\t' read -r -a array <<<"$goterms"
			
			# Iterate through each element of the array.
			# Print the first 12 fields (i.e. the fields stored in "fixed_fields") followed by a tab (%s\t).
			# Print the current element in the array (i.e. the current GO term) followed by a new line (%s\n).
			for element in "${!array[@]}"	
				do printf "%s\t%s\n" "$fixed_fields" "${array[$element]}"
			done
	fi

# Send the input file into the while loop and send the output to a file named "rhonda_fixed.txt".
done < "$file" > "$output_file"

In [36]:
#See if it was unfolded correctly
!head nonZostera_full_blast-GO-unfolded.tab

Q54BM8	TRINITY_DN102004_c0_g1	GO:0008270
Q84M24	TRINITY_DN102011_c0_g1	GO:0005319
Q84M24	TRINITY_DN102011_c0_g1	GO:0005524
Q84M24	TRINITY_DN102011_c0_g1	GO:0005774
Q84M24	TRINITY_DN102011_c0_g1	GO:0006869
Q84M24	TRINITY_DN102011_c0_g1	GO:0016021
Q84M24	TRINITY_DN102011_c0_g1	GO:0016887
Q84M24	TRINITY_DN102011_c0_g1	GO:0042626
Q84M24	TRINITY_DN102011_c0_g1	GO:0043231
A5DB51	TRINITY_DN102014_c0_g1	GO:0004746


In [37]:
!awk '{print $3"\t"$2}' nonZostera_full_blast-GO-unfolded.tab | gsort -V > nonZostera_full_blast-GO-unfolded.sorted

In [38]:
#Extra space was removed and columns reorganized
!head nonZostera_full_blast-GO-unfolded.sorted

GO:0000001	TRINITY_DN540991_c0_g1
GO:0000002	TRINITY_DN21033_c0_g1
GO:0000003	TRINITY_DN106042_c0_g1
GO:0000003	TRINITY_DN684764_c0_g1
GO:0000009	TRINITY_DN67030_c0_g1
GO:0000011	TRINITY_DN278317_c0_g1
GO:0000023	TRINITY_DN552983_c0_g1
GO:0000025	TRINITY_DN552983_c0_g1
GO:0000027	TRINITY_DN293394_c0_g3
GO:0000027	TRINITY_DN298612_c2_g2


#### 6c. Match GOterms to GO Slim terms

In [39]:
#Join files to get GOslim for each query
#Remove duplicate GOslim and queries removed
#Sort files
#Only print gene ID, GO Slim, and original GOterm
!join -1 1 -2 1 -t $'\t' \
nonZostera_full_blast-GO-unfolded.sorted \
GO-GOslim.sorted \
| uniq | awk -F'\t' -v OFS='\t' '{print $2, $4, $5}' \
> nonZostera_full_Blastquery-GOslim.tab

In [40]:
#Check output
!head nonZostera_full_Blastquery-GOslim.tab
!wc -l nonZostera_full_Blastquery-GOslim.tab

TRINITY_DN540991_c0_g1	cell organization and biogenesis	P
TRINITY_DN21033_c0_g1	cell organization and biogenesis	P
TRINITY_DN106042_c0_g1	other biological processes	P
TRINITY_DN684764_c0_g1	other biological processes	P
TRINITY_DN67030_c0_g1	other molecular function	F
TRINITY_DN278317_c0_g1	cell organization and biogenesis	P
TRINITY_DN552983_c0_g1	other metabolic processes	P
TRINITY_DN552983_c0_g1	other metabolic processes	P
TRINITY_DN293394_c0_g3	cell organization and biogenesis	P
TRINITY_DN298612_c2_g2	cell organization and biogenesis	P
   20117 nonZostera_full_Blastquery-GOslim.tab


#### 6d. Obtain Biological Process GO Slim terms

In [41]:
#Get BP entries for each gene ID. Multiple GOslim terms may be matched with one ID.
#Confirm GOslim terms were obtained
#Count the number of entries
!awk -F"\t" '$3 == "P" { print $1"\t"$2 }' nonZostera_full_Blastquery-GOslim.tab | sort > nonZostera_full_Blastquery-GOslim-BP.sorted
!head nonZostera_full_Blastquery-GOslim-BP.sorted
!wc -l nonZostera_full_Blastquery-GOslim-BP.sorted

TRINITY_DN102011_c0_g1	transport
TRINITY_DN102014_c0_g1	other metabolic processes
TRINITY_DN102016_c0_g1	cell organization and biogenesis
TRINITY_DN102016_c0_g1	developmental processes
TRINITY_DN102016_c0_g1	transport
TRINITY_DN10203_c0_g1	cell cycle and proliferation
TRINITY_DN10203_c0_g1	cell cycle and proliferation
TRINITY_DN10203_c0_g1	cell organization and biogenesis
TRINITY_DN10203_c0_g1	cell organization and biogenesis
TRINITY_DN10203_c0_g1	cell organization and biogenesis
    7641 nonZostera_full_Blastquery-GOslim-BP.sorted


In [42]:
#Remove duplicate entries
#Count the number of unique entries
!uniq nonZostera_full_Blastquery-GOslim-BP.sorted > nonZostera_full_Blastquery-GOslim-BP.sorted.unique
!head nonZostera_full_Blastquery-GOslim-BP.sorted.unique
!wc -l nonZostera_full_Blastquery-GOslim-BP.sorted.unique

TRINITY_DN102011_c0_g1	transport
TRINITY_DN102014_c0_g1	other metabolic processes
TRINITY_DN102016_c0_g1	cell organization and biogenesis
TRINITY_DN102016_c0_g1	developmental processes
TRINITY_DN102016_c0_g1	transport
TRINITY_DN10203_c0_g1	cell cycle and proliferation
TRINITY_DN10203_c0_g1	cell organization and biogenesis
TRINITY_DN10203_c0_g1	cell-cell signaling
TRINITY_DN10203_c0_g1	developmental processes
TRINITY_DN10203_c0_g1	other biological processes
    4238 nonZostera_full_Blastquery-GOslim-BP.sorted.unique


In [43]:
#Count the number of unique IDs with GOSlim terms
!uniq -f1 nonZostera_full_Blastquery-GOslim-BP.sorted.unique | wc -l

    3994


In [44]:
#Remove all "other biological processes"
#Confirm removal
#Count the number of entries
!grep --invert-match "other biological processes" nonZostera_full_Blastquery-GOslim-BP.sorted.unique \
> nonZostera_full_Blastquery-GOslim-BP.sorted.unique.noOther
!head nonZostera_full_Blastquery-GOslim-BP.sorted.unique.noOther
!wc -l nonZostera_full_Blastquery-GOslim-BP.sorted.unique.noOther

TRINITY_DN102011_c0_g1	transport
TRINITY_DN102014_c0_g1	other metabolic processes
TRINITY_DN102016_c0_g1	cell organization and biogenesis
TRINITY_DN102016_c0_g1	developmental processes
TRINITY_DN102016_c0_g1	transport
TRINITY_DN10203_c0_g1	cell cycle and proliferation
TRINITY_DN10203_c0_g1	cell organization and biogenesis
TRINITY_DN10203_c0_g1	cell-cell signaling
TRINITY_DN10203_c0_g1	developmental processes
TRINITY_DN10203_c0_g1	transport
    3692 nonZostera_full_Blastquery-GOslim-BP.sorted.unique.noOther


In [45]:
#Count the number of unique CGI IDs with defined GOSlim terms
!uniq -f1 nonZostera_full_blastquery-GOslim-BP.sorted.unique.noOther | wc -l

    3415


#### 6e. Match GO Slim terms to original annotations

In [46]:
#Join protein annotation with GO Slim terms (unique, no other)
#Use a left join so unpaired lines from GOSlim terms are still printed
!join -1 1 -2 1 -t $'\t' \
nonZostera_full_blastquery-GOslim-BP.sorted.unique.noOther \
../../data/nonZostera-blast-annot-withGeneID-noIsoforms.tab \
| awk -F'\t' -v OFS='\t' '{print $1, $2, $6}' \
> nonZostera_full_blastquery-GOslim-BP.sorted.unique.noOther.annot

In [47]:
#Check output
#Count lines (same as original file)
!head nonZostera_full_blastquery-GOslim-BP.sorted.unique.noOther.annot
!wc -l nonZostera_full_blastquery-GOslim-BP.sorted.unique.noOther.annot

TRINITY_DN102011_c0_g1	transport	ABC transporter A family member 1 (ABC transporter ABCA.1) (AtABCA1) (ABC one homolog protein 1) (AtAOH1)
TRINITY_DN102014_c0_g1	other metabolic processes	Riboflavin synthase (RS) (EC 2.5.1.9)
TRINITY_DN102016_c0_g1	cell organization and biogenesis	Myosin-8 (Myosin XI B) (AtXIB)
TRINITY_DN102016_c0_g1	developmental processes	Myosin-8 (Myosin XI B) (AtXIB)
TRINITY_DN102016_c0_g1	transport	Myosin-8 (Myosin XI B) (AtXIB)
TRINITY_DN10203_c0_g1	cell cycle and proliferation	Dynein heavy chain, cytoplasmic (Dynein heavy chain, cytosolic) (DYHC)
TRINITY_DN10203_c0_g1	cell organization and biogenesis	Dynein heavy chain, cytoplasmic (Dynein heavy chain, cytosolic) (DYHC)
TRINITY_DN10203_c0_g1	cell-cell signaling	Dynein heavy chain, cytoplasmic (Dynein heavy chain, cytosolic) (DYHC)
TRINITY_DN10203_c0_g1	developmental processes	Dynein heavy chain, cytoplasmic (Dynein heavy chain, cytosolic) (DYHC)
TRINITY_DN10203_c0_g1	transport	Dynein heavy chain, cytoplasmic (Dy