## Top protein functions for those that have > 10 taxonomic assignments
* Date: July 21, 2019
* Location on Bridges: /pylon5/mc5fr5p/hbagheri/09_Hadoop/protFuncFreq21July2019OUTPUT


## Step 1: Boa<sub>g</sub> query

* Tax ids should be distinct

```

s: Sequence = input;
protOut : output sum [string] of int;
#taxCount : output collection [string] of int;
#DefLinecount : output collection [string] of int;

distinctTax := function (seq: Sequence): int{
  taxSet : set of string;
  foreach(i:int; def(seq.annotation[i]))
   add(taxSet,seq.annotation[i].tax_id);
   
  return(len(taxSet));
  
};


if (distinctTax(s) > 10){
 #taxCount [s.seqid] << distinctTax(s);
 #DefLinecount [s.seqid] << len(s.annotation);

 foreach(i:int; def(s.annotation[i]))
   if (strfind("[",s.annotation[i].defline)> 0)
    protOut [trim(substring(s.annotation[i].defline, 0, strfind("[",s.annotation[i].defline)))] << 1;
}

```

* few lines of Output file, i.e. part-r-00000:

```
protOut[#3R-hydroxymyristoyl-] = 2
protOut[&alpha-D-glucose-1-phosphatase] = 5
protOut[&alpha-xylosidase] = 2
protOut[&beta-D-glucoside glucohydrolase, periplasmic] = 4
protOut[&gamma-glutamyl:cysteine ligase YbdK] = 2
protOut[&gamma-glutamylputrescine synthetase] = 3
protOut['2,4-dienoyl-CoA reductase (NADPH)] = 1
protOut['Cold-shock' DNA-binding domain protein (plasmid)] = 1
protOut['Cold-shock' DNA-binding domain protein] = 3355
protOut['Cold-shock' DNA-binding domain, putative] = 4
protOut['Cold-shock' DNA-binding domain-containing protein] = 2
protOut['Cold-shock' DNA-binding domain] = 3
protOut['Not a Proline racemase, nor 4-hydroxyproline epimerase (missing catalytic residues)] = 1
protOut['Paired box' domain protein] = 5
protOut['Ppf3p'] = 1
protOut['Pyruvate oxidase (ubiquinone, cytochrome)] = 1
protOut['Sequence is part of a 1.65-kb fragment of the lactose plasmid from Lactococcus lactis subsp cremoris SK11 that facilitates integration when plasmid is in L.lactis subsp. lactis.'; ORF] = 1

```

## Step 2: Post Processing

* put  the last column to the beginning of each line:  
```awk '{print $NF, $0}' part-r-00000 > part-r-00000_modified```


* sort based on the first column in the reversed order:
``` sort -nrk1 part-r-00000_modified > part-r-00000_modified_sorted_r ```


* remove the first column:
``` cut -d ' ' -f 2- part-r-00000_modified_sorted_r  > part-r-00000_modified_sorted_r_cut1 ```


*  few lines of output

```
protOut[Uncharacterised protein] = 3476480
protOut[transcriptional regulator] = 2044629
protOut[membrane protein] = 1629015
protOut[LysR family transcriptional regulator] = 871201
protOut[ABC transporter ATP-binding protein] = 819880
protOut[transposase] = 658213
protOut[MFS transporter] = 652282
protOut[oxidoreductase] = 633541
protOut[cytochrome oxidase subunit 1, partial (mitochondrion)] = 559708
protOut[ABC transporter permease] = 551235
protOut[conserved hypothetical protein] = 486938
protOut[lipoprotein] = 485403
protOut[AraC family transcriptional regulator] = 392415
protOut[TetR family transcriptional regulator] = 391287
protOut[DNA-binding response regulator] = 380999
protOut[GntR family transcriptional regulator] = 377254
protOut[phage protein] = 365918
protOut[transporter] = 356273
protOut[DNA-binding protein] = 322177
protOut[transmembrane protein] = 319394
protOut[hydrolase] = 315531
protOut[acetyltransferase] = 293665
protOut[integral membrane protein] = 293034
```




## Top protein annotations <span style="color:red"> 4 types?? </span>

### Highly conserved
* link: https://en.wikipedia.org/wiki/Conserved_sequence
* Examples of highly conserved sequences include the RNA components of ribosomes present in all domains of life, the homeobox sequences widespread amongst Eukaryotes, and the tmRNA in Bacteria.

### Highly mobile: Virus, plasmid  like antibiotic resistance

### Highly generic


## Top Annotations
*  [<span style="color:blue"> Uniport categories</span>](https://www.uniprot.org/statistics/Swiss-Prot)  
* <span style="color:red"> TODO: what categories from here? </span>

##### membrane protein
* ref: https://www.sciencedirect.com/science/article/pii/S0022283614005130
* "they underlie virtually all physiological processes in cells including key metabolic pathways, such as the respiratory chain and the photosystems, as well as the transport of solutes and signals across membranes."

##### LysR family transcriptional regulator
* https://www.annualreviews.org/doi/pdf/10.1146/annurev.mi.47.100193.003121
*

##### ABC transporter ATP-binding protein]
* ref: http://www.jlr.org/content/42/7/1007.short


##### oxidoreductase
*

##### conserved hypothetical protein
* https://academic.oup.com/nar/article/32/18/5452/998719
* https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1891709/

##### lipoprotein
* https://www.nejm.org/doi/pdf/10.1056/NEJM198904203201607
* "LIPOPROTEIN lipase, a hydrolytic enzyme produced by many tissues, is rate-limiting for the removal of lipoprotein triglycerides from the circulation"

##### family transcriptional regulator
* https://mic.microbiologyresearch.org/content/journal/micro/10.1099/mic.0.2008/022772-0?crawler=true
* The LysR-type transcriptional regulator (LTTR) family is a well-characterized group of transcriptional regulators. They are highly conserved and ubiquitous amongst bacteria, with functional orthologues identified in archaea and eukaryotic organisms ( Pérez-Rueda & Collado-Vides, 2001 ; Sun & Klein, 2004 ; Stec et al., 2006 ).

##### transporter

#####  acetyltransferase

##### Uncharacterized protein conserved in bacteria
* ```protOut[Uncharacterized protein conserved in bacteria] = 376245
```




### Boa<sub>g</sub> query: number of clusters with the specific protein functions

```
s: Sequence = input;
clstrCount: output sum [string][string] of int;

pfuncs := {
    "transcriptional regulator", "membrane protein","LysR family transcriptional regulator","cytochrome oxidase subunit 1, partial (mitochondrion)","transposase","conserved hypothetical protein","16S rRNA methyltransferase","lipoprotein", "Uncharacterized protein conserved in bacteria"
};

for (j := 0; j < len(pfuncs); j++) {
	exists(i: int; match(pfuncs[j], s.annotation[i].defline)){
    foreach(k:int; def(s.cluster[k])){
      if(s.cluster[k].similarity==95)
          	 clstrCount [pfuncs[j]][s.cluster[k].cid] << 1;

    }   
  }  
}
```


* Boa<sub>g</sub> file name on XSEDE: protFuncClstrJun132019.boa
* Output:

* Location on Bridges: ```/pylon5/mc5fr5p/hbagheri/09_Hadoop/protFuncClstrJun132019 ```




#### rRNA protein annotations

* use the output of the previous step: ```grep "rRNA" part-r-00000_modified_sorted_r_cut1 > rRNA_annotations```

* sample output:

  ```
    protOut[16S rRNA methyltransferase] = 45667
    protOut[23S rRNA methyltransferase] = 40755
    protOut[23S rRNA (pseudouridine(1915)-N(3))-methyltransferase RlmH] = 20202
    protOut[23S rRNA pseudouridylate synthase] = 20013
    protOut[23S rRNA (guanosine(2251)-2'-O)-methyltransferase RlmB] = 18832
    protOut[tRNA/rRNA methyltransferase] = 17298
    protOut[rRNA methyltransferase] = 16836
    protOut[16S rRNA (uracil(1498)-N(3))-methyltransferase] = 15866
    protOut[23S rRNA (uracil-5-)-methyltransferase RumA] = 13934
    protOut[rRNA large subunit methyltransferase] = 13582
    protOut[16S rRNA (guanine(966)-N(2))-methyltransferase RsmD] = 13431
    protOut[16S rRNA methyltransferase GidB] = 13199
    protOut[16S rRNA-processing protein RimM] = 13093
    protOut[16S rRNA pseudouridine(516) synthase] = 12613
    protOut[rRNA methylase] = 11955
    protOut[16S rRNA (cytidine(1402)-2'-O)-methyltransferase] = 11084
    protOut[16S rRNA processing protein RimM] = 10909
    protOut[rRNA maturation RNase YbeY] = 10063
    protOut[23S rRNA pseudouridine synthase F] = 9916
    protOut[23S rRNA (uracil(1939)-C(5))-methyltransferase RlmD] = 9690

   ```


## Final result based on the 2 previous Boa<sub>g</sub> query:



|Category|Protein function|#of appearance|#of clusters |Reference
|--|--| -- | -- | --|
|Highly Conserved |transcriptional regulator |2515980 | |ref
|                 |membrane protein         |2032297 |1194598 |ref
| |LysR family transcriptional regulator | 1268891 |301611 | ref
| |cytochrome oxidase subunit 1, partial (mitochondrion) | 1165451 | | ref
| | transposase| 936112 | 340654| ref
|Highly Conserved |conserved hypothetical protein | 668349 |792297 | ref
| Highly Conserved|16S rRNA methyltransferase|58951|14639 |ref
|Highly generic |lipoprotein |607562| 274952| ref
|Highly Conserved|Uncharacterized protein conserved in bacteria|376245|29004 | ref
