# ph0 - phage pan-proteome
```
Lead     : ababaian / AK / almeida / lcarmarillo
Issue    : 
start    : 2021 01 25
complete : 2021 01 XX
files    : ~/serratus/notebook/210125_ph0/
files    : http://ftp.ebi.ac.uk/pub/databases/metagenomics/genome_sets/serratus_gpd/
s3 files : s3://serratus-public/notebook/210125_ph0/
```

## Introduction

AB and AK had a brief call with almeida/lcarmarillo/rfinn to discuss expanding the 'gubaphages' by applying Serratus to search metagenome in the SRA. We need to think bigger, so we expanded that to all phage sequences focusing on terminase and primase as a marker gene. CAM team will provide a collection of terminase/primase sequences and we'll run the search and muAssembly.

### Objectives

- Create a pilot "phage" pan-proteome, `ph0` to search SRA data
- Include a few controls for sanity checking the main body of Serratus work


### Key Literature

- [Gut Phage Database](https://www.biorxiv.org/content/10.1101/2020.09.03.280214v1.full)

## Materials and Methods


## Primase and Terminase Sequences

- Recieved from almeida/lcarmarillo we will have to revisit this to make sure it was built systematically

> Luis and I have put together the data we discussed for the Serratus screen. You can find all the files here: http://ftp.ebi.ac.uk/pub/databases/metagenomics/genome_sets/serratus_gpd/
> - gpd_primases.faa -> multifasta protein file of all the primases sequences in the GPD
> - gpd_terminases.faa -> multifasta protein file of all the terminases sequences in the GPD
> ...
> - gpd_primases.tsv -> taxonomy affiliation of all the primases
> - gpd_terminases.tsv -> taxonomy affiliation of all the terminases


- We are missing GenBank sequences, these can be located with HMM using the following models (from LC)

> large terminases:

> PF04466
PF03237
PF06056
PF05876
PF03354

>Primases:

>PF01896				
PF04104					
PF08273			
PF08278					
PF10410				
PF01807						
PF01751					
PF08275						
PF08706			
PF09250						
PF13010				
PF16730						
PF16793						
PF18689					
PF03324						
PF08707					
PF09329						
PF05774					
PF08401					
PF08708



## Constructing Pan-proteome Pilot


In [None]:
# Fired up an EC2 machine
# c5n.4xlarge

In [None]:
# Initialize data
mkdir -p ph0; cd ph0

wget http://ftp.ebi.ac.uk/pub/databases/metagenomics/genome_sets/serratus_gpd/gpd_primases.faa ./
wget http://ftp.ebi.ac.uk/pub/databases/metagenomics/genome_sets/serratus_gpd/gpd_terminases.faa ./

# Seperate out RefSeq and GDP sequences
# Combine them to RefSeq-first

seqkit grep -r -p "vig_"    gpd_terminases.faa \
  | seqkit rmdup - \
  | seqkit sort -lr - > terminase.gpd.fa
seqkit grep -r -v -p "vig_" gpd_terminases.faa \
  | seqkit rmdup - \
  | seqkit sort -lr - > terminase.rs.fa
  
cat terminase.rs.fa terminase.gpd.fa > terminase.all.fa
  

seqkit grep -r -p "vig_"    gpd_primases.faa \
  | seqkit rmdup - \
  | seqkit sort -lr - > primase.gpd.fa
seqkit grep -r -v -p "vig_" gpd_primases.faa \
  | seqkit rmdup - \
  | seqkit sort -lr - > primase.rs.fa


cat primase.rs.fa primase.gpd.fa > primase.all.fa

# Input fasta files folder
mkdir -p infa; mv * infa/

In [None]:
# UCLUST Local Alignments - 90% "Species Level" clustering
# usearch v11.0.667_i86linux32

function uclust () {

  INPUT=$1
  OUTNAME=$2
  ID=$3
  
  # uclust
  OUTPUT="$OUTNAME.id$ID"
  usearch -cluster_smallmem $INPUT \
     -sortedby other \
     -id 0.$ID \
     -maxaccepts 4 \
     -maxrejects 64 \
     -maxhits 1 \
     -uc $OUTPUT.uc \
     -centroids $OUTPUT.fa

  mkdir id$ID; mv *.id$ID.* id$ID/

}

# Cluster at 90%
uclust infa/primase.all.fa primase.otu 90
#      Seqs  4950
#  Clusters  1098
#  Max size  175
#  Avg size  4.5
#  Min size  1
#Singletons  613, 12.4% of seqs, 55.8% of clusters

uclust infa/terminase.all.fa terminase.otu 90
#      Seqs  43185 (43.2k)
#  Clusters  6236
#  Max size  425
#  Avg size  6.9
#  Min size  1
#Singletons  2802, 6.5% of seqs, 44.9% of clusters


In [None]:
# DISTMAT - 40% "Family Level" clustering (single-linkage)

function uclustF () {
  INPUT=$1
  OUTNAME=$2
  ID=$3
  
  # uclust
  OUTPUT="$OUTNAME.id$ID"
  
  usearch -threads 14 -calc_distmx $INPUT \
  -maxdist 1 -termdist 1 \
  -tabbedout $OUTNAME.distmat
  
  usearch -cluster_aggd $OUTNAME.distmat \
  -treeout $OUTPUT.tree \
  -clusterout $OUTPUT.clust \
  -id 0.$ID -linkage min

  mkdir -p id$ID
  mv *.id$ID.* id$ID/
  mv $OUTNAME.distmat id$ID/

}

uclustF id90/primase.otu.id90.fa    primase.fOTU 40
uclustF id90/terminase.otu.id90.fa  terminase.fOTU 40


In [None]:
# primase.fOTU.id40.clust and terminase.fOTU.id40.clust
# imported into excel, wrangled into "clean" headers
# organizing sequence into gene, species-level and family-level OTU
# uvig_211770_8 
# -->
# prim.f1.na:uvig_211770_8
###
# YP_008857945.1 |terminase large subunit [Mycobacterium phage Conspiracy]|GCF_000913215.1 
# --> 
# term.f141.mycobacterium_phage_conspiracy:YP_008857945
#
# s3://serratus-public/notebook/210125_ph0/id40/header.swap


In [None]:
# back on EC2
cat id90/primase.otu.id90.fa id90/terminase.otu.id90.fa \
  > ph0.oldheader.tmp
  
aws s3 cp s3://serratus-public/210125_ph0/id40/header.swap ./id40/

function reheader {
  INFA=$1  # Input Fasta File
  TSV=$2   # 2-field TSV == 1:Old_Fasta_header 2:New_Fasta_Header
  OUTFA=$3 # Output Fasta File
  
  rm -f $OUTFA
  
  # Ensure header input are unique
  sort -k1,1 $TSV | uniq - > tsv.tmp
  mv tsv.tmp $TSV
  
  while read -r line; do
   if [[ "$line" = ">"* ]]; then
    # header line; replace old with new
    oldheader=$(echo $line | sed 's/>//g' -)
    newheader=$(grep -Fw "$oldheader" $TSV | cut -f2 - )
    
    if [[ "$newheader" = "" ]]; then
      echo AHHHHHHH - $oldheader
      stop
    fi
    echo ">$newheader" >> $OUTFA
    
   else
     echo $line >> $OUTFA
   fi    
  done < $INFA  
}

reheader ph0.oldheader.tmp id40/header.swap ph0.ALPHA.fa

In [None]:
# Append ph0 pilot and control sequences (see end of notebook)
cat ph0.ALPHA.fa ctrl.seq.fa \
 | seqkit sort -n - \
 > ph0.fa

In [None]:
# Upload to S3 bucket
aws s3 sync ./ s3://serratus-public/notebook/210125_ph0/

In [None]:
# Create Serratus database resources
# create and login to serratus container
sudo service docker start

# Launch "Serratus-Production" container environment
git clone -b diamond-dev https://github.com/ababaian/serratus.git
cd serratus/containers

# Build `serratus-align` container
# NOTE: diamond v0.9.35.136 Installed
# export DOCKERHUB_USER='serratusbio' # optional
# sudo docker login # optional
./build_containers.sh

# no-build boot
sudo docker run --rm --entrypoint /bin/bash -it serratus-align:latest

In [None]:
# In serratus-align continer
mkdir ph0; cd ph0

# Download seq-db
aws s3 cp s3://serratus-public/notebook/210125_ph0/ph0.fa ./

# Index seq-db
samtools faidx ph0.fa
mv ph0.fa.fai ph0.sumzer.tsv

# Create diamond index
diamond makedb --in ph0.fa -d ph0

# Create diamond memory-map


In [None]:
# Diamond2 Memory Map File -- USE PRODUCTION CODE!

run_diamondPR () {
    QUERY="$1"
    GENOME="$2"
    OUTNAME="$3"
    MMAP="$4"
    
    if [ "$MMAP" = "" ]; then
      MMAP="--mmap-target-index"
    fi
    
    # Diamond blastx alignment "BUCHFINK OPTIMIZED CODE"
    time cat $QUERY |\
    diamond blastx \
      -d "$GENOME".dmnd \
      $MMAP \
      --target-indexed \
      --masking 0 \
      --mid-sensitive -s 1 \
      -c1 -p1 -k1 -b 0.75 \
      -f 6 qseqid  qstart qend qlen qstrand \
           sseqid  sstart send slen \
           pident evalue cigar \
           qseq_translated full_qseq full_qseq_mate \
      > "$OUTNAME".bam     
}

# set up Memory Mapping (requires test data)
aws s3 cp s3://serratus-public/test-data/fq-blocks/ERR2756788.fq.00 ./test_block.fq
run_diamondPR test_block.fq ph0 test_block "--save-target-index"

rm test_block*

md5sum * > ph0.md5sum

# 1270f0e65c08a99aa5dcd7b21e6a75fd  ph0.dmnd
# eb98c91b1d72925c8c4d0856375c9690  ph0.dmnd.0
# 15cb23837634e800b2a0df714c1c3998  ph0.fa
# d937385d0bf1df1ff5c1f1f17ba2b195  ph0.sumzer.tsv


In [None]:
aws s3 sync ./ s3://serratus-public/seq/ph0/

## Control Sequence Injection

Don't tell RCE, I am including random and shuffled sequences in this search for a sanity-check of E-value given a random meta-genome input. It's tiny and may be highly informative.


`ctrl.seq.fa`

```
>ctrl.shuf.YP_009725307
GGKDRTATKYSDANCGYDVNYAYRVAHSDGATTVGDNDYKSSMDMGGTRMWYHMKTSSNA
VSNSKYMMNTTDYDDVDTRSTNSTAAYAMRSMMGVMRVYKTKTCSDTAVDHVNSCVCRRK
DRDVDRKGNNARAGNWNNYVDSCVKRRDVGWRDCGVGSHDVCWADAVYVHGYVSAGDSKD
TSANVRYATRHKKATADHCANVTYHWAKKTCVAATMSCSDRDRVVMKKYTGKMYVYANGC
TSYVMCAYVSTDNYKMYMTNVYSASKACGCSKMKSVTGYRTMANHYHSAGDKHDNTDAAA
SGHYVSTTRVNYVSHNTAMDNDTYSAKGRGHYNVRKKCNNSYSDDNGHGKAVDVNSYMYN
VGNYAAYKVYATVY
>ctrl.shuf.CAA24445
SDMNSKTGDSYSDDTSRGAKDKANKYANNSGGSVKGTSYKHNYNDYTYNDKSDRHMVKWA
ADDMKVAVDDRYGCYSGTVGMKVAKTTYSMSDTAMHVYANMSGCRDNVTAVNKRTKAYYM
NNGDKKYSAHKVGYVSWMTDKDMADVGGGGRSKRKDKVKKGVTYYVKTDRTSRWGTCGVH
KTARKTSAARAHSDSTVYDDKVSCDHVARSMHNAARGHAGKTYTWTATYTKGRASDWSTK
RMSHYYKKGVGVVRNNWVYSDACDSDMDAADDMNRKKYAGNHGAMMTKK
>ctrl.shuf.AAV48581
NTAASRNTDVKVRGTHTKHVVDGYSNVKSGNGDSDNRHRHSYAGSKHGNNYVDGAYCASN
NSSWGGHGRSATNAGCVGYKSRRCWCHVRSKGNSHKSDNMVTYSHDADAGMRHDDNSKYS
NSKTVVNWYRVVGAVTAVGTNDNYNSTKAGGNCMRRTGVKKNADMSWTKNRMDHTRWTDY
GDSNGACNGASNNDVYAVVSRGGTCYRCADAKNAGTKNTKNTHTTTSYANKSVASACSGH
THDNSSAVNKVAHYRGTDSRSSDTCYVMGYHAVNRTYNDVVKYVSSSTYGAKKGDATSSY
TGGSNAVRSRDDYKRGTYVSHVRYTASSYYAYHKVKAASDVSVMKVVCARASMVNVMWCK
GTKC
>ctrl.shuf.QED42946
VDSRGCDGDDASDAKVNKAGGTDTTVCMKTGAGDRVRNHMWVVRSGKAADHSVKDRVGGS
NTKYKDGKRVGSSSKRKSDRDGTKKGTGGTARHYDNVGRGVRYSRARSNVAHKYHVTGSS
KTCAVRKTGRSRDYTKVYCYDYSSYTYDGHVDDRGSSSYMAGRKTARTSVVYKDTNTTGR
HRSGWHGHYDYGYAHVRRGKDVSSSASYAAGSSHVYDRKVSKCAARRSHWNKSCSATWGT
ARRKVAATSVDGYGRSSKGGGRSDSS
>ctrl.shuf.QIM73896
DTHTAYGAGYVKRSVTCSRSYGAADTSTVMSRSAAVTASMKSAVRVTGVRNYGSSYADAD
MHKAGTVYDAHCRAAKGAAVVDKYHVAVCSACKVADHRDRCGKGSSSYKSVAYAGWKKWY
TTVRRSRKGVHRDAANRRARTDWMSRRTTRACDYTDWKRTVRARASRTDVVDTGGYADYS
GAVGGVVYACSGCKRYNDTAVGSRARGATGSVASAAAVKSRDYKGSVAVTMAVVDADRCG
GGAGHSMGVTTKHRRSKVVGRTVVSTDDARTSNNGTTAKDVVATRRTKATRHSKGNRTNA
MNAMGDDGCGYYRGTCCADNNV
>ctrl.shuf.NP_932306
DCKAKMDDNDHWKCGVAKSNSNSDDTRMHKTMSSYTTVSDTKVDMKANTDTYTTHADGKD
YTAATTCMVTKANTDKAAGDSVTADRRDDNWTDKGAVVDRRDTSRKDRANSRATHKAKTK
HVTSDDNWYAAVYRMVKTGVADNCVGCRKAKHAKNGTAKNHWGRNARASKGADDKTVGYH
KCAANRRRYRATRTVGAKNVKHHAHTKVMANTYGVMRYMYATARVKSKKMTYSSDDAYDS
TKSAATTWNSSGTNAKASSKTYVYAHSTHADHDVAKSRTNMRAVGNGTSDDRWRSGTTTK
DDADKHDV
>ctrl.shuf.AAC58759
YAGTASMNKVKSVDDGHNKVSHSVNSWSHAKMGHMTGRVYSRTTSYYDNGNCWMRWDTYA
CRKASSNSGAVGGGYWDMGVKRVVVARMDGDRHGWTDMGNWSVDRVMKGNKVYYDCHKTT
RWGGAMGRCRASRHGKDSSCNYDCVRWNAGSKRMRSSTGKNVDRKGGRGNAGGKAGATAK
WSAAAVRGSNVDSTDTGNAVWAKTHHTSARMMGDDARAVKGTGGDRGMCSMRVS
>ctrl.shuf.yaOV217orf12259
DCKSTYARAYADSMGMSGMNDHDNKGNGYRCYDNGRYTGSWWNRVHGDTRKASVRAAATG
NTVKHDKTVADMTWRVSSSGRNCYTGMAYNSRTKRADVDKTTCRVVSKMNGNHNSYKDRK
VRGTDADVTWHSDGTGRDSAKDYRKVDGYNMAKSKVNTGNMVRVYCNWSTKWVSRKKDKR
MDSRNAVASRVWYRGSCSYWTADSDKTNMKRKVADMAGHCDGSGADVGVAAKGKGCRSCA
ACGKKDKVTKYKNKVGSMWKMCKRSGTTHKNTYDAKDVDNGNDARCTNTAAADKHGDTSK
VYAGRVGHYVVAVKYVCNVYTRCVMATMWSMGTSNGN
>ctrl.shuf.YP_003104770
AGRAKTAGADRTTSRRRVDDWTDSRVDVKVRRRVVKYWVAHASWKRRSDWSASSRNRTVN
DAHCACSSAVMTDRTYSAGAHSGSGKHGARGHHNGKWTASHYCDRDGAVKAAGWNSVDHS
RYRDGRSNVVGRRYGHAWKGAGRNTAGCWKGVSRTDARCNKCTNSVVAVYTSNVRDDWSA
GSGGVNRKRDMMRRVKTCSRVVTSGSDHYRTARNVGGGGKVRASGNMSSTDYAADDATVR
STDVKGTGRRGCDHVGDNVMADVRGVAMRSKGVYMSKKCKNNGHGVACRRDKSRWWSTSN
YHTANVNSRVDHYHVRTGMCRMSSVAGSRS
>ctrl.shuf.SS0000002
WVKRKSAKGGHSRNVNDVSVGDSYGGKGRGKDKKMRKNKGVGRMRGRMGVSDSRVKDGRK
GRGWHRRRGGDGSGKRRKGRRSASGKSRGKGGNGTGGGKRNHGNATNNRRKGSKGGRRRR
>ctrl.rand.r1
QCLRIQRNQFSYGMQAYVPKKIMVWRVWPCINSTRYYKRTMHERHNQVFIMSIQMWMELM
GWMYRMKGNEAWRCYISHMTHPQKVIRNSQTDDLDQCWHCRYSNSRPPDVVNSWCPDVCI
HSCLWQANPGSHFKDRTIWWTESFPQNWQADQMVPMFNWHDCRAGKNFGMNNIQRSTAAR
RPIYLREKQEACIKSWVDVRKTMHDQLAVDGTAHAPSHRLNRIGPDLHLRNQPGMHDDGY
LAMSVKMWKPLPHHDCMMWCCPTCNLNVYTHCTDTAGKTGTKRCLHFDWCNTHKNMNCPD
QHTNCFFIVGYKWSMDSMLPELCLCSVGHPTKWYQFLNENWPYMAYWRGEARYNVEDKCR
QNIYCLWECMWQRHAGSCAWMWHNYTRWFCLKDWTMQQTVQSCKLCAGNRYDPDGWNAPS
YYPGKCNIPKHMELDFTIMINKKTMMVIYWPHAAIIQPQKMCIPRSDQAPGFYSNPNGEM
CASDNRSIIHRQMYWYTINV
>ctrl.rand.r2
PYLEICICWHSTDVGYDCQEMHHIIKGISGSPFAIMWSWCWPDIWPKYQGQWFPTVNMGN
APWMNWTKFCEIEVHASSQPCFGLAQNWFDPFLCGCKEFYEGILWWNWMDGGWRTTYKRA
VGFKFNSKTQNLTWSHQKKGPNQDPEVHQNYMTEMFFRPYVNGLYEETTNSPIFAKNMRC
GNTKRLWMEILTQEEICRAAFYWAVAHQKCCRSCSKMGIPLSCECAHSVCCQSNVIKKQL
HNMGLQLTHHNPARSYQVTYNAMCNWKCGPCPNSHFAMPCWINWWVRMIWWLWYRPCKCH
KKHCLPQEWTVNMWTYMNEDWRKDELSGDQQKSTQDVATIKNSPGICLMDCCRPAANRDI
QLWGARPPNDAYVWQPTKGGGRWCNDQMCILKQTECVWLVGVNSVHYRQRKFERLYCGYL
AFFDFAPCNAYPTADGENNLNHPAHEETIQCHGKQSPGEQTMCDKTNCQEGHLPAGGGIS
LVVTVYCRQVWHDKLFYPMK
>ctrl.rand.r3
SGFTMPMDARQDHNHHWEENQNQICCCGHEACHIISVILWPNNCRMPWHENFIPKDNVHP
DNRWDCIMSAPPPSGGRSAKDHQMVRERTKFCCSEKCKDVGHYCWKWDSSEMTRQQFQVA
LTIQQQTLETDIHPESYMTNDHDEEDPVYYFSDSYMWHPKIQIHQADRCFFYKYPMLFNS
TQVRTLDTNDCPYDGGICYYMGQKCCPKEHEIYAGVGCRADYRERNYTMHPEAFQKVTVL
YHCHGQKIWNFWHPNGVTRPVHLPMAHIRWHVWAFNGSFDKCKNIAIEGIDTCFNNWDSR
DEPWVAEVKWLHSFRGDHTISTGAFQRVSCATMQNVDLFGGPHFWPHRCPKFCFPGHILC
ETDYHTVNWCYCNPCVGNAASMSCYDDLGTHLSWCCPFNCSQHLDYSKKRGWGKPPRHTA
GKDFCKIFNTVTEMDVDGWIYCNTAGWMKWQNERETTMWAECDANINHLLEDTLDVAKPA
EKAPWADCYFEFCTCVNAMC
>ctrl.rand.r4
QPSLDSHKNPVPQFQSQIQQQAGYICQEFVCYKTLAYVLVLFHEPRPTYKRAQNWMEAKP
SHWMMFMCQYKDMTDHHFYFYTWYMCNYYFICRDLPAYATIGNYNTYQLERQSIWCSTGN
VRPCKWWGEYDVFIDWAIECFEDPREFQNALAQVANKRDHDKHYAMPHHLSFTRPSIGFE
SGWAVQFPSSSNGGTQRFTLRHFAEDGLTGAHADVMIAQDYKWMFSNMGWEGMQCGRREN
SGRFLWLFPYAGIELWMKFKLYGWHCHKQDTEYGKPKSMYKRCQILNWILFDALSLPELT
DDTKKWHTNNWKCPNHSGWYTMAICQGETARWTKMQRCEQAYIEISASPTMTKGHTELEE
ARFCHKSEACWSIQNGTCIVYGTNCALNGHDYMPYIIPNDNNIQCSTREPWCTAMEISGI
HSYCDRPFLCYQCEDTMSYQNKYSEWVWIWVGRIVWYNNDHNKTDWNTGQYAYTNESGGC
DLWPDVYWALYKILDWSRNI
>ctrl.rand.r5
ERIQPVRKEKDRNEGHAIFAAHDDKPGKNCGSPCCMGRYMKSPQAISYINTNERDTPWWF
DYCMMSRWNERWYYPFKHCLFPYTPIPPQCSGFFGAKKKQIKRVAICWDQYCNRWKWSCT
CLLCHYCYPQFPFDDTLTAPQVYYKSLKPCDIAASAFKKHLSKFITMRDDRWYFNETDTG
SEKDSNHWLMSPCKEMAEWNTMAHPSRMVASIVSPHMHWTGTWMSRSGAVVGRIINSQNI
VFTSKPANQVSFWPTYDCNVYIAMKWVCTGADNYTYCQTLYQHIRNYAYMVCAVGTERFD
YHQTPQEELAYWSCHKKLDTMAWEDTKQQNWQWPHQMPMGTMWELQGAEYITTSHEDKRC
YEYHVLKKCQIDGMMSMNTTKSMDWFTWHFERRQDEHYGTCYWKFIEYLPIRVMRDEMHL
VQVGFQFCMNPVVKMISSYCRKKDITPSLPQWFGGGPHMQQGMESACRYQETKPMYFRKC
NWHKPYPMTPIDVTPALWYE
>ctrl.rand.r6
DEHYSGDDECESTKFGALTPGYNEMCFWSPGTKTHFMLTAGPHCTPICIHQFLDWRPYTI
KITGGAYVFRRNLYQHTVMEIMNYFMRAMDNCFNELRMYYNDDSHVSPKQMVSNQIATSV
IHNGNRVNMNRMESPEPIDEMLYTYCMQFRWKNFMAFCNPVIELSIPFQEVWTNFFKWAT
WKLDMNWHSALCMKYAVFRAECTGRNCMVCNMLCGEMVGSIKLTIFWRWNSWRKRIEKLW
NSFRPGYPCTQGARVDMSSRDFATDLQFWWHDDENMKHHKVQGHNTWQFYNYGRDKTMEK
EPASQNLHRLIYDAPHRHLHAFTDLPTHMIPWDRMWWTNKERKTVNSQMRQREAGYTADV
DDQLAAGSLNQVDWKPFFAHYISIDWPLPCRESFPMNEEIFTGVDSHRFDLCPLIWPDTV
LGIPCMPKDGNHWHDNKKHKNKHLIGQQPFKYFLELVFTPNTRAYMNFRKKVAWYAVVHC
FVHTRTWCYRECPEWNFIYQ
>ctrl.rand.r7
TWQVTQNTWEHCKWSLSKGFFFDKQKLSRFMEYPQGKPLYEKILRYWVDDKCIGQWFAPL
PLGIEIIYWKNLAHGRPPRWNGFPMNESQNTRVEACELCASILSQIKYSHRWYPKGYSLW
QTWSPSRFDRGKNDARPGYAALNAHFLKLYEYVEDNKWSIQHHAHNDKDFMSVFGERQSH
RCGFIKSTLQDGVVPTRPELHVDNDHFWQDTCEYIYFLMGHEEANHNCMELNDMDYLKKH
KMIKAAHGVKLMEKQCNSHDTIEYPCEYFENWGNHKAVFYSVQEFHAQPDMADAGNQKSP
DMSGFHYMEICETWLLWWSSTIKKHWPAPYTNNVHAPNLWDVQVTCENAGDHGGVTTRLW
SKNQWMFKIFRKSENRKHPEPRFEEGNDWVPDSEFFVRTTDGSHRCRAMPACDWWHMMSK
SREDDQKTIGNPALSFCTWSEFHMDTMKLCPRKNRNDEDNAYLVYFRGVITRMSHCECGL
LSKHDFLTAVGMNRKWWWWD
>ctrl.rand.r8
KAIEHARHFRPSSLPWCNCYARWDPKEGSPYKGWCCYSRAGTVCIQGTWKQHEVSYGKAC
VWDWPYCFISFQVNPLWSMWDQCWNMSAHETCFPGDFVMRRASVCYHPAMNVMRYIPGST
TIGIHTTNNHLDKKGSHSMCYRDGEEVITFRECCMGGKMQHLFLIHQEWQADDCGDLTTE
CNVYTSRTWPAVQEPLDQQCAAGRYAKFPKTQQYPEHIHDMNSQYWECGMLSWAVTMNAA
HAHDDNATFTGMNLLHVGFRPNLNDSKGYLWKDCMEAEVAGTVFIEVHPQKGCNRVDTMN
HSDQYNLPCNITCPNAHKLLYKANFYGGFHICMLELSWNRIAWTKPGGWLEHCKIKYYWK
INPVVVCLFRAGTNPQCGPYDTFQALNCSMTSLTVMIIMCHFFFGVGLKMHACNVLAKCG
AKVFYQSGMMMKESAWQFRVQTADTYDFIITPDWTSDPKHFDARCHVFWQQVADFKDRIK
DRRSHIVNNRMFCINGCHPE
>ctrl.rand.r9
IQCHSIFVCHMIISKPNKDATACHRGGKLYYAHLCMCCQFMYRSRVYDFRRTSTEITTCC
HESYFGIIYFSFAMQYKTYHDIGCIEDHCRCMTIQWSWHTACAYGLTQQASRGYRMIHFH
QHIDLNVLELNHFATYSNMVAEDRGNVYQQQFALYRQWQSRMENNFITWDKKHIMSVREP
WGLFHLKNCNMFKSNEEATIFLHAMFAINDIMIAWHDPCNWKDGEKIRLQCEGDRLCEGV
FMDFRPQPAHQNTMWNPKAGLTKTYWCRYQDWCRQKDAMQGCGVTFEMERNLECGGNLIG
GKKLVRSIKETYSTCMYIYMMGCTHPNPGNQVCDTNIEPNGHGGQKNINPRRPQVSKCEI
GETWCSLIQMQYGPHAAVCADCWCQDFRSAAPQPVTRKEPGVYNMGTGMKWLQQAMGGGG
HGYPNYNQCMGIQWCWLWSGMKDKGHKQYLTNCACVALGWMYWLLIWGDCHEPGAHGYDK
CMCYSAIPTDSWDWSAWWPA
>ctrl.rand.r10
RKFSMLYKKAIDDQTLTYWVLICHHGCTITCPDAITTYMPMCTESRTAGPCEAFEAYQWK
SPFETEMDRGKARGTRTRGIMRGTCFYDTIFTKKVQNAWIGWLSPFDYIEHAGEPMLMDW
GQHVQRNRRSIFYQQHRPNRCCKPKQDAKPFFTWGMAPAPDMPHYPERYPMNPFCRWYEQ
CPKIQAWMCVCEHVIVWHRPDIWVDTMPGACYQGKTNVHVWDGWNPGNFTSRFIPRERTQ
HKFLAKPRWIAICQSLNRCYSSGSIQMTAGVQGNRGLWVNVGDSKGMKIVLMEFYIQAGC
SNPRMYFTWAKYAFKRGEAEVECKIISDGHFNPPKWAIGCGVHQQQPHWHRTTPVEPTCR
VCWDRMTCHPFSDGVKSLRDQPLEQGYTMHKSRQTWTMDTVCQYDILSINSCNDFDIMHK
FPDLMYPSDLHSAWEDTQIQQVKDCADHSVSIYVQEDRHSITHANSDHPDTLFTIGSHNI
EVMTPLTSKKTHFCELPYCR
>var.psid.tryptophan_decarboxylase:ASU62242
MQVLPACQSSALKTLCPSPEAFRKLGWLPTSDEVYNEFIDDLTGRTCNEKYSSQVTLLKPIQDFKTFIEN
DPIVYQEFISMFEGIEQSPTNYHELCNMFNDIFRKAPLYGDLGPPVYMIMARIMNTQAGFSAFTKESLNF
HFKKLFDTWGLFLSSKNSRNVLVADQFDDKHYGWFSERAKTAMMINYPGRTFEKVFICDEHVPYHGFTSY
DDFFNRRFRDKDTDRPVVGGVTDTTLIGAACESLSYNVSHNVQSLDTLVIKGEAYSLKHLLHNDPFTPQF
EHGSIIQGFLNVTAYHRWHSPVNGTIVKIVNVPGTYFAQAPYTIGSPIPDNDRDPPPYLKSLVYFSNIAA
RQIMFIEADNKDIGLIFLVFIGMTEISTCEATVCEGQHVNRGDDLGMFHFGGSSFALGLRKDSKAKILEK
FAKPGTVIRINELVASVRK
>var.psim.n_methyltransferase:ASU62241
MHIRNPYRDGVDYQALAEAFPALKPHVTVNSDNTTSIDFAVPEAQRLYTAALLHRDFGLTITLPEDRLCP
TVPNRLNYVLWVEDILKVTSDALGLPDNRQVKGIDIGTGASAIYPMLACSRFKTWSMVATEVDQKCIDTA
RLNVIANNLQERLAIIATSVDGPILVPLLQANSDFEYDFTMCNPPFYDGASDMQTSDAAKGFGFGVNAPH
TGTVLEMATEGGESAFVAQMVRESLNLQTRCRWFTSNLGKLKSLYEIVGLLREHQISNYAINEYVQGATR
RYAIAWSFIDVRLPDHLSRPSNPDLSSLF
>var.psih.monooxygenase:ASU62241
MIVLLVSLVLAGCIYYANARRVRRSRLPPGPPGIPLPFIGNMFDMPSESPWLRFLQWGRDYHTDILYLNA
GGTEIIILNTLDAITDLLEKRGSMYSGRLESTMVNELMGWEFDLGFITYGERWREERRMFAKEFSEKNIR
QFRHAQIKAANQLVRQLIKTPDRWSQHIRHQIAAMSLDIGYGIDLAEDDPWIAATQLANEGLAEASVPGS
FWVDSFPALKYLPSWLPGAGFKRKAKVWKEGADHMVNMPYETMKKLTVQGLARPSYASARLQAMDPDGDL
EHQEHVIRNTATEVNVGGGDTTVSAVSAFILAMVKYPEVQRQVQAELDALTSKGVVPNYDEEDDSLPYLT
ACVKEIFRWNQIAPLAIPHRLIKDDVYRGYLIPKNALVYANSWAVLNDPEEYPNPSEFRPERYLSSDGKP
DPTVRDPRKAAFGYGRRNCPGIHLAQSTVWIAGATLLSVFNIERPVDGNGKPIDIPATFTTGFFRHPEPF
QCRFVPRTQEILKSVSG
>var.psik.hydroxytryptamine_kinase:ASU62240
MTFDLKTEEGLLSYLTKHLSLDVAPNGVKRLSGGFVNVTWRVGLNAPYHGHTSIILKHAQPHLSSDIDFK
IGVERSAYEYQALKIVSANSSLLGSSDIRVSVPEGLHYDVVNNALIMQDVGTMKTLLDYVTAKPPISAEI
ASLVGSQIGAFIARLHNLGRENKDKDDFKFFSGNIVGRTTADQLYQTIIPNAAKYGIDDPILPIVVKELV
EEVMNSEETLIMADLWSGNILLQFDENSTELTRIWLVDWELCKYGPPSLDMGYFLGDCFLVARFQDQLVG
TSMRQAYLKSYARNVKEPINYAKATAGIGAHLVMWTDFMKWGNDEEREEFVKKGVEAFHEANEDNRNGEI
TSILVKEASRT
```