# BENCHMARKING: PatMatch use on the largest genome project to date

This is meant to benchmark PatMatch ability in the Cyverse VICE app system. 
**DO NOT RE-RUN THIS UNLESS NECESSARY**.

*Originally carried out May 14, 2019 and so at the time the largest genome project to date is the sugar pine.*  
From [From Crepeau et al. 2017, 'Pine Cones to Read Clouds: Rescaffolding the Megagenome of Sugar Pine (Pinus lambertiana)'](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5427496/#bib15):
>"The largest genome project to date is that of the white pine Pinus lambertiana (sugar pine), with a genome size of 31 Gbp (Stevens et al. 2016)."

Note that the table currently [here](https://salzberg-lab.org/genome-projects/) lists the coast redwood as slightly larger (33 Gbp) than the 31 Gbp sugar pine. However, in the April 2019 lab news it lists the sequenced portion at 26.5 Gb. And so for now sugar pine remains a reasonable candidate for largest assembly. (Another contender is the axolotl salamander at 28.4 billion bps, see [here](https://www.sfchronicle.com/science/article/California-scientists-unravel-genetic-mysteries-13786816.php).)

QUESTION: Will PatMatch running on Cyverse work for the largest genome assembly currently available?


Reference for sequence:
[From Pine Cones to Read Clouds: Rescaffolding the Megagenome of Sugar Pine (Pinus lambertiana).
Crepeau MW, Langley CH, Stevens KA. G3 (Bethesda). 2017 May 5;7(5):1563-1568. doi: 10.1534/g3.117.040055. PMID: 28341701](https://www.ncbi.nlm.nih.gov/pubmed/28341701) (Although getting sequence not as easy as following link in the paper. No automated forwarding by the university or department IT admin?)

----
----

## Preparation: Acquiring the genome of sugar pine

[From Pine Cones to Read Clouds: Rescaffolding the Megagenome of Sugar Pine (Pinus lambertiana)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5427496/) says:
>"The assembly is available at the Pine Reference Sequence project site http://dendrome.ucdavis.edu/ftp/Genome_Data/genome/pinerefseq/Pila/v1.5/."

However, despite it being only a May 2017 publication, that link is dead. (No automated forwarding by the university or department IT admin?) Searching "dendrome ucdavis.edu" failed to find anything; however, 
searching "Pine Reference Sequence" lead to top hit at https://pinerefseq.faculty.ucdavis.edu/ .  
Rigth at top of that page, it listed links to several genomes under 'Conifer Genome Sequences' and one of them is the Sugar Pine v 1.5. And so I clicked on that and went to https://treegenesdb.org/FTP/Genomes/Pila/ which I then followed the 1.5 links to get the genome link of:  

https://treegenesdb.org/FTP/Genomes/Pila/v1.5/genome/Pila.1_5.fa

So the line below should get that.  
**DO NOT START THIS UNLESS YOU ARE OKAY WITH IT RUNNING A LONG TIME AS THE FILE DOWNLOAD IS 21.5 GB IN SIZE.**

In [1]:
!curl -OL https://treegenesdb.org/FTP/Genomes/Pila/v1.5/genome/Pila.1_5.fa

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.0G  100 20.0G    0     0  19.7M      0  0:17:20  0:17:20 --:--:-- 27.1M


## Preparation: Getting the accessory script and importing main function

In [3]:
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/patmatch-utilities/patmatch_results_to_df.py
from patmatch_results_to_df import patmatch_results_to_df

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18722  100 18722    0     0  63897      0 --:--:-- --:--:-- --:--:-- 63897


## Preparation PART II: Each individual sequence of genome to a single line

From the Yan et al 2005 paper on PatMAtch (PMID: 15980466):
>"The software available on the FTP site also includes a Perl script that is needed to unjustify FASTA files that are to be used by PatMatch. This simple script takes a FASTA file, with a single or multiple sequences, as input and outputs a file with each individual sequence on a single line"

This will time that step.

In [4]:
%%time
!perl ../patmatch_1.2/unjustify_fasta.pl Pila.1_5.fa

CPU times: user 14.8 s, sys: 5.5 s, total: 20.3 s
Wall time: 11min 20s


Let's see how large the 'prepared' file is:

In [5]:
ls -lah

total 40G
drwxr-sr-x 1 jovyan users 4.0K May 14 19:03  [0m[01;34m.[0m/
drwxr-sr-x 1 jovyan users 4.0K May 14 18:27  [01;34m..[0m/
-rw-r--r-- 5 jovyan users  25K Mar  6 17:52  human_examples_bckup.txt
drwxr-sr-x 2 jovyan users 4.0K May 14 18:48  [01;34m.ipynb_checkpoints[0m/
-rw-r--r-- 1 jovyan users  16K May 14 18:45 'Iterating over genomes with PatMatch.ipynb'
-rw-r--r-- 5 jovyan users  12K Mar  6 17:52 'PatMatch initial demo and introduction.ipynb'
-rw-r--r-- 5 jovyan users  12K Mar  6 17:52 'PatMatch nucleic handling flags demystified.ipynb'
-rw-r--r-- 1 jovyan users  19K May 14 18:52  patmatch_results_to_df.py
-rw-r--r-- 1 jovyan users 7.7K May 14 19:03 'PatMatch Use on the largest genome project to date.ipynb'
-rw-r--r-- 1 jovyan users 8.9K May 14 18:50 'PatMatch with Genome and Python.ipynb'
-rw-r--r-- 5 jovyan users 5.6K Mar  6 17:52 'PatMatch with more Python.ipynb'
-rw-r--r-- 1 jovyan users  21G May 14 18:46  Pila.1_5.fa
-rw-r--r-- 1 jovyan users  20G May 14 19:03  Pila

Also around 20 Gb. So currently 40 Gb of sequence on the session 'disk'.

In [7]:
!du -h

64K	./.ipynb_checkpoints
12K	./__pycache__
40G	.


Having created the prepared data file exists, you are ready to run the program to search for a pattern.

## Running PatMatch on the sugar pine MEGAgenome

This pattern is completely made up because I didn't have an idea of what to search for.

Note that in the output of the PatMatch script (accessible as `output.n` below via the `.n` attribute of IPython.utils.text.SList ,) I was seeing several `Warning: recSearchFile: Record longer than buffer size (10000000) has been split\n`. This makes me think it has a way to work around when record is otherwise longer than the buffer size in order to bypass that issue. It probably is warning so you know that it may not show a match if it happens to match where the split was chosen.   
Those warnings did cause `patmatch_results_to_df()` to choke but otherwise everything ran. I edited the script below to remove those warnings before it the output is passed to `patmatch_results_to_df()` as shown below. 

In [22]:
%%time 
my_pattern= "DDWDWTAWAAGTARTADDDDCCA"
output = !perl ../patmatch_1.2/patmatch.pl -c {my_pattern} Pila.1_5.fa.prepared 
#Normally, send output to patmatch_results_to_df.py with the `.n` attribute of IPython.utils.text.SList ,
# but was getting warnings about size among the output text, and so removed those before sending 
# to patmatch_results_to_df.py .
warning = "Warning: recSearchFile: Record longer than buffer size (10000000) has been split\n"
output_wo_warnings = output.n.replace(warning,'')
df = patmatch_results_to_df(output_wo_warnings, pattern=my_pattern, name="test_query")

Provided results read...

CPU times: user 78 ms, sys: 18.3 ms, total: 96.3 ms
Wall time: 3min 46s



For documenting purposes, the following lists the parsed data:
                   FASTA_id  hit_number         hit_id     start       end  strand         matching pattern            query pattern
0   fragScaff_scaffold_1010           1   test_query-1   1160756   1160778      -1  TATGTTAAAAGTAATAGTTTCCA  DDWDWTAWAAGTARTADDDDCCA
1   fragScaff_scaffold_1010           2   test_query-2   1509451   1509473      -1  TATGTTAAAAGTAATAGTTTCCA  DDWDWTAWAAGTARTADDDDCCA
2   fragScaff_scaffold_1050           1   test_query-3    279695    279717       1  AGTGATATAAGTAGTAATAACCA  DDWDWTAWAAGTARTADDDDCCA
3   fragScaff_scaffold_1244           1   test_query-4   1364524   1364546      -1  GGTGATAAAAGTAGTAGATGCCA  DDWDWTAWAAGTARTADDDDCCA
4   fragScaff_scaffold_1352           1   test_query-5    782479    782501      -1  AATGTTAAAAGTAATAAATACCA  DDWDWTAWAAGTARTADDDDCCA
5   fragScaff_scaffold_1352           2   test_query-6    759402    759424       1  AAAGTTAAAAGTAATATGGACCA  DDWDWTAWAAGTARTADDDDCCA
6   f

In [23]:
df

Unnamed: 0,FASTA_id,hit_number,hit_id,start,end,strand,matching pattern,query pattern
0,fragScaff_scaffold_1010,1,test_query-1,1160756,1160778,-1,TATGTTAAAAGTAATAGTTTCCA,DDWDWTAWAAGTARTADDDDCCA
1,fragScaff_scaffold_1010,2,test_query-2,1509451,1509473,-1,TATGTTAAAAGTAATAGTTTCCA,DDWDWTAWAAGTARTADDDDCCA
2,fragScaff_scaffold_1050,1,test_query-3,279695,279717,1,AGTGATATAAGTAGTAATAACCA,DDWDWTAWAAGTARTADDDDCCA
3,fragScaff_scaffold_1244,1,test_query-4,1364524,1364546,-1,GGTGATAAAAGTAGTAGATGCCA,DDWDWTAWAAGTARTADDDDCCA
4,fragScaff_scaffold_1352,1,test_query-5,782479,782501,-1,AATGTTAAAAGTAATAAATACCA,DDWDWTAWAAGTARTADDDDCCA
5,fragScaff_scaffold_1352,2,test_query-6,759402,759424,1,AAAGTTAAAAGTAATATGGACCA,DDWDWTAWAAGTARTADDDDCCA
6,fragScaff_scaffold_1376,1,test_query-7,8344572,8344594,1,AGTAATAAAAGTAATAAAATCCA,DDWDWTAWAAGTARTADDDDCCA
7,fragScaff_scaffold_1379,1,test_query-8,735647,735669,1,AGTAATAAAAGTAATAAAATCCA,DDWDWTAWAAGTARTADDDDCCA
8,fragScaff_scaffold_1471,1,test_query-9,802146,802168,1,TAAGATAAAAGTAGTAAAAACCA,DDWDWTAWAAGTARTADDDDCCA
9,fragScaff_scaffold_1475,1,test_query-10,5063469,5063491,1,AAAATTAAAAGTAGTAATTTCCA,DDWDWTAWAAGTARTADDDDCCA


With the warnings eliminated , all the steps work.  
I honestly don't know what to expect for a result, and so it is hard to know for certain if it everything worked as it should. Clearly, though is **working on a large genome** to some extent.

Interestingly, the step of unjustifying all the linebreaks so the sequences are all one line using `unjustify_fasta.pl` is what takes the longest in the actual processing effort. That is about 11 minutes. Whereas PatMatch scans and a dataframe is made in less than 4 minutes.


------