# PatMatch Use on Custom Sequences and Integrating with Python 

The [previous notebook](PatMatch initial demo and introduction.ipynb) used prepared data supplied by the software authors in a `test` directory. The web-based PatMatch offerings listed [here](https://github.com/fomightez/patmatch-binder/#usage) lock you into matching patterns to specific sequencess. With the stand-alone, command line based PatMatch, you can run pattern matching on any sequence you'd like. Here, we will start at square one and work through such an example. 

This will get example sequence data from an external source, prepare it, and then analyze it. Then, subsequent steps will parse the resulting data into something useful in Python. And, even cover how to convert to Excel. 

### Preparing to use PatMatch software on raw sequence data

First, example nucleic acid sequence to work with is retrieved. The sequence files PatMatch works with are FASTA according to the [USAGE](PatMatch initial demo and introduction.ipynb#Usage) information.

Click on the cell below and type `shift-enter` or press `Run` on the toolbar above to get an example file.  
(In Jupyter notebooks running in the Python kernel as indicated in the upper right of this notebook, commands to the shell are prefaced with exclamation points. You'll notice when we switch to dealing with Python directly, this will not be needed.)

In [None]:
!curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa

<font color="#999">(Alternative ways to import data using the Jupyter environment's graphical user interface will be covered below.)</font>

Check the file listing by executing the next cell to see the FASTA-formatted file has been retrieved. Work through the following cells simialrly.

In [None]:
!ls

Further following the PatMatch [USAGE](PatMatch initial demo and introduction.ipynb#Usage) information, sequences should be processed so that the lines of sequence data are formatted to one line for handling by PatMatch. The PatMatch authors have provided a utility script for doing that preparation step. The following cell will run that on the example data.

In [None]:
!perl ../patmatch_1.2/unjustify_fasta.pl chrmt.fsa

That will produce a file with `.prepared` appended to the end of the supplied file name. 

Check that file was produced by running checking the file listing again using `!ls`.

In [None]:
!ls

Having verified the prepared data file exists, you are ready to run the program to search for a pattern.

### Running PatMatch

The PatMatch [USAGE](PatMatch initial demo and introduction.ipynb#Usage) information says `-n` is for nucleotide pattern match and `-c` is for complementary strand; however, [based on my tests](PatMatch nucleic handling flags demystified.ipynb) it seems that `-c` means it is for the complementary strand **in addition to** the strand in the dataset. 

**<font color="red">Therefore, if you want the pattern search to be performed on BOTH strands of the supplied sequence, as is the default of the web-based PatMatch tools, you actually want to use the `-c` flag when authoring the command.</font>**

If you are curious about this aspect futher, I demonstrate that [here](PatMatch nucleic handling flags demystified.ipynb) and in the course of that cover how to replicate the three options typically offered for strand at PatMatch web-based offerings. Feel free to examine and run that notebook or simply use the `-c` flag if you are trying to scan both strands. 


In [None]:
# !perl ../patmatch_1.2/patmatch.pl -n "DDWDWTAWAAGTARTADDDD" chrmt.fsa.prepared #dataset strand only
!perl ../patmatch_1.2/patmatch.pl -c "DDWDWTAWAAGTARTADDDD" chrmt.fsa.prepared

In [None]:
!ls

That is the basics of running PatMatch. There are options you can add to control this mismatch amount and whether to allow insertions,deletiions, or substitutions towards thos mismtaches. Example:

In [36]:
!perl ../patmatch_1.2/patmatch.pl -c "DDWDWTAWAAGTARTADDDD" chrmt.fsa.prepared 1 ids

>ref|NC_001224|:[175,157]
AATGATAAAATAATAAATA 
>ref|NC_001224|:[713,695]
TATAATAAAATAATAAAAA 
>ref|NC_001224|:[970,952]
ATAATTATAATAATAATAA 
>ref|NC_001224|:[994,976]
ATAATTATAATAATAATTA 
>ref|NC_001224|:[1018,1000]
TTAATTATAATAATAATTA 
>ref|NC_001224|:[1195,1177]
TTTTATAAAAGAATATATA 
>ref|NC_001224|:[1478,1459]
GAAATTAAAAATAATAATAA 
>ref|NC_001224|:[2929,2910]
TTATTTATAATTAATAATTT 
>ref|NC_001224|:[3170,3151]
TTATATAAAAATAATATTAA 
>ref|NC_001224|:[3219,3200]
TTATATAAAAATAATATTAA 
>ref|NC_001224|:[3482,3463]
ATTATTAAAAATAATAATAT 
>ref|NC_001224|:[3821,3803]
AAAAAATAAGTAATAGATT 
>ref|NC_001224|:[4004,3986]
AAAATTAAAATAATAATTA 
>ref|NC_001224|:[4098,4080]
TTTAATAAAATAATAAATG 
>ref|NC_001224|:[4119,4100]
TAAATTAAAAATAATAATAA 
>ref|NC_001224|:[4224,4206]
TATATTATAAGAATATAAT 
>ref|NC_001224|:[5880,5861]
ATAATTATAAATAATAAATT 
>ref|NC_001224|:[5923,5904]
TAAAATAAAAATAATAATAA 
>ref|NC_001224|:[6270,6251]
AAAAATATAAATAATATTAA 
>ref|NC_001224|:[6538,6519]
T

>ref|NC_001224|:[48,67]
TATTATAAAAATAATATTTA 
>ref|NC_001224|:[289,308]
ATAATTATAAATAATATAAA 
>ref|NC_001224|:[1586,1604]
TTATATATAATAATATTAT 
>ref|NC_001224|:[1771,1790]
AAATATATAAATAATATAAT 
>ref|NC_001224|:[1798,1816]
AAAAATATAATAATAATAA 
>ref|NC_001224|:[1959,1977]
TAATATAAAATAATAATTA 
>ref|NC_001224|:[2171,2190]
ATTATTAAAAATAATAAAAA 
>ref|NC_001224|:[2199,2219]
TTTAATAAGAAGTAATATTTA 
>ref|NC_001224|:[2922,2941]
ATAAATAAAAATAATAATTT 
>ref|NC_001224|:[3023,3042]
AGTTTTAAAAGTGATAATAT 
>ref|NC_001224|:[4444,4462]
TTATATATAATAATAATAT 
>ref|NC_001224|:[4606,4624]
TATAATATAATAATAATAT 
>ref|NC_001224|:[5031,5050]
TTTAATAAAAATAATAATAT 
>ref|NC_001224|:[5084,5102]
TTATATAAAATAATAATAA 
>ref|NC_001224|:[5346,5364]
TTAAATATAATAATAATTA 
>ref|NC_001224|:[6042,6060]
ATTTATATAATAATAATAT 
>ref|NC_001224|:[6458,6477]
TATTATATAAGTAATAAATA 
>ref|NC_001224|:[6482,6500]
TTTTATATAATAATAATAA 
>ref|NC_001224|:[6551,6569]
AAATTTATAAGAATATGAT 
>ref|NC_001224|:[6996,7014]

See the [USAGE](PatMatch initial demo and introduction.ipynb#Usage) for more information about those options. However, that covers the basics.

With the basics in hand, and using the power of the command line, searches of more sequences or more sequences and more patterns become possible. However, you'll quickly encounter problems handling all those results. As a simple example, we'll use the example pattern matching search we developed above as example for integrating with Python for more efficient handling of the results and to touch upon the advantanges offered by combining with a scripting language.

## Importing PatMatch Results into a Pandas Dataframe and Exporting to Excel

Now that you see what PatMatch is returning as results, you'll probably note that why that looks easy to read for a human, it isn't very computer friendly. Indeed, if you have used the web-based PatMatch offerings, you'll note that they return the results in a table form that is more useful.

Other ways to add the data to the running Binder are available using the file directory dashboard. EXPLAIN HOW TO UPLOAD using JUPYTER GUI.

In [23]:
# from https://stackoverflow.com/a/42703609/8508004
import io
#import pandas as pd
output = !perl ../patmatch_1.2/patmatch.pl -c "DDWDWTAWAAGTARTADDDD" chrmt.fsa.prepared 1 ids
#df = pd.read_table(io.StringIO(output.n))
print(type(output)) # see http://ipython.readthedocs.io/en/stable/api/generated/IPython.utils.text.html#IPython.utils.text.SList
print (output.n)

<class 'IPython.utils.text.SList'>
>ref|NC_001224|:[175,157]
AATGATAAAATAATAAATA 
>ref|NC_001224|:[713,695]
TATAATAAAATAATAAAAA 
>ref|NC_001224|:[970,952]
ATAATTATAATAATAATAA 
>ref|NC_001224|:[994,976]
ATAATTATAATAATAATTA 
>ref|NC_001224|:[1018,1000]
TTAATTATAATAATAATTA 
>ref|NC_001224|:[1195,1177]
TTTTATAAAAGAATATATA 
>ref|NC_001224|:[1478,1459]
GAAATTAAAAATAATAATAA 
>ref|NC_001224|:[2929,2910]
TTATTTATAATTAATAATTT 
>ref|NC_001224|:[3170,3151]
TTATATAAAAATAATATTAA 
>ref|NC_001224|:[3219,3200]
TTATATAAAAATAATATTAA 
>ref|NC_001224|:[3482,3463]
ATTATTAAAAATAATAATAT 
>ref|NC_001224|:[3821,3803]
AAAAAATAAGTAATAGATT 
>ref|NC_001224|:[4004,3986]
AAAATTAAAATAATAATTA 
>ref|NC_001224|:[4098,4080]
TTTAATAAAATAATAAATG 
>ref|NC_001224|:[4119,4100]
TAAATTAAAAATAATAATAA 
>ref|NC_001224|:[4224,4206]
TATATTATAAGAATATAAT 
>ref|NC_001224|:[5880,5861]
ATAATTATAAATAATAAATT 
>ref|NC_001224|:[5923,5904]
TAAAATAAAAATAATAATAA 
>ref|NC_001224|:[6270,6251]
AAAAATATAAATAATATTAA 
>ref|NC_001224|:[6538,6519]
TATTA

In [None]:
# %load https://gist.githubusercontent.com/fomightez/b012e51ebef6ec58c1515df3ee0c850a/raw/300da6c67ceeaf5384a3e500648b993345c361cb/run_every_eight_mins.py
import time

def executeSomething():
    #code here
    print ('.')
    time.sleep(480) #60 seconds times 8 minutes

while True:
    executeSomething()

.
.
.
.
.
