# Demo of script to check for a match to a sequence pattern in PatMatch syntax

This notebook is to demonstrate `matches_a_patmatch_pattern.py`.  
The script takes a pattern in [PatMatch syntax](https://www.yeastgenome.org/nph-patmatch#examples) and a sequence (either a FASTA sequence file or text string) and reports if there is a match present. With the `match_over_entirety` option, it reports if the squence represents a specific example of the generalized pattern. For example, is the sequence `ATTGATATAAGTAATAGATA` a specific example matching the pattern `DDWDWTAWAAGTARTADDDD`?



This notebook is meant to present snippets that could be adapted and placed in an actual series of steps to get something done.

If you are viewing this statically and want instead to run it actively. Go [here](https://github.com/fomightez/patmatch-binder), press the `launch binder` badge, and then when the session spins up, select `Demo of script to check for a match to a sequence pattern in PatMatch syntax` from the bottom section 'Additional topics: Technical' of the list. Then you'll be able to actively run this notebook without needing to install anything. The demostration parts below are written as if you are in an active session.

------

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them a <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

----

##  Preparation

Get the script for the demonstration

In [1]:
# Get the script
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/patmatch-utilities/matches_a_patmatch_pattern.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    15  100    15    0     0     85      0 --:--:-- --:--:-- --:--:--    85


#### Display USAGE block

In [2]:
!python matches_a_patmatch_pattern.py -h

usage: matches_a_patmatch_pattern.py [-h] (-n | -p) [-moe] PATTERN SEQUENCE

matches_a_patmatch_pattern.py Takes a sequence pattern in PatMatch syntax, and
checks if a provided sequence contains a match. It reports True or False
depending on that assessment. Optionally, it can be restricted to checking if
the provided sequence is a match to a sequence pattern. **** Script by Wayne
Decatur (fomightez @ github) ***

positional arguments:
  PATTERN               Sequence pattern in PatMatch syntax. For example, to
                        search for a S. cerevisiae mitochondrial promoter,
                        provide `DDWDWTAWAAGTARTADDDD`, without any quotes or
                        backticks, in the call to the script.
  SEQUENCE              Filename for FASTA sequence file or text of sequence
                        string to examine for presence of the pattern. **Only
                        the first sequence of a multi-entry FASTA file is
                        considered.**



This code uses some packages that might not be already installed and so the next line insures they are added here.

In [3]:
%pip install sh
%pip install pyfaidx

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In this case, no more preparation is needed as we'll use either sequences defined in the cell or in the PatMatch test folder.

##  Basics

This section is meant to prepare to show the basics of using it on the command line.  
(On the 'proper' command line you wouldn't need the exclamation points I put in front of these commands for them to work in this notebook.)

In [4]:
%run matches_a_patmatch_pattern.py DDWDWTAWAAGTARTADDDD AAAATTGATATAAGTAATAGATACCCC --nucleic

Checking PatMatch software present...


True


Sending the PatMatch results to Python...
Reporting: True.



Besides the sequences of the pattern and the sequence to compare it to, you need to provide whether the sequence is `--nucleic` or `--protein`.

Okay, the above sequences show a match.  
However, what we really wanted to know was **whether the short sequence represents a specific example of the general pattern?**  
In order to examine that, we add the `--match_over_entirety` flag, abbreviated `-moe`, and run the code.

In [5]:
%run matches_a_patmatch_pattern.py DDWDWTAWAAGTARTADDDD AAAATTGATATAAGTAATAGATACCCC --nucleic --match_over_entirety

False


****Error??*** You called the script with the `match_over_entirety` option;
however, your provided sequences are not equal in size and thus don't
match over their entirety. Feel free to ignore this concern.



Nope, even though the sequence contains a match to the pattern, the pattern and the sequence don't match over their entire length, and so the sequence isn't a specific example matching the pattern.  
Note that you also get a note about the length not matching and so they couldn't possible match over the entire length. You can choose to ignore this if you knew this was a possibility because you weren't screening the length ahead of time.

Let's try with another sequence against the pattern. This will be the only thing changed relative the last cell. 
*Is this one a specific example of the general pattern?*

In [6]:
%run matches_a_patmatch_pattern.py DDWDWTAWAAGTARTADDDD ATTGATATAAGTAATAGATA --nucleic --match_over_entirety

Checking PatMatch software present...


True


Sending the PatMatch results to Python...
Reporting: True.



Yes.

That covers most of what this script does. Nothing fancy really.   The big thing about it is that behind-the-scenes it uses PatMatch so that I can use the same syntax for the patterns without needing to re-implement the nucleic and protein syntax PatMatch offers just to compare two sequences.

The sequence can be provided as a sequence file in FASTA format. In that case, you point the script at the file py providing the full path and file name. Let's show how that is done using the test squences from PatMatch that are already in this repository.

In [7]:
%run matches_a_patmatch_pattern.py ACAGAGCAGG ../patmatch_1.2/test/ATH1_cdna_test --nucleic

Checking PatMatch software present...


True


Sending the PatMatch results to Python...
Reporting: True.



**NOTE THAT `ATH1_cdna_test` is actually a mutli-sequence FASTA file; however, this script only deals with the first sequence in such a file of encountered.** The point is to get a 'yes' or 'no' answer for each sequence at which this script is pointed. If you need to check multiple sequences from within a multi-sequence FASTA file, you can loop over each sequence getting the 'yes' or 'no' answer as you go.  I'll illustrate a process similar that in the next section using the main function of the script.

If you have used PatMatch on the command line before, you might have dealt with using the `unjustify.pl` script on a FASTA file to remove the line endings for each sequence. You don't need to do that here. It is handled behind-the-scenes as well (not using that particular Perl script, but instead via Python).



-------

## Use the main script in a Jupyter notebook (or IPython console)

This is meant to demonstrate using the main function of this script in a Jupyter notebook. The same would largely hold if you were using it in an IPython console.

We need to bring the main function of the script into the memory of this notebook's environment. This next, redundant-looking line will do that.

In [8]:
from matches_a_patmatch_pattern import matches_a_patmatch_pattern

Now the function can be used on the sameple data to make a dataframe of the difference matrix. The basics are shown here.

In [9]:
result1 = matches_a_patmatch_pattern("DDWDWTAWAAGTARTADDDD","AAAATTGATATAAGTAATAGATACCCC","nucleic");
result1

Checking PatMatch software present...
Sending the PatMatch results to Python...
Reporting: True.



True

Okay, the above sequences show a match.  
However, what we really wanted to know was **whether the short sequence represents a specific example of the general pattern?**  
In order to examine that, we set the `match_over_entirety` setting to `True` and run the code.

In [10]:
result2 = x = matches_a_patmatch_pattern("DDWDWTAWAAGTARTADDDD","AAAATTGATATAAGTAATAGATACCCC","nucleic",match_over_entirety=True);
result2

****Error??*** You called the script with the `match_over_entirety` option;
however, your provided sequences are not equal in size and thus don't
match over their entirety. Feel free to ignore this concern.



False

The script quickly short-circuits returning a false if the length of the two don't match.

Let's try it with sequences that match in size.

In [11]:
matches_a_patmatch_pattern("DDWDWTAWAAGTARTADDDD","ATTGATATAAGTAATAGATA","nucleic", match_over_entirety=True);

Checking PatMatch software present...
Sending the PatMatch results to Python...
Reporting: True.



Finally, we'll use the main function of the script to step through analysis of each sequence in the test data for PatMatch and make a dataframe of the results.

In [12]:
from pyfaidx import Fasta
sequence_records = Fasta("../patmatch_1.2/test/ATH1_cdna_test")
results_dict = {}
pattern = "AGCAGG"
for idx,record in enumerate(sequence_records):
    sys.stderr.write(f"Examining {record.name} ...\n")
    match_call = matches_a_patmatch_pattern(pattern,str(record),"nucleic")
    results_dict[record.name] = match_call
# make results into a dataframe
import pandas as pd
df = pd.DataFrame.from_dict(
    results_dict,orient='index').reset_index()
df.columns = ['sequence', 'contains_match_to_pattern']
df

Examining At1g01010.1 ...
Checking PatMatch software present...
Sending the PatMatch results to Python...
Reporting: True.

Examining At1g01020.1 ...
Checking PatMatch software present...
Sending the PatMatch results to Python...
Reporting: False.

Examining At1g01030.1 ...
Checking PatMatch software present...
Sending the PatMatch results to Python...
Reporting: False.

Examining At1g01040.1 ...
Checking PatMatch software present...
Sending the PatMatch results to Python...
Reporting: True.



Unnamed: 0,sequence,contains_match_to_pattern
0,At1g01010.1,True
1,At1g01020.1,False
2,At1g01030.1,False
3,At1g01040.1,True


That covers the basics of using the function in a notebook. 


-----

Feel free to substitute your data in here and run the script or function.

Be sure to download anything you make that is useful.

Enjoy!