#  Advanced: Using brackets or other strange characters to make complex patterns on command line or with Python 

If you have used PatMatch on the web, you'll know that you can create complex patterns for searching and that they can involve brackets and other odd characters as part of the query. For example you can see some of these listed in the 'Examples' column under 'Supported Pattern Syntax and Examples' [here](https://www.yeastgenome.org/nph-patmatch#examples).

These odd characters can be an issue when constructing and submitting them via the command line or a mix of shell and Python. And so this page presents some options I have found that work for brackets and presumably other odd characters.

Luckily, I had worked out some of this when working on [this demo notebook](https://github.com/fomightez/sequencework/blob/master/Extract_from_FASTA/demo%20get_seq_following_seq_from_FASTA.ipynb).  [This](https://github.com/ipython/ipython/issues/10072) was also helpful.



## Preparing

In order to insure everything is all set, act as if this is a new session in this Jupyter environment, and run the next cell so that you can step through the preparation steps to get a sequence file, prepare it, and scan it for matches to insure there is data file present. Plus, you'll get the file for script to convert it to dataframe. Repeating these steps if you had already done so this session will cause no harm, and so go ahead and run this cell.


In [1]:
!curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chrmt.fsa
!perl ../patmatch_1.2/unjustify_fasta.pl chrmt.fsa
!perl ../patmatch_1.2/patmatch.pl -c "DDWDWTAWAAGTN{{1,55}}ARTADDDD" chrmt.fsa.prepared > test.out
!curl -O https://raw.githubusercontent.com/fomightez/sequencework/master/patmatch-utilities/patmatch_results_to_df.py
from patmatch_results_to_df import patmatch_results_to_df

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 87344  100 87344    0     0   264k      0 --:--:-- --:--:-- --:--:--  264k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18722  100 18722    0     0  70649      0 --:--:-- --:--:-- --:--:-- 70649


## On the command line directly

When submitting a pattern on the command line directly, I found I needed to double-up the brackets inside the quoted patttern in order to get it recognized as brackets.

In the next cell is an example when the pattern what I want to search can be expressed as `DDWDWTAWAAGTN{1,55}ARTADDDD`.  
Note that I submit `DDWDWTAWAAGTN{{1,55}}ARTADDDD` in the command.

In [2]:
!perl ../patmatch_1.2/unjustify_fasta.pl chrmt.fsa
!perl ../patmatch_1.2/patmatch.pl -c "DDWDWTAWAAGTN{{1,55}}ARTADDDD" chrmt.fsa.prepared > testing.out

Verify it ran:

In [3]:
!head testing.out

>ref|NC_001224|:[892,858]
ATAAATAAAAGTGTTCATTTGTAATGTAATAAAAT 
>ref|NC_001224|:[1243,1205]
ATAATTAAAAGTCCGCTCCCTTTTTAATTTTAATAAGAA 
>ref|NC_001224|:[1447,1408]
AAAATTATAAGTTTCTCCTTTCGGAACTTAAAAATAATAT 
>ref|NC_001224|:[4359,4305]
AAATATATAAGTCCCGGTTTCTTACGAAACCGGGACCTCGGAGACGTAATAGGGG 
>ref|NC_001224|:[12827,12773]
AAATATATAAGTCCCGGTTTCTTACGAAACCGGGACCTCGGAGACGTAATAGGGG 


## When using Python

In the last of the introductory notebooks, [PatMatch with more Python](notebooks/PatMatch%20with%20more%20Python.ipynb), we used the following to take a PatMatch query result and then get it into Python.

```python
my_pattern= "DDWDWTAWAAGTARTADDDD"
df = patmatch_results_to_df("test.out", pattern=my_pattern, name="promoter")
```

Fortunately, in that case no modification is really necessary.

In [4]:
my_pattern= "DDWDWTAWAAGTN{1,55}ARTADDDD"
df = patmatch_results_to_df("test.out", pattern=my_pattern, name="promoter_region")

Provided results read...
For documenting purposes, the following lists the parsed data:
          FASTA_id  hit_number              hit_id  start    end  strand                                   matching pattern                query pattern
0   ref|NC_001224|           1   promoter_region-1    858    892      -1                ATAAATAAAAGTGTTCATTTGTAATGTAATAAAAT  DDWDWTAWAAGTN{1,55}ARTADDDD
1   ref|NC_001224|           2   promoter_region-2   1205   1243      -1            ATAATTAAAAGTCCGCTCCCTTTTTAATTTTAATAAGAA  DDWDWTAWAAGTN{1,55}ARTADDDD
2   ref|NC_001224|           3   promoter_region-3   1408   1447      -1           AAAATTATAAGTTTCTCCTTTCGGAACTTAAAAATAATAT  DDWDWTAWAAGTN{1,55}ARTADDDD
3   ref|NC_001224|           4   promoter_region-4   4305   4359      -1  AAATATATAAGTCCCGGTTTCTTACGAAACCGGGACCTCGGAGACG...  DDWDWTAWAAGTN{1,55}ARTADDDD
4   ref|NC_001224|           5   promoter_region-5  12773  12827      -1  AAATATATAAGTCCCGGTTTCTTACGAAACCGGGACCTCGGAGACG...  DDWDWTAWAAGTN{1,55}ART

As you see, the 'query pattern' text ends up in the ouput how it looked in the input as `pattern`.

In [5]:
df.head()

Unnamed: 0,FASTA_id,hit_number,hit_id,start,end,strand,matching pattern,query pattern
0,ref|NC_001224|,1,promoter_region-1,858,892,-1,ATAAATAAAAGTGTTCATTTGTAATGTAATAAAAT,"DDWDWTAWAAGTN{1,55}ARTADDDD"
1,ref|NC_001224|,2,promoter_region-2,1205,1243,-1,ATAATTAAAAGTCCGCTCCCTTTTTAATTTTAATAAGAA,"DDWDWTAWAAGTN{1,55}ARTADDDD"
2,ref|NC_001224|,3,promoter_region-3,1408,1447,-1,AAAATTATAAGTTTCTCCTTTCGGAACTTAAAAATAATAT,"DDWDWTAWAAGTN{1,55}ARTADDDD"
3,ref|NC_001224|,4,promoter_region-4,4305,4359,-1,AAATATATAAGTCCCGGTTTCTTACGAAACCGGGACCTCGGAGACG...,"DDWDWTAWAAGTN{1,55}ARTADDDD"
4,ref|NC_001224|,5,promoter_region-5,12773,12827,-1,AAATATATAAGTCCCGGTTTCTTACGAAACCGGGACCTCGGAGACG...,"DDWDWTAWAAGTN{1,55}ARTADDDD"


The issue arises when you start mixing running the command line-based PatMatch with Python variables.

The basis for this example actually gets fully explained in the previous notebook [Advanced: Sending PatMatch output directly to Python](notebooks/Sending%20PatMatch%20output%20directly%20to%20Python.ipynb), and so see that notebook if you aren't following what is going on.

In this example we use Python to define the pattern for the PatMatch query. This way we only have to define pattern once. And we can easily change it in one place to change the pattern being searched. The advantages of this ability become clear when you want to search a lot of sequences with the same pattern.

To use the defined pattern in the call to trigger Patmatch exectuion, brackets are used to signal to Jupyter/IPython that we are referring to the Python variable `my_pattern`. The next cell runs a basic patter matching with that approach:

In [6]:
my_pattern= "DDWDWTAWAAGTARTADDDD"
output = !perl ../patmatch_1.2/patmatch.pl -c {my_pattern} chrmt.fsa.prepared 
df2 = patmatch_results_to_df(output.n, pattern=my_pattern, name="promoterAGAIN")
df2.head()

Provided results read...
For documenting purposes, the following lists the parsed data:
          FASTA_id  hit_number            hit_id  start    end  strand      matching pattern         query pattern
0   ref|NC_001224|           1   promoterAGAIN-1  54833  54852      -1  AGATATATAAGTAATAGGGG  DDWDWTAWAAGTARTADDDD
1   ref|NC_001224|           2   promoterAGAIN-2  78291  78310      -1  ATTTTTATAAGTAGTATATT  DDWDWTAWAAGTARTADDDD
2   ref|NC_001224|           3   promoterAGAIN-3   6458   6477       1  TATTATATAAGTAATAAATA  DDWDWTAWAAGTARTADDDD
3   ref|NC_001224|           4   promoterAGAIN-4  13345  13364       1  ATTGATATAAGTAATAGATA  DDWDWTAWAAGTARTADDDD
4   ref|NC_001224|           5   promoterAGAIN-5  32205  32224       1  AAATATATAAGTAATAAATT  DDWDWTAWAAGTARTADDDD
5   ref|NC_001224|           6   promoterAGAIN-6  34874  34893       1  TATTATATAAGTAATATATA  DDWDWTAWAAGTARTADDDD
6   ref|NC_001224|           7   promoterAGAIN-7  46067  46086       1  ATTAATATAAGTAATATATA  DDWDWTAWAAGTA

Unnamed: 0,FASTA_id,hit_number,hit_id,start,end,strand,matching pattern,query pattern
0,ref|NC_001224|,1,promoterAGAIN-1,54833,54852,-1,AGATATATAAGTAATAGGGG,DDWDWTAWAAGTARTADDDD
1,ref|NC_001224|,2,promoterAGAIN-2,78291,78310,-1,ATTTTTATAAGTAGTATATT,DDWDWTAWAAGTARTADDDD
2,ref|NC_001224|,3,promoterAGAIN-3,6458,6477,1,TATTATATAAGTAATAAATA,DDWDWTAWAAGTARTADDDD
3,ref|NC_001224|,4,promoterAGAIN-4,13345,13364,1,ATTGATATAAGTAATAGATA,DDWDWTAWAAGTARTADDDD
4,ref|NC_001224|,5,promoterAGAIN-5,32205,32224,1,AAATATATAAGTAATAAATT,DDWDWTAWAAGTARTADDDD


Note that we didn't define it as a variable, because there is just one, but if you were searching different sequence you could use a Python variable in the place of `chrmt.fsa.prepared` in the above code. It too would need to be flanked by brackets in that case. This will be illustrated in the next advanced notebook in this series, [Iterating over genomes with PatMatch](Iterating%20over%20genomes%20with%20PatMatch.ipynb).

So now you may see the impending issue..  
What if our search pattern had brackets inside it, such that in its basic form it would be expressed as `DDWDWTAWAAGTN{1,55}ARTADDDD`?  
How do we use that pattern as a Python variable and get it to be recognized as we want when the Python variable is provided to call to execute PatMatch.

I have found that simply wrapping the pattern in quotes in the call to execute PatMatch suffices to allow using brackets within the pattern definition.

In [7]:
my_pattern = "DDWDWTAWAAGTN{1,55}ARTADDDD"
output = !perl ../patmatch_1.2/patmatch.pl -c "{my_pattern}" chrmt.fsa.prepared 
complexq_df = patmatch_results_to_df(output.n, pattern=my_pattern, name="promoter_region")
complexq_df.head()

Provided results read...
For documenting purposes, the following lists the parsed data:
          FASTA_id  hit_number              hit_id  start    end  strand                                   matching pattern                query pattern
0   ref|NC_001224|           1   promoter_region-1    858    892      -1                ATAAATAAAAGTGTTCATTTGTAATGTAATAAAAT  DDWDWTAWAAGTN{1,55}ARTADDDD
1   ref|NC_001224|           2   promoter_region-2   1205   1243      -1            ATAATTAAAAGTCCGCTCCCTTTTTAATTTTAATAAGAA  DDWDWTAWAAGTN{1,55}ARTADDDD
2   ref|NC_001224|           3   promoter_region-3   1408   1447      -1           AAAATTATAAGTTTCTCCTTTCGGAACTTAAAAATAATAT  DDWDWTAWAAGTN{1,55}ARTADDDD
3   ref|NC_001224|           4   promoter_region-4   4305   4359      -1  AAATATATAAGTCCCGGTTTCTTACGAAACCGGGACCTCGGAGACG...  DDWDWTAWAAGTN{1,55}ARTADDDD
4   ref|NC_001224|           5   promoter_region-5  12773  12827      -1  AAATATATAAGTCCCGGTTTCTTACGAAACCGGGACCTCGGAGACG...  DDWDWTAWAAGTN{1,55}ART

Unnamed: 0,FASTA_id,hit_number,hit_id,start,end,strand,matching pattern,query pattern
0,ref|NC_001224|,1,promoter_region-1,858,892,-1,ATAAATAAAAGTGTTCATTTGTAATGTAATAAAAT,"DDWDWTAWAAGTN{1,55}ARTADDDD"
1,ref|NC_001224|,2,promoter_region-2,1205,1243,-1,ATAATTAAAAGTCCGCTCCCTTTTTAATTTTAATAAGAA,"DDWDWTAWAAGTN{1,55}ARTADDDD"
2,ref|NC_001224|,3,promoter_region-3,1408,1447,-1,AAAATTATAAGTTTCTCCTTTCGGAACTTAAAAATAATAT,"DDWDWTAWAAGTN{1,55}ARTADDDD"
3,ref|NC_001224|,4,promoter_region-4,4305,4359,-1,AAATATATAAGTCCCGGTTTCTTACGAAACCGGGACCTCGGAGACG...,"DDWDWTAWAAGTN{1,55}ARTADDDD"
4,ref|NC_001224|,5,promoter_region-5,12773,12827,-1,AAATATATAAGTCCCGGTTTCTTACGAAACCGGGACCTCGGAGACG...,"DDWDWTAWAAGTN{1,55}ARTADDDD"


------

The next advanced notebook, [Iterating over genomes with PatMatch](Iterating%20over%20genomes%20with%20PatMatch.ipynb) builds on one of the options presented in the previous notebook along with topics covered in the [PatMatch with Python basics notebook](PatMatch%20with%20Python%20basics.ipynb) and subsequent intro-level notebooks to show how to cycle through searching multiple genomes easily.