# Basic Functionality of EnGAGE: Searching for Clusters of Motifs 

**Author:** Michelle Franc Ragsac (mragsac@eng.ucsd.edu)

### Import Necessary Packages

In [1]:
# Add a few lines for easy code updating ...
%load_ext autoreload
%autoreload 2

import engage as en

## Define Parameters for Genome Search

Within the first code block featured below, we will be defining the genome that we would like to search by specifying the file path, as well as the binding information for the motifs that we are interested in studying. Within this example, we will be searching the *Ciona intestinalis* genome (i.e., the HT2019 assembly from the Ghost Database) for clusters of *Zic* and *ETS* binding sites. 

In defining our parameters, we use two object classes that are defined in `engage/search.py`, the `Cluster` and the `TF` classes: 

1. In Line 3, we define a `Cluster` object through `en.search.Cluster('Notochord-Grammar')` where the `str`, `"Notochord-Grammar"`, is the name of the grammar cluster that we are interested in searching for
2. In Lines 6 through 12, we specify different *Zic* binding sites found in the literature before defining the `"ZicL"` `TF` object with the code in line 13, `en.search.TF('ZicL',ZicBinding)`
3. Next, in Line 14, we add the *Zic* binding site `TF` object to our grammar's `Cluster` object 
4. Finally, we perform the same process for the *ETS* binding site in Lines 17 through 20, but instead specify that we would like to provide flexibility around the *ETS* binding site core as indicated through `N` characters 

We can then verify that we have the proper parameters in our `Cluster` object by printing its contents with `print()`! 

In [2]:
# Ciona intestinalis 2019 genome from GHOST
filename = '../datasets/ciona_2019_HT-All.fasta'

# Define search cluster
cluster = en.search.Cluster('Notochord-Grammar')

# Save regex pattern for Zic sites 
ZicBinding =  "CAGCTGTG|CACAGCTG"       # zic1/2/3 + zic1/2/3 reverse-complement 
ZicBinding += "|CCGCAGT|ACTGCGG"        # zic7/3/1 + zic7/3/1 reverse-complement
ZicBinding += "|CCGCAGTC|GACTGCGG"      # zic6 + zic6 reverse-complement
ZicBinding += "|CCCGCTGTG|CACAGCGGG"    # zic1 + zic1 reverse-complement
ZicBinding += "|CCAGCTGTG|CACAGCTGG"    # zic3 + zic3 reverse-complement
ZicBinding += "|CCGCTGTG|CACAGCGG"      # zic2/zicC + zic2/zicC reverse-complement
ZicBinding += "|CCCGCAGTC|GACTGCGGG"    # zic5 + zic5 reverse-complement
zic = en.search.TF('ZicL',ZicBinding)
cluster.add_member(zic,1) 

# Save regex pattern for Ets sites 
EtsBinding =  "NNGGAANN|NNTTCCNN"   # common core motif + reverse-complement
EtsBinding += "|NNGGATNN|NNATCCNN"  # slightly less common core motif + reverse-complement
ets = en.search.TF('Ets',EtsBinding)
cluster.add_member(ets,2)

print(cluster) # verify cluster parameters

[ Notochord-Grammar ] Cluster Parameters:
Window Size = 100
Transcription Factors =
* ZicL: 1x Sites
- Motifs: ['CAGCTGTG|CACAGCTG|CCGCAGT|ACTGCGG|CCGCAGTC|GACTGCGG|CCCGCTGTG|CACAGCGGG|CCAGCTGTG|CACAGCTGG|CCGCTGTG|CACAGCGG|CCCGCAGTC|GACTGCGGG']
* Ets: 2x Sites
- Motifs: ['NNGGAANN|NNTTCCNN|NNGGATNN|NNATCCNN']



## Search for the Notochord Grammar Cluster Using `en.search.find_motif_cluster()`

In [3]:
df = en.search.find_motif_cluster(filename,cluster,num_processes=16) # default is 4 processes
print(f"There were {df.shape[0]} cluster regions found!")

Processing Search Parameters: 100%|██████████| 2/2 [00:00<00:00, 2836.86it/s]
Searching Chromosomes for Clusters: 100%|██████████| 67/67 [00:06<00:00, 10.20it/s]
Consolidating Overlapping Regions: 100%|██████████| 67/67 [10:42<00:00,  9.59s/it] 


There were 5246 cluster regions found!
