<a href="https://colab.research.google.com/github/citizenphage/Tong-hsdR/blob/main/PatternHunter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Run the block here first to install the necessary libraries

In [None]:
!pip install biopython


Collecting biopython
  Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.85


Next Run this block to define the functions

In [None]:
from Bio import SeqIO
import re
import pandas as pd
import glob


def find_motif(dna_sequence:str, target_name, motif_pattern = r"GATC.{6}GTC"):
    # Find all matches
    matches = [(target_name, match.start() +1, match.end() +1, match.group()) for match in re.finditer(motif_pattern, dna_sequence)]
    return pd.DataFrame(matches, columns=["target", "start", "end", "Motif"])


def get_sequences(file):
    with open(file, 'r') as handle:
        sequences = [(s.id, str(s.seq)) for s in SeqIO.parse(handle, 'genbank')]
    return sequences

dna_sequence = "GATCAGCTTAGTCGATCCGTAAGTCGATCAGGTACGTC"

find_motif(dna_sequence, "test")

Unnamed: 0,target,start,end,Motif
0,test,1,14,GATCAGCTTAGTC
1,test,26,39,GATCAGGTACGTC


Upload your genbank files in a folder using the 'upload to session storage' button on the left. Change the name of 'test_gbk' in the codeblock below to match the name of the folder you've just uploaded, then press run on the code block. This will generate an output file called `matches.csv` that contains all of the matches.

In [None]:
files = glob.glob("CPL00163.gbk")
results = []
for f in files:
    seq = get_sequences(f)
    for s in seq:
        results.append(find_motif(s[1], target_name=s[0]))

final_df = pd.concat(results, ignore_index=True)
final_df.to_csv("matches.csv", index=False)

This just prints the results to the screen

In [None]:
final_df

Unnamed: 0,target,start,end,Motif
0,CPL00163,2924,2937,GATCGAGCCGGTC
1,CPL00163,3059,3072,GATCGACCAAGTC
2,CPL00163,10874,10887,GATCAGGGATGTC
3,CPL00163,11249,11262,GATCACCCAAGTC
4,CPL00163,20838,20851,GATCTTCCGCGTC
5,CPL00163,29256,29269,GATCGTCGGGGTC
6,CPL00163,38231,38244,GATCGAGGACGTC


Run this block to get a summary file of motif counts per genome

In [None]:
summary_df = final_df.groupby('target')['Motif'].count().reset_index()
summary_df = summary_df.rename(columns={'Motif': 'Motif_Count'})
summary_df
final_df.to_csv("summary.csv", index=False)