# ReadMe.ipynb
- for AZ's BioE 134 Final Project

For starters, I want to admit that there are places where small details are not always working at 100%. It would be lovely to iron out those wrinkles down the line but for now it is mostly functioning.

There are a lot of stuff built behind the screen for implementations of the few exposed methods. If you are interested in checking them out, you can find them in the source code.

In [25]:
import pandas as pd # for easier visualization

## Gibson Assembly
Gibson Assembly is a very commonly used tool in cloning. It inserts one or more fragment(s) amplified with two oligos that contain a predesigned overhang into a backbone.
The most common way to do Gibson is to use a plasmid with a unique restriction cutting site at the locus where the insert is going to be as the backbone. From there, a pair of primers with sufficient overlap with the insert and the plasmid need to be designed and used for amplification of the backbone. To make that process easier and less prone to errors, this following functions comes into play.

In [26]:
from helper.biology.GibsonAssembly import GibsonAssembly

# misc methods
from helper.io import readDict

demo_data = readDict('./demo.json')

In [27]:
demo_data.keys()

dict_keys(['GibsonBackbone', 'GibsonInsert', 'EGFPSequence', 'Ensembl'])

In [3]:
# creating an assembly object
assembly = GibsonAssembly(demo_data['GibsonBackbone'], 'GAATTC')

In [4]:
# use the internal __call__ method to directly run the object
assembly(demo_data['GibsonInsert'], preserve_cutting_site='none')

('TCTCTGCTCTTCACGCTACTTATTAAACAGCTGTTAATTT',
 'GAAGAACCGGTGATGAATTCTACTTGTACAGCTCGTCCA')

In [5]:
assembly(demo_data['GibsonInsert'], preserve_cutting_site='none', overhang_length=40)

overhang_length is possibly too short or too long, which might lead to insufficient Gibson Assembly. Use around 15 to 30 for optimal result


('GCGTTGTTGTCGGTGCTCGTTCTCTGCTCTTCACGCTACTTATTAAACAGCTGTTAATTT',
 'TGGCAACTAGAAGGCACAGCGAAGAACCGGTGATGAATTCTACTTGTACAGCTCGTCCA')

In [6]:
assembly(demo_data['GibsonInsert'], preserve_cutting_site='none', binding_length=40)

binding_length is possibly too short or too long, which might lead to insufficient PCR. Use around 15 to 30 for optimal result


('TCTCTGCTCTTCACGCTACTTATTAAACAGCTGTTAATTTTGGTTTCTTTTTGGTAACAG',
 'GAAGAACCGGTGATGAATTCTACTTGTACAGCTCGTCCATGCCGCCGGTGGAGTGGCGG')

In [7]:
try:
    assembly(demo_data['GibsonInsert'], preserve_cutting_site='1234')
except:
    print('*raising error due to unrecognized input*')

*raising error due to unrecognized input*


## Basic Biology
Here are a few examples of basic biology tools in the module

In [8]:
from helper.biology import transcribe, translate, reverseTranscribe

# an example DNA sequence of an engineered GFP protein
EGFP = demo_data['EGFPSequence']

In [9]:
# returns a peptide object
translate(EGFP)

Peptide: MVSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYK*

In [10]:
transcribe(EGFP)

'AUGGUGAGCAAGGGCGAGGAGCUGUUCACCGGGGUGGUGCCCAUCCUGGUCGAGCUGGACGGCGACGUAAACGGCCACAAGUUCAGCGUGUCCGGCGAGGGCGAGGGCGAUGCCACCUACGGCAAGCUGACCCUGAAGUUCAUCUGCACCACCGGCAAGCUGCCCGUGCCCUGGCCCACCCUCGUGACCACCCUGACCUACGGCGUGCAGUGCUUCAGCCGCUACCCCGACCACAUGAAGCAGCACGACUUCUUCAAGUCCGCCAUGCCCGAAGGCUACGUCCAGGAGCGCACCAUCUUCUUCAAGGACGACGGCAACUACAAGACCCGCGCCGAGGUGAAGUUCGAGGGCGACACCCUGGUGAACCGCAUCGAGCUGAAGGGCAUCGACUUCAAGGAGGACGGCAACAUCCUGGGGCACAAGCUGGAGUACAACUACAACAGCCACAACGUCUAUAUCAUGGCCGACAAGCAGAAGAACGGCAUCAAGGUGAACUUCAAGAUCCGCCACAACAUCGAGGACGGCAGCGUGCAGCUCGCCGACCACUACCAGCAGAACACCCCCAUCGGCGACGGCCCCGUGCUGCUGCCCGACAACCACUACCUGAGCACCCAGUCCGCCCUGAGCAAAGACCCCAACGAGAAGCGCGAUCACAUGGUCCUGCUGGAGUUCGUGACCGCCGCCGGGAUCACUCUCGGCAUGGACGAGCUGUACAAGUAA'

In [11]:
# doesn't really do anything to a DNA sequence (it only RT RNAs)
# but can be used as a tool to standardize DNA format
reverseTranscribe(EGFP)

'ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAAGATCCGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGTAA'

## Ensembl?
The Ensembl genome database project provides a centralized resource for genomics. It contains various information one might find useful, such as information about a gene, the sequence of a particular transcript, etc.

It is indeed an useful too, but can often be frustrating to use if one has many queries to run. Hence, here is an example of how to do a batch request for informations.

In [12]:
from helper.ensembl.requests import LookUpRequest, BasePostRequest

ensembl_codes = demo_data['Ensembl']

In [31]:
# Ensembl has a limit of 1000 items per request, which is can be a little bit irritating
# luckily, the request method handles that and allows for any number of query simultaneously.

result = LookUpRequest(ensembl_codes)
pd.DataFrame(result).T.head(5)

LookUpRequest: iter #0


Unnamed: 0,strand,version,species,start,canonical_transcript,display_name,assembly_name,end,db_type,seq_region_name,description,object_type,id,logic_name,source,biotype
ENSG00000067082,-1,15,homo_sapiens,3775996,ENST00000497571.6,KLF6,GRCh38,3785281,core,10,KLF transcription factor 6 [Source:HGNC Symbol...,Gene,ENSG00000067082,ensembl_havana_gene_homo_sapiens,ensembl_havana,protein_coding
ENSG00000001626,1,18,homo_sapiens,117287120,ENST00000003084.11,CFTR,GRCh38,117715971,core,7,CF transmembrane conductance regulator [Source...,Gene,ENSG00000001626,ensembl_havana_gene_homo_sapiens,ensembl_havana,protein_coding
ENSG00000000003,-1,16,homo_sapiens,100627108,ENST00000373020.9,TSPAN6,GRCh38,100639991,core,X,tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858],Gene,ENSG00000000003,ensembl_havana_gene_homo_sapiens,ensembl_havana,protein_coding
ENSG00000000005,1,6,homo_sapiens,100584936,ENST00000373031.5,TNMD,GRCh38,100599885,core,X,tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757],Gene,ENSG00000000005,ensembl_havana_gene_homo_sapiens,ensembl_havana,protein_coding


In [32]:
# alternatively, use an exposed request endpoint to query basically anything with more
# rooms for customization
print("the same request but from grch37 (i.e. hg19) instead of grch38")
result = BasePostRequest("lookup/id", data={"ids": ensembl_codes}, assembly="grch37")
pd.DataFrame(result).T.head(5)

the same request but from grch37 (i.e. hg19) instead of grch38


Unnamed: 0,id,assembly_name,description,end,logic_name,seq_region_name,version,species,db_type,biotype,source,strand,object_type,start,canonical_transcript,display_name
ENSG00000000005,ENSG00000000005,GRCh37,tenomodulin [Source:HGNC Symbol;Acc:17757],99854882,ensembl_havana_gene_homo_sapiens_37,X,5,homo_sapiens,core,protein_coding,ensembl_havana,1,Gene,99839799,ENST00000373031.4,TNMD
ENSG00000001626,ENSG00000001626,GRCh37,cystic fibrosis transmembrane conductance regu...,117356025,ensembl_havana_gene_homo_sapiens_37,7,10,homo_sapiens,core,protein_coding,ensembl_havana,1,Gene,117105838,ENST00000003084.6,CFTR
ENSG00000067082,ENSG00000067082,GRCh37,Kruppel-like factor 6 [Source:HGNC Symbol;Acc:...,3827473,ensembl_havana_gene_homo_sapiens_37,10,10,homo_sapiens,core,protein_coding,ensembl_havana,-1,Gene,3818188,ENST00000497571.1,KLF6
ENSG00000000003,ENSG00000000003,GRCh37,tetraspanin 6 [Source:HGNC Symbol;Acc:11858],99894988,ensembl_havana_gene_homo_sapiens_37,X,10,homo_sapiens,core,protein_coding,ensembl_havana,-1,Gene,99883667,ENST00000373020.4,TSPAN6


## Misc
There are also a few random methods that can come in handy, like the io module used at the very beginning.

There are also method for identifying sgRNA for CRISPR, finding unique cutters, double cutters, or N cutters from a sequence, and util classes like Peptide, AminoAcids, etc.

That's pretty much it! Thank you for reading this quick readme/demo!