# 23S rRNA Discovery

The purpose of this experiment is to extract all 23S rRNA related data from NCBI/Genbank files. The 23S rRNA is another conserved region in bacteria we can attempt to use for designing scientific assays to detect the specific region as it is thought of as having conserved and unique regions across all bacterial species; allowing us to detect and discriminate between different bacterias within a sample.

Biopython will be used to interact with the Entrez API to extract a list of predetermined bacterial species we want. The dataset used for this experiment will be for the vaginal microbiome, approximately 900+ species have been identified residing in the vaginal microbiome according to past studies such as ones done for the analysis of the causitive mechanism of bacterial vaginosis.

### Data Exploration

Let's read in the data

In [None]:
import os

# SETTINGS
""" Directory Structure

../downloads/<date of data>

"""
DIRECTORY_MAIN = 'downloads'
DIRECTORY_DATE = 'test'
DIRECTORY_PATH = os.path.join(DIRECTORY_MAIN, DIRECTORY_DATE)

In [77]:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

def parse_record(genbankfile):
    """ Reads the Genbank Files 
    
    Args:
        genbankfile - *.gbk file
    Returns:
        seq_location - gene of interest
    """
    for record in SeqIO.parse(genbankfile, 'gb'):
        parse_features(record)

def parse_features(genbankrecord):
    """ Reads in the entire seq feature
    and extracts only the ones we want; in this case
    the 23S rRNA feature.
    
    Args:
        genbankrecord - individual genbank record
    """
    INTERESTED_FEATURE = 'rRNA'
    FEATURE_PRESENT = False
    
    for items in genbankrecord.features:
        if items.type == 'rRNA':
            try:
                if (items.qualifiers['product'][0].lower().find('23s')>-1):
                    print(items.qualifiers['product'][0],' | ', items.location, genbankrecord.name)
                    print(items.extract(genbankrecord.seq)[:10])
            except KeyError:
                print("Error, no rRNA product.")


In [78]:
def run(directory):
    """ Read all of the files within the directory.
    
    Args:
        directory - the directory containing the genbank files
    """
    
    for files in os.listdir(directory):
        file_dir = os.path.join(DIRECTORY_PATH, files)
        parse_record(file_dir)
        
run(DIRECTORY_PATH)

23S ribosomal RNA  |  [228:>724](+) AF342841
GATTAAAATA
Error, no rRNA product.
23S ribosomal RNA  |  [28210:31116](+) NZ_CP010783
AGGTTAAGTT
23S ribosomal RNA  |  [83655:86561](+) NZ_CP010783
AGGTTAAGTT
23S ribosomal RNA  |  [403427:406333](+) NZ_CP010783
AGGTTAAGTT
23S ribosomal RNA  |  [456477:459383](+) NZ_CP010783
AGGTTAAGTT
23S ribosomal RNA  |  [1564555:1567461](-) NZ_CP010783
AGGTTAAGTT
23S ribosomal RNA  |  [28210:31116](+) CP010783
AGGTTAAGTT
23S ribosomal RNA  |  [83655:86561](+) CP010783
AGGTTAAGTT
23S ribosomal RNA  |  [403427:406333](+) CP010783
AGGTTAAGTT
23S ribosomal RNA  |  [456477:459383](+) CP010783
AGGTTAAGTT
23S ribosomal RNA  |  [1564555:1567461](-) CP010783
AGGTTAAGTT
23S ribosomal RNA  |  [0:2901](+) NR_103990
TTAAGTTAAT
23S ribosomal RNA  |  [25726:28597](+) CP007587
AGGTTAAGTT
23S ribosomal RNA  |  [359840:362711](+) CP007587
AGGTTAAGTT
23S ribosomal RNA  |  [411737:414608](+) CP007587
AGGTTAAGTT
23S ribosomal RNA  |  [1522188:1525060](-) CP007587
AGGTTAAGTT
