# Process NCBI Data Files

The purpose of this notebook is to gather the nucleotide sequences for COVID-19 features (e.g. proteins). Feature coordinates are read from text files manually downloaded from NCBI (folder `ncbi_pages`). FASTA files (folder `fasta_files`) for each strain are then repeatedly referenced to tranlate specific nucleotide ranges to amino acid sequences for future analysis.

Notebook workflow:
1. Scan each strain's NCBI text file for the start-end coordinates of each listed "feature" type 
2. Use these coordinates to find the nucleotide sequences and translate them to amino acids
3. Store sequence results as structured data in ".pickle" files located in the `pickles` folder
4. Pickle files are used to save Python objects to files that can be stored on disk and used as inputs to other Python scripts.

### Imports

Begin workflow by importing the open source Python libraries to be used.

In [9]:
try:
  from Bio import SeqIO
except ImportError as e:
  %pip install biopython
  from Bio import SeqIO
import pickle
import os

print('Library imports complete')

Library imports complete


### Configs

Set the config variables as needed.

In [10]:
# folder relative locations
ncbi_page_folder = '../ncbi_pages/'
fasta_folder = '../fasta_files/'
pickles_folder = '../pickles/'

# temporary max limit on gene sequence length to avoid long computations
gene_length_max = 5000 

# feature type labels to search for in NCBI text pages
feature_types = ('CDS', 'mat_peptide')

# list of strain ascession numbers matching downloaded ncbi and fasta files
downloaded_strains = [
  'NC_045512.2',
  'OP733821.1',
  'OK341237.1',
  'OM251163.1',
  'OQ050563.1',
  'MW474188.1',
  'MW243586.1',
  'OL947440.1',
  'OQ253610.1'
]

print('Config variables set')

Config variables set


### Process

The NCBI page text for each strain was previously downloaded along with their FASTA file. 
1. Open each NCBI page text file and gather the lines about DNA features
2. Scan through those lines to find the feature names and their start-end coordinates
3. Open the strain's FASTA file and find the sequences using the start-end coordinates
4. Use biopython to translate the nucleotide sequences to amino acids
5. Save the gathered data for each strain to a separate .pickle file for each strain

In [13]:

for strain in downloaded_strains:

  # variables that change for each strain
  ncbi_page_path = ncbi_page_folder + strain + '.txt'
  fasta_path = fasta_folder + strain + '.fasta'
  pickles_file = strain + '.pickle'

  # find lines of text with info about a feature
  # put list of lines for each feature into a list
  features = []
  with open(ncbi_page_path) as file:
    found = False
    feature_lines = []
    for line in file:
      line = line.strip()
      if found == True and line.startswith('/'):
        if line.startswith('/translation'):
          continue
        feature_lines.append(line)
        continue
      elif found == True:
        features.append(feature_lines)
        feature_lines = []
        found = False    
      for ftype in feature_types:
        if line.startswith(ftype):
          found = True    
          feature_lines.append(line)

  # go through gathered text lines about features
  # put useful data into a dict structure and put those in a list
  dna_features = []
  feature = {}
  for feature_lines in features:
    top_line = " ".join(feature_lines[0].split())
    line_parts = top_line.split(' ')
    if len(line_parts) != 2:
      continue
    feature_type, cords_text = line_parts
    # TODO: handle join cases: mat_peptide join(13442..13468,13468..16236)
    coords = cords_text.split('..')
    if len(coords) != 2:
      continue
    start, end = coords
    feature['type'] = feature_type
    # TODO: handle greater than cases: '>27123'
    if not start.isdigit() or not end.isdigit():
      continue
    feature['start'] = int(start)
    feature['end'] = int(end)
    for line in feature_lines[1:]:
      key, val = line.split('=')
      key = key[1:]
      val = val.strip('",')
      feature[key] = val
    dna_features.append(feature)
    feature = {}
  
  # use start and end porperties found in text lines
  # to find protein coding sequences from fasta file
  fasta = SeqIO.parse(fasta_path,"fasta")
  records = []
  for record in fasta.records:
    records.append(record)
  dna = records[0].seq
  for feature in dna_features:
    start = feature['start'] - 1
    end = feature['end']
    if end - start < gene_length_max:
      coding_region = dna[start:end]
      feature['nucleotides'] = coding_region
      feature['translation'] = coding_region.translate()

  # store the list of found features in a .pickle file
  with open(pickles_folder + pickles_file, 'wb') as file:
    pickle.dump(dna_features, file)

  # print the count of features found for strain
  print('Features found for {} strain: {}'.format(strain, len(dna_features)))

print('\nProcessing and saving data complete')

Features found for NC_045512.2 strain: 36
Features found for OP733821.1 strain: 36
Features found for OK341237.1 strain: 36
Features found for OM251163.1 strain: 35
Features found for OQ050563.1 strain: 36
Features found for MW474188.1 strain: 36
Features found for MW243586.1 strain: 36
Features found for OL947440.1 strain: 36
Features found for OQ253610.1 strain: 35

Processing and saving data complete


### Results

Each strain's features are saved as a list object and stored in .pickle file:

In [14]:

# Print the names of features found
for file in os.listdir(pickles_folder):
  if ('.pickle' in file):
    print(file)

MW243586.1.pickle
OP733821.1.pickle
OK341237.1.pickle
OM251163.1.pickle
OQ050563.1.pickle
MW474188.1.pickle
OL947440.1.pickle
NC_045512.2.pickle
OQ253610.1.pickle
