# Make CSV

The purpose of this notebook is to generate a csv file (folder `csvs`) from mutations found and stored in the `pickles` folder. 

CSV (comma separated values) files are the most common file type used to share data. The can be opened as a spreadsheet (e.g. Excel). Using spreadsheet software we can make graphs and highlight rows that look interesting.

Notebook workflow:
2. Open pickle folder and get sequences for all selected strains
3. Scan through mutations for each feature of every strain
4. Write a CSV row for mutation that includes answer values for each defined column headers just below.

Define the list of CSV fields to be column headers.

In [8]:
col_headers = [
  'strain',
  'protein id',
  'product',
  'gene',
  'type',
  'feat start',
  'feat end',
  'position',
  'org',
  'new',
  'is color same',
  'org color',
  'new color',
  'is conserved',
  'org type',
  'new type'
]

### Imports

Begin workflow by importing the open source Python libraries to be used.

In [9]:
import pickle
import os
import csv

### Configs

Set the config variables needed.

In [10]:
strains_to_skip = ['NC_045512.2']
pickles_path = '../pickles/'
csvs_path = '../csvs/'
csv_file = 'bio_club_mutations_data.csv'

Declare a dictionary defines amino acid names. This allows code to read single letter values from picle files and write full amino acid names in the CSV file.

In [11]:
amino_acid_names = {
  'T': 'Threonine',
  'Q': 'Glutamine',
  'S': 'Serine',
  'N': 'Asparagine',
  'F': 'Phenylalanine',
  'M': 'Methionine',
  'L': 'Leucine',
  'V': 'Valine',
  'W': 'Tryptophan',
  'A': 'Alanine',
  'I': 'Isoleucine',
  'E': 'Glutamic acid',
  'D': 'Aspartic Acid',
  'K': 'Lysine',
  'R': 'Arginine',
  'C': 'Cysteine',
  'G': 'Glycine',
  'P': 'Proline',
  'H': 'Histidine',
  'Y': 'Tyrosine'
}

### Write CSV Rows

In [12]:
filepath = csvs_path + csv_file
with open(filepath, 'w+', newline='') as f:
  writer = csv.writer(f)
  writer.writerow(col_headers)
  for file_name in os.listdir(pickles_path):
    if '.pickle' not in file_name:
      continue
    comparison_strain = file_name.replace('.pickle', '')
    if comparison_strain in strains_to_skip:
      continue
    strain = comparison_strain
    pickles_file = strain + '.pickle'
    filepath = pickles_path + pickles_file
    objects = []
    with (open(filepath, "rb")) as openfile:
      while True:
        try:
          objects.append(pickle.load(openfile))
        except EOFError:
          break
    comparison = objects[0]
    for feature in comparison:
      print('Writing mutation rows for {} feature: {}'.format(comparison_strain, feature.get('product')))
      for mutant in feature['mutants']:
        org = amino_acid_names[mutant[1]] if mutant[1] != 'X' else 'unknown'
        new = amino_acid_names[mutant[2]] if mutant[2] != 'X' else 'unknown'
        if 'unknown' in [org, new]:
          continue
        org_color = None
        new_color = None
        if ' to ' in mutant[3]:
          org_color, new_color = mutant[3].split(' to ')
        org_type = None
        new_type = None
        if ' to ' in mutant[4]:
          org_type, new_type = mutant[4].split(' to ')
        row = [
          strain,
          feature.get('protein_id'),
          feature.get('product'),
          feature.get('gene'),
          feature.get('type'),
          feature.get('start'),
          feature.get('end'),
          mutant[0],
          org,
          new,
          True if mutant[3] == 'same' else False,
          org_color,
          new_color,
          True if mutant[4] == 'conservative' else False,
          org_type,
          new_type
        ]
        writer.writerow(row)

print('\nGenerating CSV complete')

Writing mutation rows for MW243586.1 feature: leader protein
Writing mutation rows for MW243586.1 feature: nsp2
Writing mutation rows for MW243586.1 feature: nsp3
Writing mutation rows for MW243586.1 feature: nsp4
Writing mutation rows for MW243586.1 feature: 3C-like proteinase
Writing mutation rows for MW243586.1 feature: nsp6
Writing mutation rows for MW243586.1 feature: nsp7
Writing mutation rows for MW243586.1 feature: nsp8
Writing mutation rows for MW243586.1 feature: nsp9
Writing mutation rows for MW243586.1 feature: nsp10
Writing mutation rows for MW243586.1 feature: helicase
Writing mutation rows for MW243586.1 feature: 3'-to-5' exonuclease
Writing mutation rows for MW243586.1 feature: endoRNAse
Writing mutation rows for MW243586.1 feature: 2'-O-ribose methyltransferase
Writing mutation rows for MW243586.1 feature: ORF1a polyprotein
Writing mutation rows for MW243586.1 feature: nsp11
Writing mutation rows for MW243586.1 feature: surface glycoprotein
Writing mutation rows for MW