# Generator

## Evan Huang, Swinburne Lab, UC Berkeley

This program aims to assist with organization of DNA Sequencing by generating 2 .csv files: one containing data for all samples based on each sample's metadata, and a renaming file meant to be used in conjunuction with the Sequencing File Renamer. 

In [24]:
import csv
import ipywidgets as widgets

The generate_csv() function generates the organizational .csv file that contains all data for each sample. You can add any metadata into the 'base' variable that applies to all samples, such as the date of sequencing. Each sample is organized based on genes, amplicons, oligos, and clones. Each gene is split based on which amplicons were amplified out by PCR. For each amplicon, there is a set of sequencing oligos used to for Sanger Sequencing. Finally, you may add multiple clones for each sample, which would be used if for example you had multiple DNA templates. You may find examples of how to format the inputs in the examples at the end of the notebook. You may also read through the docstring for a more technical description. 

In [64]:
def generate_csv(base, genes, amp, oligos, clones, output_path, append=False): 
    """    
    generates a csv to output_path to organize sequencing prep based on inputs.
    base: string for base data (ex: date of genotyping, etc.)
    
    Args: 
        base: string
        genes: ['gene1', 'gene2',...]
        amp: [['1_2', '3_4'],['5_6', '7_8'], ...] 2D list. len(amp) == len(genes). 
        oligos: [[[1, 2, 3], [4, 5, 6]], [[1, 3, 5], [2, 4, 6]], ...] 
            3D list. len(oligos)==len(amp)==len(genes). 
            len(oligos[i]) == len(amp[i]). 
            oligos[i][j] == list of oligos for amp[i][j] for gene[i]. 
        clones: int for how many clones for each set of oligos. aka how many templates used. 
        output_path: path for csv file to write to.
        append: will append data to an existing csv. default false. 
        
    Returns:
        nx2 matrix of indexes and final files names
    """
    
    assert len(amp) == len(genes)
    assert len(oligos)==len(amp)
    for i in range(len(oligos)):
        assert len(oligos[i]) == len(amp[i])
    
    header = ['base', 'index', 'gene', 'amp', 'oligos', 'clones']
    matrix = []
    index = 1
    for i in range(len(genes)): 
        for j in range(len(amp[i])): 
            for k in range(len(oligos[i][j])):
                for l in range(clones):
                    matrix.append([base, index, genes[i], amp[i][j], oligos[i][j][k], l+1])
                    index += 1
                
    edit = "a" if append else "w"
    csv_file = open(output_path, edit)
    csvwriter = csv.writer(csv_file)
    if not append: 
        csvwriter.writerow(header)
    csvwriter.writerows(matrix)
    csv_file.close()
    
    ret_mat = []
    for row in matrix:
        ret_mat.append([row[1], base+'_'+row[2]+'_amp_'+row[3]+'_seq_'+str(row[4])+'_clone_'+str(row[5])])
    
    return ret_mat
            

The write_naming_csv() function will use the returned matrix from generate_csv() to create a naming .csv file that will work with the Sequencing File Renamer. 

In [56]:
def write_naming_csv(csv_path, matrix, append=False): 
    """
    writes naming matrix to csv
    """
    
    csv_file = open(csv_path, "a" if append else "w")
    csvwriter = csv.writer(csv_file)
    csvwriter.writerow(["index", "name"])
    csvwriter.writerows(matrix)
    csv_file.close()

# Example Use Cases

## Example 1: Easy
The first example will be a relatively simple one. In this case, there is 1 gene with 1 amplicon being genotyped. The amplicon will be genotyped with 3 different sequencing oligos and 2 templates (clones). We begin by defining paths for our csv files:

In [66]:
sequencing_data_csv_path = 'Test Data/full_data.csv'
naming_csv_path = 'Test Data/name_ref.csv'

Then, we can start defining data for our samples. Please note in this example that the $genes$ variable is a list, the $amplicons$ variable is a 2D list, and the $oligos$ variable is a 3D list. This is so that the generator will work for more complex use cases. Please also note that the amplicons are labeled as strings with both bounding primers, such as '10_35'. This would refer to the amplicon bounded by primers 10 and 35 (the primers that were used in PCR).

In [67]:
base_data = 'Evan_09-20-22_sequencing_easy'
genes = ['gene1']
amplicons = [['10_35']] 
oligos = [[[1, 2, 3]]]
clones = 2

Finally, we can run our functions to generate the csv files. You may set the default variable $append$ to True if you would like the data to be appended to the end of the csv files (if they exist) instead of overwriting. 

In [68]:
naming_data = generate_csv(base_data, genes, amplicons, oligos, clones, sequencing_data_csv_path)
write_naming_csv(naming_csv_path, naming_data)

## Example 2: Hard
The next example will be more complex. In this case, there will be 3 genes: gene1, gene2, and gene3. Each gene will have 1, 2, and 3 amplicons respectively. Each amplicon will have a different set of sequencing oligos. Each oligo will have 4 clones. 

In [69]:
sequencing_data_csv_path = 'Test Data/full_data.csv'
naming_csv_path = 'Test Data/name_ref.csv'

base_data = 'Evan_09-20-22_sequencing_hard'
genes = ['gene1', 'gene2', 'gene3']

gene1_amplicons = ['1_2'] # 1 amplicon
gene2_amplicons = ['3_4', '5_6'] # 2 amplicons
gene3_amplicons = ['7_8', '9_10', '11_12'] # 3 amplicons
amplicons = [gene1_amplicons, gene2_amplicons, gene3_amplicons]

gene1_oligos = [[1, 2, 3]] # note this is a 2D list. the length of the outer list should match the number of amplicons
gene2_oligos = [[4, 5, 6], [7, 8, 9]] # 3 oligos for each amplicon of gene2. 
gene3_oligos = [[10, 11, 12], [13, 14, 15], [16, 17]] # can have different numbers of oligos
oligos = [gene1_oligos, gene2_oligos, gene3_oligos] # the final oligos variable will be a 3D list

clones = 4

naming_data = generate_csv(base_data, genes, amplicons, oligos, clones, sequencing_data_csv_path)
write_naming_csv(naming_csv_path, naming_data)

This program will allow each gene to have a variable number of amplicons, and each amplicon can have a variable set of oligos, each with some number of clones. The output of example 2 can be seen in the full_data.csv and name_ref.csv files in the Test Data folder. 