# Nucleotide percentage plots

This notebook contains code to process the "Nucleotide_percentage_table" file from the CRISPResso output into a file containing the % of each each nucleotide at each target nucleotide, numbered relative to the sgRNA. It also produces a plot showing the same information (as in Figure S4A, D, G, J).

In [1]:
import sys
sys.path.append('../scripts/')
import pandas as pd 
import numpy as np 
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from os import path
from pathlib import Path
import notebook_functions as nb

#import aa_guideseq_visualization as guideseq
mpl.rc('pdf', fonttype=42)
mpl.rcParams['font.sans-serif'] = "Arial"
mpl.rcParams['font.family'] = "sans-serif"


## User inputs

<font color='blue'> Please follow steps indicated in blue, then run the notebook to generate output files. If the files are formatted as described in the documentation, the code in the 'Functions' section should not need to be altered. </font> 

**Metainformation file** 

<font color='blue'> <b>Step 1:</b> Please enter the filepath to the metainformation input file used in the allele frequencies notebook. The same file will be used in this notebook. </font> 

In [4]:
global input_file
# input_filepath = '../../Metainfo_input_CBE_TP53_first_last_codon.csv'
input_filepath = input("Please enter input filepath here: ")
input_file = pd.read_csv(input_filepath)

input_file

Please enter input filepath here: ../../AudreyData/TP53/Metainfo_input_ABE_TP53_fixed_updated_1d_sample.csv


Unnamed: 0,sg,sgRNA_sequence,translation_ref_seq,BEV_start,BEV_end,primer,frame,first_codon,last_codon,rev_com,BEV_ref,BEV_test
0,1d,GCTCCTCCATGGCAGTGACC,[TTCCTCTTGCAGCAGCCAGACTGCCTTCCGGGTCACTGCC]ATGG...,417,426,F3_R2,1,ATG,CTG,True,417;418,425;426


<font color='blue'><b>Step 2:</b> Enter filepath to folder containing CRISPResso output files here. Please make sure that the filepath ends in a '/'.  </font> 

Please note that each folder containing CRISPResso output files for individual samples within the given folder should be named in the format 'CRISPResso_on_'+bev+'\_'+
primer, where bev = ('BEV' OR 'NGBEV') + sample_number and primer = primer name. 
Ex. <font color='grey'>CRISPResso_on</font><font color='purple'>_BEV_001</font><font color='green'>_F2_R2</font>

In [5]:
global bev_string_id
bev_string_id = input('Please enter either \'BEV\' or \'NGBEV\' to indicate which string is used when naming your CRISPResso files.')
if ((bev_string_id != 'BEV') and (bev_string_id != 'NGBEV')):
    raise Exception('Invalid input. Please enter either \'BEV\' or \'NGBEV\' to specify which string is used in CRISPResso file names. Be careful not to add any extra spaces.')


Please enter either 'BEV' or 'NGBEV' to indicate which string is used when naming your CRISPResso files.NGBEV


In [6]:
global CRISPResso_filepath
# CRISPResso_filepath= "AudreyData/CRISPRessoBatch_on_F2R1_batch_file_v2/"
CRISPResso_filepath = input("Please enter CRISPResso filepath here: ")
CRISPResso_filepath = nb.check_folder_filepath(CRISPResso_filepath)
print(CRISPResso_filepath)


Please enter CRISPResso filepath here: ../../AudreyData/TP53/TP53_ABE_Sample_CRISPResso/
../../AudreyData/TP53/TP53_ABE_Sample_CRISPResso/


<font color='blue'><b>Step 3:</b> Enter filepath to folder where the files generated by this notebook will be stored. Please make sure that the filepath ends in a '/'. If the folders in this file path do not currently exist, they will be created when the notebook is run.  </font> 

In [7]:
global output_filepath
# output_filepath = 'AudreyData/Validation_CRISPResso_results/'
output_filepath = input("Please enter output folder filepath here: ")
output_filepath = nb.check_folder_filepath(output_filepath)
print(output_filepath)

Please enter output folder filepath here: ../../AudreyData/TP53/ABE/
../../AudreyData/TP53/ABE/


<font color='blue'><b>Step 4:</b> Please select the type of base editor (BE) used in the samples in input file. Then, click on the next cell to continue.

In [8]:
global be_type

be_type_input = input("Please specify the type of base editor used in the input samples by entering 'A' for A base editor or 'C' for C base editor: ")


Please specify the type of base editor used in the input samples by entering 'A' for A base editor or 'C' for C base editor: A


In [9]:
# Make sure a base editor is selected and not default value

if (be_type_input != 'A') and (be_type_input != 'C'):
    raise Exception('Invalid input. Please enter either A or C to specify base editor.')

else:
    # Run rest of notebook! 
    be_type = be_type_input + 'BE'

<IPython.core.display.Javascript object>

<font color='blue'> <b>Ready to run functions!</b> </font>

In [10]:
#make input df with columns BEV, offset, rev_com, left_lim, right_lim, primer, width, height
input_df = pd.DataFrame()

#check for NaN values i.e. blank rows
if input_file.isnull().values.any(): 
    input_file = nb.clean_input_file(input_file)

allele_freq_input_df = input_file


BEV_test_df = allele_freq_input_df[['sgRNA_sequence', 'BEV_test', 'primer']].copy()
BEV_test_df['type'] = 'test'
BEV_test_df = BEV_test_df.copy().rename(columns = {'sgRNA_sequence': 'guide_seq', 'BEV_test':'BEV'})

BEV_ref_df = allele_freq_input_df[['sgRNA_sequence', 'BEV_ref', 'primer']].copy()
BEV_ref_df['type'] = 'ref'
BEV_ref_df = BEV_ref_df.copy().rename(columns = {'sgRNA_sequence': 'guide_seq', 'BEV_ref':'BEV'})

input_df = pd.concat([BEV_ref_df, BEV_test_df]).reset_index(drop=True)

# Set x-axis limits
input_df['left_lim'] = -25
input_df['right_lim'] = 25

# Set plot dimensions
input_df['width'] = 6
input_df['height'] = 6

input_df

Unnamed: 0,guide_seq,BEV,primer,type,left_lim,right_lim,width,height
0,GCTCCTCCATGGCAGTGACC,417;418,F3_R2,ref,-25,25,6,6
1,GCTCCTCCATGGCAGTGACC,425;426,F3_R2,test,-25,25,6,6


In [13]:
# Run functions row by row
for i,row in input_df.iterrows():
    print(row['BEV'])
    bev_list = row['BEV'].split(';')
    output_name = '_'.join(bev_list)#+'_'+row['primer']
    #bev_df = get_bev_df(bev_list,row['rev_com'],output_name,row['primer'], row['guide_seq'])
    bev_df = nb.get_bev_df(bev_list,output_name,row['guide_seq'], be_type, CRISPResso_filepath,
                          bev_string_id, output_filepath)
    nb.make_nuc_per_plot(bev_df,row['left_lim'],row['right_lim'],bev_list, row['width'],row['height'],
                    be_type, output_name, output_filepath)
    
    #break

417;418
GCTCCTCCATGGCAGTGACC
GCTCCTCCATGGCAGTGACC
425;426
GCTCCTCCATGGCAGTGACC
GCTCCTCCATGGCAGTGACC
