# epPCR run for the main manuscrtipt

#### Command to run docker:

If you seek to replicate these results, please use LevSeq version 1.2.4.

Also the data to repeat these experiments is available at [zenodo link](https://zenodo.org/records/13694463).

Once you have the data from zenodo downloaded please have the ParLQ-ep1 and ParLQ-ep2 in separate folders then for each `cd` into the directory with the data and run the following commands:


#### LevSeq command for ep1
```
docker run --rm -v "$(pwd):/levseq_results" yueminglong/levseq:levseq-1.1.1-arm64 /levseq_results/parLQ_20240421 /levseq_results/20240421 /levseq_results/20240421-YL-ParLQ-ep1.csv
```
#### LevSeq command for ep2
```
docker run --rm -v "$(pwd):/levseq_results" levseq /levseq_results/parLQ_20240421 /levseq_results/20240421 /levseq_results/20240421-YL-ParLQ-ep1.csv
```

#### Description

1. `"$(pwd):/levseq_results"` is the path to where you downloaded the raw data from zenodo i.e. we expect you to run it from there otherwise put in the full path rather than `pwd()`
2. `yueminglong/levseq:levseq-1.1.1-arm64` is the docker image (if you are pulling), since I ran this on a mac M1 I used this one! Use just levseq if you run locally.
3. `levseq_results/parLQ_20240502/` name of the folder that the data will be output into
4. `/levseq_results/2024050` name of the folder with the data
5. `levseq_results/20240502-YL-ParLQ-ep2.csv` name of the csv reference file



## Imports

For the manuscript we used LevSeq version 1.2.4.

In [1]:
from levseq import *
%load_ext autoreload
%autoreload 2
from levseq.seqfit import process_plate_files, gen_seqfitvis, normalise_calculate_stats
import warnings
warnings.filterwarnings('ignore')


# Load and process the plates

These were run on two different dates, four plates in one run and then six plates in another run.

In [2]:
processed_plate_df, seqfit_path = process_plate_files(product=["cis", "trans"], input_csv="/Users/arianemora/Documents/code/LevSeq/data/epPCR/epPCR_main_manuscript/ParLQ-ep1/parLQ_20240421.csv")
processed_plate_ep_1_df = processed_plate_df.copy()
processed_plate_ep_1_df['Plate'] = [f'ep1_{p}' for p in processed_plate_ep_1_df['Plate'].values]

In [3]:
processed_plate_df, seqfit_path = process_plate_files(product=["cis", "trans"], input_csv="/Users/arianemora/Documents/code/LevSeq/data/epPCR/epPCR_main_manuscript/ParLQ-ep2/parLQ_20240502.csv")
processed_plate_df['Plate'] = [f'ep2_{p}' for p in processed_plate_df['Plate'].values]
processed_plate_df = pd.concat([processed_plate_df, processed_plate_ep_1_df])
processed_plate_df

In [4]:
# Since they were run on different days add in dis to make life easier
processed_plate_df['id'] = [f'{p}_{w}' for p, w in processed_plate_df[['Plate', 'Well']].values]

## Calculate statistics for cis and trans and then combine the results

In [5]:
parent = '#PARENT#'
value_columns = ['cis']
normalise = 'standard' # one of parent, standard, minmax, none
stats_method = 'mannwhitneyu'

cis_stats_df = normalise_calculate_stats(processed_plate_df, value_columns, normalise='standard', stats_method='mannwhitneyu', parent_label='#PARENT#')
cis_stats_df = cis_stats_df.sort_values(by='amount greater than parent mean', ascending=False)


In [6]:
parent = '#PARENT#'
value_columns = ['trans']
normalise = 'standard' # one of parent, standard, minmax, none
stats_method = 'mannwhitneyu'

trans_stats_df = normalise_calculate_stats(processed_plate_df, value_columns, normalise='standard', stats_method='mannwhitneyu', parent_label='#PARENT#')
trans_stats_df = trans_stats_df.sort_values(by='amount greater than parent mean', ascending=False)


In [7]:
trans_stats_df.set_index('amino-acid_substitutions', inplace=True)
cis_stats_df.set_index('amino-acid_substitutions', inplace=True)

stats_df = trans_stats_df.join(cis_stats_df, on='amino-acid_substitutions', how='inner', lsuffix='_trans', rsuffix='_cis')
stats_df

# Plot the figures

Read in the output from LevSeq! This is the visualization file: `parLQ_20240421.csv` and `parLQ_20240502.csv`.


We combine these files.

In [8]:
df_ep1 = pd.read_csv("../../data/epPCR/epPCR_main_manuscript/ParLQ-ep1/parLQ_20240421.csv", index_col=0)
df_ep2 = pd.read_csv("../../data/epPCR/epPCR_main_manuscript/ParLQ-ep2/parLQ_20240502.csv", index_col=0)

df_ep2['Plate'] = [f'ep2_{p}' for p in df_ep2['Plate'].values]
df_ep1['Plate'] = [f'ep1_{p}' for p in df_ep1['Plate'].values]

df = pd.concat([df_ep2, df_ep1])
# Make an id since the two different dates used some of the same plates
df['id'] = [f'{p}_{w}' for p, w in df[['Plate', 'Well']].values]

df['# Mutations'] = [len(str(m).split('_')) if m not in ['#N.A.#', '#PARENT#', '-', '#LOW#'] else 0 for m in df['amino-acid_substitutions'].values]

In [9]:
df['amino-acid_substitutions'].value_counts()

In [10]:
df[['Alignment Count', 'Average mutation frequency', '# Mutations']].describe()

In [11]:
df['# Mutations'].value_counts()

In [12]:
df[df['# Mutations'] == 0]

In [13]:
df.set_index('id', inplace=True)
processed_plate_df.set_index('id', inplace=True)
df = df.join(processed_plate_df, rsuffix='_processed_plate_df', how='outer')
df

In [25]:
df.to_csv('sequence_function_paired_data_ParLQ.csv')

In [14]:
# Now classify each of them with the specific labels
df['trunc_label'] = ['Truncated' if '*' in v else 'OK' for v in df['aa_sequence'].values]
df['Type'] = [m if '*' not in str(v) else '#TRUNCATED#' for m, v in df[['amino-acid_substitutions', 'aa_sequence']].values]

na_df = df[df['Type'] == '#N.A.#']
trunc_df = df[df['Type'] == '#TRUNCATED#']
deletion_df = df[df['amino-acid_substitutions'] == '-'] # Delection
parent_df = df[df['Type'] == '#PARENT#']
variant_df = df[~df['Type'].isin(['#PARENT#', '#N.A.#', '#TRUNCATED#', '-'])]
u.dp(['Number of frame shifts: ', len(na_df), 
      '\nNumber of truncations: ', len(trunc_df), 
      '\nNumber of parents: ',  len(parent_df), 
      '\nNumber of variants:',  len(variant_df)
     ])


In [15]:
df['Type'] = [v if v != '-' else '#DELETION#' for v in df['Type'].values]
df['Type'] = [v if v[0] == '#' else '#VARIANT#' for v in df['Type'].values]
df['Type'] = [v if v != '#DELETION#' else 'Deletion' for v in df['Type'].values]
df['Type'] = [v if v != '#VARIANT#' else 'Variant' for v in df['Type'].values]
df['Type'] = [v if v != '#PARENT#' else 'Parent' for v in df['Type'].values]
df['Type'] = [v if v != '#TRUNCATED#' else 'Truncated' for v in df['Type'].values]
df['Type'] = [v if v != '#LOW#' else 'Low' for v in df['Type'].values]
df['Type'] = [v if v != '#N.A.#' else 'Empty' for v in df['Type'].values]

df['Type'].value_counts()


In [16]:
cols = ['Alignment Count', 'Average mutation frequency', '# Mutations', 
       'P adj. value', 'cis', 'trans']

In [17]:
na_df[cols].describe()

In [18]:
trunc_df[cols].describe()

In [19]:
deletion_df.describe()

In [20]:
parent_df.describe()

In [21]:
variant_df.describe()

# Join with the processed dataframe for other visualisation

In [22]:
cols = ['Alignment Count', 'Average mutation frequency', '# Mutations', 'P adj. value', 'cis', 'trans']

In [23]:
import seaborn as sns
import matplotlib.pyplot as plt

figure_dir = ''
sns.set(rc={'figure.figsize': (3, 3), 'font.family':  'sans-serif',
                'font.sans-serif': 'Arial', 'font.size': 12}, style='whitegrid')
parent = '#97CA43'
variant = '#3A578F'
deletion = '#6E6E6E'
truncation = '#FCE518'
empty = 'lightgrey'
low = '#A6A7AC'

palette = [empty, low, deletion, truncation, parent, variant]
sns.palette = palette

def set_ax_params(ax):
    ax.tick_params(direction='out', length=2, width=1.0)
    ax.tick_params(labelsize=10)
    ax.tick_params(axis='x', which='major', pad=2.0)
    ax.tick_params(axis='y', which='major', pad=2.0)
    
ax = sns.histplot(data=df, x="Alignment Count", hue="Type", bins=20, palette=palette, 
                  hue_order=['Empty', 'Low', 'Deletion', 'Truncated', 'Parent', 'Variant'], multiple="stack")
set_ax_params(ax)
plt.title('Alignment counts')
plt.savefig(f'{figure_dir}Figure1_Bar_AlignmentCounts.svg', bbox_inches='tight')

In [24]:
col = 'trans'
ax = sns.histplot(data=df, x=col, hue="Type", bins=20, palette=palette, hue_order=['Empty', 'Low', 'Deletion', 'Truncated', 'Parent', 'Variant'], multiple="stack")
set_ax_params(ax)
plt.title(f'{col}')
plt.savefig(f'{figure_dir}Figure1_Bar_Trans.svg', bbox_inches='tight')

In [80]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['svg.fonttype'] = 'none'

ax = sns.scatterplot(df, x='cis', y='trans', hue='Type', palette=palette, hue_order=['Empty', 'Low', 'Deletion', 'Truncated', 'Parent', 'Variant'],
                     size='# Mutations')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
set_ax_params(ax)
plt.xlim(0, 3.5*10**6)
plt.ylim(0, 3.5*10**6)
ax.yaxis.set_ticks([0, 1*10**6, 2*10**6, 3*10**6])

plt.title('Cis vs Trans')
plt.savefig(f'{figure_dir}Figure1_Scatter_Cis-Trans.svg', bbox_inches='tight')

In [78]:
stats_df[stats_df.index=='F70L']

In [79]:
stats_df[stats_df.index=='F89S']