# P53 - DMS Analyse
#### by Frido Petersen, Dario Prifti, Maximilian Fidlin and Enno Schäfer
*With special thanks to our Co-Worker, inspiration and beloved friend: Chat-GPT*

In [None]:
%load_ext autoreload
%autoreload 2
#
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
import seaborn as sns
import data_exploration as de
import data_cleanup as dc
import functions as fun
import Documentation as doc
import severity_score as ses
import matplotlib.pyplot as plt

In [None]:
# These are all the datasets we used
gia_null_eto: pd.DataFrame = pd.read_csv('../DMS_data/P53_HUMAN_Giacomelli_NULL_Etoposide_2018.csv')
gia_null_nut: pd.DataFrame = pd.read_csv('../DMS_data/P53_HUMAN_Giacomelli_NULL_Nutlin_2018.csv')
gia_wt_nut: pd.DataFrame = pd.read_csv('../DMS_data/P53_HUMAN_Giacomelli_WT_Nutlin_2018.csv')
kot_hum: pd.DataFrame = pd.read_csv('../DMS_data/P53_HUMAN_Kotler_2018.csv')

aa = pd.read_csv('../DMS_data/aminoacids.csv')

## Comparibility of p53 Datasets
#### Finding similarities and differences in the 4 datasets on p53

In [None]:
# Giacomelli Null Etoposide, Giacomelli Wildtype Nutlin, Giacomelli Null Nutlin
fun.mult_hmap(doc.gia_null_eto_norm, doc.gia_wt_nut_norm, doc.gia_null_nut_norm)

In [None]:
# Kotler
fun.hmap(doc.kot_hum_norm_amp)

In [None]:
# The aminoacids in the original sequence that, when replaced, caused the most negative DMS scores throughout the whole Protein.
fun.calculate_average_dms_score_old(('GNE', doc.gia_null_eto_norm_amp), ('GNN', doc.gia_null_nut_norm_amp), ('GWN', doc.gia_wt_nut_norm_amp), ('KH', doc.kot_hum_norm_amp))

In [None]:
#the amino acids that, when mutated to, resulted in the most significant decreases in the DMS score, indicating a substantial impact on protein function.
fun.calculate_average_dms_score_new(('GNE', doc.gia_null_eto_norm_amp), ('GNN', doc.gia_null_nut_norm_amp), ('GWN', doc.gia_wt_nut_norm_amp), ('KH', doc.kot_hum_norm_amp))

In [None]:
#In order to make these comparisons a little clearer we used a heatmap to illustrate these trends of substitution. The x-axis shows us the DMS scores of the AA's when replaced and the y-axis the DMS-scores of the AA's when replaced with.
fun.hmap_mean_variance(doc.mean_substitutionsGNE)

In [None]:
# The 5 position with the lowest mean DMS_scores in the "Giacomelli null etoposide" dataset
doc.GNELV

In [None]:
# The 5 position with the highest mean DMS_scores in the "Giacomelli null etoposide" dataset
doc.GNEHV

In [None]:
# The positions (-> most affected by mutation) are the following throughout each Dataset:
doc.lowest_vals.head(20)

In [None]:
#As a final and conclusive way to show the differences of the datasets we chose to visualize our datasets as linegraphs in one plot, this plot shows the mean DMS scores for each position and lets us quickly seek out positions that are greatly affected by mutation and those that are not. By summing up all DMS scores and dividing this sum by the number of values summed we can also create a rough comparability to the Kotler dataset which is also visible in this graph:
doc.liniengraph(dataframes=[doc.gia_null_eto_z_mmn_df_mean, doc.gia_null_nut_z_mmn_df_mean, doc.gia_wt_nut_z_mmn_df_mean, doc.kot_hum_z_mmn_df_mean])

## Data cleanup
#### Preparing the data to enable further anaylses

In [None]:
# min max Normalisierung
norm_frame = dc.aufteilung_mut_pos(dc.norm(gia_null_eto))
print("Z-transfromation and Min Max normalisation of df")
fun.hmap(norm_frame)
print(f"Position of Low and High values of frame")
dc.min_max_val(norm_frame)

After we finished cleaning our data, we decided to transform the data into a new, more compact format.
In this new data frame the rows resemble the original AA sequence and the rows represent the exchange with a specific AA (e.g. A). The shown values are the DMS scores for the shown substitution. The NAs   shown for the exchanges where the old and new AA are the same, are changed to the value zero. With this transformed data set, further analyses are more easily to perform.

Max: Wir haben probiert Patientendaten zu bekommen, aber wir haben die nicht bekommen

## Data exploration
#### First, we calculated the distance and mean substitution matrices. With these DataFrames, we performed PCA, clustering and plotted the results. To understand the code and see additional plots, take a look at the "data_exploration" python package.

In [None]:
# calculate feature matrices
feature_matrix_aa = dc.clean_aa(aa)
feature_matrix_p53 = dc.rmv_na(dc.df_transform(norm_frame))

In [None]:
# calculate distance matrices as well as mean substitution matrices
dist_chem = de.aa_distance_matrix(aa)

dist_wt_p53 = de.dms_distance_matrix_wt(norm_frame)
dist_mut_p53 = de.dms_distance_matrix_mutated(norm_frame)

mean_subs_wt_p53 = dc.rmv_na(de.mean_substitutions(norm_frame))
mean_subs_mut_p53 = dc.rmv_na(de.mean_substitutions(norm_frame).T)

In [None]:
# hierarchical ward clustering
de.plot_hier_clust(dist_chem, title = "AAs chemical properties")
print("---------------------------------")
de.plot_hier_clust(dist_wt_p53, title = "p53 distance matrix of WT AAs")
de.plot_hier_clust(dist_mut_p53, title = "p53 distance matrix of mutated AAs")
print("---------------------------------")
de.plot_hier_clust(mean_subs_wt_p53, title = "p53 mean substitutions for WT AAs")
de.plot_hier_clust(mean_subs_mut_p53, title = "p53 mean substitutions for mutated AAs")

In [None]:
# determine optimal amount of clusters
clusters_by_sil_chem = de.determine_clusters_silhouette(feature_matrix_aa)

clusters_by_sil_p53 = de.determine_clusters_silhouette(feature_matrix_p53)


print (clusters_by_sil_chem)
print ("---")
print (clusters_by_sil_p53)

In [None]:
# Plotting, after pca and hierarchical clustering. clusters are defined by the colors shown in the legend.
de.pca_hierarchical_plot(dist_chem, optimal_num_cluster= clusters_by_sil_chem, title = "AAs clustered by chemical properties", show_var=True)

de.pca_hierarchical_plot(dist_wt_p53, optimal_num_cluster=clusters_by_sil_p53, title = "p53 clustered by distance matrix of WT AAs")
de.pca_hierarchical_plot(dist_mut_p53, optimal_num_cluster=clusters_by_sil_p53, title = "p53 clustered by distance matrix of mutated AAs", show_var=True)

de.pca_hierarchical_plot(mean_subs_wt_p53, optimal_num_cluster=clusters_by_sil_p53, title = "p53 clustered by mean substitutions of WT AAs")
de.pca_hierarchical_plot(mean_subs_mut_p53, optimal_num_cluster=clusters_by_sil_p53, title = "p53 clustered by mean substitutions of mutated AAs")

#------------------------------------------------------------------
# I HAVE NO IDEA WHY THESE PLOTS LOOK SO AWFUL IN THIS NOTEBOOK. For better plots look at the de.clustering_pca_plotting.ipynb file

##### *Outlook and additional information for Data Exploration*

TO MAKE THESE PLOTS FOR EACH DOMAIN: Take a look at the de.pca_hierarchical_plotting_domains.ipynb file

TO SEE THE PLOTS WITH K-MEANS: To quantify the effects of the clustering method used (here: hierarchical clustering), we also performed clustering with kmeans to compare the results in the report. Take a look at the de.pca_kmeans_plotting.ipynb file

TO PROOF OUR CODE RUNS ON ALL DATASETS: We ran our code on the Stiffler Dataset on E. coli ß-Lactamase. To see the plots, take a look at the pca_hierarchical_plotting_ßlactamase.ipynb file

## Domain comparison
#### Comparing Clusterings of substitutions in the context of specific protein domains

For the comparison of the domains we first cut our data into smaller corresponding chunks. We then applied different types of analyses on those domains, all of which can be seen in the domain_comparison folder. We then repeated those steps with only the aminoacids reachable with a single base mutation (SMU) and with amino acids belonging to random codons. We then compared the complete dataset to the SMU dataset domain-wise. Also, the random mutations where compared to the SMU.

In [None]:
# All DMS scores divided onto the different domains
all_dms = plt.imread('../domain_comparison/data/all_dms.png')

fig, ax = plt.subplots(figsize=(20, 20))
plt.imshow(all_dms)
plt.axis('off')
plt.show()

In [None]:
#Sinlge mutations only
smu_dms = plt.imread('../domain_comparison/data/smu_dms.png')
fig1, ax1 = plt.subplots(figsize=(20, 20))
plt.imshow(smu_dms)
plt.axis('off')
plt.show()

In [None]:
#Comparing DNA binding domain
all_vs_smu = plt.imread('../domain_comparison/data/all_vs_smu.png')
fig2, ax2 = plt.subplots(figsize=(20, 20))
plt.imshow(all_vs_smu)
plt.axis('off')
plt.show()

## Calculating severity scores
#### Matching DMS_scores with the mutation probability (only for single mutations)


In [None]:
severity_score_p53 = ses.dms_smut(ses.p53_codons_gia, gia_null_eto, bias_dms=False, include_original=True)
severity_score_p53.head(20)

In [None]:
severity_score_p53.compare(dc.df_split(norm_frame), keep_equal=True, keep_shape=True, result_names=('smut', 'dms'))