# <b>Data Diagnostics I </b> *✲ﾟ*｡✧٩(･ิᴗ･ิ๑)۶*✲ﾟ*｡✧

In this notebook we will explore taking the min-max or percentile normalization between datasets and also derivatives and see how our data changes, i.e. the distribution of each variable, check the principal components, etc.

In [None]:
import helper_functions as hf
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.preprocessing import RobustScaler

imputed_dataframe = pd.read_hdf("imputed_dataframe_0602.h5")
annotations = imputed_dataframe["state"]

In [None]:
px.scatter(imputed_dataframe, y="SMDVR", color=imputed_dataframe["state"], color_continuous_scale='viridis', marginal_y='histogram').show()

### Behaviour state annotations

In [None]:
# converting the behaviour state annotations to discrete values
dataframe = imputed_dataframe.copy()
#annotations = imputed_dataframe["state"]

conditions = [
    ((annotations) == 1.0),
    ((annotations)  == 2.0),
    ((annotations)  == 3.0),
    ((annotations)  == 4.0)
]

values = ['forward', 'reversal', 'sustained reversal', 'turn']

# replace values based on conditions, 1 = forward, 2 = reversal (rise), 3 = reversal (sustained), 4 = turn
transformed_annotations = np.select(conditions, values)

dataframe["state"] = hf.determine_turn(dataframe, transformed_annotations) # to determine whether a turn is a dorsal or a ventral turn
dataframe.to_hdf("imputed_dataframe_0602.h5", key="data")

In [None]:
hf.px.scatter(dataframe, y="AVAR", color=dataframe["state"], color_continuous_scale='viridis', marginal_y='histogram').show()

## Check PC space

In [None]:
#let's look at the PCA space + the explained variance of the data
pca = hf.PCA(n_components=3)
raw_pca = pd.DataFrame(pca.fit_transform(dataframe.loc[:, ~dataframe.columns.isin(["state","dataset"])]))
raw_pca["state"] = dataframe["state"]
hf.plot_PCs(raw_pca, variances=pca.explained_variance_ratio_*100)


### 7 components

In [None]:
#let's look at the PCA space + the explained variance of the data
pca7 = hf.PCA(n_components=7)
pca7.fit_transform(dataframe.loc[:, ~dataframe.columns.isin(["state","dataset"])])

In [None]:
pd.DataFrame(pca7.explained_variance_ratio_*100, columns=['Explained Variance']).to_hdf('explained_variance_2903.h5', key='unpreprocessed')

In [None]:
fig = px.bar(y=pca7.explained_variance_ratio_*100,  x=[i+1 for i in range(7)], text_auto='.2s',labels={"x":"PC","y":"explained variance (%)"}, height=400)
fig.update_layout(title='Explained Variance of PCA of Unpreprocessed Data', showlegend=False)
fig.show()

### Resampling (or Up-/Downsampling)
Our datasets have different sizes, so we have to upsample them. Most recordings range from 3200 to 3780 time points as we can see in the below figure but there is one dataset with 4146 and one with 5450 time points. 8 datasets have exactly 3529 time points. We will therefore down- or upsample to this number via linear interpolation (computing the slope between two data points) implemented in numpy.   

In [None]:
hf.visualize_fps(dataframe, title="frame rate of each dataset", xlabel="dataset", ylabel="frame rate", coloring="tab:red")

In [None]:
frames_num = 3529

# resample all dataframes to the same length of 3529 frames
resampled_dataframe = hf.resample(dataframe, list(dataframe.groupby('dataset').size().values), frames_num)

In [None]:
hf.px.scatter(resampled_dataframe, y="PVR", color=resampled_dataframe["state"], color_continuous_scale='viridis', marginal_y='histogram').show()

### Truncation

We noticed some edge effects in the data, i.e. the first and last 100 time points are not very reliable. We will therefore truncate the data to the middle 3329 time points.

In [None]:
# truncating with a default n of 100, i.e. we remove the first 100 and the last 100 observations from each dataset
truncated_dataframe = hf.truncate(resampled_dataframe, n=100) 

In [None]:
truncated_dataframe.to_hdf("truncated_dataframe_0602.h5", key="df")

### Normalization between datasets

As seen above we have to deal with different scales across datasets so a natural next step is to normalize the data across datasets to make them comparable. We will do this by taking the min-max normalization between datasets. This means that we will take the minimum and maximum value of each variable across all datasets and then normalize each dataset to this range. This will be done the time derivatives of the resampled data.

We can also try the percentile normalization between datasets. This means that we will take the 5th and 95th percentile of each variable across all datasets and then normalize each dataset to this range. This will be done the time derivatives of the resampled data.

### Based on Quantiles: RobustScaler

In [None]:
scaler = RobustScaler(with_centering=False, with_scaling=True, quantile_range=(5, 99))
scaler2 = RobustScaler(with_centering=False, with_scaling=True, quantile_range=(5, 99), unit_variance=True)

# normalize per dataset
quartiled_separate = hf.normalize_per_dataset(truncated_dataframe.loc[:, ~truncated_dataframe.columns.isin(["state", "dataset"])], truncated_dataframe.groupby("dataset").size().values, scaler)

# normalize across datasets 
quartiled_data = pd.DataFrame(scaler2.fit_transform(quartiled_separate), columns = quartiled_separate.columns)

In [None]:
quartiled_data["state"] = truncated_dataframe["state"]
quartiled_data["dataset"] = truncated_dataframe["dataset"]

In [None]:
column = "SMD"
hf.px.scatter(quartiled_data, y=column, color=quartiled_data["state"], color_continuous_scale='viridis', marginal_y='histogram',hover_name="state", hover_data=["dataset", quartiled_data.index, column]).show()

In [None]:
hf.px.scatter(quartiled_data, y="PVR", color="state", color_continuous_scale='viridis', marginal_y='histogram',hover_name="dataset", hover_data=["state", quartiled_data.index, column]).show()

## Normalization with 20% quantile subtraction

In [None]:

def quantile_subtract(X):
    for column in X.loc[:,~X.columns.isin(["state","dataset"])].columns:
        quantile_20 = X[column].quantile(0.20)
        X[column] = X[column].apply(lambda x: x - quantile_20)
    return X

In [None]:
quartiled_data_copy = quartiled_data.copy()
quartiled_data_copy = quartiled_data_copy.groupby("dataset").apply(lambda x: quantile_subtract(x))

In [None]:
hf.px.scatter(quartiled_data_copy, y="AVAR", color="state", color_continuous_scale='viridis', marginal_y='histogram',hover_name="dataset", hover_data=["state", quartiled_data.index, column]).show()

In [None]:
quartiled_data_copy.to_hdf("quartiled_data_0602.h5", key="df")

In [None]:
quartiled_data_copy.to_hdf("quartiled_data_0602.h5", key="df")

### PCA on normalized data

In [None]:
pca_quartile = hf.PCA(n_components=3)
imputed_pc_quartile = pd.DataFrame(pca_quartile.fit_transform(quartiled_data_copy.loc[:, ~quartiled_data_copy.columns.isin(["state",'dataset'])]))

window_size = 10

# Applying a 10-sample sliding average for smoother visualizations!
imputed_pc_quartile[0] = np.convolve(imputed_pc_quartile[0], np.ones(window_size)/window_size, mode='same')
imputed_pc_quartile[1] = np.convolve(imputed_pc_quartile[1], np.ones(window_size)/window_size, mode='same')
imputed_pc_quartile[2] = np.convolve(imputed_pc_quartile[2], np.ones(window_size)/window_size, mode='same')

imputed_pc_quartile['state'] = truncated_dataframe["state"]
hf.plot_PCs(imputed_pc_quartile,imputed_pc_quartile['state'],'PCA_quartiled_0202.html')

# Appendix I: PCA weights

In [None]:
quartiled_transposed_dataframe = quartiled_data_copy.loc[:,~quartiled_data_copy.columns.isin(["state","dataset"])].T
n_components = 3

pca_all_splits = hf.get_LLO_PCAs(quartiled_transposed_dataframe, n_components=n_components)

fig = hf.make_subplots(rows=1, cols=1, shared_xaxes=True, y_title= "PCA weights", vertical_spacing=0.05)

variable_name = f"pca3_all_splits"
concatenated_pca = hf.pd.concat(pca_all_splits[variable_name], axis=0)
concatenated_pca.sort_values(by=['Mode 3'], inplace=True)
fig.add_trace(hf.go.Box(
    x=concatenated_pca['neuron'],
    y=concatenated_pca['Mode 3'],
    boxpoints=False,
    name='Mode 3'
), row=1, col=1)


# Update layout
fig.update_layout(
    title_text="First 3 PC weights for all neurons",
    height=600
)

# Show the figure
fig.show()
fig.write_html("PCA_neuron_weights.html")

# Appendix II: Grid Plots

In [None]:
%%capture

# we plot all the resampled traces 
saving_path="..\\plots\\23Jan\\resampled_plots\\"

hf.plot_from_stacked_imputed(length_dict, resampled_dataframe, resampled_dataframe, saving_path)

In [None]:
%%capture
# we plot all the resampled traces 
saving_path="..\\plots\\23Jan\\truncated_plots\\"

hf.plot_from_stacked_imputed(length_dict, truncated_dataframe, truncated_dataframe, saving_path)

In [None]:
%%capture
for column in quartiled_data_copy.columns:
    # plotting the trace of one neuron across all datasets
    # and save the plot
    fig, ax = plt.subplots(figsize=(40, 10))
    ax.plot(quartiled_data_copy[column].T, color="tab:blue")
    ax.set_ylabel(column)
    ax.set_xlabel("time")
    ax.set_title(column+"across all datasets")
    fig.savefig("..\\plots\\06Feb\\all_traces_normalized\\normalized_"+column+"_alldatasets.png")


In [None]:
%%capture
%matplotlib widget
saving_path="..\\plots\\23Jan\\normalized_5_95\\"

hf.plot_from_stacked_imputed(length_dict, quartiled_data, quartiled_data, saving_path)


In [None]:
%%capture
%matplotlib widget
saving_path="..\\plots\\23Jan\\normalized_on_truncated\\"

hf.plot_from_stacked_imputed(length_dict, quartiled_data, truncated_dataframe, saving_path)