# <b>Data Exploration </b> *✲ﾟ*｡✧٩(･ิᴗ･ิ๑)۶*✲ﾟ*｡✧

## Intro

<b>What data?</b> <br> We are working with the immobilized whole brain imaging data from Kerem and Rebecca. They only include control data, meaning that the worm is a wild type and has no modification apart from the GFP. The data, which were originally stored in separate wbstruct.mat files, have been converted to a dictionary of pandas dataframes (stored in a pickle file for easier handling but also into separate h5 files and csv files for easier sharing). 

<b>What kind of exploration?</b> <br> 
We want to understand the individual datasets and get a feeling of what problems we might face, whether we need to do some processing before we start with the analysis. For now, we especially want to know and make following decisions:
- Are all neurons in each dataset IDed? (If no, remove)
- How many neurons are IDed in each dataset?
- How many times is each neuron IDed in total? (If too few, remove or impute)
- Are neurons with few total number of IDs unique? (If no, impute. Else remove)
- How to deal with missing IDs of nonunique neurons? Which imputation method? 

In [None]:
import helper_functions as hf
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# 1. Quantifications 


We will first look at the number of IDs per neuron in all datasets and the number of IDs per dataset.

In [None]:
# load dataframe dictionaries
dataframes=hf.wbstruct_dataframes.loading_pkl('dataframes_rebecca_2301.pkl')
dataframes_kerem=hf.wbstruct_dataframes.loading_pkl('dataframes_kerem_0602.pkl') # data loaded with wbstruct_converter

# merging rebecas and kerems dataframe dictionaries
dataframes.update(dataframes_kerem)
hf.wbstruct_dataframes.saving_as_pkl(dataframes, 'dataframes_0602.pkl')

In [None]:
# concatenate state annotations
annotations = []
for k,v in dataframes.items():
   annotations.append(v["state"].values)
   v.drop("state", axis=1, inplace=True)
   dataframes[k] = v
annotations = np.concatenate(annotations)

In [None]:
stacked_dataframe = hf.pd.concat([df for df in dataframes.values()], ignore_index=True)
threshold = 10

# we stack all neurons that have less than 10 IDs
all_IDed_neurons, IDs_per_set = hf.count_IDs(dataframes) # count how many times each neuron was IDed
stacked_dataframe = stacked_dataframe.drop(columns=[neuron for neuron in stacked_dataframe.columns if all_IDed_neurons[neuron] < threshold])

In [None]:
# adding which dataset each observation belongs to
new_col = []
for key, value in dataframes.items():
    new_col.extend([key] * len(value))
stacked_dataframe['dataset'] = new_col

#### Let's visualize this information with a cute matplotlib plot 

In [None]:
hf.visualize_IDs(IDs_per_set, title="Plot of datasets and number of IDs", xlabel="dataset", ylabel="IDs", coloring="tab:green")

In [None]:
# plot the number of IDed neurons per set
fig, ax = hf.visualize_IDs(all_IDed_neurons, title="Plot of names of neurons and number of times IDed",xlabel="neuron ID",ylabel="count in all datasets",display_all_values=True)
hf.plt.show()


Most neurons are ID'ed more than 16 times while there are some neurons that are IDed less than 10 times. Considering we have 25 datasets, having to impute more IDs that exist in all datasets might not be a good idea. We could set a threshold to only impute IDs that are present in at least 10 datasets. This would reduce the number of IDs to impute to 25.
But before we make the cut we want to know how unique each neuron is. Let's say neuron RMER, which isn't IDed very much, is not very unique in the entirety of the datasets, meaning we could impute the recording of the neuron in all the datasets where it's missing. But if RMER is very unique, the imputation might lead to a completely wrong result.

# 2. ID imputation with PPCA

Probabilistic PCA aims to estimate the principal axes of a datset through maximum likelihood estimation of parameters in a latent variable model. It is a probabilistic formulation of PCA that is more numerically stable and allows for missing data. With PPCA we can impute missing neuronal activities in some of our datasets

In [None]:
stacked_dataframe_copy = stacked_dataframe.copy()
stacked_dataframe_copy = stacked_dataframe_copy.drop(columns="dataset")

In [None]:
# the data is imputed with PPCA
imputed_dataframe_og = hf.utils_imputation.impute_missing_values_in_dataframe(stacked_dataframe_copy)

In [None]:
imputed_dataframe_og["state"] = annotations
imputed_dataframe_og["dataset"] = stacked_dataframe["dataset"]
imputed_dataframe_og.to_hdf("imputed_dataframe_0602.h5", key="imputed_dataframe_0602")
imputed_dataframe_og.head()

# Appendix I: "Uniqueness" of neurons

So we want to know two things: how unique is each neuron across all datasets and does it make sense to use PPCA for data imputation considering that it assumes that the variables can be linearly modelled. 
We will run a Least Square Regression model d(=number of datasets)*n(=number of neurons) times where in each round y will be a single neuron and X will be all neurons but y. The aim is to see how well a neuron can be explained by all the other neurons. We can get a rough understanding of this by looking at the R<sup>2</sup> value of each LS model. 

### R-Squared R<sup>2</sup>
The R<sup>2</sup> measures the proportion of the neurons variance or spread explained by all the other neurons. R<sup>2</sup> ranges from 0 to 1 where 1 indicates that all the variance is explained by the other neurons and 0 indicates that none of the variance is explained by the other neurons. Now, if in our case the R<sup>2</sup>-value is high, we can say that the neuron is not very unique and we can impute the missing IDs. If the R<sup>2</sup>-value is low, we can say that the neuron is very unique and we should not impute the missing IDs. We will look at the average R<sup>2</sup> across all datasets. <br>
Runtime: ~ 3 minutes

In [None]:
avg_r2, predictions, importances, raw_data = hf.get_R2_predictions(dataframes, all_IDed_neurons)

In [None]:
hf.visualize_IDs(avg_r2, title="Average R2 values", xlabel="neuron ID", ylabel="R2", coloring="tab:blue")

In [None]:
min_value = min(avg_r2.values(), key=lambda x: abs(x - 0.7)) # this finds the closest R2 value to 0.7

percent = hf.find_percent(avg_r2.values(), min_value) # this finds the percentage of neurons that have an average R2 value of at least 0.7
print("{:.2f}% of neurons have an average R2 value of at least {:.2f}".format(percent, min_value))

Given that more than 67 percent of neurons have an average R<sup>2</sup>-value of at least 0.7, we can say that most neurons are not very unique. This means that we can impute the missing IDs of most neurons and use PPCA for this task.

# Appendix II: Variable p-values of the linear regression model

We want to know which neurons were important for modelling some of the neurons in the datasets. We will look at the p-values of each linear regression model. The p-value indicates the significance of each variable during the modelling of each neuron.

Calculation of p-value: 

In [None]:
# we take the mean VIP score of each neuron, sort them and take the top 5
for key,value in importances.items():
    for neuron, list in value.items():
        importances[key][neuron] = hf.np.mean(list)
    importances[key] = sorted(importances[key].items(), key=lambda item: item[1], reverse=True)[:5]

In [None]:
# we plot the top 5 neurons with the highest variable importances per neuron
fig, axes = plt.subplots(5, 6, figsize=(18, 9), sharey=True)
fig.suptitle('Top 5 neurons with highest Variable Importances per neuron')
plt.subplots_adjust(hspace = 2)
axes = axes.flatten()
count = 0
palette = iter(sns.husl_palette(30))
for i in importances.keys():
    
    keys = [neuron[0] for neuron in importances[i]]
    vals = [neuron[1] for neuron in importances[i]]
    sns.barplot(ax = axes[count], x=keys, y=vals, dodge=False, color=next(palette))
    axes[count].set_title(i)
    count = count + 1
    
    # cut off the plot after 30 neurons so that the plot is not too big
    if count % round(len(importances.keys())/2.5) == 0:
        plt.show()
        fig, axes = plt.subplots(5, 6, figsize=(18, 9), sharey=True)
        fig.suptitle('Top 5 neurons with highest Variable Importances per neuron')
        plt.subplots_adjust(hspace = 2)
        axes = axes.flatten()
        count = 0
        palette = iter(sns.husl_palette(30))
    


# Appendix III: Saving plots

### Modelled neuron activity against true activity

In [None]:
%%capture
%matplotlib widget
# we don't want to output all plots, just save them

delta_path="..\\plots\\23Jan\\delta_plots\\"
model_path="..\\plots\\23Jan\\modelled_plots\\"

plot_kwargs = {'alpha': 0.7}

hf.plot_from_single_imputed(raw_data, predictions, delta_path, model_path, plot_kwargs=plot_kwargs)

### Imputed neuron activity against existing true activity

In [None]:
%%capture 

from collections import defaultdict

# we will save all dataframe keys and their lengths in a dictionary for the unstacking part
length_dict = defaultdict()
for key, value in dataframes.items():
    length_dict[key] = len(value)

saving_path="..\\plots\\23Jan\\imputed_plots\\"

hf.plot_from_stacked_imputed(length_dict, imputed_dataframe, stacked_dataframe, saving_path)