# Clustering Data Preparation: Ganglion Cells in the Retina

- **Author:** David Felipe
- **Contact:** https://github.com/davidnfu0
- **Last Modification:** January 25, 2024
- **Description:** In this notebook, we will process data for clustering analysis. Our goal is to create DataFrames with varying characteristics to enable different approaches to clustering. These DataFrames will be used in subsequent notebooks for in-depth analysis.

## Introduction

### Importing Libraries


In [1]:
import sys
import umap
import pickle
import pandas as pd
from sklearn.decomposition import PCA

  from .autonotebook import tqdm as notebook_tqdm
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
sys.path.append("../../")

In [3]:
from scripts import load_yaml_config, hide_warnings

### Paths and configuration

In [4]:
configPath = "../../config/"
config = load_yaml_config(configPath + "general_config.yml")

### Loading the Data
We will primarily work with three DataFrames in this analysis. The first DataFrame contains data obtained through Spike-Triggered Averages (STA), and the second one includes stimulus data. The third DataFrame consists of STA data reduced using Principal Component Analysis (PCA). It's important to note that the data for use in this analysis must be pre-processed and ready.

In [5]:
staDf = pd.read_csv("../../" + config["paths"]["data"]["sta_df"])
stimDf = pd.read_csv("../../" + config["paths"]["data"]["stim_df"])
staPca = pd.read_csv("../../" + config["paths"]["data"]["sta_pca_df"])

## Processing the Data

### Dividing the data

In [6]:
for i, col in enumerate(staDf.columns):
    print(f"({i}: {col})", end=" ")
print()
for i, col in enumerate(stimDf.columns):
    print(f"({i}: {col})", end=" ")

(0: template) (1: x) (2: y) (3: w) (4: h) (5: a) (6: exc) (7: area) (8: total_spikes) (9: peak_poly) (10: time_to_peak_poly) (11: array_peak_position_poly) (12: hwhh_x_poly) (13: hwhh_y_poly) (14: bandwidth_poly) (15: zero_crossing_poly) (16: peak_func) (17: time_to_peak_func) (18: array_peak_position_func) (19: hwhh_x_func) (20: hwhh_y_func) (21: bandwidth_func) (22: zero_crossing_func) 
(0: template) (1: chirp-comp_1) (2: chirp-comp_2) (3: chirp-comp_3) (4: chirp-comp_4) (5: chirp-comp_5) (6: chirp-comp_6) (7: chirp-comp_7) (8: chirp-comp_8) (9: chirp-comp_9) (10: chirp-comp_10) (11: chirp-comp_11) (12: chirp-comp_12) (13: chirp-comp_13) (14: chirp-comp_14) (15: chirp-comp_15) (16: chirp-comp_16) (17: chirp-comp_17) (18: chirp-comp_18) (19: chirp-comp_19) (20: chirp-comp_20) (21: chirp-comp_21) (22: chirp-comp_22) (23: chirp-comp_23) (24: chirp-comp_24) (25: chirp-comp_25) (26: chirp-comp_26) (27: chirp-comp_27) (28: chirp-comp_28) (29: chirp-comp_29) (30: chirp-comp_30) (31: chirp-c

#### Polynomial Data
This DataFrame contains only the Spike-Triggered Average (STA) data that have been processed using polynomial fitting.

In [7]:
df_poly = staDf[
    [
        "template",
        "peak_poly",
        "time_to_peak_poly",
        "hwhh_x_poly",
        "hwhh_y_poly",
        "zero_crossing_poly",
        "bandwidth_poly",
    ]
].dropna()
df_poly = df_poly.iloc[:]
df_poly.reset_index(inplace=True, drop=True)

#### Functional Data
This DataFrame exclusively contains Spike-Triggered Average (STA) data obtained by fitting a function described in one of the research papers.

In [8]:
df_func = staDf[
    [
        "template",
        "peak_func",
        "time_to_peak_func",
        "hwhh_x_func",
        "hwhh_y_func",
        "zero_crossing_func",
        "bandwidth_func",
    ]
].dropna()
df_func = df_func.iloc[:]
df_func.reset_index(inplace=True, drop=True)

#### All Stimuli Data
This DataFrame comprises all the Spike-Triggered Average (STA) data with reduced dimensionality via Principal Component Analysis (PCA), along with all the stimulus data.

In [9]:
df_all_stim = stimDf.merge(staPca, on="template")
df_all_stim.dropna(inplace=True)

#### Chirp-Only Data
This DataFrame includes the average spike responses for each stimulus type of each cell, but only for chirp stimuli. Additionally, Spike-Triggered Average (STA) data are not used in this dataset.

In [10]:
df_chirp_only = stimDf.dropna()

#### DataFrame Dictionary
Next, we will create a dictionary to store all the DataFrames. This approach simplifies the management and access of the various DataFrame structures.

In [11]:
dfs = {}
dfs["poly"] = df_poly
dfs["func"] = df_func
dfs["all_stim"] = df_all_stim
dfs["only_chirp"] = df_chirp_only
dfs_norm = {df_name: df.drop(columns=["template"]) for df_name, df in dfs.items()}

### Dimension Reduction
In this section, we will reduce the dimensionality of the DataFrames with the main goal of enabling their visualization in 2D. Two methods will be used: PCA and UMAP.

#### PCA
Principal Component Analysis (PCA) is a statistical technique that minimizes redundancy and maximizes variance in a dataset. Mathematically, PCA seeks to find the axes (principal components) along which the data varies most, and projects the data onto these axes to reduce dimensionality. The goal is to minimize the sum of the squares of the distances from the data points to these axes, thereby ensuring that the projection retains as much of the original data's variance as possible. In summary, PCA reduces data complexity while maintaining the most relevant information.

In [12]:
dfs_pca_2d = dict()
pcas_2d = dict()
for df_name, data in dfs_norm.items():
    pcas_2d[df_name] = PCA(n_components=2).fit(data)
    df_pca_2d = pcas_2d[df_name].transform(data)
    dfs_pca_2d[df_name] = df_pca_2d

#### UMAP
Uniform Manifold Approximation and Projection (UMAP) is an advanced dimensionality reduction technique particularly useful for visualizing high-dimensional data. Unlike linear methods like PCA, UMAP is based on manifold learning, aiming to preserve both local and global data structures. It works by constructing a high-dimensional graph, where each point is connected to its nearest neighbors, and then optimizes a low-dimensional graph to resemble the high-dimensional one as closely as possible. UMAP's goal is to minimize the difference between the relationships in high-dimensional space and those in low-dimensional space, thus faithfully maintaining the inherent structure of the original data. This makes UMAP effective for data visualization, clustering, and data exploration tasks, where maintaining the inherent structure of the data is crucial.

In [13]:
hide_warnings()
dfs_umap_2d = dict()
umaps_2d = dict()
for df_name, data in dfs_norm.items():
    umaps_2d[df_name] = umap.UMAP(random_state=0).fit(data)
    df_umap_2d = umaps_2d[df_name].fit_transform(data)
    dfs_umap_2d[df_name] = df_umap_2d

## Export the Data

In [14]:
with open("../.." + config["paths"]["data_cache"]["clustering"]["DFS"], "wb") as output:
    pickle.dump(dfs, output)
with open(
    "../.." + config["paths"]["data_cache"]["clustering"]["DFS_NORM"], "wb"
) as output:
    pickle.dump(dfs_norm, output)
with open(
    "../.." + config["paths"]["data_cache"]["clustering"]["DFS_PCA_2D"], "wb"
) as output:
    pickle.dump(dfs_pca_2d, output)
with open(
    "../.." + config["paths"]["data_cache"]["clustering"]["PCAS_2D"], "wb"
) as output:
    pickle.dump(pcas_2d, output)
with open(
    "../.." + config["paths"]["data_cache"]["clustering"]["DFS_UMAP_2D"], "wb"
) as output:
    pickle.dump(dfs_umap_2d, output)
with open(
    "../.." + config["paths"]["data_cache"]["clustering"]["UMAPS_2D"], "wb"
) as output:
    pickle.dump(umaps_2d, output)

___