# Dimension Reduction with PaCMAP

## Where the data at?

In [1]:
input_path = '../Data/Processed_Data/'
output_path = '../Data/Processed_Data/PaCMAP_Results/'

## Load Datasets

In [2]:
import pandas as pd

x = pd.read_pickle(input_path+'x.pkl')
y = pd.read_csv(input_path+'y.csv', index_col=0)

print(
    f' Dataset (df) contains {x.shape[1]} rows (mC sites) and {x.shape[0]} columns (samples).')

FileNotFoundError: [Errno 2] No such file or directory: '../Data/Processed_Data/x.pkl'

In [29]:
# Load control and relapse data
ctrl_rel = pd.read_pickle(input_path+'control_relapse.pkl')

# Split control and relapse data into x and y
ctrl_rel_x = ctrl_rel[x.columns]
ctrl_rel_y = ctrl_rel[y.columns]

# Split control and relapse data into control and relapse
ctrl_y = ctrl_rel_y[ctrl_rel_y['Sample Type'].isin(['Bone Marrow Normal','Blood Derived Normal'])]
rel_y = ctrl_rel_y[~ctrl_rel_y['Sample Type'].isin(['Bone Marrow Normal','Blood Derived Normal'])]

# Apply to x
ctrl_x = ctrl_rel_x.T[ctrl_y.index].T
rel_x = ctrl_rel_x.T[rel_y.index].T

## Train-Test Split

Here we will split the data into a training/discovery and testing/validation set.

We will use ```y_train``` to denote the training set, and ```y_test``` to denote the testing set. 

In [10]:
# Split train and test by clinical trial
y_train = y[~y['Clinical Trial'].isin(['AML02', 'AML08'])]
# y_train = y_train[y_train['Sample Type'].isin(['Diagnosis',
#        'Primary Blood Derived Cancer - Bone Marrow', 'Bone Marrow Normal',
#        'Primary Blood Derived Cancer - Peripheral Blood',
#        'Blood Derived Normal'])]

y_test = y[y['Clinical Trial'].isin(['AML02', 'AML08'])]

# Select samples in x that are in y_train
x_train = x.loc[y_train.index]
x_test = x.loc[y_test.index]

# x_train = pd.concat([x_train, ctrl_x], axis=0)
# y_train = pd.concat([y_train, ctrl_y], axis=0,keys=['Diagnosis','Control'], names=['sample_type'])


print(
    f"Discovery dataset (train) contains {x_train.shape[1]} rows (mC sites) and {x_train.shape[0]} columns (samples)")
print(
    f"\n{y_train['Clinical Trial'].value_counts(dropna=False).to_string()}\n")
print(
    f"Validation dataset (test) contains {x_test.shape[1]} rows (mC sites) and {x_test.shape[0]} columns (samples).")
print(f"\n{y_test['Clinical Trial'].value_counts(dropna=False).to_string()}\n")


Discovery dataset (train) contains 310545 rows (mC sites) and 1553 columns (samples)

AAML1031    704
AAML0531    630
AAML03P1     72
AML05        64
CCG2961      42
NaN          41

Validation dataset (test) contains 310545 rows (mC sites) and 209 columns (samples).

AML02    167
AML08     42



## Batch Correction with pyCombat

- __pyCombat__: a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods

- __Github__: [https://epigenelabs.github.io/pyComBat/](https://epigenelabs.github.io/pyComBat/)

- __Implementation Paper__: [bioRxiv](https://doi.org/10.1101/2020.03.17.995431)

- __Original Paper__: [Biostatistics](https://pubmed.ncbi.nlm.nih.gov/16632515/)

In [11]:
from combat.pycombat import pycombat

# Correct batch effects in the training dataset
x_train2 = pycombat(x_train.T, y_train['Batch']).T

print('Succesfully corrected batch effects in the training dataset.')


Found 4 batches.
Adjusting for 0 covariate(s) or covariate level(s).
Standardizing Data across genes.
Fitting L/S model and finding priors.
Finding parametric adjustments.
Adjusting the Data
Succesfully corrected batch effects in the training dataset.


## Dimension Reduction with PaCMAP

- __PaCMAP__: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure

- __Github__: [https://github.com/YingfanWang/PaCMAP](https://github.com/YingfanWang/PaCMAP)

- __Paper__: [Journal of Machine Learning Research](https://jmlr.org/papers/v22/20-1061.html)

In [23]:
import pacmap


def run_pacmap(x_train, x_test, n_components=2):
    """
    Run PaCMAP on the training dataset apply learned parameters to the train and test.

    Parameters
    ----------
    x_train : pandas.DataFrame
        Training dataset.
    x_test : pandas.DataFrame
        Test dataset.
    n_components : int, optional
        Number of components. The default is 2.

    Returns
    -------
    embedding : numpy.ndarray
        Embedding of the training dataset.
    embedding_test : numpy.ndarray
        Embedding of the test dataset.

    """

    # Initialize PaCMAP. Note: hyperparameter tuning has been performed.
    reducer = pacmap.PaCMAP(n_components=n_components, n_neighbors=15,
                            MN_ratio=0.4, FP_ratio=16.0, random_state=42,
                            lr=0.1, num_iters=5000)

    # Fit (estimate) parameters to the training dataset to learn the embedding
    embedding = reducer.fit_transform(x_train)

    # Transform (apply) parameters to the test dataset
    embedding_test = reducer.transform(x_test, basis=x_train.copy())

    return embedding, embedding_test


embedding, embedding_test = run_pacmap(x_train2, x_test, n_components=3)




```{note}

You may have noticed that we called two methods in the PaCMAP class: ```fit``` and ```transform```.

- ```fit``` means to learn the parameters of a model _from_ a dataset.

- ```transform``` means to apply the learned parameters of a model _to_ a dataset.
```

## Save Embedding

In [24]:
# Transform df to pandas dataframe format
embedding = pd.DataFrame(embedding, index=x_train2.index,
                         columns=['PaCMAP 1', 'PaCMAP 2', 'PaCMAP 3'])
embedding_test = pd.DataFrame(embedding_test, index=x_test.index,
                              columns=['PaCMAP 1', 'PaCMAP 2', 'PaCMAP 3'])

# Save embeddings
embedding.to_pickle(output_path+'embedding.pkl')
embedding_test.to_pickle(output_path+'embedding_test.pkl')

print(
    f'Successfuly saved {embedding.shape[0]} x_train samples and {embedding_test.shape[0]} x_test samples.\nPath: {output_path}')


Successfuly saved 1553 x_train samples and 209 x_test samples.
Path: ../Data/Processed_Data/PaCMAP_Results/


## Watermark

In [19]:
%load_ext watermark

In [20]:
# produce a list of the loaded modules
%watermark -v -p numpy,pandas,sklearn,combat,pacmap

Python implementation: CPython
Python version       : 3.10.10
IPython version      : 8.8.0

numpy  : 1.23.5
pandas : 1.5.2
sklearn: 1.2.0
combat : 0.3.3
pacmap : 0.7.0

