ComBatBatch Harmonization
- data: The expression matrix as a dataframe. It contains the information about the gene expression (rows) for each sample (columns).
- batch: List of batch indexes. The batch list describes the batch for each sample. The list of batches contains as many elements as the number of columns in the expression matrix.

Source: https://blog.4dcu.be/programming/2021/04/21/Code-Nugget-Batch_Effects.html

In [None]:
import pandas as pd
from combat.pycombat import pycombat
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
combined_internal = pd.read_csv("../../data/normalized/combined.csv", sep= "\t", index_col=0)
combined_external = pd.read_csv("../../data/normalized/combined_ext.csv", sep= "\t", index_col=0)
scanner_internal = pd.read_excel("../../data/initial/T1/scanner_type_T1.xlsx", sheet_name='scanner_type_T1', engine='openpyxl')
scanner_external = pd.read_excel("../../data/initial/T1/scanner_type_T1_ext.xlsx", sheet_name='scanner_type_T1_ext', engine='openpyxl')

In [None]:
scanner_internal.drop(scanner_internal.columns[1], axis=1, inplace=True)
scanner_internal['ID_intern'] = scanner_internal['ID_intern'].str.slice(53, 61)
scanner_internal['ID_intern'].replace("_se", "", regex=True, inplace=True)
# scanner_internal

In [None]:
scanner_external.drop(scanner_external.columns[1], axis=1, inplace=True)
scanner_external['ID_intern'] = scanner_external['ID_intern'].str.slice(68, 71)
scanner_external['ID_intern'].replace("_T", "", regex=True, inplace=True)
scanner_external['ID_intern'].replace("_t", "", regex=True, inplace=True)
scanner_external['ID_intern'].replace("_", "", regex=True, inplace=True)
# scanner_external

In [None]:
combined_internal = scanner_internal.join(combined_internal.set_index('ID_intern'), on='ID_intern', how='inner')
combined_internal

In [None]:
combined_external = scanner_external.join(combined_external.set_index('ID_intern'), on='ID_intern', how='inner')
combined_external

In [None]:
frames = [combined_internal, combined_external]
data = pd.concat(frames)

In [None]:
data

In [None]:
data_corrected = pycombat(data.drop(columns=["ID_intern", "Scanner"]).transpose(), data["Scanner"], mean_only=True).transpose()

In [None]:
data_corrected.head(60)

In [None]:
data = data.drop("ID_intern", axis=1)

In [None]:
# long_df = data.melt(id_vars=["Scanner"])

Visualizing the data changes

In [None]:
# long_corrected_df = data_corrected.melt()
# merged_df = pd.merge(
#     long_df,
#     long_corrected_df,
#     left_index=True,
#     right_index=True,
#     suffixes=("_raw", "_corrected"),
# )
# g = sns.FacetGrid(
#     merged_df,
#     col="variable_raw",
#     height=3,
#     aspect=1,
#     sharex=False,
#     sharey=False,
#     col_wrap=3,
# )
# g.map_dataframe(sns.scatterplot, x="value_raw", y="value_corrected", hue="Scanner")
# plt.show()

Here scatterplot is used to compare the original value (x-axis) with the corrected value (y-axis) for values from different batches (color). If no corrections were done all samples would be on the diagonal as their x- and y-values are identical, where corrections are applied there will be a shift.

In [None]:
data = pd.concat(frames)
headers = ["ID_intern"]
data_final = [data["ID_intern"]]
for c in data_corrected.columns:
    data_final.append(data_corrected[c])
    headers.append(c)
data_harmonized = pd.concat(data_final, axis=1, keys=headers)

In [None]:
data_harmonized

In [None]:
# data_harmonized.to_csv("../../data/harmonized/t1_harmonized.csv", sep='\t', encoding='utf-8', index=False)

Harmonized data split

In [None]:
data_harmonized['ID_intern'] = data_harmonized['ID_intern'].astype('str') 

In [None]:
strings = ['LIPO', 'LT']
combined_harmonized = data_harmonized[data_harmonized['ID_intern'].str.contains('|'.join(strings))]

combined_harmonized

In [None]:
combined_harmonized.to_csv("../../data/harmonized/combined_harmonized.csv", sep='\t', encoding='utf-8', index=False)

In [None]:
combined_external_harmonized = data_harmonized[~data_harmonized['ID_intern'].str.contains('|'.join(strings))]

combined_external_harmonized

In [None]:
combined_external_harmonized.to_csv("../../data/harmonized/combined_external_harmonized.csv", sep='\t', encoding='utf-8', index=False)