## (GSE48213): Identifying gene expression patterns associated with different breast cancer subtypes

The dataset includes both treated (estrogen) and control conditions. 
We use Adaptive CCA to identify differences in gene expression patterns between these conditions over time.

The time points (1, 2, 4, 8, 12 hours) are not equally spaced, which is common in biological experiments. 
- Adaptive CCA should handle such non-linear time progressions 

In [1]:
import os
import pandas as pd
import numpy as np

file_path = os.path.join(os.getcwd(), "..", "data", "GSE48213")

In [None]:
os.getcwd()

Dataset overview:

- 56 breast cancer cell lines were profiled
- The data represents gene expression levels in these cell lines
- Each cell line is in an unperturbed, baseline state


In current file: 

1. Column 1 (EnsEMBL_Gene_ID): unique identifier for each gene from the Ensembl database
2. Column 2 (e.g., MDAMB453): expression value for each gene in the specific cell line.

These are normalized read counts or FPKM/TPM values (Fragments/Transcripts Per Kilobase Million).
Higher values indicate higher expression of the gene in that cell line, zero values indicate that the gene is not expressed (or expression is below detection threshold)


In [None]:
from utils.utils import load_data
load_data(file_path, os.getcwd())

In [7]:
output_file = os.path.join(os.getcwd(), "combined_data.txt")
data = pd.read_csv(output_file, sep="\t")

In [None]:
data.head()

In [None]:
print(data.shape)
print(data.info())
print(data.head())

In [None]:
# Preprocessing
print(data.isnull().sum())
data = data.dropna()

In [None]:
# Log2 transform the data (if not already done)
data = np.log2(data + 1)

# Normalize the data (optional, depending on your analysis needs)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns, index=data.index)


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# Heatmap of gene expression across cell lines
plt.figure(figsize=(12, 8))
sns.heatmap(data_scaled.iloc[:100, :], cmap='viridis', xticklabels=False, yticklabels=False)
plt.title('Heatmap of Gene Expression (First 100 Genes)')
plt.show()

# Distribution of gene expression values
plt.figure(figsize=(10, 6))
data_scaled.mean().hist(bins=50)
plt.title('Distribution of Mean Gene Expression Across Cell Lines')
plt.xlabel('Mean Expression')
plt.ylabel('Frequency')
plt.show()

# PCA plot
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data_scaled.T)
plt.figure(figsize=(10, 8))
plt.scatter(pca_result[:, 0], pca_result[:, 1])
for i, cell_line in enumerate(data_scaled.columns):
    plt.annotate(cell_line, (pca_result[i, 0], pca_result[i, 1]))
plt.title('PCA of Cell Lines')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

# Correlation between cell lines
correlation = data_scaled.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation, cmap='coolwarm', center=0)
plt.title('Correlation Between Cell Lines')
plt.show()
