# 🧠 Machine Learning & Deep Learning for Microbiome and Multi-omics Data
### Training Hands-on Session
**Date:** _2025-10-21_  
**Author:** _Berkay Ekren_  
**Session:** Hands-On

##### Required Packages
1. Python >3.12


## 🕐 Hands-on Session 1: Machine Learning for Microbiome and Multi-omics Case Studies
### 📋 Objectives:
- Perform **classification** or **regression** using microbiome and omics datasets.
- Identify **biomarkers** relevant to aquaculture species.

### 🔗 Suggested Datasets:
- Microbiome OTU/ASV tables
- Metabolomics or transcriptomics profiles
- Aquaculture phenotype or environmental metadata

### 🧰 Tasks:
1. Load and preprocess data
2. Explore dataset (summary statistics, visualization)
3. Early integration - Late integration methods
4. Apply ML models (e.g., Random Forest, SVM, Gradient Boosting)
5. Evaluate model performances
6. Identify potential biomarkers (feature importance, SHAP, etc.)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Import data [1]
microbiome_df = pd.read_csv("data/microbiome.csv", sep="\t")
metabolome_df = pd.read_csv("data/metabolome.csv", sep="\t")

# Uncomment the below 2 lines to see the first few rows of the dataframes to see the file structure
#print(microbiome_df.head())
#print(metabolome_df.head())

# Check the distribution of the data with histograms
plt.figure(figsize=(14, 5))
sns.histplot(microbiome_df.iloc[:, 1:].values.flatten(), bins=50, color='blue', label='Microbiome', kde=True)
sns.histplot(metabolome_df.iloc[:, 1:].values.flatten(), bins=50, color='orange', label='Metabolome', kde=True)
plt.title('Distribution of Relative Abundance Values', fontsize=16)
plt.xlabel('Abundance')
plt.ylabel('Frequency')
plt.legend()
plt.show()

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
import xgboost as xgb
import lightgbm as lgb

#### 🧬 Early integration in machine learning: Concatanate features before modelling

In [None]:
print("--- Strategy 1: Early Integration ---")

# 1. Set the first column as the index for each dataframe
microbiome_features = microbiome_df.set_index(microbiome_df.columns[0])
metabolome_features = metabolome_df.set_index(metabolome_df.columns[0])

# 2. Transpose the dataframes so that rows are samples and columns are features
X_microbiome = microbiome_features.T
X_metabolome = metabolome_features.T

# 3. Concatenate the dataframes horizontally (axis=1) to create a single feature matrix.
# This aligns the data by sample ID (the index).
early_integration_df = pd.concat([X_microbiome, X_metabolome], axis=1)

print("Shape of Microbiome data (samples, features):", X_microbiome.shape)
print("Shape of Metabolome data (samples, features):", X_metabolome.shape)
print("Shape of combined data for Early Integration:", early_integration_df.shape)

print("\n--- Early Integration DataFrame Head ---")
print(early_integration_df.head())

# Split data for classification task
X_train_early_c, X_test_early_c, y_train_c, y_test_c = train_test_split(
    early_integration_df, y_classification, test_size=0.3, random_state=42
)
# Split data for regression task
X_train_early_r, X_test_early_r, y_train_r, y_test_r = train_test_split(
    X_early_integration, y_regression, test_size=0.3, random_state=42
)


In [None]:


# 1. Early Integration: Concatenate features before modeling
print("--- Strategy 1: Early Integration ---")
X_early_integration = pd.concat([X_microbiome, X_transcriptome], axis=1)



# --- Models for Classification (Early Integration) ---
print("\nTraining Classification Models...")
classifiers = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(random_state=42),
    "SVM": SVC(probability=True, random_state=42),
    "XGBoost": xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
    "LightGBM": lgb.LGBMClassifier(random_state=42)
}

for name, model in classifiers.items():
    print(f"Fitting {name}...")
    # model.fit(X_train_early_c, y_train_c)
    # score = model.score(X_test_early_c, y_test_c)
    # print(f"{name} Accuracy: {score:.4f}")

# --- Models for Regression (Early Integration) ---
print("\nTraining Regression Models...")
regressors = {
    "Random Forest": RandomForestRegressor(random_state=42),
    "SVR": SVR(),
    "XGBoost": xgb.XGBRegressor(objective='reg:squarederror', random_state=42),
    "LightGBM": lgb.LGBMRegressor(random_state=42)
}

for name, model in regressors.items():
    print(f"Fitting {name}...")
    # model.fit(X_train_early_r, y_train_r)
    # score = model.score(X_test_early_r, y_test_r) # R^2 score
    # print(f"{name} R^2 Score: {score:.4f}")


# --- 2. Late Integration: Train models separately, then combine predictions (Stacking) ---
print("\n--- Strategy 2: Late Integration (Stacking) ---")

# Split individual datasets
X_m_train, X_m_test, y_c_train, y_c_test = train_test_split(X_microbiome, y_classification, test_size=0.3, random_state=42)
X_t_train, X_t_test, _, _ = train_test_split(X_transcriptome, y_classification, test_size=0.3, random_state=42)

# --- Classification (Late Integration) ---
# Step A: Train base models on each dataset
base_model_m = RandomForestClassifier(random_state=42).fit(X_m_train, y_c_train)
base_model_t = RandomForestClassifier(random_state=42).fit(X_t_train, y_c_train)

# Step B: Get predictions from base models on the test set
preds_m = base_model_m.predict_proba(X_m_test)[:, 1]
preds_t = base_model_t.predict_proba(X_t_test)[:, 1]

# Step C: Combine predictions to form a new feature set
X_late_integration_test = np.c_[preds_m, preds_t]

# To train the meta-model, you'd typically use cross-validated predictions on the training set.
# For simplicity here, we'll just show the concept with a pre-trained meta-model.
meta_model_c = LogisticRegression()
# meta_model_c.fit(X_late_integration_train, y_c_train)
# final_preds = meta_model_c.predict(X_late_integration_test)
print("\nLate integration concept for classification demonstrated.")
print("Meta-model would be trained on predictions from base models.")

# --- Regression (Late Integration) ---
# The same principle applies. Base regressors are trained, and a final meta-regressor
# (e.g., LinearRegression) is trained on their output predictions.
print("\nLate integration for regression follows the same stacking principle.")

# Note: The fitting and scoring lines are commented out.
# You can uncomment them to run the models on the placeholder data.
print("\nSetup complete. Replace placeholder data and uncomment model.fit/score lines to run.")


## 🕐 Hands-on session 2: Deep Learning for Metagenomic & Multi-omics Data
### 📋 Objectives:
- Build **deep learning** models to predict outcomes from multi-omics datasets.
- Combine datasets (multi-view or multimodal learning).
- Evaluate performance and interpretability.

### 🧬 Suggested Frameworks:
- TensorFlow / Keras
- PyTorch / PyTorch Lightning

### 🧰 Tasks:
1. Prepare multi-omics datasets for modeling
2. Define and train deep learning models
3. Evaluate performance (accuracy, loss curves, confusion matrix)
4. Interpret model predictions

In [None]:
# 🧬 Example: Simple neural network with Keras
from tensorflow import keras
from tensorflow.keras import layers

# Example model
model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=(X_microbiome.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# model.fit(X_train, y_train, epochs=20, validation_split=0.2)

## 🕐 Hands-on 3: Integrated Workflow — From Data to Insights
### 📋 Objectives:
- Build an **end-to-end workflow** from raw data → preprocessing → ML/DL models.
- Combine microbiome, omics, and environmental data.
- Derive interpretable **biological insights** relevant to aquaculture.

### ⚙️ Example Workflow Steps:
1. Raw data QC and normalization
2. Feature selection or dimensionality reduction
3. Model training and validation
4. Post-hoc interpretation and visualization

In [None]:
# Example: Integrated pipeline pseudocode
# Step 1: Preprocess data
# Step 2: Train model
# Step 3: Evaluate results
# Step 4: Visualize findings

# Placeholder for pipeline code

### 📚 Suggested Reading & Resources
- [QIIME 2 Machine Learning Plugin](https://docs.qiime2.org/)
- [scikit-learn documentation](https://scikit-learn.org/stable/)
- [TensorFlow tutorials](https://www.tensorflow.org/tutorials)
- [PyTorch tutorials](https://pytorch.org/tutorials/)
- Example dataset: [EBI Metagenomics](https://www.ebi.ac.uk/metagenomics/)

**References**
1. Mazzella, V., Dell’Anno, A., Etxebarría, N., González-Gaya, B., Nuzzo, G., Fontana, A., & Núñez-Pons, L. (2024). High microbiome and metabolome diversification in coexisting sponges with different bio-ecological traits. Communications Biology, 7(1), 422.