# Usecase 2: Temperature prediction original model set-up

This notebook trains an ElasticNet model in Python with the same train-test splits as the ritme runs and a modelling set-up similar to the one described in [the original publication by Sunagawa et al. 2015](https://www.science.org/doi/10.1126/science.1261359).


This notebook can be run in the following conda environment:

```shell
mamba create -n ritme_model -c adamova -c qiime2 -c conda-forge -c bioconda -c pytorch -c anaconda ritme ipykernel -y
conda activate ritme_model
```


Description of modelling set-up used by original publication:           
"Compositional data (see above) were normalized to ranks across samples and then used to learn a regression model to predict environmental measures. In particular, we fitted an elastic net model (44) using inner cross-validation to set the hyperparameters as implemented by the scikit-learn Python package"

For a proper comparison with ritme, we are running inner cross-validation on the train set and evaluating the resulting model on the test set.

## Setup

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import ElasticNetCV
from sklearn.metrics import r2_score, root_mean_squared_error

In [2]:
######## USER INPUTS ########

# path to folder where train-test splits used for ritme are stored
data_splits_folder = "data_splits_u2"

# path to processed feature table used by original publication
path_to_suna_ft = "../../data/u2_tara_ocean/otu_table_tara_ocean_proc.tsv"
path_to_suna_md = "../../data/u2_tara_ocean/md_tara_ocean.tsv"

######## END USER INPUTS #####

## Prepare data

In [3]:
# Get indices of training and test data used by ritme
ritme_train_df = pd.read_pickle(f"{data_splits_folder}/train_val.pkl")
ritme_test_df = pd.read_pickle(f"{data_splits_folder}/test.pkl")

train_idx = ritme_train_df.index.tolist()
test_idx = ritme_test_df.index.tolist()

In [4]:
# select these indices from sunagawa md + feature table
suna_ft = pd.read_csv(path_to_suna_ft, sep="\t", index_col=0)
predictor_cols = suna_ft.columns

suna_md = pd.read_csv(path_to_suna_md, sep="\t", index_col=0)

suna_data = suna_md.join(suna_ft, how="inner")

In [5]:
train_df = suna_data.loc[train_idx]
test_df = suna_data.loc[test_idx]

print(f"Train data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")

assert train_df.shape[0] == len(train_idx)
assert test_df.shape[0] == len(test_idx)

Train data shape: (129, 7252)
Test data shape: (10, 7252)


In [6]:
# Extract predictors and target
train_predictors = train_df[predictor_cols]
train_target = train_df["temperature_mean_degc"]

test_predictors = test_df[predictor_cols]
test_target = test_df["temperature_mean_degc"]

## Train & evaluate ElasticNet model in Python

In [7]:
# Train ElasticNet with inner 5-fold CV on ranked features
enet_cv = ElasticNetCV(
    l1_ratio=[0.1, 0.5, 0.9, 1.0],
    cv=5,
    random_state=123,
    n_jobs=-1,
)
enet_cv.fit(train_predictors, train_target)

  model = cd_fast.enet_coordinate_descent(


0,1,2
,l1_ratio,"[0.1, 0.5, ...]"
,eps,0.001
,n_alphas,'deprecated'
,alphas,'warn'
,fit_intercept,True
,precompute,'auto'
,max_iter,1000
,tol,0.0001
,cv,5
,copy_X,True


In [8]:
# Predict
train_preds = enet_cv.predict(train_predictors)
test_preds = enet_cv.predict(test_predictors)

# Calculate R² and RMSE for train and test data
train_r2 = r2_score(train_target, train_preds)
train_rmse = root_mean_squared_error(train_target, train_preds)
test_r2 = r2_score(test_target, test_preds)
test_rmse = root_mean_squared_error(test_target, test_preds)

## Display all results

In [9]:
pd_results = pd.DataFrame(
    {
        "R2 test": [test_r2],
        "RMSE test": [test_rmse],
        "R2 train": [train_r2],
        "RMSE train": [train_rmse],
    },
    index=["elasticnet_py"],
).round(3)

pd_results

Unnamed: 0,R2 test,RMSE test,R2 train,RMSE train
elasticnet_py,0.802,5.688,1.0,0.049
