# Usecase 1: Age prediction original model set-up

This notebook trains RandomForest models in R and Python with the same train-test splits as the ritme runs and a modelling set-up as described in [the original publication by Subramanian et al. 2014](https://doi.org/10.1038/nature13421).


This notebook can be run in the following conda environment:

```shell
conda create -n subr_r_env -c conda-forge \
    r-base=4.2.3 \
    r-randomforest \
    r-reticulate \
    r-metrics \
    r-caret \
    r-irkernel \
    python=3.10 \
    pandas \
    rpy2 \
    scikit-learn \
    jupyterlab -y

conda activate subr_r_env
```


Description of modelling set-up used by original publication:           
"default parameters of R package implementation: "R package ‘randomForest’, ntree = 10,000, using default mtry of p/3 where p is the number of input 97%-identity OTUs (features)"

## Setup

In [None]:
import pandas as pd
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, root_mean_squared_error

In [None]:
######## USER INPUTS ########

# path to folder where train-test splits used for ritme are stored
data_splits_folder = "data_splits_u1"

# path to filtered and rarefied feature table used by original publication
path_to_subr_ft = "../../data/u1_subramanian14/otu_table_subr14_rar.tsv"
path_to_subr_md = "../../data/u1_subramanian14/md_subr14_rar.tsv"

######## END USER INPUTS #####

## Prepare data

In [None]:
# Get indices of training and test data used by ritme
ritme_train_df = pd.read_pickle(f"{data_splits_folder}/train_val.pkl")
ritme_test_df = pd.read_pickle(f"{data_splits_folder}/test.pkl")

train_idx = ritme_train_df.index.tolist()
test_idx = ritme_test_df.index.tolist()

In [None]:
# select these indices from subramanian md + feature table
subr_ft = pd.read_csv(path_to_subr_ft, sep="\t", index_col=0)
predictor_cols = subr_ft.columns

subr_md = pd.read_csv(path_to_subr_md, sep="\t", index_col=0)

subr_data = subr_md.join(subr_ft, how="inner")

In [None]:
train_df = subr_data.loc[train_idx]
test_df = subr_data.loc[test_idx]

assert train_df.shape[0] == len(train_idx)
assert test_df.shape[0] == len(test_idx)

In [None]:
# Extract predictors and target
train_predictors = train_df[predictor_cols]
train_target = train_df["age_months"]

test_predictors = test_df[predictor_cols]
test_target = test_df["age_months"]

# Convert absolute abundances to relative abundances
train_predictors_rel = train_predictors.div(train_predictors.sum(axis=1), axis=0)
test_predictors_rel = test_predictors.div(test_predictors.sum(axis=1), axis=0)

## Train & evaluate RandomForest model in R

In [None]:
# Determine the number of predictors
p = train_predictors_rel.shape[1]

In [None]:
# Activate pandas to R DataFrame conversion
pandas2ri.activate()

# Load the R function
with open("n4_run_rf_r_model.R") as f:
    r_code = f.read()
robjects.r(r_code)

# Get & call the R function
run_rf_model = robjects.globalenv["run_rf_model"]

results = run_rf_model(
    train_predictors_rel, train_target, test_predictors_rel, test_target, p
)

In [None]:
# Convert & print results in Python
train_r2, train_rmse, test_r2, test_rmse = list(results)

In [None]:
pd_results = pd.DataFrame(
    {
        "R2 test": [test_r2],
        "RMSE test": [test_rmse],
        "R2 train": [train_r2],
        "RMSE train": [train_rmse],
    }
)
pd_results.index = ["original_r"]
pd_results = pd_results.round(3)
pd_results

## Train & evaluate RandomForest model in Python

In [None]:
# Train Random Forest regression model
rf_model = RandomForestRegressor(
    n_estimators=10000, max_features=round(p / 3), random_state=123, n_jobs=-1
)
rf_model.fit(train_predictors_rel, train_target)

# Predictions on train + test
train_preds = rf_model.predict(train_predictors_rel)
test_preds = rf_model.predict(test_predictors_rel)

# Calculate R² and RMSE for train and test data
train_r2_py = r2_score(train_target, train_preds)
train_rmse_py = root_mean_squared_error(train_target, train_preds)

# Calculate R² and RMSE for testing data
test_r2_py = r2_score(test_target, test_preds)
test_rmse_py = root_mean_squared_error(test_target, test_preds)

pd_results_py = pd.DataFrame(
    {
        "R2 test": [test_r2_py],
        "RMSE test": [test_rmse_py],
        "R2 train": [train_r2_py],
        "RMSE train": [train_rmse_py],
    }
)
pd_results_py.index = ["original_py"]

## Display all results

In [None]:
pd_results = pd.concat([pd_results, pd_results_py])
pd_results = pd_results.round(3)
pd_results