# Random Forest Regression - HepG2 Cryoprotectant Optimization

Otimização de crioprotetores usando Random Forest com análise de importância de features.

## Importar bibliotecas e configurar constantes

In [4]:
import pandas as pd, numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import warnings
warnings.filterwarnings('ignore')

BASE_DIR = Path('..').resolve()
FEATURES, TARGET = ['% DMSO', 'TREHALOSE'], '% QUEDA DA VIABILIDADE'

## Carregar e preparar dados

Leitura do CSV com conversão de valores percentuais e decimais em virgula. Limpeza de valores ausentes e combinações inválidas.

In [5]:
def safe_float(x):
    s = str(x).replace('%', '').replace(',', '.').strip()
    return float('nan') if s in ('', 'nan') else float(s)

df = pd.read_csv(BASE_DIR / 'data/raw/hepg2.csv', decimal=',', thousands='.')
for col in FEATURES + [TARGET]:
    df[col] = df[col].apply(safe_float)

df = df.dropna(subset=FEATURES + [TARGET])
df = df[~((df[FEATURES[0]] == 0) & (df[FEATURES[1]] == 0))]
df = df[((df[FEATURES] >= 0).all(axis=1)) & ((df[FEATURES] <= 100).all(axis=1))]

X, y = df[FEATURES].values, df[TARGET].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Dataset: {len(df)} samples | Train: {len(X_train)} | Test: {len(X_test)}")
print(f"Viability drop: {y.min():.2f}% - {y.max():.2f}% (mean: {y.mean():.2f}%)")

Dataset: 200 samples | Train: 160 | Test: 40
Viability drop: 0.15% - 100.00% (mean: 45.10%)


## Treinar modelo e avaliar performance

Treinamento com 200 árvores e profundidade máxima de 15. Cálculo de R², MAE, RMSE e importância das features.

In [6]:
rf = RandomForestRegressor(n_estimators=200, max_depth=15, min_samples_leaf=3, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

y_pred_test = rf.predict(X_test)
r2_train = r2_score(y_train, rf.predict(X_train))
r2_test = r2_score(y_test, y_pred_test)
mae = mean_absolute_error(y_test, y_pred_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))

print("="*50)
print("RANDOM FOREST PERFORMANCE")
print("="*50)
print(f"R² (Train): {r2_train:.4f} | R² (Test): {r2_test:.4f}")
print(f"MAE: {mae:.4f}% | RMSE: {rmse:.4f}%")
print("\nFeature Importance:")
for feat, imp in zip(FEATURES, rf.feature_importances_):
    print(f"  {feat}: {imp:.1%}")

RANDOM FOREST PERFORMANCE
R² (Train): 0.9590 | R² (Test): 0.9618
MAE: 5.2239% | RMSE: 6.8736%

Feature Importance:
  % DMSO: 73.7%
  TREHALOSE: 26.3%


## Gerar predições para grid de combinações

Avalia todas as combinações de concentração (0-100%) em passos de 1%. Identifica as top 15 combinações com melhor predição de viabilidade.

In [7]:
conc = np.arange(0, 101, 1)
grid = np.array(np.meshgrid(conc, conc)).reshape(2, -1).T
y_pred = rf.predict(grid)

valid = ~((grid[:, 0] == 0) & (grid[:, 1] == 0))
best_idx = np.argmin(y_pred[valid])
best_global_idx = np.where(valid)[0][best_idx]

best_dmso, best_tre = grid[best_global_idx]
best_viab = 100 - y_pred[best_global_idx]

top_idx = np.argsort(y_pred[valid])[:15]
top_global = np.where(valid)[0][top_idx]

print("\n" + "="*50)
print("OPTIMAL RECOMMENDATION")
print("="*50)
print(f"DMSO: {best_dmso:.0f}% | Trehalose: {best_tre:.0f}%")
print(f"Predicted Viability: {best_viab:.2f}%")
print("\nTOP 15 COMBINATIONS:")
print("-" * 60)
print(f"{'Rank':>4} {'DMSO':>7} {'Trehalose':>11} {'Viability':>12}")
print("-" * 60)
for i, idx in enumerate(top_global, 1):
    d, t = grid[idx]
    v = 100 - y_pred[idx]
    print(f"{i:4d} {d:6.0f}% {t:10.0f}% {v:11.2f}%")


OPTIMAL RECOMMENDATION
DMSO: 1% | Trehalose: 0%
Predicted Viability: 97.75%

TOP 15 COMBINATIONS:
------------------------------------------------------------
Rank    DMSO   Trehalose    Viability
------------------------------------------------------------
   1      1%          0%       97.75%
   2      0%          1%       97.75%
   3      3%          0%       97.75%
   4      2%          0%       97.75%
   5      1%          1%       97.75%
   6      2%          1%       97.75%
   7      3%          1%       97.75%
   8      0%          2%       97.75%
   9      2%          2%       97.75%
  10      3%          2%       97.75%
  11      1%          2%       97.75%
  12      6%          2%       96.75%
  13      5%          1%       96.75%
  14      6%          1%       96.75%
  15      4%          2%       96.75%


## Comparar predição com observações do dataset

Valida se a recomendação do modelo está alinhada com os melhores casos realmente observados nos dados experimentais.

In [8]:
best_obs_idx = df[TARGET].argmin()
best_obs = df.iloc[best_obs_idx]
best_obs_viab = 100 - best_obs[TARGET]

print("\n" + "="*60)
print("COMPARISON: RF Prediction vs Dataset Observed")
print("="*60)
print(f"\nDataset Best Observed:")
print(f"   DMSO: {best_obs[FEATURES[0]]:.0f}% | Trehalose: {best_obs[FEATURES[1]]:.0f}%")
print(f"   Viability: {best_obs_viab:.2f}% (actual measurement)")

print(f"\nRandom Forest Recommendation:")
print(f"   DMSO: {best_dmso:.0f}% | Trehalose: {best_tre:.0f}%")
print(f"   Predicted Viability: {best_viab:.2f}%")

test_best = np.array([[best_obs[FEATURES[0]], best_obs[FEATURES[1]]]])
rf_pred_for_best = 100 - rf.predict(test_best)[0]
print(f"\n   RF prediction for observed best: {rf_pred_for_best:.2f}%")


COMPARISON: RF Prediction vs Dataset Observed

Dataset Best Observed:
   DMSO: 2% | Trehalose: 0%
   Viability: 99.85% (actual measurement)

Random Forest Recommendation:
   DMSO: 1% | Trehalose: 0%
   Predicted Viability: 97.75%

   RF prediction for observed best: 97.75%
