# XGBoost Regression (v2 Otimizado) - HepG2 Cryoprotectant Optimization

Gradient Boosting otimizado com tuning de hiperparâmetros para melhorar performance.

## Importar bibliotecas e configurar constantes

In [3]:
import pandas as pd, numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import warnings
warnings.filterwarnings('ignore')

BASE_DIR = Path('..').resolve()
FEATURES, TARGET = ['% DMSO', 'TREHALOSE'], '% QUEDA DA VIABILIDADE'

In [4]:
def safe_float(x):
    s = str(x).replace('%', '').replace(',', '.').strip()
    return float('nan') if s in ('', 'nan') else float(s)

df = pd.read_csv(BASE_DIR / 'data/raw/hepg2.csv', decimal=',', thousands='.')
for col in FEATURES + [TARGET]:
    df[col] = df[col].apply(safe_float)

df = df.dropna(subset=FEATURES + [TARGET])
df = df[~((df[FEATURES[0]] == 0) & (df[FEATURES[1]] == 0))]
df = df[((df[FEATURES] >= 0).all(axis=1)) & ((df[FEATURES] <= 100).all(axis=1))]

X, y = df[FEATURES].values, df[TARGET].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Dataset: {len(df)} samples | Train: {len(X_train)} | Test: {len(X_test)}")
print(f"Viability drop: {y.min():.2f}% - {y.max():.2f}% (mean: {y.mean():.2f}%)")

Dataset: 200 samples | Train: 160 | Test: 40
Viability drop: 0.15% - 100.00% (mean: 45.10%)


## Carregar e preparar dados

Leitura do CSV com conversão de valores percentuais e decimais em virgula. Limpeza de valores ausentes e combinações inválidas.

In [5]:
xgb = XGBRegressor(
    objective='reg:squarederror',
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.85,
    colsample_bytree=0.9,
    gamma=0.5,
    min_child_weight=2,
    random_state=42,
    verbosity=0
)
xgb.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)

y_pred_test = xgb.predict(X_test)
r2_train = r2_score(y_train, xgb.predict(X_train))
r2_test = r2_score(y_test, y_pred_test)
mae = mean_absolute_error(y_test, y_pred_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))

print("="*50)
print("XGBOOST PERFORMANCE")
print("="*50)
print(f"R² (Train): {r2_train:.4f} | R² (Test): {r2_test:.4f}")
print(f"MAE: {mae:.4f}% | RMSE: {rmse:.4f}%")
print("\nFeature Importance:")
for feat, imp in zip(FEATURES, xgb.feature_importances_):
    print(f"  {feat}: {imp:.1%}")

XGBOOST PERFORMANCE
R² (Train): 0.9421 | R² (Test): 0.9398
MAE: 6.6003% | RMSE: 8.6245%

Feature Importance:
  % DMSO: 73.4%
  TREHALOSE: 26.6%


In [6]:
conc = np.arange(0, 101, 1)
grid = np.array(np.meshgrid(conc, conc)).reshape(2, -1).T
y_pred = xgb.predict(grid)

valid = ~((grid[:, 0] == 0) & (grid[:, 1] == 0))
best_idx = np.argmin(y_pred[valid])
best_global_idx = np.where(valid)[0][best_idx]

best_dmso, best_tre = grid[best_global_idx]
best_viab = 100 - y_pred[best_global_idx]

top_idx = np.argsort(y_pred[valid])[:15]
top_global = np.where(valid)[0][top_idx]

print("\n" + "="*50)
print("OPTIMAL RECOMMENDATION")
print("="*50)
print(f"DMSO: {best_dmso:.0f}% | Trehalose: {best_tre:.0f}%")
print(f"Predicted Viability: {best_viab:.2f}%")
print("\nTOP 15 COMBINATIONS:")
print("-" * 60)
print(f"{'Rank':>4} {'DMSO':>7} {'Trehalose':>11} {'Viability':>12}")
print("-" * 60)
for i, idx in enumerate(top_global, 1):
    d, t = grid[idx]
    v = 100 - y_pred[idx]
    print(f"{i:4d} {d:6.0f}% {t:10.0f}% {v:11.2f}%")


OPTIMAL RECOMMENDATION
DMSO: 2% | Trehalose: 20%
Predicted Viability: 100.81%

TOP 15 COMBINATIONS:
------------------------------------------------------------
Rank    DMSO   Trehalose    Viability
------------------------------------------------------------
   1      2%         29%      100.81%
   2      2%         20%      100.81%
   3      4%         29%      100.81%
   4      3%         28%      100.81%
   5      2%         28%      100.81%
   6      3%         29%      100.81%
   7      2%         26%      100.81%
   8      2%         22%      100.81%
   9      4%         21%      100.81%
  10      2%         27%      100.81%
  11      3%         27%      100.81%
  12      4%         27%      100.81%
  13      2%         21%      100.81%
  14      4%         20%      100.81%
  15      3%         20%      100.81%


## Gerar predições para grid de combinações

Avalia todas as combinações de concentração (0-100%) em passos de 1%. Identifica as top 15 combinações com melhor predição de viabilidade.

In [7]:
best_obs_idx = df[TARGET].argmin()
best_obs = df.iloc[best_obs_idx]
best_obs_viab = 100 - best_obs[TARGET]

print("\n" + "="*60)
print("COMPARISON: XGBoost Prediction vs Dataset Observed")
print("="*60)
print(f"\nDataset Best Observed:")
print(f"   DMSO: {best_obs[FEATURES[0]]:.0f}% | Trehalose: {best_obs[FEATURES[1]]:.0f}%")
print(f"   Viability: {best_obs_viab:.2f}% (actual measurement)")

print(f"\nXGBoost Recommendation:")
print(f"   DMSO: {best_dmso:.0f}% | Trehalose: {best_tre:.0f}%")
print(f"   Predicted Viability: {best_viab:.2f}%")

test_best = np.array([[best_obs[FEATURES[0]], best_obs[FEATURES[1]]]])
xgb_pred_for_best = 100 - xgb.predict(test_best)[0]
print(f"\n   XGBoost prediction for observed best: {xgb_pred_for_best:.2f}%")


COMPARISON: XGBoost Prediction vs Dataset Observed

Dataset Best Observed:
   DMSO: 2% | Trehalose: 0%
   Viability: 99.85% (actual measurement)

XGBoost Recommendation:
   DMSO: 2% | Trehalose: 20%
   Predicted Viability: 100.81%

   XGBoost prediction for observed best: 98.73%


## Comparar predição com observações do dataset

Valida se a recomendação do modelo está alinhada com os melhores casos realmente observados nos dados experimentais.