# üåæ Clasificaci√≥n de cultivos agr√≠colas con Redes Feedforward (FNN)
**Fecha:** 2025-11-10

Este notebook implementa un pipeline completo para clasificar el **cultivo m√°s adecuado** en una regi√≥n usando variables de **suelo y clima** (pH, N, P, K, humedad, temperatura, tipo de suelo, lluvia, etc.).

## Objetivos
- Preparar y explorar los datos.
- Entrenar **baselines** (Regresi√≥n Log√≠stica, Random Forest).
- Dise√±ar y entrenar una **FNN (softmax)** para clasificaci√≥n multiclase.
- Evaluar con m√©tricas robustas: Accuracy, F1 macro, matriz de confusi√≥n, ROC/PR por clase.
- Documentar arquitectura, entrenamiento, ventajas/limitaciones y pr√≥ximos pasos.

> **Tip:** Si no tienes datos propios, puedes empezar con el dataset p√∫blico *Crop Recommendation* (N, P, K, temperatura, humedad, pH, lluvia, `label` de cultivo). Guarda el CSV como `data/crops.csv` y apunta `DATA_PATH` m√°s abajo.


In [None]:
# === Configuraci√≥n general ===
PROJECT_SEED = 42
DATA_PATH = 'data/crops.csv'  # <-- Cambia aqu√≠ la ruta a tu dataset

import os, numpy as np, pandas as pd
np.random.seed(PROJECT_SEED)

# Gr√°ficos y m√©tricas
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, f1_score, classification_report, confusion_matrix, roc_auc_score
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# TensorFlow / Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

## 1. Carga y vista r√°pida de los datos

In [None]:
# Cargar datos
df = pd.read_csv(DATA_PATH)
print(df.shape)
display(df.head())

# Inferir columnas
target_col = 'label' if 'label' in df.columns else 'crop'
numeric_cols = [c for c in df.columns if c not in [target_col] and df[c].dtype != 'object']
categorical_cols = [c for c in df.columns if c not in [target_col] and df[c].dtype == 'object']

print("Target:", target_col)
print("Num√©ricas:", numeric_cols)
print("Categ√≥ricas:", categorical_cols)

# Valores faltantes
display(df.isna().mean().sort_values(ascending=False).head(10))

## 2. EDA esencial (r√°pido y accionable)

In [None]:
# Distribuci√≥n de clases
cls_counts = df[target_col].value_counts().sort_values(ascending=False)
print(cls_counts)

plt.figure()
cls_counts.plot(kind='bar')
plt.title('Distribuci√≥n de clases')
plt.xlabel('Cultivo')
plt.ylabel('Frecuencia')
plt.show()

# Estad√≠sticos de num√©ricas
display(df[numeric_cols].describe().T)

## 3. Partici√≥n estratificada Train/Val/Test

In [None]:
X = df.drop(columns=[target_col])
y = df[target_col].astype('category')
class_names = y.cat.categories.tolist()

# Train/Test
X_train, X_tmp, y_train, y_tmp = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=PROJECT_SEED
)

# Val/Test
X_val, X_test, y_val, y_test = train_test_split(
    X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=PROJECT_SEED
)

len(X_train), len(X_val), len(X_test)

## 4. Preprocesamiento (escala num√©ricas, one-hot categ√≥ricas)

In [None]:
numeric_tf = Pipeline(steps=[('scaler', StandardScaler())])
categorical_tf = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_tf, numeric_cols),
        ('cat', categorical_tf, categorical_cols)
    ]
)

# Ajustar preprocesador s√≥lo con train
preprocessor.fit(X_train)

# Transformar
X_train_t = preprocessor.transform(X_train)
X_val_t   = preprocessor.transform(X_val)
X_test_t  = preprocessor.transform(X_test)

# Si no hay categ√≥ricas, get_feature_names_out puede fallar; lo controlamos
try:
    feature_names = preprocessor.get_feature_names_out().tolist()
except Exception:
    feature_names = [f'f{i}' for i in range(X_train_t.shape[1])]

X_train_t = X_train_t.toarray() if hasattr(X_train_t, 'toarray') else X_train_t
X_val_t   = X_val_t.toarray() if hasattr(X_val_t, 'toarray') else X_val_t
X_test_t  = X_test_t.toarray() if hasattr(X_test_t, 'toarray') else X_test_t

# One-hot del target para m√©tricas multi-roc si se requiere
y_train_idx = y_train.cat.codes.values
y_val_idx   = y_val.cat.codes.values
y_test_idx  = y_test.cat.codes.values

num_classes = len(class_names)
print('Input dim:', X_train_t.shape[1], 'Num classes:', num_classes)

## 5. Baselines (Regresi√≥n Log√≠stica, Random Forest)

In [None]:
# Regresi√≥n Log√≠stica multinomial
logreg = LogisticRegression(max_iter=1000, multi_class='multinomial')
logreg.fit(X_train_t, y_train_idx)
pred_lr = logreg.predict(X_val_t)
print('LogReg - Acc:', accuracy_score(y_val_idx, pred_lr), 'F1(macro):', f1_score(y_val_idx, pred_lr, average='macro'))

# Random Forest
rf = RandomForestClassifier(n_estimators=300, random_state=PROJECT_SEED)
rf.fit(X_train_t, y_train_idx)
pred_rf = rf.predict(X_val_t)
print('RandomForest - Acc:', accuracy_score(y_val_idx, pred_rf), 'F1(macro):', f1_score(y_val_idx, pred_rf, average='macro'))

## 6. FNN: arquitectura y entrenamiento

In [None]:
tf.keras.utils.set_random_seed(PROJECT_SEED)

def build_fnn(input_dim, num_classes, hidden=[128, 64, 32], dropout=0.2):
    model = keras.Sequential([layers.Input(shape=(input_dim,))])
    for h in hidden:
        model.add(layers.Dense(h, activation='relu'))
        model.add(layers.BatchNormalization())
        model.add(layers.Dropout(dropout))
    model.add(layers.Dense(num_classes, activation='softmax'))
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=1e-3),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

model = build_fnn(X_train_t.shape[1], num_classes)
callbacks = [
    keras.callbacks.EarlyStopping(monitor='val_loss', patience=12, restore_best_weights=True),
    keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=6)
]

history = model.fit(
    X_train_t, y_train_idx,
    validation_data=(X_val_t, y_val_idx),
    epochs=200,
    batch_size=256,
    callbacks=callbacks,
    verbose=1
)

## 7. Evaluaci√≥n en Test

In [None]:
# Curva de entrenamiento
plt.figure()
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val')
plt.title('Curva de p√©rdida')
plt.xlabel('√âpoca')
plt.ylabel('Loss')
plt.legend()
plt.show()

# M√©tricas en Test
test_probs = model.predict(X_test_t, verbose=0)
test_pred  = test_probs.argmax(axis=1)

acc = accuracy_score(y_test_idx, test_pred)
f1m = f1_score(y_test_idx, test_pred, average='macro')
print(f'FNN Test - Accuracy: {acc:.4f} | F1 macro: {f1m:.4f}')

print("\nReporte por clase:")
print(classification_report(y_test_idx, test_pred, target_names=class_names))

# Matriz de confusi√≥n
cm = confusion_matrix(y_test_idx, test_pred)
plt.figure()
plt.imshow(cm, interpolation='nearest')
plt.title('Matriz de confusi√≥n')
plt.colorbar()
tick_marks = range(len(class_names))
plt.xticks(tick_marks, class_names, rotation=90)
plt.yticks(tick_marks, class_names)
plt.xlabel('Predicho')
plt.ylabel('Real')
plt.tight_layout()
plt.show()

## 8. Comparativa FNN vs. Baselines

In [None]:
from pandas import DataFrame

val_lr_acc = accuracy_score(y_val_idx, logreg.predict(X_val_t))
val_lr_f1  = f1_score(y_val_idx, logreg.predict(X_val_t), average='macro')

val_rf_acc = accuracy_score(y_val_idx, rf.predict(X_val_t))
val_rf_f1  = f1_score(y_val_idx, rf.predict(X_val_t), average='macro')

val_fnn_acc = accuracy_score(y_val_idx, model.predict(X_val_t, verbose=0).argmax(axis=1))
val_fnn_f1  = f1_score(y_val_idx, model.predict(X_val_t, verbose=0).argmax(axis=1), average='macro')

results = DataFrame({
    'Modelo': ['LogisticRegression', 'RandomForest', 'FNN'],
    'Val_Accuracy': [val_lr_acc, val_rf_acc, val_fnn_acc],
    'Val_F1_Macro': [val_lr_f1, val_rf_f1, val_fnn_f1]
})

results

## 9. Limitaciones y siguientes pasos
- **Datos:** mejorar cobertura de variables (textura de suelo, materia org√°nica, radiaci√≥n, NDVI).
- **Clase minoritaria:** aplicar *class weights* o *oversampling* si hay desbalance.
- **Hiperpar√°metros:** b√∫squeda de tama√±o de capas, *dropout*, y *learning rate*.
- **Explicabilidad:** usar Permutation Importance en el Random Forest y contrastar con sensibilidad en la FNN.
- **Producci√≥n:** empaquetar preprocesador y modelo (Pickle + SavedModel) y exponer v√≠a REST.

## 10. Alineaci√≥n con sostenibilidad (ENEL)
- Mejor recomendaci√≥n de cultivos acorde a condiciones reales ‚Üí **eficiencia en uso de energ√≠a** (riego, bombeo) y **planeaci√≥n de demanda**.
- Posibilidad de **mapear zonas** con alta probabilidad de √©xito por cultivo para **planificaci√≥n de infraestructura**.
