<a href="https://colab.research.google.com/github/francobarrionuevoenv21/Codeforenviro_projects_code/blob/main/Ejercicio_final_AA_BARRIONUEVO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 🌎💻📄 **GeoDojo** > Curso Aprendizaje Automático > Trabajo Final
----

#### **Consignas**

1. Contexto
* El conjunto de datos a usar es el _Geo-referenced Annual Crop Yields_, del equipo de investigación ecológica de largo plazo de la Estación Biológica Kellogg [(link)](https://lter.kbs.msu.edu/datatables/80)
* Este tiene datos de rendimientos de las **cosechas anuales de cultivos de maíz, soja y trigo**, del experimento _Main Cropping System Experiment (MCSE)_ (octubre 1996-julio 2013)

2. Consignas

   2.1. Elaborar un análisis exploratorio (EDA) sobre las variables del dataset usando algún tipo de gráfico o tabla no generado por YData-Profiling. Generar por lo menos un nuevo feature derivado de los originales. En el caso de identificar algún problema de calidad y decidir excluir parte de los registros, plantearlo.

   2.2. Desarrollar un modelo para predecir la variable target: `crop_flow_lb_s`. Usar por lo menos dos tipos de ensambles de árboles o un ensamble de árboles y una regresión lineal.

   2.3. Evaluar la performace de los modelos usando métricas y gráficos.

   2.4. Seleccionar un modelo y argumentar la decisión.

3. Variables

| variable         | descripcion                                                           | unidad de medida   |
|------------------|-----------------------------------------------------------------------|--------------------|
| `longitude`      | Longitud del monitor al momento del registro                          | grados             |
| `latitude`       | Latitud del monitor al momento del registro                           | grados             |
| `crop_flow_lb_s` | Caudal volumétrico o tasa de flujo del cultivo que pasa por el sensor | libras por segundo |
| `datetime`       | Fecha y hora de la muestra                                            |                    |
| `duration`       | Segundos desde el último registro                                     | segundos           |
| `distance_in`    | Largo del área muestreada                                             | pulgadas           |
| `swth_wdth_in`   | Ancho del área muestreada                                             | pulgadas           |
| `moisture`       | Humedad gravimétrica del grano                                        | %                  |
| `status`         | Status (up, down)                                                     |                    |
| `pass_num`       | Identificador                                                         |                    |
| `serial_number`  | Número de serie del dispositivo de monitoreo del rendimiento          |                    |
| `field`          | Descripción generada por el usuario para el campo relevado            |                    |
| `dataset`        | Descripción generada por el usuario para la carga dentro de un campo  |                    |
| `product`        | Descripción generada por el usuario para el cultivo relevado          |                    |
| `elevation_ft`   | Elevación del GPS                                                     | pies               |


---

In [None]:
!pip install ydata-profiling --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.1/400.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.5/296.5 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m687.8/687.8 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.4/105.4 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.3/43.3 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for htmlmin (setup.py) ... [?25l[?25hdone


In [None]:
import pandas as pd
import numpy as np
from datetime import datetime

from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV

from ydata_profiling import ProfileReport

# Set Pandas Options
pd.set_option("display.max_columns",   100,
              "display.max_rows",      200,
              "display.float_format",  "{:.2f}".format)

In [None]:
df = pd.read_csv("https://lter.kbs.msu.edu/datatables/80.csv", skiprows = 35)
df.columns = ["longitude", "latitude", "crop_flow_lb_s", "datetime", "duration", "distance_in", "swth_wdth_in",
              "moisture", "status", "pass_num", "serial_number", "field", "dataset", "product", "elevation_ft"]

# Filtramos según fechas, desde 2011
df["datetime"] = pd.to_datetime(df["datetime"])
df = df[df["datetime"] >= "2011-01-01"].copy()

df

Unnamed: 0,longitude,latitude,crop_flow_lb_s,datetime,duration,distance_in,swth_wdth_in,moisture,status,pass_num,serial_number,field,dataset,product,elevation_ft
0,-85.37,42.41,2.89,2013-07-24 19:20:03+00:00,1,47.00,180.00,6.60,0.00,141,5648,F1: Lysimeter,L1:,Wheat (Sft Rd Wtr),953.50
1,-85.37,42.41,2.89,2013-07-24 19:20:02+00:00,1,38.00,180.00,6.60,0.00,141,5648,F1: Lysimeter,L1:,Wheat (Sft Rd Wtr),953.50
2,-85.37,42.41,3.69,2013-07-24 19:20:01+00:00,1,18.00,180.00,6.60,0.00,141,5648,F1: Lysimeter,L1:,Wheat (Sft Rd Wtr),953.50
3,-85.37,42.41,3.39,2013-07-24 19:20:00+00:00,1,39.00,180.00,6.60,0.00,141,5648,F1: Lysimeter,L1:,Wheat (Sft Rd Wtr),953.50
4,-85.37,42.41,3.99,2013-07-24 19:19:59+00:00,1,61.00,180.00,6.60,0.00,141,5648,F1: Lysimeter,L1:,Wheat (Sft Rd Wtr),953.50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123456,-85.37,42.41,4.18,2011-11-14 11:08:28+00:00,1,28.00,180.00,0.00,1.00,1,5648,F1: T1R5-West,L1:,Corn,1013.25
123457,-85.37,42.41,4.08,2011-11-14 11:08:27+00:00,1,28.00,180.00,0.00,1.00,1,5648,F1: T1R5-West,L1:,Corn,1013.00
123458,-85.37,42.41,4.48,2011-11-14 11:08:26+00:00,1,28.00,180.00,0.00,1.00,1,5648,F1: T1R5-West,L1:,Corn,1013.33
123459,-85.37,42.41,4.28,2011-11-14 11:08:25+00:00,1,29.00,180.00,0.00,1.00,1,5648,F1: T1R5-West,L1:,Corn,1013.33


In [None]:
df.dtypes

Unnamed: 0,0
longitude,float64
latitude,float64
crop_flow_lb_s,float64
datetime,"datetime64[ns, UTC]"
duration,int64
distance_in,float64
swth_wdth_in,float64
moisture,float64
status,float64
pass_num,int64


In [None]:
# Generamos un reporte de data profiling
profile = ProfileReport(df, title = "Profiling Report")
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/15 [00:00<?, ?it/s][A
  7%|▋         | 1/15 [00:00<00:01,  9.30it/s][A
 20%|██        | 3/15 [00:00<00:03,  3.53it/s][A
 27%|██▋       | 4/15 [00:00<00:02,  3.98it/s][A
 47%|████▋     | 7/15 [00:01<00:01,  7.15it/s][A
 53%|█████▎    | 8/15 [00:01<00:01,  6.67it/s][A
 67%|██████▋   | 10/15 [00:01<00:00,  6.77it/s][A
 80%|████████  | 12/15 [00:02<00:00,  5.91it/s][A
 87%|████████▋ | 13/15 [00:02<00:00,  5.29it/s][A
100%|██████████| 15/15 [00:02<00:00,  5.86it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [None]:
target = "crop_flow_lb_s"
regresores = ["longitude", "latitude", "duration", "distance_in", "swth_wdth_in", "pass_num", "elevation_ft"]

train, test = train_test_split(df, test_size = 0.2, random_state = 1111)
X_train = train[regresores]
y_train = train[target]
X_test = test[regresores]
y_test = test[target]

In [None]:
grilla_hiperparametros = {"n_estimators"  : list(np.linspace(50, 150, 10).astype(int)),
                          "learning_rate" : list(np.logspace(np.log10(0.001), np.log10(0.3), base = 10, num = 10)),
                          "random_state"  : [1111]}
grilla_hiperparametros

{'n_estimators': [np.int64(50),
  np.int64(61),
  np.int64(72),
  np.int64(83),
  np.int64(94),
  np.int64(105),
  np.int64(116),
  np.int64(127),
  np.int64(138),
  np.int64(150)],
 'learning_rate': [np.float64(0.001),
  np.float64(0.0018846716378431187),
  np.float64(0.003551987182490267),
  np.float64(0.006694329500821696),
  np.float64(0.01261661294457514),
  np.float64(0.023778172582285134),
  np.float64(0.04481404746557166),
  np.float64(0.08445976423531824),
  np.float64(0.15917892219322097),
  np.float64(0.29999999999999993)],
 'random_state': [1111]}

In [None]:
%%time
# Regresión con ensamble de árboles (Gradient Boosting)
reg = RandomizedSearchCV(estimator = GradientBoostingRegressor(),
                         param_distributions = grilla_hiperparametros,
                         n_jobs = -1,
                         cv = 5,
                         n_iter = 10,
                         random_state = 1111)
reg.fit(X_train, y_train)

CPU times: user 19.8 s, sys: 730 ms, total: 20.5 s
Wall time: 6min 28s


In [None]:
df_train = pd.concat([X_train, y_train], axis = 1)
df_test = pd.concat([X_test, y_test], axis = 1)

df_train_pred = df_train.copy()
df_test_pred = df_test.copy()

# Predecir usando el modelo (entrenado solo con Train)
df_train_pred["y_pred"] = reg.predict(df_train_pred[regresores])
df_test_pred["y_pred"] = reg.predict(df_test_pred[regresores])

# Métricas (Train vs Test)
metrica_train_rmse = np.round(root_mean_squared_error(df_train_pred[target], df_train_pred["y_pred"]),2).astype(str)
metrica_test_rmse = np.round(root_mean_squared_error(df_test_pred[target], df_test_pred["y_pred"]),2).astype(str)

# Imprimir combinación óptima de hiperparámetros
print("Combinación óptima de hiperparámetros:")
print(reg.best_params_)
print("")

print("RMSE - Train:", metrica_train_rmse, "Test:", metrica_test_rmse)

Combinación óptima de hiperparámetros:
{'random_state': 1111, 'n_estimators': np.int64(127), 'learning_rate': np.float64(0.15917892219322097)}

RMSE - Train: 2.37 Test: 2.38
