# Ensemble Modeling of Crop Yield

**Goal:** develop a machine learning solution to predict crop yields and identify key factors influencing agricultural productivity, enabling better resource planning and improved profitability

In this notebook we will:
1. Feature Engineering and Preprocessing of Data
2. Separation of train, cross validation, and test set
3. application of several models
    - decision tree
    - random forest
    - gradient boosted trees
    - svm
4. hyperparameter tuning with grid search
5. Evaluation of models
    - calc mape, mae, rmse, r2, and time to train(if ends up being relevant)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [12]:
df = pd.read_csv("yield_df.csv", index_col=0)
df.head()

Unnamed: 0,Area,Item,Year,hg/ha_yield,average_rain_fall_mm_per_year,pesticides_tonnes,avg_temp
0,Albania,Maize,1990,36613,1485.0,121.0,16.37
1,Albania,Potatoes,1990,66667,1485.0,121.0,16.37
2,Albania,"Rice, paddy",1990,23333,1485.0,121.0,16.37
3,Albania,Sorghum,1990,12500,1485.0,121.0,16.37
4,Albania,Soybeans,1990,7000,1485.0,121.0,16.37


### Important takeaways from EDA
- items are grouped by [Area,Item,Year,hg/ha_yield,average_rain_fall_mm_per_year,pesticides_tonnes,avg_temp] meaning that are duplicates if two regions experienced same weather and produced similar levels of crops
- if we are comparing trends across different items, we will need to scale as some crops are produced at much higher levels than others (same for average rain, pesticides, and avg_temp)
- some countries don't produce certain crops
- "Year" represents the season

#### Data stats:
- items: 10
- seasons: 23
- Area: 101
- TOTAL ROWS: 28242

# Feature Engineering/Selection

In [14]:
engineered_df = df.drop(columns=['Area'],axis=1)
df_encoded = pd.get_dummies(engineered_df, columns=["Item", "Year"], drop_first=True)
df_encoded.head()


Unnamed: 0,hg/ha_yield,average_rain_fall_mm_per_year,pesticides_tonnes,avg_temp,Item_Maize,Item_Plantains and others,Item_Potatoes,"Item_Rice, paddy",Item_Sorghum,Item_Soybeans,...,Year_2004,Year_2005,Year_2006,Year_2007,Year_2008,Year_2009,Year_2010,Year_2011,Year_2012,Year_2013
0,36613,1485.0,121.0,16.37,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,66667,1485.0,121.0,16.37,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,23333,1485.0,121.0,16.37,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
3,12500,1485.0,121.0,16.37,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
4,7000,1485.0,121.0,16.37,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False


In [15]:
from sklearn.model_selection import train_test_split

X = df_encoded.drop(columns=['hg/ha_yield'], axis=1)
y = df_encoded['hg/ha_yield']

#Generate training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)

X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


In [17]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train,y_train)

rf_preds = rf_model.predict(X_test)
rf_rmse = mean_squared_error(y_test, rf_preds)
print(f"Random Forest RMSE: {rf_rmse}")


Random Forest RMSE: 186977745.17045566
