<h2>Scikit learn - Crop dataset - Gilles Witters

In this notebook I will explain how I used Scikit learn to create a prediction on the crop dataset.

First, let's import all the correct libraries.

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor

Now let's load the crop dataset.

In [14]:
# Load the dataset
crop_df = pd.read_csv('../../4. Other/4.1 Datasets/Crop/Crop cleaned_2.csv')

Let's print the first 5 lines to check if the dataframe is loaded correctly.

In [15]:
crop_df.head(5)

Unnamed: 0.1,Unnamed: 0,index,crop,hectares,production,year,yield
0,0,575,3,1573000.0,1580000.0,1922,1.00445
1,1,576,3,1630000.0,1547000.0,1923,0.94908
2,2,577,3,1763000.0,1721000.0,1924,0.976177
3,3,578,3,1598000.0,1659000.0,1925,1.038173
4,4,579,3,1962000.0,2042000.0,1926,1.040775


Now let's split the data into a training and test set.

In [16]:
# Split the data into training and test sets
X = crop_df[['crop', 'hectares', 'year', 'yield']]
y = crop_df['production']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Now that our data is ready we will train our model. Because this is a regression task, and not a classification task we will use the sklearn.linear_model

In [17]:
# Create and train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

Now our model is trained, let's do a prediction on it.

In [18]:
# Make predictions
y_pred = model.predict(X_test)

Finally let's  evaluate the model using mean squared error (MSE) and R-squared (R2) to assess its performance in predicting production values.

In [19]:
# Evaluate the model using mean squared error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print("Mean Squared Error: {:.2f}".format(mse))
print("R-squared: {:.2f}".format(r2))

Mean Squared Error: 7318892042224.08
R-squared: 0.67


Let's ask ChatGPT now what our metrics mean and whether the model is performing well.

<b>Mean Squared Error (MSE):</b> The MSE is a measure of the average squared difference between the actual production values (ground truth) and the predicted values made by your regression model. A smaller MSE indicates that the model's predictions are closer to the actual values and, therefore, more accurate. In your case, an MSE of 7,318,892,042,224.08 is relatively large, but it's important to consider the context of your data. <br>

<b>R-squared (R2):</b> The R2 score measures the proportion of the variance in the dependent variable (production) that can be explained by the independent variables (Crop, hectares etc...) in your model. It ranges from 0 to 1, where 1 indicates a perfect fit, and 0 suggests that the model does not explain any of the variance. An R2 score of 0.67 means that approximately 67% of the variance in the production values is explained by your model. In other words, the model's predictions reasonably capture the underlying patterns in the data. <br>

In summary, while the MSE is relatively large, the R2 score of 0.67 suggests that your current linear regression model is performing reasonably well in explaining and predicting a significant portion of the variability in the production values. However, there is still room for improvement, and you may explore more advanced regression techniques or feature engineering to further enhance the model's performance.

So we can see there is still room for improvement. Let's try the Gradient Boosting Regressor now.

In [21]:
regressor = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=0)
regressor.fit(X_train, y_train)

# Make predictions
y_pred = regressor.predict(X_test)

# Evaluate the model using mean squared error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print("Mean Squared Error: {:.2f}".format(mse))
print("R-squared: {:.2f}".format(r2))

Mean Squared Error: 122013803657.78
R-squared: 0.99


Woow, that's a lot better. 

Even ChatGPT says: 
An MSE of 122,013,803,657.78 and an R2 of 0.99 indicate that your Gradient Boosting Regressor model is performing exceptionally well in predicting production values. The model's predictions are highly accurate, and it explains a large portion of the variance in the production data. These results suggest that your model is a strong predictor for crop production.

Let's do a prediction now with numbers.

We have crop, hectares, year and yield.

In [27]:

prediction = model.predict([[2,1000000.0,2025,1.004450095359186]])
print("Predicted Production:", prediction[0])

Predicted Production: 1794304.0262655085




Now we can see the predicted production will be around 1794304.