# Sowing Success: How Machine Learning Helps Farmers Select the Best Crops

![Farmer in a field](farmer_in_a_field.jpg)

Measuring essential soil metrics such as nitrogen, phosphorous, potassium levels, and pH value is an important aspect of assessing soil condition. However, it can be an expensive and time-consuming process, which can cause farmers to prioritize which metrics to measure based on their budget constraints.

Farmers have various options when it comes to deciding which crop to plant each season. Their primary objective is to maximize the yield of their crops, taking into account different factors. One crucial factor that affects crop growth is the condition of the soil in the field, which can be assessed by measuring basic elements such as nitrogen and potassium levels. Each crop has an ideal soil condition that ensures optimal growth and maximum yield.

A farmer reached out, as a machine learning expert for assistance in selecting the best crop for his field. They've provided a dataset called `soil_measures.csv`, which contains:

- `"N"`: Nitrogen content ratio in the soil
- `"P"`: Phosphorous content ratio in the soil
- `"K"`: Potassium content ratio in the soil
- `"pH"` value of the soil
- `"crop"`: categorical values that contain various crops (target variable).

Each row in this dataset represents various measures of the soil in a particular field. Based on these measurements, the crop specified in the `"crop"` column is the optimal choice for that field.  

### The task in this project involves two main objectives:

1. Predict the Crop Type: Use the variables N (Nitrogen),P (Phosphorous), K (Potassium), and pH value of the soil to build a machine learning model that can predict the type of crop (categorical target variable) that would be best suited for a given set of soil conditions. This is a classic example of a multi-class classification problem.


2. Identify the Most Significant Variable: Apart from predicting the crop type, a key part of the project is to determine which of these soil metrics (N, P, K, or pH) is the most predictive of the crop type. This involves analyzing the feature importance from the model to see which variable contributes the most to the model's predictive performance. This helps in understanding which soil metric is most critical for deciding the crop type, which can be very valuable for optimizing the use of resources in agricultural practices.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics





In [3]:

crops = pd.read_csv("soil_measures.csv")

In [3]:
crops.head(10)

Unnamed: 0,N,P,K,ph,crop
0,90,42,43,6.502985,rice
1,85,58,41,7.038096,rice
2,60,55,44,7.840207,rice
3,74,35,40,6.980401,rice
4,78,42,42,7.628473,rice
5,69,37,42,7.073454,rice
6,69,55,38,5.700806,rice
7,94,53,40,5.718627,rice
8,89,54,38,6.685346,rice
9,68,58,38,6.336254,rice


In [4]:
crops.dtypes

N         int64
P         int64
K         int64
ph      float64
crop     object
dtype: object

In [5]:
crops.isnull().sum()

N       0
P       0
K       0
ph      0
crop    0
dtype: int64

In [6]:
crops.describe(include='all')

Unnamed: 0,N,P,K,ph,crop
count,2200.0,2200.0,2200.0,2200.0,2200
unique,,,,,22
top,,,,,rice
freq,,,,,100
mean,50.551818,53.362727,48.149091,6.46948,
std,36.917334,32.985883,50.647931,0.773938,
min,0.0,5.0,5.0,3.504752,
25%,21.0,28.0,20.0,5.971693,
50%,37.0,51.0,32.0,6.425045,
75%,84.25,68.0,49.0,6.923643,


In [7]:
crops["crop"].unique()

array(['rice', 'maize', 'chickpea', 'kidneybeans', 'pigeonpeas',
       'mothbeans', 'mungbean', 'blackgram', 'lentil', 'pomegranate',
       'banana', 'mango', 'grapes', 'watermelon', 'muskmelon', 'apple',
       'orange', 'papaya', 'coconut', 'cotton', 'jute', 'coffee'],
      dtype=object)

# A. Predict the Crop Type

### 1. Logistic Regression Model

In [4]:
# Define features and target variable
X = crops[['N', 'P', 'K', 'ph']] 
y = crops['crop']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the StandardScaler
scaler = StandardScaler()

# Scale the training data and transform the test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [5]:
# Initialize the logistic regression model
model = LogisticRegression(max_iter=1000, multi_class='multinomial', solver='lbfgs')

# Train the model on the scaled training data
model.fit(X_train_scaled, y_train)

LogisticRegression(max_iter=1000, multi_class='multinomial')

In [6]:
from sklearn.metrics import accuracy_score, classification_report

# Predict the crop types on the scaled test data
y_pred = model.predict(X_test_scaled)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print a detailed classification report
print(classification_report(y_test, y_pred))

Accuracy: 0.6590909090909091
              precision    recall  f1-score   support

       apple       0.88      0.30      0.45        23
      banana       1.00      1.00      1.00        21
   blackgram       0.76      0.65      0.70        20
    chickpea       1.00      0.77      0.87        26
     coconut       0.81      0.63      0.71        27
      coffee       0.76      0.76      0.76        17
      cotton       0.88      0.88      0.88        17
      grapes       0.45      0.93      0.60        14
        jute       0.50      0.48      0.49        23
 kidneybeans       0.45      0.65      0.53        20
      lentil       0.30      0.64      0.41        11
       maize       0.91      1.00      0.95        21
       mango       0.40      0.53      0.45        19
   mothbeans       0.60      0.25      0.35        24
    mungbean       0.67      0.74      0.70        19
   muskmelon       0.62      0.76      0.68        17
      orange       1.00      1.00      1.00        1

### Implementation Steps:
* Data Preparation: Features and labels are defined, data is split into training and testing sets.

* Feature Scaling: StandardScaler is used to scale the features, which is crucial for logistic regression as it relies on gradient descent algorithms that benefit from feature scaling.

* Model Training and Prediction: A logistic regression model is initialized and trained on the scaled data. Predictions are then made on the test set.

* Evaluation: The model's performance is assessed using accuracy and a detailed classification report which provides precision, recall, and F1-scores for each class.

### Outcomes:
* Overall Accuracy: 65.91%, indicating that the model correctly predicts the crop type for about two-thirds of the test set but rom for improbememt.

* Performance Variability: The model performs well for certain crops like banana and chickpea but struggles with others like pigeonpeas and rice, suggesting variability in its ability to handle different classes, again the need for improvement

## 2. Random Forest Classifier

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils.class_weight import compute_class_weight
import joblib


crops = pd.read_csv('soil_measures.csv')

# Ensure ph is treated as float
crops['ph'] = crops['ph'].astype(float)

# Encode the target variable
label_encoder = LabelEncoder()
crops['crop'] = label_encoder.fit_transform(crops['crop'])

# Define features and target
X = crops[['N', 'P', 'K', 'ph']]
y = crops['crop']

# Split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Calculate class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = dict(enumerate(class_weights))

# Define the model with class weights
rf = RandomForestClassifier(random_state=42, class_weight=class_weights_dict)

# Set up hyperparameter grid for tuning
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Setup the grid search
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, 
                           n_jobs=-1, verbose=2, scoring='f1_macro')

# Fit grid search to the data
grid_search.fit(X_train_scaled, y_train)

# Best model
best_model = grid_search.best_estimator_

# Predict using the best model
y_pred = best_model.predict(X_test_scaled)

# Decode the predictions back to original class labels
y_test_decoded = label_encoder.inverse_transform(y_test)
y_pred_decoded = label_encoder.inverse_transform(y_pred)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print(classification_report(y_test_decoded, y_pred_decoded))


Fitting 3 folds for each of 81 candidates, totalling 243 fits
Accuracy: 0.825
              precision    recall  f1-score   support

       apple       0.76      0.70      0.73        23
      banana       1.00      1.00      1.00        21
   blackgram       0.86      0.90      0.88        20
    chickpea       1.00      1.00      1.00        26
     coconut       0.84      1.00      0.92        27
      coffee       0.94      0.94      0.94        17
      cotton       0.89      1.00      0.94        17
      grapes       0.56      0.64      0.60        14
        jute       0.56      0.83      0.67        23
 kidneybeans       0.77      1.00      0.87        20
      lentil       0.50      0.73      0.59        11
       maize       1.00      0.95      0.98        21
       mango       1.00      0.74      0.85        19
   mothbeans       0.95      0.83      0.89        24
    mungbean       0.79      1.00      0.88        19
   muskmelon       0.59      0.59      0.59        17
   

### Implementation Steps:

* Model Setup: Random Forest Classifier is defined.


* Hyperparameter Tuning: GridSearchCV is utilized to find the optimal parameters (like the number of trees, maximum depth of trees, etc.) across a specified grid. This helps in optimizing the model by tuning it to the best possible configuration for the given data.


* Model Training and Prediction: The best model from GridSearchCV is used to make predictions on the test set.


* Evaluation: Similar to logistic regression, the model’s performance is evaluated using accuracy and a detailed classification report.

### Outcomes:
* Overall Accuracy: 82.5%, a significant improvement over the logistic regression model.

* Enhanced Performance: The Random Forest model shows not only higher overall accuracy but also improved precision, recall, and F1-scores for most crops. This indicates a better handling of class variability and an overall stronger predictive performance.




### Rationale Behind this model Selection
* Logistic Regression: A good baseline model for binary classification tasks. It's relatively simple and interpretable but may not handle complex relationships and interactions between features as effectively, especially in multi-class settings.

* Random Forest: An ensemble method that builds multiple decision trees and aggregates their predictions. It's more robust against overfitting and can capture complex patterns in the data, making it suitable for tasks with high-dimensional feature space and multiple classes.


The significant improvement in accuracy and class-specific metrics with Random Forest suggests that complex models are more suited for this particular task. Given the diverse and multi-dimensional nature of the data (various soil metrics influencing crop type), Random Forest can effectively capture the necessary interactions and non-linear relationships compared to binary class model

# B. Identify the Most Significant Variable

This objective involves analyzing the feature importance to determine which soil metric is most predictive of the crop type. Here are the steps and the current status:

In [15]:
# Extract feature importances
importances = best_model.feature_importances_
feature_names = ['Nitrogen', 'Phosphorous', 'Potassium', 'pH']

# Create a DataFrame to view the importances
importances_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances}
                             ).sort_values(by='Importance', ascending=False)


print(importances_df)


       Feature  Importance
2    Potassium    0.347810
1  Phosphorous    0.257595
0     Nitrogen    0.205841
3           pH    0.188754


Here, Potassium has the highest importance score, indicating it was the most influential in determining the crop type according to the Random Forest model. Conversely, pH has the lowest score, suggesting it was the least influential of the four features in this specific context.



### Why This Analysis Matters ?

Understanding feature importance is crucial for several reasons:

* Interpreting the Model: It helps in understanding what drives the model's predictions, providing insights into the underlying patterns in the farm data.

* Feature Selection: Knowing the most important features can guide efforts to streamline the model or focus data collection on specific areas, thereby improving efficiency and effectiveness.

* Practical Applications: In agriculture, identifying the most predictive soil metrics can inform soil management and fertilizer use, leading to more efficient and sustainable farming practices.


### Why Random Forest is a Good Choice?

Random Forest is an excellent choice for this type of analysis for several reasons:

- Handles Multiclass Classification: It is inherently suited for multiclass classification problems, effectively managing the complexity of predicting multiple crop types.

- Feature Importance: It provides natural feature importance metrics, guiding practical agricultural decisions like prioritizing soil tests for specific nutrients.

- Non-Linear Relationships: The model captures non-linear relationships between features and the target variable, negating the need for feature scaling or transformations.

- Robustness: Random Forest reduces the risk of overfitting and provides more stable predictions across different datasets.

### Final Recommendations

- Focus on High-Importance Features: Given the findings, potassium and phosphorous should be prioritized in soil management practices and testing. These nutrients are highly influential in crop prediction and should be closely monitored.

- Regular Model Updating: As new data becomes available, particularly with changes in agricultural practices or crop varieties, the model should be regularly retrained to maintain its accuracy and relevance.

## Model Export and Serialization with joblib

In [16]:

# Save the model and the label encoder
joblib.dump(best_model, 'random_forest_model.pkl')
joblib.dump(label_encoder, 'label_encoder.pkl')
joblib.dump(scaler, 'scaler.pkl')

['scaler.pkl']