This code is designed to build and evaluate machine learning models for crop recommendation using a dataset that includes features like soil nutrients and climate conditions. By applying preprocessing steps such as label encoding and standardization, we prepare the data for **four different classifiers**: Logistic Regression, K-Nearest Neighbors, Random Forest, and Gradient Boosting. Each model is assessed using **5-fold cross-validation**, allowing us to compare their performance on key metrics like accuracy, precision, recall, and
F1 score. The goal is to *identify the best-performing model to assist in predicting optimal crops based on environmental data*.

In [None]:
# Import necessary libraries
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score

# Set up a directory for Kaggle API configuration
#os.makedirs('/root/.kaggle', exist_ok=True)

# Move the kaggle.json file (API key) to the correct directory for authentication
#!cp kaggle.json /root/.kaggle/

# Set the correct permissions to protect the Kaggle API key file
#!chmod 600 /root/.kaggle/kaggle.json

# Download the dataset using Kaggle API
!kaggle datasets download -d siddharthss/crop-recommendation-dataset

# Unzip the downloaded dataset
!unzip crop-recommendation-dataset.zip

# Load the dataset into a DataFrame
df = pd.read_csv('Crop_recommendation.csv')
df.head()  # Display the first few rows of the dataset

Dataset URL: https://www.kaggle.com/datasets/siddharthss/crop-recommendation-dataset
License(s): Attribution 3.0 IGO (CC BY 3.0 IGO)
crop-recommendation-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  crop-recommendation-dataset.zip
replace Crop_recommendation.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall,label
0,90,42,43,20.879744,82.002744,6.502985,202.935536,rice
1,85,58,41,21.770462,80.319644,7.038096,226.655537,rice
2,60,55,44,23.004459,82.320763,7.840207,263.964248,rice
3,74,35,40,26.491096,80.158363,6.980401,242.864034,rice
4,78,42,42,20.130175,81.604873,7.628473,262.71734,rice


In [None]:
#see if dataset has any missing values
missing_values = df.isnull().sum()
print(missing_values)  # Print count of missing values per column

N              0
P              0
K              0
temperature    0
humidity       0
ph             0
rainfall       0
label          0
dtype: int64


In [None]:


# Encode target labels as numeric values
label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['label'])

# Separate features (X) and target (y)
X = df.drop('label', axis=1)
y = df['label']

# Standardize the feature data
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Initialize classifiers to test 4 algorithms
models = {
    "Logistic Regression": LogisticRegression(),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier()
}

# Define the scoring metrics for model evaluation
scoring = {
    'accuracy': 'accuracy',
    'precision': make_scorer(precision_score, average='weighted'),
    'recall': make_scorer(recall_score, average='weighted'),
    'f1': make_scorer(f1_score, average='weighted')
}

# Perform 5-fold cross-validation for each model
results = {}
for model_name, model in models.items():
    print(f"\nModel: {model_name}")
    cv_results = cross_validate(model, X, y, cv=5, scoring=scoring)

    # Calculate and store the average of each metric across the 5 folds
    results[model_name] = {
        'Accuracy': np.mean(cv_results['test_accuracy']),
        'Precision': np.mean(cv_results['test_precision']),
        'Recall': np.mean(cv_results['test_recall']),
        'F1 Score': np.mean(cv_results['test_f1'])
    }

    # Display metrics for each model
    print(f"accuracies of all folds: ", cv_results['test_accuracy'])
    print(f"Accuracy: {results[model_name]['Accuracy']:.4f}")
    print(f"Precision: {results[model_name]['Precision']:.4f}")
    print(f"Recall: {results[model_name]['Recall']:.4f}")
    print(f"F1 Score: {results[model_name]['F1 Score']:.4f}")



Model: Logistic Regression
accuracies of all folds:  [0.97272727 0.95454545 0.97727273 0.96818182 0.98409091]
Accuracy: 0.9714
Precision: 0.9734
Recall: 0.9714
F1 Score: 0.9712

Model: K-Nearest Neighbors
accuracies of all folds:  [0.96363636 0.97045455 0.97045455 0.97045455 0.98181818]
Accuracy: 0.9714
Precision: 0.9744
Recall: 0.9714
F1 Score: 0.9712

Model: Random Forest
accuracies of all folds:  [0.99772727 0.99318182 0.99772727 0.99545455 0.98636364]
Accuracy: 0.9941
Precision: 0.9946
Recall: 0.9941
F1 Score: 0.9941

Model: Gradient Boosting


In this code, we tested four machine learning models—Logistic Regression, K-Nearest Neighbors, Random Forest, and Gradient Boosting—using 5-fold cross-validation to evaluate their performance. Among them, **Random Forest Classifier** gave the best overall results in terms of accuracy, precision, recall, and F1 score, making it the top choice for this crop recommendation task. Its strong performance is likely due to its ability to handle complex patterns in the data, which helps make more reliable predictions.