# Cultivating Success: Leveraging Machine Learning to Optimize Crop Selection
Assessing soil health by measuring critical metrics such as nitrogen, phosphorus, potassium levels, and pH is vital for understanding soil conditions. However, this process can be both costly and time-intensive, often leading farmers to prioritize which metrics to measure based on their financial constraints.

Farmers face numerous decisions regarding which crops to plant each season, with their primary goal being to optimize crop yield. One significant factor influencing crop performance is soil condition, which can be gauged through measurements of essential elements like nitrogen and potassium. Each crop has specific soil requirements that support its optimal growth and yield.

A farmer has sought your expertise in machine learning to help determine the best crop for his field. The dataset provided, named soil_measures.csv, includes the following:

"N": Nitrogen content ratio in the soil
"P": Phosphorus content ratio in the soil
"K": Potassium content ratio in the soil
"pH": Soil pH value
"crop": Categorical values representing different crops (target variable).
Each entry in this dataset reflects various soil measurements from a specific field, with the crop listed in the "crop" column being the ideal choice for that soil condition.

The goal of this project is to develop multi-class classification models to predict the crop type and identify the most critical feature contributing to predictive accuracy.

In [17]:
# All required libraries are imported here for you.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Load the dataset
crops = pd.read_csv("soil_measures.csv")

# Step 1: Read the data into a pandas DataFrame and perform exploratory data analysis

# Display the first five rows of thedataset
crops.head()

Unnamed: 0,N,P,K,ph,crop
0,90,42,43,6.502985,rice
1,85,58,41,7.038096,rice
2,60,55,44,7.840207,rice
3,74,35,40,6.980401,rice
4,78,42,42,7.628473,rice



The initial rows of the dataset `crops` show that all entries correspond to the crop "rice." Each row includes measurements for nitrogen (N), phosphorus (P), potassium (K), and soil pH. The values for these soil metrics vary, but the consistent presence of "rice" in the `crop` column suggests that these conditions are optimized for this specific crop type. This indicates that the dataset may be focused on analyzing soil conditions that are particularly suited for rice cultivation.

In [18]:
# Display basic information about the dataset
print(crops.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2200 entries, 0 to 2199
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   N       2200 non-null   int64  
 1   P       2200 non-null   int64  
 2   K       2200 non-null   int64  
 3   ph      2200 non-null   float64
 4   crop    2200 non-null   object 
dtypes: float64(1), int64(3), object(1)
memory usage: 86.1+ KB
None


The dataset comprises 2,200 entries with 5 columns: `"N"` (Nitrogen content), `"P"` (Phosphorus content), `"K"` (Potassium content), `"ph"` (pH value), and `"crop"` (crop type). Each column has no missing values, with `"N"`, `"P"`, and `"K"` stored as integers, `"ph"` as a float, and `"crop"` as an object. The dataset is well-structured, with consistent data types and complete entries, making it suitable for analysis.

In [19]:
# Check for missing values
missing_values = crops.isnull().sum()
print("Missing Values:\n", missing_values)

Missing Values:
 N       0
P       0
K       0
ph      0
crop    0
dtype: int64


The analysis of the dataset `crops` reveals that there are no missing values in any of the columns. All fields—Nitrogen (`N`), Phosphorus (`P`), Potassium (`K`), pH, and Crop—are complete, indicating that the dataset is fully populated and ready for further analysis.

In [20]:
# Check unique crop types
unique_crops = crops['crop'].unique()
print("Unique Crop Types:\n", unique_crops)

Unique Crop Types:
 ['rice' 'maize' 'chickpea' 'kidneybeans' 'pigeonpeas' 'mothbeans'
 'mungbean' 'blackgram' 'lentil' 'pomegranate' 'banana' 'mango' 'grapes'
 'watermelon' 'muskmelon' 'apple' 'orange' 'papaya' 'coconut' 'cotton'
 'jute' 'coffee']


The analysis reveals that the dataset contains a diverse range of crop types, including 21 distinct varieties. The crops span a wide spectrum, from staple grains such as rice and maize to fruits like pomegranate and mango, and other agricultural products like coffee and cotton. This variety suggests that the model will need to handle a broad classification task, accounting for different agricultural needs and optimal conditions for each crop type.

In [21]:
# Step 2: Split the data

# Features and target variable
X = crops.drop('crop', axis=1)
y = crops['crop']


In [22]:
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [23]:
# Step 3: Evaluate feature performance

# Create a dictionary to store each feature's predictive performance
best_predictive_feature = {}

# Loop through the features
for feature in X.columns:
    # Create and train the logistic regression model
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train[[feature]], y_train)
    
    # Predict target values using the test set
    y_pred = model.predict(X_test[[feature]])
    
    # Evaluate the performance of the feature using accuracy as the metric
    accuracy = metrics.accuracy_score(y_test, y_pred)
    
    # Store the feature name and the respective model's evaluation score
    best_predictive_feature[feature] = accuracy


In [24]:
# Step 4: Create the best_predictive_feature variable

# Identify the feature with the highest accuracy
best_feature_name = max(best_predictive_feature, key=best_predictive_feature.get)
best_score = best_predictive_feature[best_feature_name]

# Create the best_predictive_feature variable
best_predictive_feature = {best_feature_name: best_score}

# Display the best predictive feature
print("Best Predictive Feature:", best_predictive_feature)


Best Predictive Feature: {'K': 0.25681818181818183}


The code identifies the most predictive feature for the model from a set of features by comparing their accuracy scores. The displayed data reveals that the feature with the highest predictive accuracy is `"K"`, which refers to the potassium content in the soil. This feature achieved an accuracy score of approximately 0.257, indicating that potassium levels are the most important factor among the measured features for predicting the optimal crop in this dataset.