**Project Description:** A farmer reached out to you as a machine learning expert seeking help to select the best crop for his field. Due to budget constraints, the farmer explained that he could only afford to measure one out of the four essential soil measures:<br>
•	Nitrogen content ratio in the soil <br>
•	Phosphorous content ratio in the soil <br>
•	Potassium content ratio in the soil <br>
•	pH value of the soil <br>
The expert realized that this is a classic feature selection problem, where the objective is to pick the most important feature that could help predict the crop accurately. <br>

**Sowing Success - How Machine Learning Helps Farmers Select the Best Crops:** <br> 
Measuring essential soil metrics such as nitrogen, phosphorous, potassium levels, and pH value is an important aspect of assessing soil condition. However, it can be an expensive and time-consuming process, which can cause farmers to prioritize which metrics to measure based on their budget constraints. Farmers have various options when it comes to deciding which crop to plant each season. Their primary objective is to maximize the yield of their crops, taking into account different factors. One crucial factor that affects crop growth is the condition of the soil in the field, which can be assessed by measuring basic elements such as nitrogen and potassium levels. Each crop has an ideal soil condition that ensures optimal growth and maximum yield. A farmer reached out to you as a machine learning expert for assistance in selecting the best crop for his field. They've provided you with a dataset called soil_measures.csv, which contains: <br>
"N": Nitrogen content ratio in the soil <br>
"P": Phosphorous content ratio in the soil <br>
"K": Potassium content ratio in the soil <br>
"pH" value of the soil <br>
"crop": categorical values that contain various crops (target variable). <br>

Each row in this dataset represents various measures of the soil in a particular field. Based on these measurements, the crop specified in the "crop" column is the optimal choice for that field. In this project, you will build multi-class classification models to predict the type of "crop" and identify the single most importance feature for predictive performance. <br>

Identify the single feature that has the strongest predictive performance for classifying crop types. <br>
•	Find the feature in the dataset that produces the best score for predicting "crop". <br>
•	From this information, create a variable called best_predictive_feature, which: <br>
o	Should be a dictionary containing the best predictive feature name as a key and the evaluation score (for the metric you chose) as the value. <br>

In [52]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the dataset
crops = pd.read_csv("soil_measures.csv")

# Check for null values and describe the dataset
print(crops.isna().sum())
print(crops.describe())

# Encode the target variable using LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(crops['crop'])

# Define the features and target variable
X = crops[['N', 'P', 'K', 'ph']]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

# Initialize variables to store the best score and best model name
best_score = 0
best_model_name = ""

# Define a list of classification models to be included in the pipeline
models = [
    ('knn', KNeighborsClassifier()),
    ('decision_tree', DecisionTreeClassifier()),
    ('logistic_regression', LogisticRegression(max_iter=10000, solver='saga')),
    ('random_forest', RandomForestClassifier()),
    ('svc', SVC())
]

# Define parameter grids for GridSearchCV for each model
param_grids = {
    'knn': {'knn__n_neighbors': [2, 3, 4, 5, 6]},
    'decision_tree': {'decision_tree__max_depth': [5, 6, 7, 8, 9, 10]},
    'logistic_regression': {'logistic_regression__C': [0.1, 1.0, 10.0]},
    'random_forest': {'random_forest__n_estimators': [50, 100], 'random_forest__max_depth': [None, 10]},
    'svc': {'svc__C': [0.1, 1.0], 'svc__kernel': ['linear', 'rbf']}
}

# Function to perform GridSearchCV and evaluate models
def evaluate_models(X_train_selected, X_test_selected):
    global best_score, best_model_name
    best_pipelines = {}
    for name, model in models:
        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            (name, model)
        ])
        grid_search = GridSearchCV(pipeline, param_grids[name], cv=5, n_jobs=-1)
        grid_search.fit(X_train_selected, y_train)
        
        print(f"Best parameters for {name}: {grid_search.best_params_}")
        print(f"Best score for {name}: {grid_search.best_score_}")
        
        best_pipelines[name] = grid_search.best_estimator_

        if grid_search.best_score_ > best_score:
            best_score = grid_search.best_score_
            best_model_name = name

    # Evaluate the best model on the test set for each pipeline
    for name, best_pipeline in best_pipelines.items():
        y_pred = best_pipeline.predict(X_test_selected)
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Test set accuracy of the best {name} model: {accuracy}")
    
    # Predict the best crop using the best performing model
    best_pipeline = best_pipelines[best_model_name]
    best_crop_prediction = best_pipeline.predict(X_test_selected)

    # Find the most frequent predicted crop using pandas mode function
    best_crop_name = label_encoder.inverse_transform([pd.Series(best_crop_prediction).mode()[0]])[0]

    print("Best Performing Model:", best_model_name)
    print("Best Crop:", best_crop_name)

# Feature Importance: Select features based on their importance scores from a RandomForestClassifier.
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
print("Feature Importance:\n", feature_importance_df)

top_features_importance = feature_importance_df['Feature'].head(3).tolist()
X_train_importance_selected = X_train[top_features_importance]
X_test_importance_selected = X_test[top_features_importance]

print("\nEvaluating models with Feature Importance selected features:")
evaluate_models(X_train_importance_selected, X_test_importance_selected)

# RFECV: Automatically select the optimal number of features using recursive feature elimination with cross-validation.
rfecv_selector = RFECV(RandomForestClassifier(), step=1, cv=5, n_jobs=-1)
rfecv_selector.fit(X_train, y_train)
selected_features_rfecv = X.columns[rfecv_selector.support_]
print("RFECV Selected Features:", selected_features_rfecv)

X_train_rfecv_selected = rfecv_selector.transform(X_train)
X_test_rfecv_selected = rfecv_selector.transform(X_test)

print("\nEvaluating models with RFECV selected features:")
evaluate_models(X_train_rfecv_selected, X_test_rfecv_selected)

# Ensemble Methods: Combine multiple models to improve overall performance.
ensemble = VotingClassifier(estimators=[
    ('logistic_regression', LogisticRegression(C=10.0, max_iter=20000, solver='saga')),
    ('random_forest', RandomForestClassifier(n_estimators=100, max_depth=10)),
    ('svc', SVC(C=1.0, kernel='linear'))
], voting='hard', n_jobs=-1)

ensemble.fit(X_train_rfecv_selected, y_train)
ensemble_accuracy = accuracy_score(y_test, ensemble.predict(X_test_rfecv_selected))
print(f"\nEnsemble Test Set Accuracy: {ensemble_accuracy}")

# RandomizedSearchCV: Explore a broader range of hyperparameters to find the best model configuration.
param_grids_randomized = {
    'knn': {'knn__n_neighbors': range(1, 20)},
    'decision_tree': {'decision_tree__max_depth': range(1, 20)},
    'logistic_regression': {'logistic_regression__C': [0.01, 0.1, 1.0, 10.0, 100.0]},
    'random_forest': {'random_forest__n_estimators': [50, 100, 200], 'random_forest__max_depth': [None, 10, 20]},
    'svc': {'svc__C': [0.1, 1.0, 10.0], 'svc__kernel': ['linear', 'rbf']}
}

best_pipelines_randomized = {}
for name, model in models:
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        (name, model)
    ])
    random_search = RandomizedSearchCV(pipeline, param_grids_randomized[name], n_iter=5, cv=5, random_state=12, n_jobs=-1)
    random_search.fit(X_train_rfecv_selected, y_train)
    
    print(f"Best parameters for {name} (RandomizedSearchCV): {random_search.best_params_}")
    print(f"Best score for {name} (RandomizedSearchCV): {random_search.best_score_}")

N       0
P       0
K       0
ph      0
crop    0
dtype: int64
                 N            P            K           ph
count  2200.000000  2200.000000  2200.000000  2200.000000
mean     50.551818    53.362727    48.149091     6.469480
std      36.917334    32.985883    50.647931     0.773938
min       0.000000     5.000000     5.000000     3.504752
25%      21.000000    28.000000    20.000000     5.971693
50%      37.000000    51.000000    32.000000     6.425045
75%      84.250000    68.000000    49.000000     6.923643
max     140.000000   145.000000   205.000000     9.935091
Feature Importance:
   Feature  Importance
2       K    0.316785
1       P    0.265932
0       N    0.211928
3      ph    0.205355

Evaluating models with Feature Importance selected features:
Best parameters for knn: {'knn__n_neighbors': 5}
Best score for knn: 0.6392045454545455
Best parameters for decision_tree: {'decision_tree__max_depth': 10}
Best score for decision_tree: 0.6636363636363637
Best parameters f