PHASE 3: 
1. Clean the dataset. 
2. Ensure no data leakage.
3. Prepare two ML-ready datasets- one for Crop Recommendation (classification) and one for Yield Prediction (regression)
4. Proper train/test splits.
5. Save the processed data (for report and reproducibility)

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer


In [2]:
df = pd.read_csv("../data/raw/crop_yield.csv")
df.head()


Unnamed: 0,Crop,Crop_Year,Season,State,Area,Production,Annual_Rainfall,Fertilizer,Pesticide,Yield
0,Arecanut,1997,Whole Year,Assam,73814.0,56708,2051.4,7024878.38,22882.34,0.796087
1,Arhar/Tur,1997,Kharif,Assam,6637.0,4685,2051.4,631643.29,2057.47,0.710435
2,Castor seed,1997,Kharif,Assam,796.0,22,2051.4,75755.32,246.76,0.238333
3,Coconut,1997,Whole Year,Assam,19656.0,126905000,2051.4,1870661.52,6093.36,5238.051739
4,Cotton(lint),1997,Kharif,Assam,1739.0,794,2051.4,165500.63,539.09,0.420909


Load the data. Ensures preprocessing is reproducible and not dependent on EDA notebook state.

In [3]:
df = df.drop(columns=["Production"])


Production is used to compute yield. Using it as an input would leak the answer. So, we drop it before modelling.

In [4]:
y_crop = df["Crop"]


In [5]:
X_crop = df.drop(columns=["Crop", "Yield"])


CLASSIFICATION - Crop Recommendation.
1. Traget: Crop
2. Features: Crop, Yield

In [6]:
y_yield = df["Yield"]
X_yield = df.drop(columns=["Yield"])


YIELD PREDICTION - REGRESSION
1. Crop yield (important predictor)
2. Yield removed from all inputs.

In [7]:
X_crop.shape, y_crop.shape
X_yield.shape, y_yield.shape


((19689, 8), (19689,))

Sanity check - Confirming number of dimensions.

In [8]:
cat_cols_crop = ["Season", "State"]
num_cols_crop = [col for col in X_crop.columns if col not in cat_cols_crop]


Identify categorical and numerical features - for Crop Recommendation.

In [9]:
cat_cols_yield = ["Crop", "Season", "State"]
num_cols_yield = [col for col in X_yield.columns if col not in cat_cols_yield]


Identify categorical and numerical features - for Yield Prediction

In [10]:
Xc_train, Xc_test, yc_train, yc_test = train_test_split(
    X_crop, y_crop,
    test_size=0.2,
    random_state=42,
    stratify=y_crop
)


Crop Recommendation split.          
Stratified split --> class balance preserved.        
Split --> then encode --> then scale.

In [11]:
Xy_train, Xy_test, yy_train, yy_test = train_test_split(
    X_yield, y_yield,
    test_size=0.2,
    random_state=42
)


Yield Prediction Split - No stratification needed for regression.

In [12]:
preprocess_crop = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols_crop),
        ("num", StandardScaler(), num_cols_crop)
    ]
)


Building prepprocessing pipelines for Crop Recommendation.

In [13]:
preprocess_yield = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols_yield),
        ("num", StandardScaler(), num_cols_yield)
    ]
)


Building PreProcessor pipelines for Yield Prediction.

In [14]:
Xc_train_enc = preprocess_crop.fit_transform(Xc_train)
Xc_test_enc = preprocess_crop.transform(Xc_test)


Fit on TRAIN, transform TRAIN and TEST                
TO avoid data leakage.                                
Crop Recommendation.

In [15]:
Xy_train_enc = preprocess_yield.fit_transform(Xy_train)
Xy_test_enc = preprocess_yield.transform(Xy_test)


Fit on TRAIN, transform TRAIN and TEST                
TO avoid data leakage.                                
Yield Prediction.

In [16]:
import joblib

joblib.dump(preprocess_crop, "../data/processed/preprocess_crop.pkl")
joblib.dump(preprocess_yield, "../data/processed/preprocess_yield.pkl")


['../data/processed/preprocess_yield.pkl']

Save processed data 