## Feature Engineering

This notebook prepares the cleaned crop yield dataset for machine learning models.
It focuses on handling missing values, encoding categorical variables, scaling numeric
features, and constructing the final feature matrix and target variable.


In [36]:
import os

os.getcwd()


'c:\\Users\\pc-msi\\Desktop\\CropYieldML-Review-1\\notebooks'

In [37]:
os.listdir()


['01_exploration.ipynb',
 '02_feature_engineering.ipynb',
 '03_baseline_models.ipynb',
 '04_model_comparison.ipynb']

In [38]:
os.listdir("../data")


['processed', 'raw', 'README.md']

## Load Cleaned dataset

In [39]:
import pandas as pd

df = pd.read_csv("../processed/df_merged_clean.csv")
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56717 entries, 0 to 56716
Data columns (total 17 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   domain_code                    56717 non-null  int64  
 1   domain                         56717 non-null  object 
 2   area_code                      56717 non-null  int64  
 3   area                           56717 non-null  object 
 4   element_code                   56717 non-null  int64  
 5   element                        56717 non-null  object 
 6   item_code                      56717 non-null  int64  
 7   item                           56717 non-null  object 
 8   year_code                      56717 non-null  int64  
 9   year                           56717 non-null  int64  
 10  unit                           56717 non-null  object 
 11  yield_value                    56717 non-null  int64  
 12  average_rain_fall_mm_per_year  25385 non-null 

## Define target and feature matrix

In [40]:
# Target
y = df["yield_value"]

# Features
X = df.drop(columns=[
    "yield_value",      # target
    "pesticide_value",  # all missing
    "domain_code",      # IDs/codes
    "domain",           # optional, can drop
    "area_code",        # ID
    "element_code",     # ID
    "element",          # optional
    "item_code",        # ID
    "year_code",        # ID
    "unit_code",        # ID
    "unit",             # optional
    "domain_code_code"  # redundant
])


## Check x columns 

In [41]:
X.columns


Index(['area', 'item', 'year', 'average_rain_fall_mm_per_year', 'avg_temp'], dtype='object')

## Handle Missing Values

In [42]:
from sklearn.impute import SimpleImputer

num_cols = ['average_rain_fall_mm_per_year', 'avg_temp']

imputer = SimpleImputer(strategy='median')
X[num_cols] = imputer.fit_transform(X[num_cols])

# Verify
X[num_cols].isna().sum()


average_rain_fall_mm_per_year    0
avg_temp                         0
dtype: int64

## Encode Categorical Features

In [43]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

cat_cols = ['area', 'item']
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

X_cat_encoded = encoder.fit_transform(X[cat_cols])

# Check shape
X_cat_encoded.shape


(56717, 222)

In [44]:
## Combine numeric and Categorical Features

X_num = X.drop(columns=cat_cols).values
X_final = np.hstack([X_num, X_cat_encoded])

print("Final feature matrix shape:", X_final.shape)


Final feature matrix shape: (56717, 225)


## Feature Scaling 

In [45]:
## StandardScaler for numeric columns

from sklearn.preprocessing import StandardScaler

# Numeric columns: first 3 columns (year + rainfall + temp)
scaler = StandardScaler()
X_final[:, :3] = scaler.fit_transform(X_final[:, :3])

# Verify
print("First 5 rows after scaling:")
print(X_final[:5, :5])


First 5 rows after scaling:
[[-1.77707006 -0.08679222 -1.297768    1.          0.        ]
 [-1.71508552 -0.08679222 -1.32403801  1.          0.        ]
 [-1.65310099 -0.08679222 -1.14014796  1.          0.        ]
 [-1.59111645 -0.08679222 -1.39880649  1.          0.        ]
 [-1.52913191 -0.08679222 -1.3644534   1.          0.        ]]


In [46]:
## Target Transformation 

import numpy as np

y_log = np.log1p(y.values)  # log(1 + y)


## Save Final Features for modeling

In [47]:
import joblib
import numpy as np
import os

os.makedirs("../processed", exist_ok=True)

np.save("../processed/X_final.npy", X_final)
np.save("../processed/y.npy", y.values)
np.save("../processed/y_log.npy", y_log)

joblib.dump(scaler, "../processed/scaler.pkl")
joblib.dump(encoder, "../processed/encoder.pkl")

print("Feature engineering complete. Files saved in ../processed/")


Feature engineering complete. Files saved in ../processed/
