# Steps

## 1. [Import the Libraries](#1.-Import-the-libraries)
## 2. [Load the data](#2.-Load-the-Data)
## 3. [Preprocessing](#3.-Preprocessing-the-Data)
- 3.1 [OneHotEncoding](#3.1-Data-tranformation)
- 3.2 [Standard Scaling](#3.2-Standard-Scaling)
## 4. [Data preparation](#4.-Data-Preparation)
## 5. [Model Evaluator](#5.-Mean-Average-Precision)
## 6. [Model training](#6.-Model-Training)
- 6.1 [MAP@3 Score](#6.1-Model-Score)
## 7. [Feature Engineering](#7.-Adding-New-Features)
- 7.1 [Numeric Features](#7.1-Numerical-Columns)
## [Submission](#Submit-the-test-data-prediction)

## 1. Import the libraries

[🔝 Return to top](#Steps)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder # Used to label Categorical Target Variable
from sklearn.preprocessing import OneHotEncoder # Used for unordered categorical features
from sklearn.preprocessing import OrdinalEncoder # Used for ordered categorical features
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.compose import ColumnTransformer

from xgboost import XGBClassifier, plot_importance
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import average_precision_score

########  hyperparameter Tuning  #################
#import optuna

import warnings
warnings.filterwarnings('ignore')

## 2. Load the Data

[🔝 Return to top](#Steps)

In [2]:
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
train_data.set_index('id', inplace = True)
test_data.set_index('id', inplace = True)
train_copy = train_data.copy()
test_copy = test_data.copy()

## 3. Preprocessing the Data

[🔝 Return to top](#Steps)

In [3]:
numerical_columns = [ col for col in train_data.columns if train_data[col].dtype != 'O']
#['Temparature', 'Humidity', 'Moisture', 'Nitrogen', 'Potassium', 'Phosphorous']
# All Categorical Columns -> ['Soil Type', 'Crop Type', 'Fertilizer Name'] 
categorical_columns = [col for col in train_data.columns if train_data[col].dtype == 'object' and col != 'Fertilizer Name']
#['Soil Type', 'Crop Type']

### 3.1 Data tranformation

[🔝 Return to top](#Steps)

In [4]:
labeler = LabelEncoder()
train_data['Fertilizer Name'] = labeler.fit_transform(train_data['Fertilizer Name'])

In [5]:
cat_encoder = OneHotEncoder() # Unordered Catergorical Columns

preprocessor = ColumnTransformer(
    transformers = [
        ('categorical', OneHotEncoder(sparse_output = False, handle_unknown = 'ignore'), categorical_columns)
],
    remainder = 'passthrough'  # Keep other columns same
)

In [6]:
for col in categorical_columns:
    train_dummy = pd.get_dummies(train_data[col], dtype= int )
    train_data = pd.concat([train_data, train_dummy], axis = 1)
    test_dummy = pd.get_dummies(test_data[col], dtype = int)
    test_data = pd.concat([test_data, test_dummy], axis = 1)

In [7]:
train_data.drop(categorical_columns, axis = 1, inplace = True)
test_data.drop(categorical_columns, axis = 1, inplace = True)

In [8]:
test_data.shape

(250000, 22)

### 3.2 Standard Scaling

[🔝 Return to top](#Steps)

## 4. Data Preparation

[🔝 Return to top](#Steps)

In [9]:
X = train_data.drop(['Fertilizer Name'], axis = 1)
y = train_data['Fertilizer Name']

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

## 5. Mean Average Precision

[🔝 Return to top](#Steps)

In [10]:
############################### MAP@3 ####################################
def mapk(y_test, y_pred_proba,k =3):
    top3_indices = np.argsort(y_pred_proba, axis = 1)[:, ::-1][ : ,:3]
    sum_ap = 0

    for indx, lst in enumerate(top3_indices):
        true_label = y_test.iloc[indx]

        try:
            ap = 1 / (list(lst).index(true_label) + 1)
        except ValueError:
            ap = 0

        sum_ap += ap

    return (sum_ap / len(y_test))

## 6. Model Training 

[🔝 Return to top](#Steps)

In [11]:
models = [XGBClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    use_label_encoder=False,
    eval_metric='mlogloss'
),
          CatBoostClassifier(verbose = 0),
          LGBMClassifier(verbose = -1)
         ]

In [12]:
skf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)

In [13]:
for model in models:
    print(f"Evaluating: {model.__class__.__name__}")
    scores = []

    for fold, (train_index, test_index) in enumerate(skf.split(X, y)):
        X_train, X_test, y_train, y_test = X.iloc[train_index], X.iloc[test_index], y.iloc[train_index], y.iloc[test_index]

        model.fit(X_train, y_train)
        y_pred_proba = model.predict_proba(X_test)
        score = mapk(y_test, y_pred_proba)
        scores.append(score)
        print('-'*40)
        print(f"  Fold {fold+1} ---> MAP@3 Score = {score:.4f}")

    print(f"Average MAP@3 score for {model.__class__.__name__} = {np.mean(scores):.4f}\n")

Evaluating: XGBClassifier
----------------------------------------
  Fold 1 ---> MAP@3 Score = 0.3229
----------------------------------------
  Fold 2 ---> MAP@3 Score = 0.3234
----------------------------------------
  Fold 3 ---> MAP@3 Score = 0.3236
----------------------------------------
  Fold 4 ---> MAP@3 Score = 0.3242
----------------------------------------
  Fold 5 ---> MAP@3 Score = 0.3239
Average MAP@3 score for XGBClassifier = 0.3236

Evaluating: CatBoostClassifier
----------------------------------------
  Fold 1 ---> MAP@3 Score = 0.3277
----------------------------------------
  Fold 2 ---> MAP@3 Score = 0.3274
----------------------------------------
  Fold 3 ---> MAP@3 Score = 0.3278
----------------------------------------
  Fold 4 ---> MAP@3 Score = 0.3282
----------------------------------------
  Fold 5 ---> MAP@3 Score = 0.3284
Average MAP@3 score for CatBoostClassifier = 0.3279

Evaluating: LGBMClassifier
----------------------------------------
  Fold 1 ---> 

##### ouput

## 7. Adding New Features

[🔝 Return to top](#Steps)

### 7.1 Numerical Columns

## Submit the test data prediction

[🔝 Return to top](#Steps)

In [18]:
model = CatBoostClassifier(verbose = 0)

model.fit(X,y)

probs = model.predict_proba(test_data)
top3_indices = np.argsort(probs , axis = 1)[ : , ::-1][:,:3]
top3_labels = labeler.inverse_transform(top3_indices.ravel()).reshape(top3_indices.shape)

pred_top3 = [' '.join(row) for row in top3_labels]

In [19]:
submission = pd.DataFrame({
    "id" : test_data.index,
    "Fertilizer Name" : pred_top3
})

submission.to_csv("submission.csv", index = False) 
submission.head()

Unnamed: 0,id,Fertilizer Name
0,750000,10-26-26 DAP 28-28
1,750001,17-17-17 20-20 10-26-26
2,750002,20-20 14-35-14 10-26-26
3,750003,14-35-14 17-17-17 DAP
4,750004,20-20 10-26-26 17-17-17
