#  **IE 423 Task 3**

# **Black**  **Friday Sales**


**Initialize**

The necessary libraries are imported.

In [2]:
import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

**Load Data**

In [8]:
from google.colab import drive
drive.mount('/content/drive')
dfsls= pd.read_csv('/content/drive/My Drive/train.csv')
dfsls.head()

# Target variable is the Purchase column as explained in dataset description
df = dfsls.dropna(subset=["Purchase"])
y = df.loc[:,['Purchase']].values.ravel()
X = df.drop(['Purchase'],axis=1)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8, test_size=0.2,random_state=1)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**BUILD RANDOM FOREST MODEL**

In [22]:
from sklearn.ensemble import RandomForestRegressor

# Function for building and scoring Random Forest models
def get_random_forest_mae(X_train, X_test, y_train, y_test):
    mdlRfs = RandomForestRegressor(random_state=1)
    mdlRfs.fit(X_train, y_train)
    y_tst_prd = mdlRfs.predict(X_test)
    mae = mean_absolute_error(y_test, y_tst_prd)
    return (mae)




In [10]:
# Try to build a model using all features
get_random_forest_mae(X_train, X_test, y_train, y_test)

ValueError: could not convert string to float: 'P00304042'

As it can be seen above, the random forest model cannot work with strings.

**Use only the numerical values and test and train them.**

In [14]:
# Select numeric features
cols_num = [col for col in X.columns if X[col].dtype in ['int64', 'float64']]
Xnum = X[cols_num]

# Split numeric features into training and test sets
Xnum_train, Xnum_test, y_train, y_test = train_test_split(Xnum,y,train_size=0.8, test_size=0.2,random_state=1)

#Try to build a model using all numeric features
get_random_forest_mae(Xnum_train, Xnum_test, y_train, y_test)

ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

After the non-numerical values, now having NAN values are critical.

Count the missing values in each column. It seems that product category 2 and 3 have missing values.

In [15]:
Xnum_train.isna().sum()

User_ID                    0
Occupation                 0
Marital_Status             0
Product_Category_1         0
Product_Category_2    138892
Product_Category_3    306504
dtype: int64

Of course, these missing values should be handled and there are some options. We will work on them one by one.

**OPTION1**
Dropping columns with non numerical values: This method is not so efficient because there will be whole column lost.

In [16]:
# Identify columns with missing values and then drop such columns
cols_num_null = [col for col in Xnum_train.columns
    if Xnum_train[col].isnull().any()]
Xnum_train_drpnull = Xnum_train.drop(cols_num_null, axis=1)
Xnum_test_drpnull = Xnum_test.drop(cols_num_null, axis=1)

In [17]:
print('MAE from Approach 1 (Drop features with missing values):')
print(get_random_forest_mae(Xnum_train_drpnull, Xnum_test_drpnull, y_train, y_test))

MAE from Approach 1 (Drop features with missing values):
2091.2402741391948


Even if we drop lots of data, additional datas help MAE to decrease.

**OPTION 2**
**Filling the missing values with the imputation.**
Imputation can be thought as filling the missing values by some numbers.

In [25]:
# Replace with specific value (0, bfill, ffill)
Xnum_train_repnull = Xnum_train.fillna(method = 'ffill')
Xnum_test_repnull = Xnum_test.fillna(method = 'ffill')

# Fill any remaining NaNs (e.g., at the beginning of columns) with 0
Xnum_train_repnull = Xnum_train_repnull.fillna(0)  # Fill remaining NaNs with 0
Xnum_test_repnull = Xnum_test_repnull.fillna(0)    # Fill remaining NaNs with 0

# Check if NaNs still exist
print("Remaining NaNs in Xnum_train_repnull:", Xnum_train_repnull.isna().sum().sum())
print("Remaining NaNs in Xnum_test_repnull:", Xnum_test_repnull.isna().sum().sum())

print('MAE from Approach 2 (Replace missing values with forward fill):')
print(get_random_forest_mae(Xnum_train_repnull, Xnum_test_repnull, y_train, y_test))

Remaining NaNs in Xnum_train_repnull: 0
Remaining NaNs in Xnum_test_repnull: 0
MAE from Approach 2 (Replace missing values with forward fill):
2272.0611026771116


As you can see the mae is decreased more than option 1.

In [26]:
# Replace with mean value
Xnum_train_repnull = Xnum_train.fillna(Xnum_train.mean())
Xnum_test_repnull = Xnum_test.fillna(Xnum_test.mean())

print('MAE from Approach 2 (Replace missing values with mean):')
print(get_random_forest_mae(Xnum_train_repnull, Xnum_test_repnull, y_train, y_test))

MAE from Approach 2 (Replace missing values with mean):
2193.456214200903


In [27]:
# let us replace all missing numeric values with the column mean
X_train[cols_num]=Xnum_train_repnull[cols_num]
X_test[cols_num]=Xnum_test_repnull[cols_num]

We should add the non numerical values to the model by converting them to the numerical values.

In [28]:
# Select non-numeric features
cols_obj = [col for col in X.columns if X[col].dtype == 'object']
cols_obj

['Product_ID', 'Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']

In [30]:
# Label encoding on all non-numeric features

from sklearn.preprocessing import LabelEncoder

Xle_train = X_train.copy()
Xle_test = X_test.copy()
# Apply label encoder to each column with non-numeric data
label_encoder = LabelEncoder()
for col in cols_obj:
    # Handle unseen values in test data
    Xle_train[col] = label_encoder.fit_transform(X_train[col])
    try:
        Xle_test[col] = label_encoder.transform(X_test[col])
    except ValueError:
        # Handle unseen labels in the test set
        unseen_labels = set(X_test[col]) - set(label_encoder.classes_)
        # Option 1: Replace unseen labels with a placeholder (e.g., 'unknown')
        Xle_test[col] = X_test[col].apply(lambda x: x if x in label_encoder.classes_ else 'unknown')
        Xle_test[col] = label_encoder.transform(Xle_test[col])
        # Option 2: Ignore rows with unseen labels (might lead to data loss)
        # Xle_test = Xle_test[X_test[col].isin(label_encoder.classes_)]

ValueError: invalid literal for int() with base 10: 'unknown'

Work on categorical features to label.

In [31]:
# Select categorical features
cols_cat = [col for col in X.columns if X[col].dtype == 'object' and X[col].nunique()<10]
cols_cat

['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years']

In [32]:
# Label encoding on only categorical features

from sklearn.preprocessing import LabelEncoder

Xle_train = X_train.copy()
Xle_test = X_test.copy()
# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in cols_cat:
    Xle_train[col] = label_encoder.fit_transform(X_train[col])
    Xle_test[col] = label_encoder.transform(X_test[col])

In [33]:
# Encode and Build/Score using all Categorical columns

mae = get_random_forest_mae(Xle_train[cols_num + cols_cat], Xle_test[cols_num + cols_cat], y_train, y_test)
print("MAE from Label Encoding all Categorical columns:")
print(mae)

MAE from Label Encoding all Categorical columns:
2154.210619732392


Let's implement **Gradient Boosting Model**


In [34]:
from xgboost import XGBRegressor

#Build and score default Gradient Boosting Model
mdlXgbMlb = XGBRegressor()
mdlXgbMlb.fit(Xle_train[cols_num + cols_cat], y_train)
y_test_pred = mdlXgbMlb.predict(Xle_test[cols_num + cols_cat])
mae = mean_absolute_error(y_test_pred, y_test)

print("MAE from default XGBoost model:")
print(mae)

#Build and score a tuned Gradient Boosting Model
mdlXgbMlb = XGBRegressor(n_estimators=5000, learning_rate=0.01, max_depth=5)
mdlXgbMlb.fit(Xle_train[cols_num + cols_cat], y_train)
y_test_pred = mdlXgbMlb.predict(Xle_test[cols_num + cols_cat])
mae = mean_absolute_error(y_test_pred, y_test)

print("MAE from tuned XGBoost model:")
print(mae)

MAE from default XGBoost model:
2091.6067442607223


As it can be seen the mean absolute error has the least amount in XGBoost option compared to the other options.