## Supermarket Customer Satisfaction Prediction

Given *data about purchases made at three supermarkets*, let's try to predict the **satisfaction level** of a given customer.

We will use a variety of regression models to make our predictions.

Data source: https://www.kaggle.com/datasets/faresashraf1001/supermarket-sales

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import LinearSVR, SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

import warnings
warnings.simplefilter('ignore')

In [2]:
data = pd.read_csv('SuperMarket Analysis.csv')
data

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Sales,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,Alex,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,1:08:00 PM,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,Giza,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.8200,80.2200,3/8/2019,10:29:00 AM,Cash,76.40,4.761905,3.8200,9.6
2,631-41-3108,Alex,Yangon,Normal,Female,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,1:23:00 PM,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,Alex,Yangon,Member,Female,Health and beauty,58.22,8,23.2880,489.0480,1/27/2019,8:33:00 PM,Ewallet,465.76,4.761905,23.2880,8.4
4,373-73-7910,Alex,Yangon,Member,Female,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37:00 AM,Ewallet,604.17,4.761905,30.2085,5.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,233-67-5758,Giza,Naypyitaw,Normal,Male,Health and beauty,40.35,1,2.0175,42.3675,1/29/2019,1:46:00 PM,Ewallet,40.35,4.761905,2.0175,6.2
996,303-96-2227,Cairo,Mandalay,Normal,Female,Home and lifestyle,97.38,10,48.6900,1022.4900,3/2/2019,5:16:00 PM,Ewallet,973.80,4.761905,48.6900,4.4
997,727-02-1313,Alex,Yangon,Member,Male,Food and beverages,31.84,1,1.5920,33.4320,2/9/2019,1:22:00 PM,Cash,31.84,4.761905,1.5920,7.7
998,347-56-2442,Alex,Yangon,Normal,Male,Home and lifestyle,65.82,1,3.2910,69.1110,2/22/2019,3:33:00 PM,Cash,65.82,4.761905,3.2910,4.1


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Invoice ID               1000 non-null   object 
 1   Branch                   1000 non-null   object 
 2   City                     1000 non-null   object 
 3   Customer type            1000 non-null   object 
 4   Gender                   1000 non-null   object 
 5   Product line             1000 non-null   object 
 6   Unit price               1000 non-null   float64
 7   Quantity                 1000 non-null   int64  
 8   Tax 5%                   1000 non-null   float64
 9   Sales                    1000 non-null   float64
 10  Date                     1000 non-null   object 
 11  Time                     1000 non-null   object 
 12  Payment                  1000 non-null   object 
 13  cogs                     1000 non-null   float64
 14  gross margin percentage  

### Preprocessing

In [12]:
df = data.copy()

In [13]:
# Drop ID column
df = df.drop('Invoice ID', axis=1)

In [14]:
# Split df into X and y
y = df['Rating']
X = df.drop('Rating', axis=1)

In [15]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=7)

In [16]:
X_train

Unnamed: 0,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Sales,Date,Time,Payment,cogs,gross margin percentage,gross income
822,Giza,Naypyitaw,Member,Male,Sports and travel,10.17,1,0.5085,10.6785,2/7/2019,2:15:00 PM,Cash,10.17,4.761905,0.5085
188,Alex,Yangon,Normal,Male,Home and lifestyle,74.07,1,3.7035,77.7735,2/10/2019,12:50:00 PM,Ewallet,74.07,4.761905,3.7035
251,Giza,Naypyitaw,Member,Male,Fashion accessories,35.19,10,17.5950,369.4950,3/17/2019,7:06:00 PM,Credit card,351.90,4.761905,17.5950
71,Giza,Naypyitaw,Member,Female,Fashion accessories,62.12,10,31.0600,652.2600,2/11/2019,4:19:00 PM,Cash,621.20,4.761905,31.0600
664,Giza,Naypyitaw,Normal,Female,Sports and travel,98.80,2,9.8800,207.4800,2/21/2019,11:39:00 AM,Cash,197.60,4.761905,9.8800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
579,Cairo,Mandalay,Normal,Male,Health and beauty,69.51,2,6.9510,145.9710,3/1/2019,12:15:00 PM,Ewallet,139.02,4.761905,6.9510
502,Giza,Naypyitaw,Normal,Male,Home and lifestyle,69.40,2,6.9400,145.7400,1/27/2019,7:48:00 PM,Ewallet,138.80,4.761905,6.9400
537,Alex,Yangon,Normal,Male,Home and lifestyle,97.94,1,4.8970,102.8370,3/7/2019,11:44:00 AM,Ewallet,97.94,4.761905,4.8970
196,Giza,Naypyitaw,Member,Male,Health and beauty,43.70,2,4.3700,91.7700,3/26/2019,6:03:00 PM,Cash,87.40,4.761905,4.3700


#### Constructing Pipeline

In [17]:
{column: len(X_train[column].unique()) for column in X_train.select_dtypes('object').columns}

{'Branch': 3,
 'City': 3,
 'Customer type': 2,
 'Gender': 2,
 'Product line': 6,
 'Date': 89,
 'Time': 426,
 'Payment': 3}

In [18]:
{column: X_train[column].unique() for column in X_train.select_dtypes('object').columns}

{'Branch': array(['Giza', 'Alex', 'Cairo'], dtype=object),
 'City': array(['Naypyitaw', 'Yangon', 'Mandalay'], dtype=object),
 'Customer type': array(['Member', 'Normal'], dtype=object),
 'Gender': array(['Male', 'Female'], dtype=object),
 'Product line': array(['Sports and travel', 'Home and lifestyle', 'Fashion accessories',
        'Health and beauty', 'Electronic accessories',
        'Food and beverages'], dtype=object),
 'Date': array(['2/7/2019', '2/10/2019', '3/17/2019', '2/11/2019', '2/21/2019',
        '1/12/2019', '3/19/2019', '1/25/2019', '2/15/2019', '1/27/2019',
        '2/18/2019', '1/14/2019', '1/30/2019', '1/24/2019', '1/2/2019',
        '1/31/2019', '1/28/2019', '2/23/2019', '2/14/2019', '1/16/2019',
        '2/2/2019', '2/12/2019', '1/11/2019', '2/22/2019', '2/25/2019',
        '3/16/2019', '3/15/2019', '1/15/2019', '3/29/2019', '2/4/2019',
        '3/2/2019', '3/30/2019', '2/9/2019', '1/6/2019', '1/7/2019',
        '1/29/2019', '3/12/2019', '3/8/2019', '3/28/2019', 

In [19]:
# Categorizing our features
binary_features = [
    'Customer type',
    'Gender'
]

date_features = [
    'Date'
]

time_features = [
    'Time'
]

nominal_features = [
    'Branch',
    'City',
    'Product line',
    'Payment'
]

In [29]:
# Create custom transformers for date and time features
class DateEncoder:
    def fit(self, X, y):
        return self

    def transform(self, X):
        for column in X.columns:
            X[column] = pd.to_datetime(X[column])
            X[column + '_year'] = X[column].apply(lambda x: x.year)
            X[column + '_month'] = X[column].apply(lambda x: x.month)
            X[column + '_day'] = X[column].apply(lambda x: x.day)
            X = X.drop(column, axis=1)
        return X


class TimeEncoder:
    def fit(self, X, y):
        return self

    def transform(self, X):
        for column in X.columns:
            X[column] = pd.to_datetime(X[column])
            X[column + '_hour'] = X[column].apply(lambda x: x.hour)
            X[column + '_minute'] = X[column].apply(lambda x: x.minute)
            X = X.drop(column, axis=1)
        return X

In [28]:
DateEncoder().transform(X_train[['Date']])

Unnamed: 0,Date_year,Date_month,Date_day
822,2019,2,7
188,2019,2,10
251,2019,3,17
71,2019,2,11
664,2019,2,21
...,...,...,...
579,2019,3,1
502,2019,1,27
537,2019,3,7
196,2019,3,26


In [30]:
# Construct transformer pipelines for each feature type
binary_transformer = Pipeline(steps = [
    ('Ordinal', OrdinalEncoder())
])

date_transformer = Pipeline(steps=[
    ('date', DateEncoder())
])

time_transformer = Pipeline(steps=[
    ('time', TimeEncoder())
])

nominal_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder())
])

In [31]:
# Combine transformers with ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('binary', binary_transformer, binary_features),
    ('date', date_transformer, date_features),
    ('time', time_transformer, time_features),
    ('nominal', nominal_transformer, nominal_features)
])

### Training

In [32]:
# Define models
models = {
    "                     Linear Regression": LinearRegression(),
    " Linear Regression (L2 Regularization)": Ridge(),
    " Linear Regression (L1 Regularization)": Lasso(),
    "                   K-Nearest Neighbors": KNeighborsRegressor(),
    "                        Neural Network": MLPRegressor(),
    "Support Vector Machine (Linear Kernel)": LinearSVR(),
    "   Support Vector Machine (RBF Kernel)": SVR(),
    "                         Decision Tree": DecisionTreeRegressor(),
    "                         Random Forest": RandomForestRegressor(),
    "                     Gradient Boosting": GradientBoostingRegressor(),
    "                               XGBoost": XGBRegressor(),
    "                              LightGBM": LGBMRegressor(),
    "                              CatBoost": CatBoostRegressor(verbose=0)
}

In [33]:
# Make a scaler
scaler = StandardScaler()

for name, model in models.items():
    # Construct the final pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('scaler', scaler),
        ('regressor', model)
    ])
    # Fit the pipeline
    pipeline.fit(X_train, y_train)
    print(name + " trained.")

                     Linear Regression trained.
 Linear Regression (L2 Regularization) trained.
 Linear Regression (L1 Regularization) trained.
                   K-Nearest Neighbors trained.
                        Neural Network trained.
Support Vector Machine (Linear Kernel) trained.
   Support Vector Machine (RBF Kernel) trained.
                         Decision Tree trained.
                         Random Forest trained.
                     Gradient Boosting trained.
                               XGBoost trained.
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000104 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 160
[LightGBM] [Info] Number of data points in the train set: 700, number of used features: 21
[LightGBM] [Info] Start training from score 6.985429
                              LightGBM trained.
                     

### Results

In [34]:
for name, model in models.items():
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('scaler', scaler),
        ('regressor', model)
    ])
    print(name + " R^2 Score: {:.5f}".format(pipeline.score(X_test, y_test)))

                     Linear Regression R^2 Score: -0.01253
 Linear Regression (L2 Regularization) R^2 Score: -0.01250
 Linear Regression (L1 Regularization) R^2 Score: -0.00063
                   K-Nearest Neighbors R^2 Score: -0.14468
                        Neural Network R^2 Score: -0.06546
Support Vector Machine (Linear Kernel) R^2 Score: -0.03148
   Support Vector Machine (RBF Kernel) R^2 Score: -0.06555
                         Decision Tree R^2 Score: -1.11951
                         Random Forest R^2 Score: -0.00739
                     Gradient Boosting R^2 Score: -0.05248
                               XGBoost R^2 Score: -0.25997
                              LightGBM R^2 Score: -0.13633
                              CatBoost R^2 Score: -0.12970
