## Supermarket Customer Satisfaction Prediction

Given *data about purchases made at three supermarkets*, let's try to predict the **satisfaction level** of a given customer.

We will use a variety of regression models to make our predictions.

Data source: https://www.kaggle.com/datasets/lovishbansal123/sales-of-a-supermarket

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import LinearSVR, SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
data = pd.read_csv('archive/supermarket_sales.csv')
data

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.8200,80.2200,3/8/2019,10:29,Cash,76.40,4.761905,3.8200,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.2880,489.0480,1/27/2019,20:33,Ewallet,465.76,4.761905,23.2880,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,233-67-5758,C,Naypyitaw,Normal,Male,Health and beauty,40.35,1,2.0175,42.3675,1/29/2019,13:46,Ewallet,40.35,4.761905,2.0175,6.2
996,303-96-2227,B,Mandalay,Normal,Female,Home and lifestyle,97.38,10,48.6900,1022.4900,3/2/2019,17:16,Ewallet,973.80,4.761905,48.6900,4.4
997,727-02-1313,A,Yangon,Member,Male,Food and beverages,31.84,1,1.5920,33.4320,2/9/2019,13:22,Cash,31.84,4.761905,1.5920,7.7
998,347-56-2442,A,Yangon,Normal,Male,Home and lifestyle,65.82,1,3.2910,69.1110,2/22/2019,15:33,Cash,65.82,4.761905,3.2910,4.1


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Invoice ID               1000 non-null   object 
 1   Branch                   1000 non-null   object 
 2   City                     1000 non-null   object 
 3   Customer type            1000 non-null   object 
 4   Gender                   1000 non-null   object 
 5   Product line             1000 non-null   object 
 6   Unit price               1000 non-null   float64
 7   Quantity                 1000 non-null   int64  
 8   Tax 5%                   1000 non-null   float64
 9   Total                    1000 non-null   float64
 10  Date                     1000 non-null   object 
 11  Time                     1000 non-null   object 
 12  Payment                  1000 non-null   object 
 13  cogs                     1000 non-null   float64
 14  gross margin percentage  

### Initial Preprocessing

In [4]:
df = data.copy()

In [5]:
# Drop Invoice ID column
df = df.drop('Invoice ID', axis=1)
df

Unnamed: 0,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.8200,80.2200,3/8/2019,10:29,Cash,76.40,4.761905,3.8200,9.6
2,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,A,Yangon,Member,Male,Health and beauty,58.22,8,23.2880,489.0480,1/27/2019,20:33,Ewallet,465.76,4.761905,23.2880,8.4
4,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,C,Naypyitaw,Normal,Male,Health and beauty,40.35,1,2.0175,42.3675,1/29/2019,13:46,Ewallet,40.35,4.761905,2.0175,6.2
996,B,Mandalay,Normal,Female,Home and lifestyle,97.38,10,48.6900,1022.4900,3/2/2019,17:16,Ewallet,973.80,4.761905,48.6900,4.4
997,A,Yangon,Member,Male,Food and beverages,31.84,1,1.5920,33.4320,2/9/2019,13:22,Cash,31.84,4.761905,1.5920,7.7
998,A,Yangon,Normal,Male,Home and lifestyle,65.82,1,3.2910,69.1110,2/22/2019,15:33,Cash,65.82,4.761905,3.2910,4.1


In [6]:
# Split df into X and y
y = df['Rating'].copy()
X = df.drop('Rating', axis=1)

In [7]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)

In [8]:
X_train

Unnamed: 0,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income
731,A,Yangon,Normal,Male,Health and beauty,56.00,3,8.4000,176.4000,2/28/2019,19:33,Ewallet,168.00,4.761905,8.4000
716,A,Yangon,Member,Female,Fashion accessories,71.46,7,25.0110,525.2310,3/28/2019,16:06,Ewallet,500.22,4.761905,25.0110
640,B,Mandalay,Member,Female,Food and beverages,98.79,3,14.8185,311.1885,2/23/2019,20:00,Ewallet,296.37,4.761905,14.8185
804,B,Mandalay,Member,Female,Electronic accessories,75.59,9,34.0155,714.3255,2/23/2019,11:12,Cash,680.31,4.761905,34.0155
737,C,Naypyitaw,Normal,Male,Electronic accessories,58.76,10,29.3800,616.9800,1/29/2019,14:26,Ewallet,587.60,4.761905,29.3800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
767,B,Mandalay,Normal,Male,Sports and travel,13.69,6,4.1070,86.2470,2/13/2019,13:59,Cash,82.14,4.761905,4.1070
72,B,Mandalay,Member,Female,Food and beverages,48.52,3,7.2780,152.8380,3/5/2019,18:17,Ewallet,145.56,4.761905,7.2780
908,A,Yangon,Member,Female,Food and beverages,79.54,2,7.9540,167.0340,3/27/2019,16:30,Ewallet,159.08,4.761905,7.9540
235,A,Yangon,Normal,Female,Sports and travel,93.14,2,9.3140,195.5940,1/20/2019,18:09,Ewallet,186.28,4.761905,9.3140


In [9]:
{column: len(X_train[column].unique()) for column in X_train.select_dtypes('object').columns}

{'Branch': 3,
 'City': 3,
 'Customer type': 2,
 'Gender': 2,
 'Product line': 6,
 'Date': 89,
 'Time': 427,
 'Payment': 3}

In [10]:
{column: X_train[column].unique() for column in X_train.select_dtypes('object').columns}

{'Branch': array(['A', 'B', 'C'], dtype=object),
 'City': array(['Yangon', 'Mandalay', 'Naypyitaw'], dtype=object),
 'Customer type': array(['Normal', 'Member'], dtype=object),
 'Gender': array(['Male', 'Female'], dtype=object),
 'Product line': array(['Health and beauty', 'Fashion accessories', 'Food and beverages',
        'Electronic accessories', 'Sports and travel',
        'Home and lifestyle'], dtype=object),
 'Date': array(['2/28/2019', '3/28/2019', '2/23/2019', '1/29/2019', '3/25/2019',
        '1/13/2019', '3/8/2019', '1/23/2019', '1/20/2019', '2/24/2019',
        '2/7/2019', '2/15/2019', '2/19/2019', '3/14/2019', '2/27/2019',
        '2/2/2019', '3/20/2019', '2/12/2019', '1/5/2019', '1/19/2019',
        '2/10/2019', '2/22/2019', '2/20/2019', '3/13/2019', '1/1/2019',
        '2/5/2019', '2/25/2019', '1/8/2019', '2/8/2019', '2/16/2019',
        '3/16/2019', '1/3/2019', '1/11/2019', '1/31/2019', '1/10/2019',
        '2/1/2019', '3/15/2019', '2/4/2019', '1/9/2019', '3/11/2019',


#### Constructing Pipeline

In [11]:
# Categorize our features
binary_features = [
    'Customer type',
    'Gender'
]

date_feature = [
    'Date'
]

time_features = [
    'Time'
]

nominal_features = [
    'Branch',
    'City',
    'Product line',
    'Payment'
]

In [12]:
# Creating custom transformers for date and time features

class DateEncoder:
    def fit(self, X, y):
        return self
        
    def transform(self, X):
        for column in X.columns:
            X[column] = pd.to_datetime(X[column])
            X[column + '_year'] = X[column].apply(lambda x: x.year)
            X[column + '_month'] = X[column].apply(lambda x: x.month)
            X[column + '_day'] = X[column].apply(lambda x: x.day)
            X = X.drop(column, axis=1)
        return X

class TimeEncoder:
    def fit(self, X, y):
        return self
        
    def transform(self, X):
        for column in X.columns:
            X[column] = pd.to_datetime(X[column])
            X[column + '_hour'] = X[column].apply(lambda x: x.hour)
            X[column + '_minute'] = X[column].apply(lambda x: x.minute)
            X = X.drop(column, axis=1)
        return X

In [13]:
# Construct Transformer Pipelines for each feature type

binary_transformer = Pipeline(
    steps=[
        ('ordinal', OrdinalEncoder())
    ]
)

date_transformer = Pipeline(
    steps=[
        ('date', DateEncoder())
    ]
)

time_transformer = Pipeline(
    steps=[
        ('time', TimeEncoder())
    ]
)

nominal_transformer = Pipeline(
    steps=[
        ('onehot', OneHotEncoder())
    ]
)

In [14]:
# Combine transformers with ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
    ('binary', binary_transformer, binary_features),
    ('date', date_transformer, date_feature),
    ('time', time_transformer, time_features),
    ('nominal', nominal_transformer, nominal_features)
],
    remainder = 'passthrough'
)

### Training

In [15]:
# Define models
models = {
    "                     Linear Regression": LinearRegression(),
    " Linear Regression (L2 Regularization)": Ridge(),
    " Linear Regression (L1 Regularization)": Lasso(),
    "                   K-Nearest Neighbors": KNeighborsRegressor(),
    "                        Neural Network": MLPRegressor(), 
    "Support Vector Machine (Linear Kernel)": LinearSVR(),
    "   Support Vector Machine (RBF Kernel)": SVR(),
    "                          DecisionTree": DecisionTreeRegressor(),
    "                         Random Forest": RandomForestRegressor(),
    "                     Gradient Boosting": GradientBoostingRegressor(),
    "                               XGBoost": XGBRegressor(),
    "                              LightGBM": LGBMRegressor(),
    "                              CatBoost": CatBoostRegressor(verbose=0)
}

In [16]:
# Make a scaler
scaler = StandardScaler()

for name, model in models.items():
    # Construct the final pipeline
    pipeline = Pipeline(
        steps = [
            ('preprocessor', preprocessor),
            ('scaler', scaler),
            ('regressor', model)
        ]
    )
    # Fit the pipeline
    pipeline.fit(X_train, y_train)
    print(name + " trained.")

                     Linear Regression trained.
 Linear Regression (L2 Regularization) trained.
 Linear Regression (L1 Regularization) trained.
                   K-Nearest Neighbors trained.
                        Neural Network trained.
Support Vector Machine (Linear Kernel) trained.
   Support Vector Machine (RBF Kernel) trained.
                          DecisionTree trained.
                         Random Forest trained.
                     Gradient Boosting trained.
                               XGBoost trained.
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000502 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1340
[LightGBM] [Info] Number of data points in the train set: 700, number of used features: 27
[LightGBM] [Info] Start training from score 6.960143
                              LightGBM trained.
                    

### Results

In [18]:
for name, model in models.items():
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('scalar', scaler),
        ('regressor', model)
    ])
    print(name + " R^2 Score: {:.5f}".format(pipeline.score(X_test, y_test)))

                     Linear Regression R^2 Score: -0.03844
 Linear Regression (L2 Regularization) R^2 Score: -0.03816
 Linear Regression (L1 Regularization) R^2 Score: -0.00059
                   K-Nearest Neighbors R^2 Score: -0.20285
                        Neural Network R^2 Score: -0.19971
Support Vector Machine (Linear Kernel) R^2 Score: -0.08847
   Support Vector Machine (RBF Kernel) R^2 Score: -0.14270
                          DecisionTree R^2 Score: -0.98741
                         Random Forest R^2 Score: -0.06153
                     Gradient Boosting R^2 Score: -0.08775
                               XGBoost R^2 Score: -0.39204
                              LightGBM R^2 Score: -0.22274
                              CatBoost R^2 Score: -0.16819
