# Final Project

Welcome to the final practical project for our course on [Data Science Bootcamp](https://open.hpi.de/courses/datascience2023). Throughout this project, you will go through the entire data science process, starting from data loading and cleaning, all the way to running a model and making predictions. This hands-on project will provide you with valuable experience and allow you to apply the concepts and techniques you've learned in the course. Get ready to dive into real-world data analysis and build your skills as a data scientist!


## Important Remarks:

 - The ultimate goal of this project is to conduct comprehensive data analysis and build 2 models using the provided datasets.
 - Code is not the only thing graded here. Well-written and understandable documentation of your code is to be expected
 - Clear reasoning behind your choices in every step of the notebook is important. Be it the choice of a data cleaning technique or selecting certain features in your analysis or the choice of your 2 models.

# Importing packages


### Packages in use:

**pandas** - Open Source Data Analysis and Manipulation Tool, current version 2.0.3, [documentation in this link](https://pandas.pydata.org/).<br>
**seaborn** - Data Visualization library based on matplotlib, current version 0.12, [documentation in this link](https://seaborn.pydata.org/).<br>
**matplotlib** - Library for creating static, animated, and interactive visualiations in Python, current version 3.5, [documentation in this link](https://matplotlib.org/).<br>
**numpy** - Package for scientific omputing, current version 1.25.0, [documentation in this link](https://numpy.org/).<br>
**scikit-learn (sklearn)** - Open Source Library for predictive data analsysis, current version 1.3, [documentation in this link](https://scikit-learn.org/stable/#).<br>
> Details for **StandardScaler** can be found in the [link](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).<br>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder        # Label Encoding Library
from sklearn.preprocessing import StandardScaler      # Scale / Standardize features library
from sklearn.model_selection import train_test_split  # Split data train/test library
from sklearn.neighbors import KNeighborsClassifier    # Classifier library
from sklearn.linear_model import LogisticRegression   # Logistic Regression library
from sklearn.metrics import accuracy_score            # Accuracy test library

# Load the dataset into data


In [2]:
spmkt = pd.read_csv("/OpenHPI - DS Bootcamp - Final Project/supermarket_survey.csv",
                    sep = ';',
                    header = 0)

# Dataset overview and statistical summary


In [3]:
# set the max columns to none to allow view all columns when df.head() is issued.
pd.set_option('display.max_columns', None)

Peeking at the dataset for study...

In [4]:
spmkt.head()

Unnamed: 0,randomInt,age,gender,district,modeOfTransportation,distance,G03Q13amountOfPeople,income,frequency,days[1],days[2],days[3],days[4],days[5],days[6],days[7],time[1],time[2],time[3],time[4],time[5],moneySpent,orderingItems,deliveringItems,willingPayDelivery,findProducts,usingDiscounts,preferCash,preferCashless,isRelaxing,satisGeneralStore,satisMusic,satisQualityProducts,satisGeneralAssortment,satisVeganProducts,satisOrganicProducts,satisGlutenfreeProducts,satisAnimalProducts,ideasExtendedBusiness,ideasHelpCarry,ideasCustomerCouncil,ideasFreeWifi,ideasTouchDisplay,ideasSelfCheckout,ideasBikeParking,ideasUndergroundParking
0,4,,Male,Godham,Own Car,1-2km,3.0,120000.0,Twice,No,No,No,Yes,Yes,No,No,No,No,Yes,Yes,No,Between 75 and 100 USD,… ordering online.,… get them directly delivered to my address.,15 to 20 USD,Neutral / Undecided,Rather disagree,Strongly disagree,Rather agree,Strongly agree,4.0,4.0,4.0,3.0,2.0,8.0,8.0,7.0,2.0,4.0,3.0,4.0,,4.0,,
1,4,,,,,,,,,No,No,No,No,No,No,No,No,No,No,No,No,,,,,,,,,,,,,,,,,,,,,,,,,
2,3,20-25,Female,Springtown,Own Car,>7km,2.0,15.0,Three times,No,No,Yes,No,No,No,No,No,No,No,No,Yes,Between 50 and 75,,…get them myself in and from the store.,,,,,,,,,7.0,7.0,7.0,7.0,7.0,,7.0,7.0,7.0,7.0,,7.0,7.0,7.0
3,4,,,,,,,1337.0,,No,No,No,No,No,No,No,No,No,No,No,No,,,,,,,,,,,,,,,,,,,,,,,,,
4,3,15-20,Male,Piltunder,Own Car,1-2km,4.0,250000.0,Twice,No,No,Yes,No,No,No,Yes,No,No,Yes,Yes,No,Between 50 and 75,…selecting them myself in the store.,…get them myself in and from the store.,,Strongly agree,Rather agree,Rather disagree,Strongly agree,Rather disagree,8.0,9.0,7.0,,8.0,8.0,8.0,1.0,9.0,2.0,1.0,10.0,10.0,10.0,8.0,


In [5]:
spmkt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 353 entries, 0 to 352
Data columns (total 46 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   randomInt                353 non-null    int64  
 1   age                      345 non-null    object 
 2   gender                   347 non-null    object 
 3   district                 334 non-null    object 
 4   modeOfTransportation     341 non-null    object 
 5   distance                 338 non-null    object 
 6   G03Q13amountOfPeople     345 non-null    object 
 7   income                   331 non-null    float64
 8   frequency                339 non-null    object 
 9   days[1]                  353 non-null    object 
 10  days[2]                  353 non-null    object 
 11  days[3]                  353 non-null    object 
 12  days[4]                  353 non-null    object 
 13  days[5]                  353 non-null    object 
 14  days[6]                  3

### Some background on this dataset - Supermarket_notebook.csv

It was created as part of the Data Science Bootcamp at OpenHPI, with answer from the participants. The answers were from it participant, but considering the general persona: <br>
"*As a mid-career professional, you are a parent of a family. You plan, schedule and organize your grocery shopping because it gets the job done. You buy in bulk, and if you get a chance, you would get your items delivered, even for a premium.*"<br>

The variables and the questions behind them: <br>
- randomInt - random, sequential number. Not part of the survey, but automatically created by the survey tool.<br>
- age - How old are you?<br>
- gender - What is your gender?<br>
- district - In what district to you buy your groceries the most often?<br>
- modeOfTransportation - What mode of transportation do you use primarily to get to and from our store?<br>
- distance - What is the distance you have to cover to go to one of our stores?<br>
- G03Q13amountOfPeople - For how many people are you usually buying groceries for?<br>
- income - What is your overall household income (in USD)? <br>
- frequency -  How often per week do you buy groceries?<br>
- daysX - When are you usually in our supermarket?, where X is a number that represents: 1 - Monday, 2 - Tuesday and so on, till 7 - Sunday.<br>
- timeX - At what time of day do you buy your groceries?, where X is a number tha represents: 1 - 8AM to 11AM, 2 - 11AM to 2PM, 3 - 2PM to 5PM, 4 - 5PM to 8PM and 5 - after 8PM till the store closes.<br>
- moneySpent - How much money do you spend on average each time you buy groceries (in USD)?<br>
- orderingItems - What preferences do you have when it comes to selecting / ordering your items?<br>
- deliveringItems - What preferences do you have when it comes to getting / picking up your items? <br>
- willingPayDelivery - If you picked “directly delivered to my address” in the previous question, how much are you willing to pay for this direct delivery? Assume we deliver on the same-day.<br>
- findProducts - Scale of agreement: I can easily find the product I am looking for in the store.<br>
- usingDiscounts - Scale of agreement: I am using discount codes, special offers and promotions to get better prices.<br>
- preferCash - Scale of agreement: I prefer paying with cash.<br>
- preferCashless - Scale of agreement: I prefer paying cashless.<br>
- isRelaxing - Scale of agreement: Shopping is not only about getting my items, it is also relaxing.<br>
- satisGeneralStore - How satisfied are you in general, with your local store?<br>
- satisMusic - How satisfied are you the music we play in our stores?<br>
- satisQualityProducts - How satisfied are you with the quality of the products?<br>
- satisGeneralAssortment - How satisfied are you with the general assortment of products?<br>
- satisVeganProducts - How satisfied are you with the existing assortment of vegan products?<br>
- satisOrganicProducts - How satisfied are you with the existing assortment of organic products?<br>
- satisGlutenfreeProducts - How satisfied are you with the existing assortment of gluten-free products?<br>
- satisAnimalProducts - How satisfied are you with the existing assortment of fresh animal products (meat, eggs, fish)?<br>
- ideasExtendedBusiness - Extended business hours (before 8 am, after 9 pm)<br>
- ideasHelpCarry - Do you might need help to carry your selected products back to your car?<br>
- ideasCustomerCouncil - Are you interested to be part of a customer-store-counsel, helping us further? (unpaid)<br>
- ideasFreeWifi - Free WiFi in the store<br>
- ideasTouchDisplay - Touch-Displays for navigation, offers and in-store entertainment<br>
- ideasSelfCheckout - Self-checkout (for less than 10 items)<br>
- ideasBikeParking - Bike parking spots<br>
- ideasUndergroundParking - Underground parking<br>



Checking some variables, and their numbers and details

In [6]:
# Variable gender
spmkt.gender.value_counts()

gender
Male                 232
Female                84
Prefer not to say     22
Diverse                9
Name: count, dtype: int64

In [7]:
# Variable age
spmkt.age.value_counts().sort_index(ascending=True)

age
15-20    10
20-25    53
25-30    49
30-35    37
35-40    53
40-45    34
45-50    22
50-55    28
55-60    17
60-65    14
65-70    20
70-75     4
>75       4
Name: count, dtype: int64

In [8]:
# Variable frequency - how often the person goes to the supermarket
spmkt.frequency.value_counts()

frequency
Twice                   136
Once                     99
Three times              69
Four times               23
More than four times     12
Name: count, dtype: int64

In [9]:
# Variable modeOfTransportation - type of transportation the person uses when going to the supermarket
spmkt.modeOfTransportation.value_counts()

modeOfTransportation
Bicycle                       108
Walking                       105
Own Car                        97
Public transportation          20
Rented car (“car sharing”)      6
Taxi                            5
Name: count, dtype: int64

In [10]:
# Variable distance - the distance the person needs to through to go to the supermarket
spmkt.distance.value_counts()

distance
500 meters to 1km               102
1-2km                            88
3-5km                            61
Less than few hundred meters     48
5-7km                            23
>7km                             16
Name: count, dtype: int64

In [11]:
# Variable G03Q13amountOfPeople - for how many people the person buys groceries in the supermarket
spmkt.G03Q13amountOfPeople.value_counts()

G03Q13amountOfPeople
2            104
1             93
3             58
4             55
5 or more     35
Name: count, dtype: int64

In [12]:
# Variable income - person's income
spmkt.income.describe()

count       331.000000
mean      66275.568882
std      132542.950482
min      -99932.000000
25%        2290.000000
50%       21000.000000
75%       80284.000000
max      999999.000000
Name: income, dtype: float64

In [13]:
# Variable days[x] - day or days of the week the person goes to the supermarket
print(f"Supermarket on Mondays: {spmkt.loc[spmkt['days[1]'] == 'Yes', 'days[1]'].count()}.")
print(f"Supermarket on Tuesdays: {spmkt.loc[spmkt['days[2]'] == 'Yes', 'days[2]'].count()}.")
print(f"Supermarket on Wednesdays: {spmkt.loc[spmkt['days[3]'] == 'Yes', 'days[3]'].count()}.")
print(f"Supermarket on Thursdays: {spmkt.loc[spmkt['days[4]'] == 'Yes', 'days[4]'].count()}.")
print(f"Supermarket on Fridays: {spmkt.loc[spmkt['days[5]'] == 'Yes', 'days[5]'].count()}.")
print(f"Supermarket on Saturdays: {spmkt.loc[spmkt['days[6]'] == 'Yes', 'days[6]'].count()}.")
print(f"Supermarket on Sundays: {spmkt.loc[spmkt['days[7]'] == 'Yes', 'days[7]'].count()}.")

Supermarket on Mondays: 117.
Supermarket on Tuesdays: 105.
Supermarket on Wednesdays: 104.
Supermarket on Thursdays: 89.
Supermarket on Fridays: 134.
Supermarket on Saturdays: 189.
Supermarket on Sundays: 50.


In [14]:
# Variable time[x] - period of the day the person goes to the supermarket
print(f"Supermarket between 8AM and 11AM: {spmkt.loc[spmkt['time[1]'] == 'Yes', 'time[1]'].count()}.")
print(f"Supermarket between 11AM and 2PM: {spmkt.loc[spmkt['time[2]'] == 'Yes', 'time[2]'].count()}.")
print(f"Supermarket between 2PM and 5PM: {spmkt.loc[spmkt['time[3]'] == 'Yes', 'time[3]'].count()}.")
print(f"Supermarket between 5PM and 8PM: {spmkt.loc[spmkt['time[4]'] == 'Yes', 'time[4]'].count()}.")
print(f"Supermarket after 8PM: {spmkt.loc[spmkt['time[5]'] == 'Yes', 'time[5]'].count()}.")

Supermarket between 8AM and 11AM: 112.
Supermarket between 11AM and 2PM: 76.
Supermarket between 2PM and 5PM: 66.
Supermarket between 5PM and 8PM: 192.
Supermarket after 8PM: 98.


In [15]:
# Variable moneySpent - amount of money (in dolars) that the person spends each time he/she goes to the supermarket
spmkt.moneySpent.value_counts()

moneySpent
Between 25 and 50 USD     103
Between 50 and 75          58
Less than 25 USD           58
More than 125 USD          44
Between 75 and 100 USD     42
100 to 125 USD             33
Name: count, dtype: int64

In [16]:
# Variable orderingItems - person's preference for buying in the store or selecting online
spmkt.orderingItems.value_counts()

orderingItems
…selecting them myself in the store.    250
… ordering online.                       84
Name: count, dtype: int64

## Question to evaluate
<br>
How some characteristics (variables) afect how the person likes to order her/his items (Variable orderingItems).

# Data cleaning

For this study, we will use some of the variables in the original dataset, and to simplify the analysis, a new dataset, subset of the original will be created: *spmkt1*.

In [17]:
# Creating subdataset spmkt1, with some selected variable for analysis of proposed question 1

spmkt1 = spmkt[['age', 'gender', 'modeOfTransportation', 'distance', 'G03Q13amountOfPeople', 'income',
                'frequency', 'days[1]', 'days[2]', 'days[3]', 'days[4]', 'days[5]', 'days[6]', 'days[7]',
                'time[1]', 'time[2]', 'time[3]', 'time[4]', 'time[5]', 'moneySpent', 'orderingItems' ]]

# Renaming the columns for better understanding

spmkt1 = spmkt1.rename(columns = {'modeOfTransportation': 'transport',
                                  'G03Q13amountOfPeople': 'amount_of_people',
                                  'days[1]': 'monday',
                                  'days[2]': 'tuesday',
                                  'days[3]': 'wednesday',
                                  'days[4]': 'thursday',
                                  'days[5]': 'friday',
                                  'days[6]': 'saturday',
                                  'days[7]': 'sunday',
                                  'time[1]': 'morning',
                                  'time[2]': 'lunch_time',
                                  'time[3]': 'afternoon',
                                  'time[4]': 'evening',
                                  'time[5]': 'night'})

In [18]:
# Checking the new dataframe information (and also, the number of entries - which allows identification 
# of missing values in some variables)

spmkt1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 353 entries, 0 to 352
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   age               345 non-null    object 
 1   gender            347 non-null    object 
 2   transport         341 non-null    object 
 3   distance          338 non-null    object 
 4   amount_of_people  345 non-null    object 
 5   income            331 non-null    float64
 6   frequency         339 non-null    object 
 7   monday            353 non-null    object 
 8   tuesday           353 non-null    object 
 9   wednesday         353 non-null    object 
 10  thursday          353 non-null    object 
 11  friday            353 non-null    object 
 12  saturday          353 non-null    object 
 13  sunday            353 non-null    object 
 14  morning           353 non-null    object 
 15  lunch_time        353 non-null    object 
 16  afternoon         353 non-null    object 
 1

In [19]:
# Checking some of the data in the new dataset created

spmkt1.head()

Unnamed: 0,age,gender,transport,distance,amount_of_people,income,frequency,monday,tuesday,wednesday,thursday,friday,saturday,sunday,morning,lunch_time,afternoon,evening,night,moneySpent,orderingItems
0,,Male,Own Car,1-2km,3.0,120000.0,Twice,No,No,No,Yes,Yes,No,No,No,No,Yes,Yes,No,Between 75 and 100 USD,… ordering online.
1,,,,,,,,No,No,No,No,No,No,No,No,No,No,No,No,,
2,20-25,Female,Own Car,>7km,2.0,15.0,Three times,No,No,Yes,No,No,No,No,No,No,No,No,Yes,Between 50 and 75,
3,,,,,,1337.0,,No,No,No,No,No,No,No,No,No,No,No,No,,
4,15-20,Male,Own Car,1-2km,4.0,250000.0,Twice,No,No,Yes,No,No,No,Yes,No,No,Yes,Yes,No,Between 50 and 75,…selecting them myself in the store.


To make the analysis and variables treatment easier, changing some long strings in the variable to shorter data, with the same data.

In [20]:
# Adjusting categorical variable contents, to simplify the categories.

spmkt1['age'] = spmkt1['age'].replace({'>75': '75plus'})

spmkt1['gender'] = spmkt1['gender'].replace({'Prefer not to say': 'NI'})

spmkt1['transport'] = spmkt1['transport'].replace({'Own Car': 'car',
                                                   'Bicycle': 'bike',
                                                   'Walking': 'walking',
                                                   'Public transportation': 'public',
                                                   'Rented car (“car sharing”)': 'rented',
                                                   'Taxi': 'taxi'})

spmkt1['distance'] = spmkt1['distance'].replace({'500 meters to 1km': '500m-1km',
                                                 'Less than few hundred meters': 'less500m',
                                                 '>7km': 'more7km'})

spmkt1['amount_of_people'] = spmkt1['amount_of_people'].replace({'5 or more': 'more5'})

spmkt1['moneySpent'] = spmkt1['moneySpent'].replace({'Between 25 and 50 USD': '25-50USD',
                                                     'Between 50 and 75': '50-75USD',
                                                     'Less than 25 USD': 'less25USD',
                                                     'More than 125 USD': 'more125USD',
                                                     'Between 75 and 100 USD': '75-100USD',
                                                     '100 to 125 USD': '100-125USD'})

spmkt1['orderingItems'] = spmkt1['orderingItems'].replace({'…selecting them myself in the store.': 'store',
                                                           '… ordering online.': 'online'})

In [21]:
# Checking for missing values
spmkt.isnull().sum()

randomInt                    0
age                          8
gender                       6
district                    19
modeOfTransportation        12
distance                    15
G03Q13amountOfPeople         8
income                      22
frequency                   14
days[1]                      0
days[2]                      0
days[3]                      0
days[4]                      0
days[5]                      0
days[6]                      0
days[7]                      0
time[1]                      0
time[2]                      0
time[3]                      0
time[4]                      0
time[5]                      0
moneySpent                  15
orderingItems               19
deliveringItems             20
willingPayDelivery         187
findProducts                19
usingDiscounts              27
preferCash                  22
preferCashless              24
isRelaxing                  26
satisGeneralStore           21
satisMusic                  65
satisQua

In [22]:
# Deleting entries with missing values
spmkt1_nona = spmkt1.dropna()

Both Models considering in this study (KNN and Logistic Regression) expect numerical variables (or at least, dummy variables - y/n). With the exception of variable *income*, all other variables are categorical, with more than 2 categories. The next step will change all the categorical variables into dummy.

In [23]:
# Transforming categorical variables in dummy variables
spmkt1_nona = pd.get_dummies(spmkt1_nona, drop_first = True)

spmkt1_nona

Unnamed: 0,income,age_20-25,age_25-30,age_30-35,age_35-40,age_40-45,age_45-50,age_50-55,age_55-60,age_60-65,age_65-70,age_70-75,age_75plus,gender_Female,gender_Male,gender_NI,transport_car,transport_public,transport_rented,transport_taxi,transport_walking,distance_3-5km,distance_5-7km,distance_500m-1km,distance_less500m,distance_more7km,amount_of_people_2,amount_of_people_3,amount_of_people_4,amount_of_people_more5,frequency_More than four times,frequency_Once,frequency_Three times,frequency_Twice,monday_Yes,tuesday_Yes,wednesday_Yes,thursday_Yes,friday_Yes,saturday_Yes,sunday_Yes,morning_Yes,lunch_time_Yes,afternoon_Yes,evening_Yes,night_Yes,moneySpent_25-50USD,moneySpent_50-75USD,moneySpent_75-100USD,moneySpent_less25USD,moneySpent_more125USD,orderingItems_store
4,250000.0,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False,True,False,False,False,True,False,False,True,True,False,False,True,False,False,False,True
5,500.0,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False,True,False,True,False,False,True,True,False,False,False,True,False,True
6,5000.0,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,True,False,False,False,True,True,False,False,False,False,False,False,False,False,True
8,600.0,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,True,True,True,True,False,True,False,False,False,False,True,False,False,False,False,True,False,True
9,1200.0,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,True,False,True,True,False,False,False,False,True,True,False,False,False,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
348,45700.0,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,True,False,False,True,False,False,False,True,False,False,False,True,False,False,False,False,True
349,50.0,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
350,5.5,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,True,True
351,600.0,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False,False,True,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,True,True,False,True,True,False,False,False,True,False,False,False,True


# Dataset preparation

In [25]:
# Defining features and assigning X and y for the analysis.

# Creating an auxiliary dataset, 
spmkt1_aux = spmkt1_nona.loc[:, spmkt1_nona.columns != 'orderingItems_store']

# Creating variable for selecting the columns for X
feature_columns = []

for col in spmkt1_aux:
    feature_columns.append(col)
    
# Defining X:
X = spmkt1_nona[feature_columns].values

# Defining y:
y = spmkt1_nona['orderingItems_store'].values

# Label Encoding, so y is not a string
label_enc = LabelEncoder()
y = label_enc.fit_transform(y)

In [26]:
# Splitting the data, considering 80-20 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [27]:
# Display some elements before standardization
X_train[0:5]

array([[400000.0, False, False, False, False, False, True, False, False,
        False, False, False, False, False, True, False, True, False,
        False, False, False, True, False, False, False, False, False,
        True, False, False, False, False, False, True, False, True,
        False, False, True, False, False, True, False, False, False,
        False, False, False, True, False, False],
       [125000.0, False, False, True, False, False, False, False, False,
        False, False, False, False, False, True, False, False, False,
        False, False, True, False, False, True, False, False, True,
        False, False, False, False, False, True, False, True, True,
        False, True, False, False, False, False, False, False, True,
        False, True, False, False, False, False],
       [3300.0, False, False, False, False, True, False, False, False,
        False, False, False, False, False, True, False, False, False,
        False, False, False, True, False, False, False, False,

In [28]:
# Scale / standardization of the features using StandardScaler
# Initiating the object and apply scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [29]:
# Display some elements after standardization
X_train[0:5]

array([[ 3.54985098, -0.43054135, -0.41702883, -0.35355339, -0.43054135,
        -0.3233349 ,  3.76662979, -0.29124119, -0.2071677 , -0.21774709,
        -0.25649459, -0.0910975 , -0.0910975 , -0.57259833,  0.70056944,
        -0.25649459,  1.72731195, -0.25649459, -0.14494276, -0.12936925,
        -0.67460345,  2.25277607, -0.26548932, -0.66171733, -0.41702883,
        -0.2071677 , -0.6681531 ,  2.47932628, -0.4571162 , -0.31551151,
        -0.18450624, -0.62972353, -0.53452248,  1.27000127, -0.76021186,
         1.48235233, -0.69405138, -0.61064012,  1.24815654, -1.18572366,
        -0.40333538,  1.46827932, -0.5153882 , -0.50257071, -1.18572366,
        -0.65529517, -0.68106932, -0.43723732,  2.82842712, -0.46368092,
        -0.38943391],
       [ 0.721313  , -0.43054135, -0.41702883,  2.82842712, -0.43054135,
        -0.3233349 , -0.26548932, -0.29124119, -0.2071677 , -0.21774709,
        -0.25649459, -0.0910975 , -0.0910975 , -0.57259833,  0.70056944,
        -0.25649459, -0.57893

# Creating ML model 1 - KNN

KNN model performance may change depending on the number of neighbors considered in the analysis. Considering this, this analysis will test for 2 to 7 neighbors.

In [30]:
# Number of neighbors = 2
classifier_2 = KNeighborsClassifier(n_neighbors=2)

# Number of neighbors = 3
classifier_3 = KNeighborsClassifier(n_neighbors=3)

# Number of neighbors = 4
classifier_4 = KNeighborsClassifier(n_neighbors=4)

# Number of neighbors = 5
classifier_5 = KNeighborsClassifier(n_neighbors=5)

# Number of neighbors = 6
classifier_6 = KNeighborsClassifier(n_neighbors=6)

# Number of neighbors = 7
classifier_7 = KNeighborsClassifier(n_neighbors=7)

In [31]:
# Fitting the model, considering the number of neighbors in analysis

# Fitting the model for 2 neighbors
classifier_2.fit(X_train, y_train)

# Fitting the model for 3 neighbors
classifier_3.fit(X_train, y_train)

# Fitting the model for 4 neighbors
classifier_4.fit(X_train, y_train)

# Fitting the model for 5 neighbors
classifier_5.fit(X_train, y_train)

# Fitting the model for 6 neighbors
classifier_6.fit(X_train, y_train)

# Fitting the model for 7 neighbors
classifier_7.fit(X_train, y_train)

## Prediction on Test data

In [32]:
# Predicting the Test set results, considering the number of neighbors in analysis

# Predicting the test set results (2 neighbors)
y_predict_2 = classifier_2.predict(X_test)

# Predicting the test set results (3 neighbors)
y_predict_3 = classifier_3.predict(X_test)

# Predicting the test set results (4 neighbors)
y_predict_4 = classifier_4.predict(X_test)

# Predicting the test set results (5 neighbors)
y_predict_5 = classifier_5.predict(X_test)

# Predicting the test set results (6 neighbors)
y_predict_6 = classifier_6.predict(X_test)

# Predicting the test set results (7 neighbors)
y_predict_7 = classifier_7.predict(X_test)

## Model 1 Performance

In [33]:
# Check the results against the test subset of the dataset

# Model considering 2 neighbors
accuracy_2 = accuracy_score(y_test, y_predict_2)
print(f"Acuracy for the model with 2 neighbors: {accuracy_2 * 100}")

# Model considering 3 neighbors
accuracy_3 = accuracy_score(y_test, y_predict_3)
print(f"Acuracy for the model with 3 neighbors: {accuracy_3 * 100}")

# Model considering 4 neighbors
accuracy_4 = accuracy_score(y_test, y_predict_4)
print(f"Acuracy for the model with 4 neighbors: {accuracy_4 * 100}")

# Model considering 5 neighbors
accuracy_5 = accuracy_score(y_test, y_predict_5)
print(f"Acuracy for the model with 5 neighbors: {accuracy_5 * 100}")

# Model considering 6 neighbors
accuracy_6 = accuracy_score(y_test, y_predict_6)
print(f"Acuracy for the model with 6 neighbors: {accuracy_6 * 100}")

# Model considering 7 neighbors
accuracy_7 = accuracy_score(y_test, y_predict_7)
print(f"Acuracy for the model with 7 neighbors: {accuracy_7 * 100}")

Acuracy for the model with 2 neighbors: 62.295081967213115
Acuracy for the model with 3 neighbors: 68.85245901639344
Acuracy for the model with 4 neighbors: 67.21311475409836
Acuracy for the model with 5 neighbors: 68.85245901639344
Acuracy for the model with 6 neighbors: 73.77049180327869
Acuracy for the model with 7 neighbors: 73.77049180327869


## Comments for the Model 1, KNN results

The number of neighbors considered in the model affects the accuracy. The tests show that the better number of neighbor for the analysis, with the chosen features, is 6.<br>
So, considering the features chose, and 6 neighbors, the model can predict with an accuracy of 73.77%.

# Creating ML model 2 - Logistic Regression

In [36]:
# Creating the Logistic Regression Model
model = LogisticRegression()

# Training the model
model.fit(X_train, y_train)

## Prediction on Test data

In [37]:
# Prediction
y_pred = model.predict(X_test)

## Model 2 Performance

In [38]:
# Accuracy
accuracy_LR = accuracy_score(y_test, y_pred)
print('Accuracy of Logistic Regression for the dataset:', accuracy_LR)

Accuracy of Logistic Regression for the dataset: 0.7213114754098361


# Report and insight from your analysis

The dataset *supermarket_survey.csv* used in this study had 353 entries, and with exception of the variable income (numerical), had 45 categorical variables. This dataset has information related to people who buy groceries in a supermarket. <br>

It is a very extensive collection of data, that allows many analysis. This study will focus on a single construct: How some characteristics (variables) affect how the person likes to order her/his items (Variable orderingItems), ie, is it possible to predict if a person prefers to order online his/her items considering other characteristics provided?

This analysis may allow the supermarket to review some strategies for online marketing, or to organize better its delivery process, or still, to estimate if new hirings will be necessary to support the store or the online stocks.

Part of the Data cleaning was to rename some variables (to be simpler one) and also review the categories values - which would be necessary for a later step to change the categorical variables into dummy variables, as both models chosen for use (KNN and Logistic Regression) don’t work with categorical data.

Some findings from the dataset could be identified during the EDA phase:<br>
- Most buyers are male;<br>
- Most buyers are between 20 and 55 years;<br>
- Most buyers go the the supermarket twice in the week;<br>
- Bike and Waking are the preferred transportation model;<br>
- The preferred days for going to the supermarket are Saturdays and Fridays;<br>
- And the preferred times are in the evening (between 5PM and 8PM) and in the morning (between 8AM and 11AM).<br>
- And finally, most buyers prefer to select their products themselves in the store.

For the predictive analysis, the original datasets were spliced in 2 subsets, each:<br>
- Training set, with 80% of data;<br>
- Test set, with 20% of data.<br>

After running both models, the KNN had a slight better performance - with accuracy of 73,77% in the prediction, while the Logistic Regression had an accuracy of 72,13%.<br>
<br>
Perhaps using other variables, or a larger dataset, the models can be retrained and provide a better accuracy and prediction.

***