# Python Scientific Data Analysis
## Course's Final Project
### Barak Daniel - 204594329

## Installations needed for the program to run:

In [None]:
#!pip install numpy
#!pip install pandas
#!pip install seaborn
#!pip install matplotlib
#!pip install seaborn
#!pip install sklearn
#!pip install scipy
#!pip install pydotplus

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.naive_bayes import GaussianNB
import sklearn as skl
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pydotplus
import random

# Intro

### Overview
This project is about targetted marketing, in the data set given we have 'customer' as each row of data, and the features can tell us about the customer's status in general (age, marriage, etc..) and hes shopping behavior.
In this section of the project I will go over the data set and the goal to try and understand the whole process before starting the actual work on the data.

### So what exactly is targetted marketing?
Targeting in marketing is a strategy that breaks a large market into smaller segments to concentrate on a specific group of customers within that audience. It defines a segment of customers based on their unique characteristics and focuses solely on serving them.
Instead of trying to reach an entire market, a brand uses target marketing to put their energy into connecting with a specific, defined group within that market.
So for this reason I'll break down all the features in the given dataset ('customers3.csv'), and understand them.


### Feature's data breadown
The following features are included in the data set given in 'customes3.csv':
- ID - Unique ID to each customer
- Gender - The gender of the customer
- Ever_Marries - Indicates if the customer was married
- Age - The age of the customer
- Graduated - Has the customer graduated high school
- Profession - The profession of the customer
- Work_Experience - The number of years of the customer's expirence in his profession
- Spending_Score - The spending habits of the customer classified to 3 categories
- Family_Size - The number of family members the customer has in his household
- Shop_Day - The day of the week which the customer is shopping on the most
- Shop_Other - Normalized measure of customer deviation from average store customer spending on non specified products
- Shop_Dairy - Normalized measure of customer deviation from average store customer spending on dairy products
- Shop_Household: Normalized measure of customer deviation from average store customer spending on household products
- Shop_Meat - Normalized measure of customer deviation from average store customer spending on meat products
- Group - The target group which the customer belongs to

### Feature's type breakdown
- ID - Numerical discrete (Integer)
- Gender - Categorical (Male/Female)
- Ever_Marries - Categorical nominal (Yes/No)
- Age - Numerical continuous (Integer)
- Graduated - Categorical (Yes/No)
- Profession - Categorical nominal
- Work_Experience - Numerical discrete (Integer)
- Spending_Score - Categorical ordinal (Low/average/High)
- Family_Size - Numerical discrete (Integer)
- Shop_Day - Categorical ordinal (Sunday, Monday, ..., Saturday)
- Shop_Other - Numerical continuous (Double)
- Shop_Dair - Numerical continuous (Double)
- Shop_Household - Numerical continuous (Double)
- Shop_Meat -Numerical continuous (Double)
- Group - Categorical nominal

In [None]:
df = pd.read_csv('customers3.csv')
count = df.count()

print("The number of rows is: {}".format(len(df.index)))
print("The number of columns is: {}".format(len(df.columns)))
print("The number of cells is: {}".format(len(df.index) * len(df.columns)))
print("The number of cells with concrete values is: {}".format(count.sum()))
print("The number of cells without concrete values is: {}\n".format(len(df.index) * len(df.columns) - count.sum()))

print("\nThe number of concrete values for each feature:")
df.count()


### The size of the data set is:
- 8120 rows of customer's data (+1 for the headers of each column)
- 15 columns for the features
- 8120*15 = 121,8000 cells, but we can see that not all of them has concrete values.

### Missing values:
After watching the dataset and trying to understand it, I have also encountered many cells with missing data values.
After reading the dataset a transformation of this Nan values is needed, for each feature with missing data, I'll examine it and understand which of the methods is the best to deal with those values (Mode, Mean, Median, Removal, etc..).

### Other types of missing values:
A validation for the values that are not missing must be made, after going through the features, the options are numeric value which is out of the range as given with feature definition, a numeric value that cannot be negative, etc...

After going through out the dataset, those are the features needed to be fixed:
- Shop_Day - Must contain values of 1 to 7 but there are values out of this range therefore it will be filled by the same method as all the feature values


## Initial Data Analysis

As we saw above, there are a lot of missing values and categorical values we want to transform before we can alanyze the data completely.
In this section of the project I will deal with those values, for each feature a check for the accuracy of the model will be taken and by that I can make the decision what was the best method for the feature.

The first feature to handle missing data will be 'Gender', since we have less than 60 missing values and its binary the best method to do it is to check their distributions over the data set and then fill them with this distributions.


In [None]:
male = 0
female = 0
for index,row in df.iterrows():
    if(row['Gender'] == 'Male'):
        male += 1
    elif(row['Gender'] == 'Female'):
        female += 1

print("Female precentage", female/(male+female))
print("Male precentage", male/(male+female))

So we can see now that females represents ~45.25% of the rows in the data set and Males are ~54.75% .
Now we fill the missing values with this distribution:

In [None]:
nans = df['Gender'].isna()
length = sum(nans)
replacement = random.choices(['Male', 'Female'], weights=[.5475, .4525], k=length)
df.loc[nans,'Gender'] = replacement

df.Gender.count()

The next feature I'll be dealing with is Ever_Married which is also a binary answer of yes or no.


In [None]:
married = 0
notmarried = 0
for index,row in df.iterrows():
    if(row['Ever_Married'] == 'Yes'):
        married += 1
    elif(row['Ever_Married'] == 'No'):
        notmarried += 1


married_list = ["Yes", "No"]
df["Ever_Married_Transformed"] = pd.Categorical(df.Ever_Married, ordered=True, categories=married_list).codes + 1
group_list = ["A", "B", "C", "D"]
df["Group_Transformed"] = pd.Categorical(df.Group, ordered=True, categories=group_list).codes + 1

print(df["Group_Transformed"].corr(df["Ever_Married_Transformed"]))
df.drop(['Ever_Married_Transformed', 'Group_Transformed'], axis=1)

print("Percentage of married = ", (married/(married+notmarried)))
print("Percentage of married = ", (notmarried/(married+notmarried)))


The correlation to the group is low and therefore I can use the same method as I did in 'Gender'.

In [None]:
nans = df['Ever_Married'].isna()
length = sum(nans)
replacement = random.choices(['Yes', 'No'], weights=[.5859, .4141], k=length)
df.loc[nans,'Ever_Married'] = replacement

df.Ever_Married.count()

The next feature to deal with is 'Age', first I'll check some of the feature's data like the range and mean.
Afterwards I fill the df with the values in the better method.

In [None]:
group_list = ["A", "B", "C", "D"]
df["Group_Transformed"] = pd.Categorical(df.Group, ordered=True, categories=group_list).codes + 1

print(df["Group_Transformed"].corr(df["Age"]))

ageMin = df.Age.min()
ageMax = df.Age.max()
ageMean = df.Age.mean()
ageMedian = df.Age.median()

print("\nAge aggregations:\nMean = {}\nMedian = {}".format(ageMean, ageMedian))
print("Min age is: {}  --- Max age is: {}\n".format(ageMin, ageMax))

df["Mean_Age"] = df.Age.fillna(ageMean)
df["Median_Age"] = df.Age.fillna(ageMedian)


print("Corr with mean: ", df["Group_Transformed"].corr(df["Mean_Age"]))
print("Corr with median: ", df["Group_Transformed"].corr(df["Median_Age"]))

As we can see, both of the values are nearly the same, but we will still prefer the median for the better correlation even if it is only slightly higher.

In [None]:
df.Age = df.Median_Age
df = df.drop(['Group_Transformed', "Mean_Age", "Median_Age"], axis=1)

df.Age.count()

The next feature will be Graduated which missing a few values, so like the other binary features I have dealt with above, I'll do the same here.

In [None]:
grad = 0
ungrad = 0
for index,row in df.iterrows():
    if(row['Graduated'] == 'Yes'):
        grad += 1
    elif(row['Graduated'] == 'No'):
        ungrad += 1

print("Graduated precentage", ungrad/(grad+ungrad))
print("Haven't graduated precentage", grad/(grad+ungrad))

nans = df['Graduated'].isna()
length = sum(nans)
replacement = random.choices(['Yes', 'No'], weights=[.3781, .6219], k=length)
df.loc[nans,'Graduated'] = replacement

df.Graduated.count()

The next feature is Profession, the difference from the features we dealt with already is that this feature is Categorical nominal which the set of his values is not finite, therefore to fill this column and not lose the data from the rest of the rows, I will fill the values with Mode.

In [None]:
proMode = df.Profession.mode()
df["Profession"] = df.Age.fillna(proMode)

Now the feature to be dealt with is Work_Expirence, because there is the Age col, maybe here we can fill the missing values with deductive imputation, so now I'll check their corr and decide how to fill this feature.

In [None]:
df.Age.corr(df.Work_Experience)

Because the corr is low, using the Age feature won't be good enough for filling those values, now I'll check the mathematical operation that can be done.

In [None]:
group_list = ["A", "B", "C", "D"]
df["Group_Transformed"] = pd.Categorical(df.Group, ordered=True, categories=group_list).codes + 1

print(df["Group_Transformed"].corr(df["Work_Experience"]))

workMin = df.Work_Experience.min()
workMax = df.Work_Experience.max()
workMean = df.Work_Experience.mean()
workMedian = df.Work_Experience.median()

print("\nAge aggregations:\nMean = {}\nMedian = {}".format(workMean, workMedian))
print("Min work exp is: {}  --- Max work exp is: {}\n".format(workMin, workMax))

df["Mean_work"] = df.Work_Experience.fillna(workMean)
df["Median_work"] = df.Work_Experience.fillna(workMedian)


print("Corr with mean: ", df["Group_Transformed"].corr(df["Mean_work"]))
print("Corr with median: ", df["Group_Transformed"].corr(df["Median_work"]))

df["Work_Experience"] = df.Work_Experience.fillna(workMean)
df = df.drop(['Group_Transformed', "Mean_work", "Median_work"], axis=1)

print(df.Work_Experience.count())

As we can see in the above, the correlation that is most fitted here is filling the missing values with Mean.

Now the next feature is "family size", for that feature I will test Mode, Median and Mean.
For the mean option I will round the value of the mean so there will be a valid family size value.

In [None]:
group_list = ["A", "B", "C", "D"]
df["Group_Transformed"] = pd.Categorical(df.Group, ordered=True, categories=group_list).codes + 1

print(df["Group_Transformed"].corr(df["Family_Size"]))

familyMean = round(df.Family_Size.mean())
familyMedian = df.Family_Size.median()
familyMode = df.Family_Size.mode()[0]

print("\nAge aggregations:\nMean = {}\nMedian = {}\nMode = {}\n".format(familyMean, familyMedian, familyMode))

df["Mean_family"] = df.Family_Size.fillna(familyMean)
df["Median_family"] = df.Family_Size.fillna(familyMedian)
df["Mode_family"] = df.Family_Size.fillna(familyMode)


print("Corr with mean: ", df["Group_Transformed"].corr(df["Mean_family"]))
print("Corr with median: ", df["Group_Transformed"].corr(df["Median_family"]))
print("Corr with mode: ", df["Group_Transformed"].corr(df["Mode_family"]))

df["Family_Size"] = df.Family_Size.fillna(familyMean)
df = df.drop(['Group_Transformed', "Mean_family", "Median_family", "Mode_family"], axis=1)

print(df.Family_Size.count())

Both mean and median gave the same result and got better correlation than mode, that is why I choose their value to fill the "Family_Size" feature.

For the next feature, "Shop_Day", before filling the missing values I need to deal with the wrong values the feaure is containing:
Must contain values of 1 to 7 but there are values out of this range.

In [None]:
print(df.Shop_Day.unique())

The values that needs to be dealt with first are 0 and 22

In [None]:
temp_Shop_Day = df.Shop_Day
for index, value in enumerate(temp_Shop_Day):
    if(value == 0 or value == 22):
        temp_Shop_Day[index] = np.nan

df.Shop_Day = temp_Shop_Day
print(df.Shop_Day.unique())

Now I can fill the missing values by checking the best mathematical operation without wrong values.

In [None]:
group_list = ["A", "B", "C", "D"]
df["Group_Transformed"] = pd.Categorical(df.Group, ordered=True, categories=group_list).codes + 1

print(df["Group_Transformed"].corr(df["Shop_Day"]))

dayMean = round(df.Shop_Day.mean())
dayMedian = df.Shop_Day.median()
dayMode = df.Shop_Day.mode()[0]

print("\nAge aggregations:\nMean = {}\nMedian = {}\nMode = {}\n".format(dayMean, dayMedian, dayMode))

df["Mean_day"] = df.Shop_Day.fillna(dayMean)
df["Median_day"] = df.Shop_Day.fillna(dayMedian)
df["Mode_day"] = df.Shop_Day.fillna(dayMode)


print("Corr with mean: ", df["Group_Transformed"].corr(df["Mean_day"]))
print("Corr with median: ", df["Group_Transformed"].corr(df["Median_day"]))
print("Corr with mode: ", df["Group_Transformed"].corr(df["Mode_day"]))

df["Shop_Day"] = df.Shop_Day.fillna(dayMean)
df = df.drop(['Group_Transformed', "Mean_day", "Median_day", "Mode_day"], axis=1)

print(df.Shop_Day.count())

As we can see above the Mean filling option is better than median and mode which are sharing the same value, Therefore I chose to use it.

Moving on to the next feature, "Shop_Diary", I will check all the fitting mathematical operations as well.

In [None]:
group_list = ["A", "B", "C", "D"]
df["Group_Transformed"] = pd.Categorical(df.Group, ordered=True, categories=group_list).codes + 1

print(df["Group_Transformed"].corr(df["Shop_Dairy"]))

dairyMean = df.Shop_Dairy.mean()
dairyMedian = df.Shop_Dairy.median()
dairyMode = df.Shop_Dairy.mode()[0]

print("\nAge aggregations:\nMean = {}\nMedian = {}\nMode = {}\n".format(dairyMean, dairyMedian, dairyMode))

df["Mean_dairy"] = df.Shop_Dairy.fillna(dairyMean)
df["Median_dairy"] = df.Shop_Dairy.fillna(dairyMedian)
df["Mode_dairy"] = df.Shop_Dairy.fillna(dairyMode)


print("Corr with mean: ", df["Group_Transformed"].corr(df["Mean_dairy"]))
print("Corr with median: ", df["Group_Transformed"].corr(df["Median_dairy"]))
print("Corr with mode: ", df["Group_Transformed"].corr(df["Mode_dairy"]))

df["Shop_Dairy"] = df.Shop_Dairy.fillna(dairyMean)
df = df.drop(['Group_Transformed', "Mean_dairy", "Median_dairy", "Mode_dairy"], axis=1)

print(df.Shop_Dairy.count())

The mean and median are pretty close in their correlation to the group, still mean is higher so That is why I chose to use it here as well. 

For the next 2 feature's, "Shop_Household" and "Shop_Meat", I will use the same method as in Dairy 

In [None]:
group_list = ["A", "B", "C", "D"]
df["Group_Transformed"] = pd.Categorical(df.Group, ordered=True, categories=group_list).codes + 1

print(df["Group_Transformed"].corr(df["Shop_Household"]))

HouseholdMean = df.Shop_Household.mean()
HouseholdMedian = df.Shop_Household.median()
HouseholdMode = df.Shop_Household.mode()[0]

print("\nAge aggregations:\nMean = {}\nMedian = {}\nMode = {}\n".format(HouseholdMean, HouseholdMedian, HouseholdMode))

df["Mean_Household"] = df.Shop_Household.fillna(HouseholdMean)
df["Median_Household"] = df.Shop_Household.fillna(HouseholdMedian)
df["Mode_Household"] = df.Shop_Household.fillna(HouseholdMode)


print("Corr with mean: ", df["Group_Transformed"].corr(df["Mean_Household"]))
print("Corr with median: ", df["Group_Transformed"].corr(df["Median_Household"]))
print("Corr with mode: ", df["Group_Transformed"].corr(df["Mode_Household"]))

df["Shop_Household"] = df.Shop_Household.fillna(HouseholdMean)
df = df.drop(['Group_Transformed', "Mean_Household", "Median_Household", "Mode_Household"], axis=1)

print(df.Shop_Household.count())

In [None]:
group_list = ["A", "B", "C", "D"]
df["Group_Transformed"] = pd.Categorical(df.Group, ordered=True, categories=group_list).codes + 1

print(df["Group_Transformed"].corr(df["Shop_Meat"]))

MeatMean = round(df.Shop_Meat.mean())
MeatMedian = df.Shop_Meat.median()
MeatMode = df.Shop_Meat.mode()[0]

print("\nAge aggregations:\nMean = {}\nMedian = {}\nMode = {}\n".format(MeatMean, MeatMedian, MeatMode))

df["Mean_Household"] = df.Shop_Meat.fillna(MeatMean)
df["Median_Household"] = df.Shop_Meat.fillna(MeatMedian)
df["Mode_Household"] = df.Shop_Meat.fillna(MeatMode)


print("Corr with mean: ", df["Group_Transformed"].corr(df["Mean_Household"]))
print("Corr with median: ", df["Group_Transformed"].corr(df["Median_Household"]))
print("Corr with mode: ", df["Group_Transformed"].corr(df["Mode_Household"]))

df["Shop_Meat"] = df.Shop_Meat.fillna(MeatMean)
df = df.drop(['Group_Transformed', "Mean_Household", "Median_Household", "Mode_Household"], axis=1)

print(df.Shop_Meat.count())

Both "Shop_Household" and "Shop_Meat" will get the best result by filling with the mean value of the feature.