# Customer analysis

The intent of this project is to be able to successfully apply classification and clustering on a dataset and be able to analyze and make predictions that are relevant to the goal predefined.

For this project I have chosen a dataset which collects the data that is necessary to analyse a customer behavior when making a purchase within a company, so a *Customer Personality Analysis*.

The link to find the dataset that was used is:
https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis

## Goals
The goal of this project is to:
1. Predict whether offers are an effective method to have a client buy in the store - We will be doing this by using binary classification of the attribute "Response", since it tells us whether a customer accepted the offer (1) or refused (0) - classify features values in 1 or 0. It is done in this notebook.
2. Segment customers based on their characteristics such as age, income, family situation... -  We will identify distinct customer groups that may have different needs and behaviors by applying clustering techniques. Look at the notebook "Clustering" - We will first segment customers based on all the characteristics given by the dataset and afterwards we will only consider their spending habits and Income to segregate them.

## Attributes
In this dataset the attributes are divided into four different types of categories

**People**
- ID: Customer's unique identifier
- Year_Birth: Customer's birth year
- Education: Customer's education level
- Marital_Status: Customer's marital status
- Income: Customer's yearly household income
- Kidhome: Number of children in customer's household
- Teenhome: Number of teenagers in customer's household
- Dt_Customer: Date of customer's enrollment with the company
- Recency: Number of days since customer's last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise

**Products**
- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years

**Promotions**
- NumDealsPurchases: Number of purchases made with a discount
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

**Place of purchase**
- NumWebPurchases: Number of purchases made through the company’s website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month

The meaning of the attributes was copied from the presentation of the dataset, this presentation can be found in the link given above.

## Libraries needed for the project
In order to develop this project, these libraries are needed (some of the libraries were not used and just added during the development of the project):

In [1]:
# Basic necessary libraries
import warnings
warnings.filterwarnings('ignore')
import random
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from collections import Counter
from mlxtend.plotting import plot_decision_regions
import mglearn

In [2]:
# For the analysis of the dataset
import pandas as pd
import missingno as msno

# For preprocessing
import sklearn
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MaxAbsScaler, OrdinalEncoder, StandardScaler, KBinsDiscretizer, add_dummy_feature, LabelEncoder, Binarizer
from sklearn.preprocessing import KBinsDiscretizer, add_dummy_feature, LabelEncoder, Binarizer, Normalizer, MinMaxScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# For splitting in train and test sets
from sklearn.model_selection import train_test_split

#----------------------------------------------------------------------------------------------------------------------------------------------
# We added all possible methods of libraries to facilitate model selection

# Classifiers - Supervised learning
from sklearn.linear_model import Perceptron, LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor # Decision Tree, even if I will probably only use the classification Decision Tree
from sklearn.svm import LinearSVC, SVC # Support Vector Machine
from sklearn.decomposition import PCA # Dimensionality reduction feature extraction

# Classifier - Unsupervised learning
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA # Dimensionality reduction feature extraction
from mlxtend.feature_selection import SequentialFeatureSelector as SFS # Dimensionality reduction feature selection

# To deal with imbalanced classes
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, TomekLinks

# Model selection
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import KFold, StratifiedKFold, GridSearchCV, RandomizedSearchCV, cross_validate, RepeatedStratifiedKFold, HalvingGridSearchCV, HalvingRandomSearchCV
from sklearn.model_selection import cross_val_predict, RepeatedKFold, ShuffleSplit, StratifiedShuffleSplit, learning_curve, validation_curve, cross_val_score
from random import choice
import itertools
from imblearn.pipeline import Pipeline as IMBPipeline

# Ensemble learning
from sklearn.ensemble import VotingClassifier, BaggingClassifier, RandomForestClassifier, RandomForestRegressor, AdaBoostClassifier
from xgboost import XGBClassifier

# Model performance evaluation
from sklearn.metrics import f1_score, roc_auc_score, confusion_matrix, matthews_corrcoef, roc_curve, get_scorer_names
from sklearn.metrics import precision_score, accuracy_score, recall_score,  precision_recall_curve
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report

# For the refinement of the model selection
from scipy.stats import loguniform, beta, uniform


# Dataset analysis

To import the dataset we have to use Pandas library, since it allows to visualize the dataset and act upon it.

In [3]:
dataset_marketing = pd.read_csv('marketing_campaign.csv', sep="\t")
dataset_marketing

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,10870,1967,Graduation,Married,61223.0,0,1,13-06-2013,46,709,...,5,0,0,0,0,0,0,3,11,0
2236,4001,1946,PhD,Together,64014.0,2,1,10-06-2014,56,406,...,7,0,0,0,1,0,0,3,11,0
2237,7270,1981,Graduation,Divorced,56981.0,0,0,25-01-2014,91,908,...,6,0,1,0,0,0,0,3,11,0
2238,8235,1956,Master,Together,69245.0,0,1,24-01-2014,8,428,...,3,0,0,0,0,0,0,3,11,0


Now we check the information about the dataset...

In [4]:
dataset_marketing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

... and the specific number of null values that it has.

In [5]:
dataset_marketing.isnull().sum(axis=0)

ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64

Checking the information about the dataset we find that it is almost perfectly clean (we only have few missing values in the column of the attribute Income), but for the aim of the project we will produce some noise, so that we can then apply different data-preprocessing methods.

Now, we corrupt the dataset in the notebook "Noise in the dataset"

## Adding noise to the dataset

In this part of the project we will corrupt the dataset, so we will add missing values, so that we can then show that we have the competence of  dealing with missing data.

After importing the necessary libraries and the dataset we decide what we want to do to the dataset in order to corrupt it.

For the project we have decided to add missing values to the columns:
- Recency
- Kidhome
- AcceptedCmp4

In [6]:
# Add % of missing values to insert in the column
def add_missing_columns(col, amount):
    X = col.copy()
    size = amount if amount >= 1 else int(len(X) * amount)
    indexes = np.random.choice(len(X), size, replace = False )
    X[indexes] = np.nan
    return X

dataset_marketing["Recency"]=add_missing_columns(dataset_marketing["Recency"], 0.1) # I want to add 10% of missing values
dataset_marketing["Kidhome"]=add_missing_columns(dataset_marketing["Kidhome"], 0.15) # I want to add 15% of missing values
dataset_marketing["AcceptedCmp4"]=add_missing_columns(dataset_marketing["AcceptedCmp4"], 30) # I want to add 30 missing values in the column

Moreover I will insert 5% of missing values in the rows that we have.

In [7]:
# Add % of missing values to insert in the row
def add_missing_rows(df, amount):
    X = df.copy()
    rows, cols = X.shape
    size = amount if amount >= 1 else int(rows * amount)
    indexes = np.random.choice(rows, size, replace = False ) + 0.5
    for i in indexes:
        X.loc[i] = np.full((cols,),np.nan)
    X = X.sort_index().reset_index(drop=True)
    return X

dataset_marketing = add_missing_rows(dataset_marketing, 0.05) # added 5% of missing values in the existing rows

Finally, we add enough missing values in a column to show that we are able to drop a column, action needed when said column has too many missing values to be dealt with - the same can be done with rows.

We will be doing this to the columns "Year_Birth" and "Dt_Customer", since they are less relevant for the purpose of the project.

In [8]:
dataset_marketing["Year_Birth"] = add_missing_columns(dataset_marketing["Year_Birth"], 0.45) # I want to add 45% of missing values
dataset_marketing["Dt_Customer"] = add_missing_columns(dataset_marketing["Dt_Customer"], 0.50) # I want to add 50% of missing values

Now we check the info of the dataset and see for ourselves that it is not a clean dataset anymore since we have a discrete amount of missing data.

In [9]:
dataset_marketing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2352 entries, 0 to 2351
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   float64
 1   Year_Birth           1227 non-null   float64
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              1904 non-null   float64
 6   Teenhome             2240 non-null   float64
 7   Dt_Customer          1119 non-null   object 
 8   Recency              2016 non-null   float64
 9   MntWines             2240 non-null   float64
 10  MntFruits            2240 non-null   float64
 11  MntMeatProducts      2240 non-null   float64
 12  MntFishProducts      2240 non-null   float64
 13  MntSweetProducts     2240 non-null   float64
 14  MntGoldProds         2240 non-null   float64
 15  NumDealsPurchases    2240 non-null   f

In [10]:
dataset_marketing.isnull().sum(axis=0)

ID                      112
Year_Birth             1125
Education               112
Marital_Status          112
Income                  136
Kidhome                 448
Teenhome                112
Dt_Customer            1233
Recency                 336
MntWines                112
MntFruits               112
MntMeatProducts         112
MntFishProducts         112
MntSweetProducts        112
MntGoldProds            112
NumDealsPurchases       112
NumWebPurchases         112
NumCatalogPurchases     112
NumStorePurchases       112
NumWebVisitsMonth       112
AcceptedCmp3            112
AcceptedCmp4            142
AcceptedCmp5            112
AcceptedCmp1            112
AcceptedCmp2            112
Complain                112
Z_CostContact           112
Z_Revenue               112
Response                112
dtype: int64

Now we save the corrupted dataset as a csv file

In [11]:
dataset_marketing.to_csv('marketing_corrupted.csv')

Note: saved so that we don't have different datasets each time me run the project, since the rows are removed randomly

## Importing the new corrupted dataset

The new corrupted dataset is:

In [12]:
dataset_marketing= pd.read_csv('marketing_corrupted.csv', sep=",")
dataset_marketing

Unnamed: 0.1,Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,0,5524.0,1957.0,Graduation,Single,58138.0,0.0,0.0,04-09-2012,58.0,...,7.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,1.0
1,1,2174.0,1954.0,Graduation,Single,46344.0,1.0,1.0,,38.0,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
2,2,4141.0,,Graduation,Together,71613.0,,0.0,21-08-2013,26.0,...,4.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
3,3,6182.0,1984.0,Graduation,Together,26646.0,1.0,0.0,10-02-2014,26.0,...,6.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
4,4,5324.0,1981.0,PhD,Married,58293.0,1.0,0.0,19-01-2014,94.0,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2347,2347,10870.0,1967.0,Graduation,Married,61223.0,0.0,1.0,13-06-2013,46.0,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
2348,2348,4001.0,1946.0,PhD,Together,64014.0,2.0,1.0,,,...,7.0,0.0,0.0,0.0,1.0,0.0,0.0,3.0,11.0,0.0
2349,2349,7270.0,,Graduation,Divorced,56981.0,0.0,0.0,,91.0,...,6.0,0.0,1.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
2350,2350,8235.0,1956.0,Master,Together,69245.0,0.0,1.0,24-01-2014,8.0,...,3.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0


Now, we have missing values in all the columns present in the dataset as shown in the notebook where we have corrupted the dataset.
We need to handle these missing values and then proceed with further analysis and modeling of data.