# Customer analysis

The intent of this project is to be able to successfully apply classification and clustering on a dataset and be able to analyze and make predictions that are relevant to the goal predefined.

For this project I have chosen a dataset which collects the data that is necessary to analyse a customer behavior when making a purchase within a company, so a *Customer Personality Analysis*.

The link to find the dataset that was used is:
https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis

## Goals
The goal of this project is to:
1. Predict whether offers are an effective method to have a client buy in the store - We will be doing this by using binary classification of the attribute "Response", since it tells us whether a customer accepted the offer (1) or refused (0) - classify features values in 1 or 0. It is done in this notebook.
2. Segment customers based on their characteristics such as age, income, family situation... -  We will identify distinct customer groups that may have different needs and behaviors by applying clustering techniques. Look at the notebook "Clustering" - We will first segment customers based on all the characteristics given by the dataset and afterwards we will only consider their spending habits and Income to segregate them.

## Attributes
In this dataset the attributes are divided into four different types of categories

**People**
- ID: Customer's unique identifier
- Year_Birth: Customer's birth year
- Education: Customer's education level
- Marital_Status: Customer's marital status
- Income: Customer's yearly household income
- Kidhome: Number of children in customer's household
- Teenhome: Number of teenagers in customer's household
- Dt_Customer: Date of customer's enrollment with the company
- Recency: Number of days since customer's last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise

**Products**
- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years

**Promotions**
- NumDealsPurchases: Number of purchases made with a discount
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

**Place of purchase**
- NumWebPurchases: Number of purchases made through the company’s website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month

The meaning of the attributes was copied from the presentation of the dataset, this presentation can be found in the link given above.

## Libraries needed for the project
In order to develop this project, these libraries are needed (some of the libraries were not used and just added during the development of the project):

In [1]:
# Basic necessary libraries
import warnings
warnings.filterwarnings('ignore')
import random
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from collections import Counter
from mlxtend.plotting import plot_decision_regions
import mglearn

# For the analysis of the dataset
import pandas as pd
import missingno as msno

# For preprocessing
import sklearn
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MaxAbsScaler, OrdinalEncoder, StandardScaler, KBinsDiscretizer, add_dummy_feature, LabelEncoder, Binarizer
from sklearn.preprocessing import KBinsDiscretizer, add_dummy_feature, LabelEncoder, Binarizer, Normalizer, MinMaxScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# For splitting in train and test sets
from sklearn.model_selection import train_test_split

#----------------------------------------------------------------------------------------------------------------------------------------------
# We added all possible methods of libraries to facilitate model selection

# Classifiers - Supervised learning
from sklearn.linear_model import Perceptron, LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor # Decision Tree, even if I will probably only use the classification Decision Tree
from sklearn.svm import LinearSVC, SVC # Support Vector Machine
from sklearn.decomposition import PCA # Dimensionality reduction feature extraction

# Classifier - Unsupervised learning
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA # Dimensionality reduction feature extraction
from mlxtend.feature_selection import SequentialFeatureSelector as SFS # Dimensionality reduction feature selection

# To deal with imbalanced classes
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, TomekLinks

# Model selection
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import KFold, StratifiedKFold, GridSearchCV, RandomizedSearchCV, cross_validate, RepeatedStratifiedKFold, HalvingGridSearchCV, HalvingRandomSearchCV
from sklearn.model_selection import cross_val_predict, RepeatedKFold, ShuffleSplit, StratifiedShuffleSplit, learning_curve, validation_curve, cross_val_score
from random import choice
import itertools
from imblearn.pipeline import Pipeline as IMBPipeline

# Ensemble learning
from sklearn.ensemble import VotingClassifier, BaggingClassifier, RandomForestClassifier, RandomForestRegressor, AdaBoostClassifier
from xgboost import XGBClassifier

# Model performance evaluation
from sklearn.metrics import f1_score, roc_auc_score, confusion_matrix, matthews_corrcoef, roc_curve, get_scorer_names
from sklearn.metrics import precision_score, accuracy_score, recall_score,  precision_recall_curve
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report

# For the refinement of the model selection
from scipy.stats import loguniform, beta, uniform


ModuleNotFoundError: No module named 'mlxtend'

In [3]:
pip install numpy

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install matplotlib

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install seaborn

Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.
