# Customer Segmentation

#### By: Brian Rafferty

## Introduction

#### In this project, I will perform unsupervised learning on a supermarket's dataset containing customer information to group individuals that share characteristics. Unsupervised grouping of a dataset is called clustering, and it is extremely beneficial for businesses that want to learn more about the nuances of their data. In the context of this problem, clustering will provide stakeholders of the supermarket with an opportunity to increase revenue in future quarters by knowing which groups of customers will likely have a positive response to marketing campaigns.

## Table of Contents

### 1. Import Libraries
### 2. Load Data
### 3. Exploratory Data Analysis
### 4. Data Cleaning
### 5. Principle Component Analysis
### 6. Clustering
### 7. Profiling
### 8. Conclusion

## Import Libraries

In [1]:
import numpy as np
np.random.seed(4)
import pandas as pd
pd.set_option('max_columns', None)
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt, numpy as np
plt.style.use('ggplot')
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import AgglomerativeClustering
from matplotlib.colors import ListedColormap
from sklearn import metrics
import seaborn as sns

## Load Data

In [2]:
df = pd.read_csv('../input/customer-personality-analysis/marketing_campaign.csv', sep='\t')

## Exploratory Data Analysis

#### Here I will go through the dataset to learn more about it. What I learn here will influence the steps I must take in the data cleaning process. Things that are important to understand before moving forward are: 
#### 1. Data Shape
#### 2. Data Types
#### 3. Distribution
#### 4. Data Missing

In [3]:
print("Data Shape\n-----------------\n# of Rows: {}\n# of Columns: {}".format(df.shape[0], df.shape[1]))

In [4]:
# Use .info() to see the datatype for each column
df.info()

In [5]:
df.describe()

In [6]:
df.isna().sum()

##### Notes: The dataset contains 29 columns and 2240 rows. Out of the 29 columns, only Education, and Maritial_Status require encoding during the data cleaning process. Many columns contain skewed distributions, which I will need to correct during the data cleaning process, those columns include: Income, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, and NumWebVisitsMonth. Z_CostContact and Z_Revenue only contain a single value for every row, making those columns useless and in need of dropping during the data cleaning process. Lastly, Income is missing values in 24 rows, so I will drop those rows from the dataset in the data cleaning process as well.

## Data Cleaning

#### To properly clean the dataset, my workflow will entail:
#### 1. Remove rows with missing values in the Income column
#### 2. Remove outliers from columns with skewed distributions
#### 3. Encode columns with string data types into integers 

In [7]:
# drop rows with missing values for Income
df = df.dropna(axis=0)
df.shape

In [8]:
# drop columns with useless data
df.drop(['Dt_Customer', 'Z_CostContact', 'Z_Revenue'], axis=1, inplace=True)
df.shape

In [9]:
# remove outliers from columns with extreme outliers
#df = df[(np.abs(stats.zscore(df[['Year_Birth', 'Income']])) < 3).all(axis=1)]
df = df[df['Income'] < 200000]
df = df[df['Year_Birth'] > 1920]
df.shape
df.head()

In [10]:
# do some feature engineering

# find total amount spent at stores
col_list = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
df['TotalSpent'] = df[col_list].sum(axis = 1)

# find family size
df.loc[df['Marital_Status'].isin(['Alone', 'Absurd', 'YOLO', 'Divorced', 'Widow', 'Single']), 'Marital_Status'] = 1
df.loc[df['Marital_Status'].isin(['Together', 'Married']), 'Marital_Status'] = 2

# encode string columns to ints
df['Education'] = df['Education'].astype('category').cat.codes
df.head()

In [11]:
# make copy of dataframe keeping metrics: 'AcceptedCmp3','AcceptedCmp4','AcceptedCmp5','AcceptedCmp1','AcceptedCmp2','Response','Complain'
orig_df = df.copy()
df.drop(['AcceptedCmp3','AcceptedCmp4','AcceptedCmp5','AcceptedCmp1',
         'AcceptedCmp2','Response','Complain','ID'], axis=1, inplace=True)
df.head()

In [12]:
# scale all remaining columns
scaler = StandardScaler()
scaler.fit_transform(df)
df = pd.DataFrame(scaler.transform(df), columns = df.columns)
df.head()

## Principle Component Analysis

#### There are far too many columns in the dataset to effectively apply a clustering algorithm, so I will apply a dimensionality reduction technique called Principle Component Analysis (PCA).

In [13]:
pca = PCA(n_components=3)
pca.fit(df)
pca_df = pd.DataFrame(pca.transform(df), columns=(["col1","col2", "col3"]))
pca_df.head()

## Clustering

#### Using the reduced dimensions provided by PCA, I will now cluster the dataset and assign the results to the original dataset for future profiling. In order to cluster the data, I will first employ the Elbow Method to determine the optimal number of clusters for this dataset. Afterwards I will apply a hierarchical clustering method so that my results are reproducible. Lastly, I will visualize the clusters that I produced with a 3D scatter plot.

In [14]:
elbow = KElbowVisualizer(KMeans(), k=10)
elbow.fit(pca_df)
elbow.show()

#### The optimal number of clusters for the dataset is 4.

#### Now I cluster the dataset using Agglomerative Clustering (a type of hierarchical clustering) to generate a cluster prediction for each row in the PCA dataset. I will then connect those clusters directly to the original dataset so that I can conduct profiling in the future.

In [15]:
#Initiating the Agglomerative Clustering model 
ac = AgglomerativeClustering(n_clusters=4)
# fit model and predict clusters
predictions = ac.fit_predict(pca_df)
pca_df["Clusters"] = predictions
orig_df["Clusters"]= predictions

#### With the clusters generated, now I will visualize the results in a 3D scatterplot.

In [16]:
#Plotting the clusters
x = pca_df["col1"]
y = pca_df["col2"]
z = pca_df["col3"]
fig = plt.figure(figsize=(10,8))
ax = plt.subplot(111, projection='3d', label="bla")
ax.scatter(x, y, z, s=40, c=pca_df["Clusters"], marker='o')
ax.set_title("The Plot Of The Clusters")
plt.show()

## Profiling

#### With each row in the dataset placed into a cluster, I will begin profiling the results (determine the different characteristics of each cluster).

In [17]:
colors = ['#a6cee3','#1f78b4','#b2df8a','#33a02c']
pl = sns.countplot(x=orig_df["Clusters"], palette=colors)
pl.set_title("Distribution Of The Clusters")
plt.show()

In [18]:
pl = sns.scatterplot(data = orig_df,x=orig_df["TotalSpent"], y=orig_df["Income"],hue=orig_df["Clusters"], palette= colors)
pl.set_title("Cluster's Profile Based On Income And Total Spending")
plt.legend()
plt.show()

In [19]:
plt.figure()
pl=sns.boxenplot(x=orig_df["Clusters"], y=orig_df["TotalSpent"], palette=colors)
plt.show()

In [20]:
orig_df.head()

#### Cluster 1 is our star customer group; They spend the most money at the store.

In [23]:
characteristics = [ "Kidhome","Teenhome", "Year_Birth", "Education", "Marital_Status"]

for i in characteristics:
    plt.figure()
    sns.jointplot(x=orig_df[i], y=orig_df["TotalSpent"], hue =orig_df["Clusters"], kind="kde", palette=colors)
    plt.show()

#### Cluster 0: 
* Most are parents
* Most are younger
* Wide range of educations
* Most are married

#### Cluster 1:
* Not parents
* Wide range of ages
* Most have high education
* Half are married

#### Cluster 2:
* Most are parents
* Most are older
* Most have high education
* Most are married

#### Cluster 3:
* Most are parents
* Most are older
* Most have high education
* Half are married

## Conclusion

#### Using unsupervised learning to segment customers, I was able to learn that individuals who are not parents and have high education are most likely to purchase products at the store. This group of people should be leveraged with future marketing campaigns to maximize profit.