# Context
## Problem Statement

Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

# Content
## Attributes

### People

    ID: Customer's unique identifier
    Year_Birth: Customer's birth year
    Education: Customer's education level
    Marital_Status: Customer's marital status
    Income: Customer's yearly household income
    Kidhome: Number of children in customer's household
    Teenhome: Number of teenagers in customer's household
    Dt_Customer: Date of customer's enrollment with the company
    Recency: Number of days since customer's last purchase
    Complain: 1 if the customer complained in the last 2 years, 0 otherwise
### Products

    MntWines: Amount spent on wine in last 2 years
    MntFruits: Amount spent on fruits in last 2 years
    MntMeatProducts: Amount spent on meat in last 2 years
    MntFishProducts: Amount spent on fish in last 2 years
    MntSweetProducts: Amount spent on sweets in last 2 years
    MntGoldProds: Amount spent on gold in last 2 years
### Promotion

    NumDealsPurchases: Number of purchases made with a discount
    AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
    AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
    AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
    AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
    AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
    Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
### Place

    NumWebPurchases: Number of purchases made through the company’s website
    NumCatalogPurchases: Number of purchases made using a catalogue
    NumStorePurchases: Number of purchases made directly in stores
    NumWebVisitsMonth: Number of visits to company’s website in the last month
### Target
    Need to perform clustering to summarize customer segments.

### Acknowledgement
    The dataset for this project is provided by Dr. Omar Romero-Hernandez.

### Solution
    You can take help from following link to know more about the approach to solve this problem.
https://thecleverprogrammer.com/2021/02/08/customer-personality-analysis-with-python/

Inspiration
happy learning….

Hope you like this dataset please don't forget to like this dataset

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv("../input/customer-personality-analysis/marketing_campaign.csv",sep="\t",parse_dates=["Dt_Customer"]);data.head(20).T

In [3]:
data.info()

Let's look at weather there are missing values or not

In [4]:
data.isna().sum()

Let's just drop out missing values since it is small in size

In [5]:
data.dropna(inplace=True)

In [6]:
#Instead of year of birth, age is calculated.
data["Age"]= 2021 - data["Year_Birth"]

In [7]:
data.describe()
#There are some outliers observed in Income and Age so it is limited to certain gap. 

In [8]:
data = data[data["Income"]<250000]
data = data[data["Age"]<90]

In [9]:
data.Education.value_counts()

In [10]:
#Instead of various education level I just narrowed it to 2. Undergraduate and Postgraduate.
data.Education = data.Education.replace({"Basic":"Undergraduate","2n Cycle":"Undergraduate","Graduation":"Postgraduate","PhD":"Postgraduate","Master":"Postgraduate"})

In [11]:
data.Marital_Status.value_counts()

In [12]:
#Again instead of describing same meanings in different terms I just narrowed into 2. Together and Single.
data["Marital_Status"].replace({"Married":"Together","Alone":"Single","Absurd":"Single","YOLO":"Single","Divorced":"Single","Widow":"Single"},inplace=True)

In [13]:
data.Marital_Status.value_counts()

In [14]:
#Some alterations have been made for easily readed columns. 
data.rename(columns={"MntWines":"Wines","MntFruits":"Fruits","MntMeatProducts":"MeatProducts","MntFishProducts":"FishProducts","MntSweetProducts":"SweetProducts","MntGoldProds":"GoldProds"},inplace=True)

In [15]:
#Instead of writing down spending for each and every products, I just calculate total spending since overall spending is giving us 
#big picture. I will do a pair-wise analysis on all product later on.
data["Total Spending"] = data["Wines"] + data["Fruits"] + data["MeatProducts"] + data["FishProducts"] + data["SweetProducts"] + data["GoldProds"]

In [16]:
#Let's check first which group spends the most in average:
data.groupby("Education")[["Total Spending","Income"]].mean()

In [17]:
#Again instead of seperating kids and teens I just calculate total to see the big picture.
data["Children"] = data["Kidhome"] + data["Teenhome"]

In [18]:
#Instead of registary date, seniority level of the customers are calculated.
data["Seniority"] = data.Dt_Customer.map(lambda x: data.Dt_Customer.max()-x)

In [19]:
data["Family_Size"] = data["Marital_Status"].replace({"Together":2,"Single":1}) + data["Children"]

In [20]:
#Seeing majority of family sizes in our dataset.
data.groupby(by="Family_Size")[["Family_Size"]].size()/len(data)

In [21]:
#Observing which family size spend the most on average
data.groupby(by="Family_Size")[["Total Spending","Income","NumDealsPurchases"]].mean()

In [22]:
#Calculating total acceptence size of campaigns
data["TotalAcceptedCmp"] = (data.AcceptedCmp1 + data.AcceptedCmp2 + data.AcceptedCmp3 + data.AcceptedCmp4 + data.AcceptedCmp5)

In [23]:
#Plotting the size of total accepted campaigns by number of campaigns accepted for each and every customer
data.TotalAcceptedCmp.value_counts().plot(kind="bar")

In [24]:
data[(data.Complain ==1) & (data.TotalAcceptedCmp >0)]

From there it can clearly seen that there is no corrolation between complainment and giving that customer a promotion. So there is no need to give a promotion to customer who has a complain.

In [25]:
data.columns

In [26]:
#I can drop some columns since it will not be needed.
data.drop(columns=["Year_Birth","ID","Z_CostContact","Z_Revenue","Marital_Status","Dt_Customer"],inplace=True)

In [27]:
data.dtypes[data.dtypes == "object"]

In [28]:
#In order to get rid of object data types, I applied label encoder to transform it in numbers.
label_encoder = LabelEncoder()
data["Education"] = label_encoder.fit_transform(data["Education"])
data["Education"].unique()

In [29]:
#It is applied the same technique for Seniority column.
data.Seniority = label_encoder.fit_transform(data.Seniority);data.Seniority

In [30]:
data.dtypes

In [31]:
#Since there are no object and datetime variable in our dataset, scaler can be applied.
#Scaler should be applied since our values for each and every column is in different range and those should be in
#similar range to obtain healty results.
scaler = StandardScaler()
scaler.fit(data)
scaler_df = pd.DataFrame(scaler.transform(data), columns=data.columns)

In [32]:
scaler_df.head().T

# Principle Component Analysis
It is a dimentionality reduction technic. Dimensionality reduction is used when there are lots of features and some features are highly corrolated with each other. Instead of using corrolated features, we use uncorrolated features to obtain more accurate results. In PCA, corrolated columns are calculated and clustered with each other so we can see clearly which columns are corrolated and escape using those sorrolated columns in our model.  

In [33]:
pca = PCA()
pca.fit(scaler.transform(data))
pca_data = pca.transform(scaler.transform(data))

In [34]:
#The following code constructs the Scree plot
per_var = np.round(pca.explained_variance_ratio_* 100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(per_var)+1)]
 
plt.bar(x=range(1,len(per_var)+1), height=per_var, tick_label=labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.xticks(rotation=90)
plt.title('Scree Plot')
plt.show()
#Above, for 1 column variance can be explained %30. for 1 and 2 component or feature, %40 of varience can be explained.

From that plot, we can clearly see that after 3th component, varience explanation become unimportant. So we can build our model for 3 features again. 

In [35]:
pca = PCA(n_components=3)
pca.fit(scaler.transform(data))
#Let's store it in a Data Frame
PCA_df = pd.DataFrame(pca.transform(scaler.transform(data)),columns=["PCA1","PCA2","PCA3"])

In [36]:
PCA_df

In [37]:
#Let's calculate inertias for each and all clusters
wcss=[]
KMeans()
for i in range(1,11):
    kmeans = KMeans(n_clusters= i, init='k-means++', random_state=0)
    kmeans.fit(PCA_df)
    wcss.append(kmeans.inertia_)

In [38]:
#Visualizing the ELBOW method to get the optimal value of K 
plt.plot(range(1,11), wcss)
plt.title('The Elbow Method')
plt.xlabel('no of clusters')
plt.ylabel('wcss')
plt.show()

In [39]:
#From above, after 4th cluster, inertia does not improve greatly. So 4 clusters will be applied.
kmeansmodel = KMeans(n_clusters= 4, init='k-means++', random_state=0)
y_kmeans= kmeansmodel.fit_predict(PCA_df)

In [40]:
#Let's add clusters in reduction data frame and our original data
PCA_df["Clusters"] = y_kmeans
data["Clusters"] = y_kmeans

In [41]:
sns.countplot(data=data,x=data["Clusters"]).set_title("Distribution Of The Clusters")
plt.show()

In [42]:
sns.scatterplot(data=data,x=data["Total Spending"],y=data["Income"],hue=data["Clusters"]).set_title("Income-Total Spending Clusters")
plt.show()

From above plot;
* Cluster 0 can be named as average-income and average-spending customers
* Cluster 1 can be named as low-income and low-spending customers
* Cluster 2 can be named as high-income and high-spending customers
* Cluster 3 can be named as high-income and average-spending customers

In [43]:
sns.scatterplot(data=data,x=data["Total Spending"],y=data["Family_Size"],hue=data["Clusters"]).set_title("Family Size - Total Spending Clusters")
plt.show()

From above graph it can be seen that the lower family size there is, the higher chance to spend more for a family. Thus it can be focused to cluster number 2 and 3 customers since they are the majority of high spending customers.

In [44]:
sns.countplot(data=data,x=data["TotalAcceptedCmp"], hue = data["Clusters"]).set_title("Distribution Of The Total Acepted Campaign by Clusters")
plt.show()

From above graph, lots of the customers did not accept the offered campaigns but cluster number 3. Definitely it should be given a campaign to cluster 3 customers since they are high income average spend customers. And also it should be considered to give a campaign to cluster number 2 customers since they are again high income and high spending customer group.

In [45]:
sns.pairplot(data,x_vars=["Wines","Fruits","MeatProducts","FishProducts","SweetProducts","GoldProds"],y_vars=["Income","Total Spending","Family_Size","TotalAcceptedCmp"],hue="Clusters")
plt.show()

In above plot, I tried to classify cluster by product type so marketing team can see which product to offet at which customer cluster. Cluster 3 customer accept all kinds of campaign and spending more on wines, fish, gold and meat products on the other hand cluster number 2 tends to spend more on wine and gold products. 

# Conclusion

In this notebook, I cluster customers and found total of 4 clusters. I analise clusters between total spending and income, family size and total accepted campaigns and finally I ploted pair-wise plot for all products to make better marketing strategies on right products. 