About this Dataset
Context
Problem Statement

Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

Content
Attributes

People

ID: Customer's unique identifier
Year_Birth: Customer's birth year
Education: Customer's education level
Marital_Status: Customer's marital status
Income: Customer's yearly household income
Kidhome: Number of children in customer's household
Teenhome: Number of teenagers in customer's household
Dt_Customer: Date of customer's enrollment with the company
Recency: Number of days since customer's last purchase
Complain: 1 if customer complained in the last 2 years, 0 otherwise
Products

MntWines: Amount spent on wine in last 2 years
MntFruits: Amount spent on fruits in last 2 years
MntMeatProducts: Amount spent on meat in last 2 years
MntFishProducts: Amount spent on fish in last 2 years
MntSweetProducts: Amount spent on sweets in last 2 years
MntGoldProds: Amount spent on gold in last 2 years
Promotion

NumDealsPurchases: Number of purchases made with a discount
AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
Place

NumWebPurchases: Number of purchases made through the company’s web site
NumCatalogPurchases: Number of purchases made using a catalogue
NumStorePurchases: Number of purchases made directly in stores
NumWebVisitsMonth: Number of visits to company’s web site in the last month
Target
Need to perform clustering to summarize customer segments.

https://www.kaggle.com/karnikakapoor/customer-segmentation-clustering/notebook#DATA-PREPROCESSING

In [None]:
#기본 라이브러리
import numpy as np
import pandas as pd
import os
import time

#시각화 라이브러리
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import warnings
warnings.filterwarnings("ignore", category = FutureWarning)


#그래프 스타일 서식 지정
plt.style.use('default')
#그래프 한글 폰트
from matplotlib import font_manager, rc
plt.rc("font", family = "Malgun Gothic")
plt.rc("axes", unicode_minus = False)

#사이킷런
from sklearn.datasets import *
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV, KFold

from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, precision_recall_curve, roc_curve, RocCurveDisplay
from sklearn.metrics import classification_report

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, RandomForestClassifier, VotingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier, plot_importance
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA

from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')
os.chdir('/content/gdrive/MyDrive/Colab Notebooks/data/')

In [None]:
data = pd.read_csv('marketing_campaign.csv', sep = "\t")
data.head(10)

In [None]:
data.info()

In [None]:
data = data.dropna()

In [None]:
data['Dt_Customer'] = pd.to_datetime(data['Dt_Customer'])

In [None]:
data.head()

In [None]:
data['Age'] = 2021 - data['Year_Birth']

In [None]:
data.head()

In [None]:
data = data.drop('Year_Birth', axis=1)

In [None]:
data = data.drop('ID', axis=1)

In [None]:
data.hist(figsize = (24,16))
plt.show()

In [None]:
numeric_col = [col for col in data if data[col].dtype != "object" ]
object_col = [col for col in data if data[col].dtype == "object" ]

In [None]:
LE=LabelEncoder()
for i in object_col:
    data[i]=data[[i]].apply(LE.fit_transform)

In [None]:
data_eda = data.copy()

In [None]:
data_eda.head()

In [None]:
data_eda = data_eda.drop('Dt_Customer', axis=1)

In [None]:
scaler = StandardScaler()
scaler.fit(data_eda.values)
df_scaled = scaler.transform(data_eda)
df_scaled = pd.DataFrame(df_scaled,columns= data_eda.columns )

In [None]:
df_scaled.head()

In [None]:
df_scaled.hist(figsize = (24,16))
plt.show()

In [None]:
k = 3
# 주성분 개수를 k개로 하는 PCA 모델 생성
pca = PCA( n_components = k ) 
# 표준화된 변수에 대해 PCA 수행
df_pca = pca.fit_transform(df_scaled)

In [None]:
df_pca

In [None]:
# Scores (주성분 점수)

# 주성분 컬럼명 리스트 생성 ( ['PC1', 'PC2', ....] )
pc_names = []
for i in range( 1, k+1 ):
    pc_names.append( 'PC'+str(i) )

# 주성분 분석결과 데이터프레임 생성
df_pca = pd.DataFrame( df_pca, columns = pc_names)

print('< 주성분 점수 >')
display( df_pca )
print()

print('< 주성분 간의 상관계수 >')
display( df_pca.corr() )

In [None]:
# 주성분 각각의 설명력 및 설명력 비율

pd.DataFrame( {'주성분 별 설명력' : pca.explained_variance_,
               '주성분 별 설명력 비율' : pca.explained_variance_ratio_},
             index = pc_names )

In [None]:
df_pca

In [None]:
x =df_pca["PC1"]
y =df_pca["PC2"]
z =df_pca["PC3"]
#To plot
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection="3d")
ax.scatter(x,y,z, c="maroon", marker="o" )
ax.set_title("A 3D Projection Of Data In The Reduced Dimension")
plt.show()

In [None]:
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=4).fit(df_pca)
labels = gmm.predict(df_pca)

plt.scatter(x, y, z, c=labels, cmap='viridis')

In [None]:
# BIC, AIC

n_components = np.arange(1, 21)
models = [GaussianMixture(n, covariance_type='full', random_state=0).fit(df_pca) for n in n_components]

plt.plot(n_components, [m.bic(df_pca) for m in models], label='BIC')
plt.plot(n_components, [m.aic(df_pca) for m in models], label='AIC')
plt.legend(loc='best')
plt.xlabel('n_components');

In [None]:
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=9).fit(df_pca)
labels = gmm.predict(df_pca)

plt.scatter(x, y, z, c=labels, cmap='viridis')