# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

In [1]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# magic word for producing visualizations in notebook
%matplotlib inline

## Part 0: Get to Know the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. [One of them](./DIAS Information Levels - Attributes 2017.xlsx) is a top-level list of attributes and descriptions, organized by informational category. [The other](./DIAS Attributes - Values 2017.xlsx) is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the `.csv` data files in this project that they're semicolon (`;`) delimited, so an additional argument in the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

## Read-in Raw Data

In [2]:
# load in the data
azdias = pd.read_csv('Kmeans_raw_data/azdias.csv')
customers = pd.read_csv('Kmeans_raw_data/customers.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
print(azdias.shape)
print(customers.shape)

(891221, 366)
(191652, 369)


## Convert String Data to Numeric Data

In [4]:
# Read in a csv file and return a transformed dataframe
def numerical_dataframe(str_df):
    '''Reads in a str dataframe file which is assumed to have'CAMEO_DEU_2015' column.
       This function does two things: 
       1) converts `CAMEO_DEU_2015` column values to numerical values
       2) converts missing value to the mode, because the mode is '6B', so 6 will be used to replace missing value
       :param str_df: dataframe to deal with
       :return: A dataframe with numerical categories'''
        
    target_column = 'CAMEO_DEU_2015'
    if target_column not in str_df.columns.values.tolist():
        print('CAMEO_DEU_2015 column not found')
        return

    CAMEO_DEU_2015_nb = []
    for category in str_df['CAMEO_DEU_2015']:
        if category in ['1A','1B','1C','1D','1E']:
            CAMEO_DEU_2015_nb.append(1)
        elif category in ['2A','2B','2C','2D']:
            CAMEO_DEU_2015_nb.append(2)
        elif category in ['3A','3B','3C','3D']:
            CAMEO_DEU_2015_nb.append(3)
        elif category in ['4A','4B','4C','4D','4E']:
            CAMEO_DEU_2015_nb.append(4)
        elif category in ['5A','5B','5C','5D','5E','5F']:
            CAMEO_DEU_2015_nb.append(5)
        elif category in ['6A','6B','6C','6D','6E','6F']:
            CAMEO_DEU_2015_nb.append(6)
        elif category in ['7A','7B','7C','7D','7E']:
            CAMEO_DEU_2015_nb.append(7)
        elif category in ['8A','8B','8C','8D']:
            CAMEO_DEU_2015_nb.append(8)
        elif category in ['9A','9B','9C','9D','9E']:
            CAMEO_DEU_2015_nb.append(9)
        else:
            CAMEO_DEU_2015_nb.append(6)
    str_df.drop(columns='CAMEO_DEU_2015')
    
    str_df['CAMEO_DEU_2015']=CAMEO_DEU_2015_nb
    return str_df

In [5]:
azdias = numerical_dataframe(azdias)
customers = numerical_dataframe(customers)

In [6]:
azdias.index = azdias['LNR']
customers.index = customers['LNR']

In [7]:
azdias["CAMEO_DEUG_2015"] = pd.to_numeric(azdias.CAMEO_DEUG_2015, errors='coerce')
customers["CAMEO_DEUG_2015"] = pd.to_numeric(customers.CAMEO_DEUG_2015, errors='coerce')
print(azdias["CAMEO_DEUG_2015"].dtypes)
print(customers["CAMEO_DEUG_2015"].dtypes)

float64
float64


## Convert all missing/unknown value into -1

Raw data was organized poorly. In some columns, -1 and 0 are all stand for unknown/missing data. In other columns 0 or 9 is stand for unknown/missing data.<br>
This section includes three parts:
1. covert all unknown/missing value into -1, make sure only -1 stands for missing value
2. Select only useful columns from raw data. Delete redundant and unrelated columns.
3. Drop out the columns with a high missing rate.
4. Use average value to replace the misssing value.

In [8]:
# replace 0 with -1 for some columns
def replace_val(df,target_columns,target_val,new_val):      
    converted_df = df[target_columns].replace(target_val,new_val)
    df = df.drop(target_columns, axis=1)
    prepared_df = pd.concat([df,converted_df],axis=1)
    
    return prepared_df

In [9]:
# Raw data considered 0/9/-1 as unknown value. To make the definition consistant, I find out the lists below.
# According to these lists, we can convert some 0/9 to make sure only -1 stands for unknown value. 
zero_to_neg_one = ["ALTERSKATEGORIE_GROB","ALTER_HH","ANREDE_KZ","CJT_GESAMTTYP","GEBAEUDETYP","HH_EINKOMMEN_SCORE","KBA05_BAUMAX","KBA05_GBZ","KKK","NATIONALITAET_KZ","PRAEGENDE_JUGENDJAHRE","REGIOTYP","RETOURTYP_BK_S","TITEL_KZ","WOHNDAUER_2008","W_KEIT_KIND_HH"]
nine_to_neg_one = ["KBA05_ALTER1","KBA05_ALTER2","KBA05_ALTER3","KBA05_ALTER4","KBA05_ANHANG","KBA05_AUTOQUOT","KBA05_CCM1","KBA05_CCM2","KBA05_CCM3","KBA05_CCM4","KBA05_DIESEL","KBA05_FRAU","KBA05_HERST1","KBA05_HERST2","KBA05_HERST3","KBA05_HERST4","KBA05_HERST5","KBA05_HERSTTEMP","KBA05_KRSAQUOT","KBA05_KRSHERST1","KBA05_KRSHERST2","KBA05_KRSHERST3","KBA05_KRSKLEIN","KBA05_KRSOBER","KBA05_KRSVAN","KBA05_KRSZUL","KBA05_KW1","KBA05_KW2","KBA05_KW3","KBA05_MAXAH","KBA05_MAXBJ","KBA05_MAXHERST","KBA05_MAXSEG","KBA05_MAXVORB","KBA05_MOD1","KBA05_MOD2","KBA05_MOD3","KBA05_MOD4","KBA05_MOD8","KBA05_MODTEMP","KBA05_MOTOR","KBA05_MOTRAD","KBA05_SEG1","KBA05_SEG10","KBA05_SEG2","KBA05_SEG3","KBA05_SEG4","KBA05_SEG5","KBA05_SEG6","KBA05_SEG7","KBA05_SEG8","KBA05_SEG9","KBA05_VORB0","KBA05_VORB1","KBA05_VORB2","KBA05_ZUL1","KBA05_ZUL2","KBA05_ZUL3","KBA05_ZUL4","RELAT_AB","SEMIO_DOM","SEMIO_ERL","SEMIO_FAM","SEMIO_KAEM","SEMIO_KRIT","SEMIO_KULT","SEMIO_LUST","SEMIO_MAT","SEMIO_PFLICHT","SEMIO_RAT","SEMIO_REL","SEMIO_SOZ","SEMIO_TRADV","SEMIO_VERT","ZABEOTYP"]

In [10]:
def convert_unknown_to_negone(df,zero_to_neg_one,nine_to_neg_one):
    df = replace_val(df,zero_to_neg_one,0,-1)
    df = replace_val(df,nine_to_neg_one,9,-1)
    df = df.fillna(value=-1)
    return df

In [11]:
azdias_df = convert_unknown_to_negone(azdias,zero_to_neg_one,nine_to_neg_one)
customers_df = convert_unknown_to_negone(customers,zero_to_neg_one,nine_to_neg_one)

### Column Selection

I choosed 114 meaningful columns from the raw data(360plus columns) based on the provided meta data information.

In [12]:
col_list = ['ANREDE_KZ', 'CJT_GESAMTTYP', 'FINANZTYP', 'GFK_URLAUBERTYP', 'GREEN_AVANTGARDE', 'HEALTH_TYP', 'LP_FAMILIE_FEIN', 'LP_STATUS_FEIN', 'NATIONALITAET_KZ', 'PRAEGENDE_JUGENDJAHRE', 'RETOURTYP_BK_S', 'SEMIO_SOZ', 'SEMIO_FAM', 'SEMIO_REL', 'SEMIO_MAT', 'SEMIO_VERT', 'SEMIO_LUST', 'SEMIO_ERL', 'SEMIO_KULT', 'SEMIO_RAT', 'SEMIO_KRIT', 'SEMIO_DOM', 'SEMIO_KAEM', 'SEMIO_PFLICHT', 'SEMIO_TRADV', 'SHOPPER_TYP', 'VERS_TYP', 'ZABEOTYP', 'ALTER_HH', 'ANZ_PERSONEN', 'ANZ_TITEL', 'HH_EINKOMMEN_SCORE', 'D19_VERSAND_ANZ_12', 'D19_VERSAND_ANZ_24', 'D19_GESAMT_ONLINE_DATUM', 'D19_GESAMT_DATUM', 'D19_BANKEN_ONLINE_DATUM', 'D19_BANKEN_DATUM', 'D19_TELKO_ONLINE_DATUM', 'D19_TELKO_DATUM', 'D19_VERSAND_ONLINE_DATUM', 'D19_VERSAND_DATUM', 'D19_VERSI_ONLINE_DATUM', 'D19_VERSI_DATUM', 'D19_VERSAND_ONLINE_QUOTE_12', 'W_KEIT_KIND_HH', 'ANZ_HH_TITEL', 'KONSUMNAEHE', 'WOHNLAGE', 'CAMEO_DEUG_2015', 'KBA05_AUTOQUOT', 'KBA05_KRSOBER', 'KBA05_MAXAH', 'KBA05_MAXBJ', 'KBA05_MOTRAD', 'KBA05_SEG9', 'WOHNDAUER_2008', 'D19_BANKEN_DIREKT', 'D19_BANKEN_GROSS', 'D19_BANKEN_LOKAL', 'D19_BANKEN_REST', 'D19_BEKLEIDUNG_GEH', 'D19_BEKLEIDUNG_REST', 'D19_BIO_OEKO', 'D19_BILDUNG', 'D19_BUCH_CD', 'D19_DIGIT_SERV', 'D19_DROGERIEARTIKEL', 'D19_ENERGIE', 'D19_FREIZEIT', 'D19_GARTEN', 'D19_HANDWERK', 'D19_HAUS_DEKO', 'D19_KINDERARTIKEL', 'D19_KOSMETIK', 'D19_LEBENSMITTEL', 'D19_NAHRUNGSERGAENZUNG', 'D19_RATGEBER', 'D19_REISEN', 'D19_SAMMELARTIKEL', 'D19_SCHUHE', 'D19_SONSTIGE', 'D19_TECHNIK', 'D19_TELKO_MOBILE', 'D19_TELKO_REST', 'D19_TIERARTIKEL', 'D19_VERSICHERUNGEN', 'D19_VOLLSORTIMENT', 'D19_WEIN_FEINKOST', 'BALLRAUM', 'EWDICHTE', 'INNENSTADT', 'GEBAEUDETYP_RASTER', 'KKK', 'MOBI_REGIO', 'ONLINE_AFFINITAET', 'REGIOTYP', 'KBA13_AUTOQUOTE', 'KBA13_HALTER_20', 'KBA13_HALTER_25', 'KBA13_HALTER_30', 'KBA13_HALTER_35', 'KBA13_HALTER_40', 'KBA13_HALTER_45', 'KBA13_HALTER_50', 'KBA13_HALTER_55', 'KBA13_HALTER_60', 'KBA13_HALTER_65', 'KBA13_HALTER_66', 'PLZ8_BAUMAX', 'PLZ8_HHZ', 'PLZ8_GBZ', 'ARBEIT', 'ORTSGR_KLS9']

In [13]:
azdias_df = azdias_df[col_list]
customers_df = customers_df[col_list]

In [14]:
azdias_df.shape

(891221, 114)

### Drop out the columns with a high missing rate

In [15]:
# This function drop those columns that has a high missing value rate from the specified table.
# table: the specified table 
# threshold: the columns which have a higher missing rate than this threshold will be drop
# function return is the remianing column list that has low missing value rate.

def drop_missing_val_column(table, threshold):
    missing_rate = []
    for column in table.columns:
        rate = table[table[column]==-1].shape[0]/len(table)*100
        missing_rate.append(rate)
    
    column_remain = []
    column_name = table.columns
    for i in range(0,table.shape[1]):
        if missing_rate[i]<=threshold:
            column_remain.append(column_name[i])
    
    return column_remain

In [16]:
# For the Kmeans model, I want to keep all of the 114 columns, so the threshold is set to 100.
column_remain=drop_missing_val_column(customers_df,100)
print(len(column_remain))

114


### Use average value to replace the misssing value

In [17]:
# The Data_process function changed all missing/unknown value(-1) to the mean value of each column.
import helper

def Data_process(df,column_remain):
    df_remain = df[column_remain]
    df_new = helper.replace_unknown_to_mean(df_remain)
    return df_new
    

In [18]:
# checked already changed all -1 to the mean value for all columns.
customers_df = Data_process(customers_df,column_remain)
customers_df.shape

(191652, 114)

In [19]:
azdias_df = Data_process(azdias_df,column_remain)
azdias_df.shape

(891221, 114)

In [20]:
customers_df['Flag']=1
azdias_df['Flag']=0

Segmentation_df = pd.concat([customers_df,azdias_df],join = 'inner')
Segmentation_df.shape

(1082873, 115)

In [21]:
Segmentation_df.head()

Unnamed: 0_level_0,ANREDE_KZ,CJT_GESAMTTYP,FINANZTYP,GFK_URLAUBERTYP,GREEN_AVANTGARDE,HEALTH_TYP,LP_FAMILIE_FEIN,LP_STATUS_FEIN,NATIONALITAET_KZ,PRAEGENDE_JUGENDJAHRE,...,KBA13_HALTER_55,KBA13_HALTER_60,KBA13_HALTER_65,KBA13_HALTER_66,PLZ8_BAUMAX,PLZ8_HHZ,PLZ8_GBZ,ARBEIT,ORTSGR_KLS9,Flag
LNR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9626,1,5.0,2,4.0,1,1.0,2.0,10.0,1.0,4.0,...,5.0,5.0,4.0,3.0,1.0,5.0,5.0,1.0,2.0,1
9628,1,3.677928,2,6.302268,0,1.0,4.254448,6.68791,1.0,5.687074,...,2.983593,2.985203,3.336145,3.225709,1.556607,3.634893,3.622192,2.82485,5.119517,1
143872,2,2.0,2,3.0,1,2.0,1.0,10.0,1.0,4.0,...,3.0,3.0,3.0,4.0,3.0,3.0,2.0,3.0,5.0,1
143873,1,2.0,6,10.0,0,2.0,0.0,9.0,1.0,1.0,...,3.0,3.0,1.0,2.0,1.0,3.0,4.0,1.0,3.0,1
143874,1,6.0,2,2.0,0,3.0,10.0,1.0,1.0,8.0,...,3.0,3.0,3.0,4.0,2.0,3.0,3.0,3.0,5.0,1


## Normalize the data
We need to standardize the scale of the numerical columns in order to consistently compare the values of different features. We can use a MinMaxScaler to transform the numerical values so that they all fall between 0 and 1. This is a necessary step before using PCA.


In [22]:
# scale numerical features into a normalized range, 0-1
# reference to the following website: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

from sklearn.preprocessing import MinMaxScaler

scaler=MinMaxScaler()

Segmentation_scaled=pd.DataFrame(scaler.fit_transform(Segmentation_df.astype(float)))
Segmentation_scaled.columns=Segmentation_df.columns
Segmentation_scaled.index=Segmentation_df.index

Segmentation_scaled.head()

Unnamed: 0_level_0,ANREDE_KZ,CJT_GESAMTTYP,FINANZTYP,GFK_URLAUBERTYP,GREEN_AVANTGARDE,HEALTH_TYP,LP_FAMILIE_FEIN,LP_STATUS_FEIN,NATIONALITAET_KZ,PRAEGENDE_JUGENDJAHRE,...,KBA13_HALTER_55,KBA13_HALTER_60,KBA13_HALTER_65,KBA13_HALTER_66,PLZ8_BAUMAX,PLZ8_HHZ,PLZ8_GBZ,ARBEIT,ORTSGR_KLS9,Flag
LNR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9626,0.0,0.8,0.2,0.272727,1.0,0.0,0.181818,1.0,0.0,0.214286,...,1.0,1.0,0.75,0.5,0.0,1.0,1.0,0.0,0.222222,1.0
9628,0.0,0.535586,0.2,0.482024,0.0,0.0,0.386768,0.63199,0.0,0.334791,...,0.495898,0.496301,0.584036,0.556427,0.139152,0.658723,0.655548,0.228106,0.568835,1.0
143872,1.0,0.2,0.2,0.181818,1.0,0.5,0.090909,1.0,0.0,0.214286,...,0.5,0.5,0.5,0.75,0.5,0.5,0.25,0.25,0.555556,1.0
143873,0.0,0.2,1.0,0.818182,0.0,0.5,0.0,0.888889,0.0,0.0,...,0.5,0.5,0.0,0.25,0.0,0.5,0.75,0.0,0.333333,1.0
143874,0.0,1.0,0.2,0.090909,0.0,1.0,0.909091,0.0,0.0,0.5,...,0.5,0.5,0.5,0.75,0.25,0.5,0.5,0.25,0.555556,1.0


In [23]:
customers_clean = Segmentation_scaled[Segmentation_scaled['Flag']==1]
customers_clean = customers_clean.drop(['Flag'], axis=1)
print(customers_clean.shape)

(191652, 114)


In [24]:
azdias_clean = Segmentation_scaled[Segmentation_scaled['Flag']==0]
azdias_clean = azdias_clean.drop(['Flag'], axis=1)
print(azdias_clean.shape)

(891221, 114)


In [25]:
customers_clean.head()

Unnamed: 0_level_0,ANREDE_KZ,CJT_GESAMTTYP,FINANZTYP,GFK_URLAUBERTYP,GREEN_AVANTGARDE,HEALTH_TYP,LP_FAMILIE_FEIN,LP_STATUS_FEIN,NATIONALITAET_KZ,PRAEGENDE_JUGENDJAHRE,...,KBA13_HALTER_50,KBA13_HALTER_55,KBA13_HALTER_60,KBA13_HALTER_65,KBA13_HALTER_66,PLZ8_BAUMAX,PLZ8_HHZ,PLZ8_GBZ,ARBEIT,ORTSGR_KLS9
LNR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9626,0.0,0.8,0.2,0.272727,1.0,0.0,0.181818,1.0,0.0,0.214286,...,0.75,1.0,1.0,0.75,0.5,0.0,1.0,1.0,0.0,0.222222
9628,0.0,0.535586,0.2,0.482024,0.0,0.0,0.386768,0.63199,0.0,0.334791,...,0.475533,0.495898,0.496301,0.584036,0.556427,0.139152,0.658723,0.655548,0.228106,0.568835
143872,1.0,0.2,0.2,0.181818,1.0,0.5,0.090909,1.0,0.0,0.214286,...,0.25,0.5,0.5,0.5,0.75,0.5,0.5,0.25,0.25,0.555556
143873,0.0,0.2,1.0,0.818182,0.0,0.5,0.0,0.888889,0.0,0.0,...,0.75,0.5,0.5,0.0,0.25,0.0,0.5,0.75,0.0,0.333333
143874,0.0,1.0,0.2,0.090909,0.0,1.0,0.909091,0.0,0.0,0.5,...,0.75,0.5,0.5,0.5,0.75,0.25,0.5,0.5,0.25,0.555556


### PCA

We can use PCA to reduce the dimension of our data. One of the important parater for PCA process is n_components. I choosed 50. From the explanied_var function we can see that, 85% variance of the data has been captured by our PCA process with the n_components=50.

In [26]:
import numpy as np
from sklearn.decomposition import PCA
customers_X = customers_clean.values
azdias_X = azdias_clean.values
pca = PCA(n_components=50)
pca.fit(customers_X)

PCA(n_components=50)

In [27]:
var_np = pca.explained_variance_ratio_

In [28]:
def explanined_var(np,n_componets):
    i = 0
    explained_var = 0
    while i<n_componets:
        explained_var = explained_var + np[i]
        i = i + 1
    return explained_var

In [29]:
explanined_variance = explanined_var(var_np,50)
print(explanined_variance)

0.8523575797952553


In [30]:
customers_pca = pca.transform(customers_X)
customers_pca.shape

(191652, 50)

In [31]:
azdias_pca = pca.transform(azdias_X)
azdias_pca.shape

(891221, 50)

In [32]:
customers_pca_df = pd.DataFrame(customers_pca, index=customers.index)
azdias_pca_df = pd.DataFrame(azdias_pca, index=azdias.index)

In [33]:
print(customers_pca_df.shape)
print(azdias_pca_df.shape)

(191652, 50)
(891221, 50)


### Choose K clusters for K-means

In [34]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

X = customers_pca_df.values

cluster_list = [13,14,15,16]

for n in cluster_list:
    clusterer = KMeans(n_clusters=n, random_state=10)
    labels = clusterer.fit_predict(X)

    # Use silhouette_score to find the best cluster number.
    silhouette = silhouette_score(X, labels)
    print("For n_clusters =", n,
          "The average silhouette_score is :", silhouette)


For n_clusters = 13 The average silhouette_score is : 0.1886748212230266
For n_clusters = 14 The average silhouette_score is : 0.19259809508881234
For n_clusters = 15 The average silhouette_score is : 0.19240962316461996
For n_clusters = 16 The average silhouette_score is : 0.1942616429789558


### K-means Customer Group Segmentation

According to our silhouette_score:<br/>
For n_clusters = 2 The average silhouette_score is : 0.17901421116635363<br/>
For n_clusters = 3 The average silhouette_score is : 0.17144843232551113<br/>
For n_clusters = 4 The average silhouette_score is : 0.18548645269641556<br/>
For n_clusters = 5 The average silhouette_score is : 0.18951748899143575<br/>
For n_clusters = 6 The average silhouette_score is : 0.19626502613419164<br/>
For n_clusters = 7 The average silhouette_score is : 0.19794623465007988<br/>
For n_clusters = 8 The average silhouette_score is : 0.1975584576362408<br/>
For n_clusters = 9 The average silhouette_score is : 0.2025562541429389<br/>
For n_clusters = 10 The average silhouette_score is : 0.2020959211536878<br/>
For n_clusters = 11 The average silhouette_score is : 0.20290045542667645<br/>
For n_clusters = 12 The average silhouette_score is : 0.18727972425048411<br/>
For n_clusters = 13 The average silhouette_score is : 0.18972808734848196<br/>
For n_clusters = 14 The average silhouette_score is : 0.19077473486719984<br/>
For n_clusters = 15 The average silhouette_score is : 0.19457337532228255<br/>
For n_clusters = 16 The average silhouette_score is : 0.19632406199537455<br/>

n_clusters=11 has the best score, so we select n=11 for customers data.


## Customer Segmentation Report

Set parameter n_clusters=11 for K-means process. The 11 centriod will reflect the characteristics of 11 customer groups. For the populations in azdias data who is close to one of those 11 centriod, will more likely to became our customer. The following process will print out the 11 centriod.

In [35]:
Kmeans_clusterer = KMeans(n_clusters = 11, random_state=10)
Final_cluster_labels = Kmeans_clusterer.fit_predict(X)

In [36]:
Kmeans_clusterer.cluster_centers_

array([[-1.60260846e+00,  1.48242825e-01, -6.48260207e-01,
         1.82952470e-01,  5.49400208e-02,  6.98498145e-02,
         1.18471126e-01,  4.58629721e-02,  5.64132803e-02,
        -2.23400899e-01, -9.99482043e-03,  4.87456793e-02,
         5.70493493e-02,  2.59755762e-02, -3.46547755e-02,
        -1.31245265e-02, -5.30274928e-02, -9.16559647e-02,
        -3.34030307e-02, -4.18155163e-02,  1.72866602e-02,
        -3.07382384e-02, -5.15692133e-02, -1.26962825e-02,
        -3.40600245e-02, -3.49252237e-02,  5.49530647e-02,
        -4.67244026e-03,  5.42287704e-02,  3.81105645e-03,
        -2.98968760e-02, -1.98959684e-02, -1.91328449e-02,
        -1.00661364e-01,  1.07254215e-02,  1.63105305e-02,
        -1.25076210e-01, -1.68612299e-01, -7.35934502e-02,
         1.19872414e-01, -2.30404209e-02,  6.37684982e-02,
         1.66393045e-02, -5.73538062e-02,  1.54869624e-01,
         4.84602656e-02, -4.12398327e-02, -3.50830479e-03,
        -5.22654268e-02, -2.05730298e-02],
       [ 1.36