# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# magic word for producing visualizations in notebook
%matplotlib inline

## Part 0: Get to Know the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. [One of them](./DIAS Information Levels - Attributes 2017.xlsx) is a top-level list of attributes and descriptions, organized by informational category. [The other](./DIAS Attributes - Values 2017.xlsx) is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the `.csv` data files in this project that they're semicolon (`;`) delimited, so an additional argument in the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

In [2]:
azdias = pd.read_csv('Data/Udacity_AZDIAS_052018.csv', sep=';')
customers = pd.read_csv('Data/Udacity_CUSTOMERS_052018.csv', sep=';')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
#get to know the features of azdias
azdias.head()

Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,...,VHN,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,910215,-1,,,,,,,,,...,,,,,,,,3,1,2
1,910220,-1,9.0,0.0,,,,,21.0,11.0,...,4.0,8.0,11.0,10.0,3.0,9.0,4.0,5,2,1
2,910225,-1,9.0,17.0,,,,,17.0,10.0,...,2.0,9.0,9.0,6.0,3.0,9.0,2.0,5,2,3
3,910226,2,1.0,13.0,,,,,13.0,1.0,...,0.0,7.0,10.0,11.0,,9.0,7.0,3,2,4
4,910241,-1,1.0,20.0,,,,,14.0,3.0,...,2.0,3.0,5.0,4.0,2.0,9.0,3.0,4,1,3


In [4]:
azdias.info()
azdias.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Columns: 366 entries, LNR to ALTERSKATEGORIE_GROB
dtypes: float64(267), int64(93), object(6)
memory usage: 2.4+ GB


(891221, 366)

According to the shape information, there are 891,221 rows/individuals in the dataset. Each individual has 366 features. 


## Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

## Data Cleaning

1.1 From the DIAS Attributes excel sheet, I created a csv file called Data_NaN that denotes values that mean unknown cases in each column. From Data_NaN I create a corresponding dictionary.

In [5]:
pd_NaN = pd.read_csv('Data/Data_NaN.csv', sep=',', header=0)
pd_NaN.head()

Unnamed: 0,Attribute,Value
0,AGER_TYP,-1
1,ALTERSKATEGORIE_GROB,"-1, 0"
2,ALTER_HH,0
3,ANREDE_KZ,"-1, 0"
4,BALLRAUM,-1


In [6]:
pd_NaN['Value'] = pd_NaN['Value'].apply(lambda x: x.split(","))
pd_NaN['Value'] = pd_NaN['Value'].apply(lambda x: list(map(int,x)))

In [7]:
dict_NaN = dict(zip(pd_NaN['Attribute'], pd_NaN['Value']))
print(dict_NaN)

{'AGER_TYP': [-1], 'ALTERSKATEGORIE_GROB': [-1, 0], 'ALTER_HH': [0], 'ANREDE_KZ': [-1, 0], 'BALLRAUM': [-1], 'CAMEO_DEUG_2015': [-1], 'CAMEO_INTL_2015': [-1], 'CJT_GESAMTTYP': [0], 'EWDICHTE': [-1], 'FINANZTYP': [-1], 'FINANZ_ANLEGER': [-1], 'FINANZ_HAUSBAUER': [-1], 'FINANZ_MINIMALIST': [-1], 'FINANZ_SPARER': [-1], 'FINANZ_UNAUFFAELLIGER': [-1], 'FINANZ_VORSORGER': [-1], 'GEBAEUDETYP': [-1, 0], 'HEALTH_TYP': [-1], 'HH_EINKOMMEN_SCORE': [-1, 0], 'INNENSTADT': [-1], 'KBA05_ALTER1': [-1, 9], 'KBA05_ALTER2': [-1, 9], 'KBA05_ALTER3': [-1, 9], 'KBA05_ALTER4': [-1, 9], 'KBA05_ANHANG': [-1, 9], 'KBA05_ANTG1': [-1], 'KBA05_ANTG2': [-1], 'KBA05_ANTG3': [-1], 'KBA05_ANTG4': [-1], 'KBA05_AUTOQUOT': [-1, 9], 'KBA05_BAUMAX': [-1, 0], 'KBA05_CCM1': [-1, 9], 'KBA05_CCM2': [-1, 9], 'KBA05_CCM3': [-1, 9], 'KBA05_CCM4': [-1, 9], 'KBA05_DIESEL': [-1, 9], 'KBA05_FRAU': [-1, 9], 'KBA05_GBZ': [-1, 0], 'KBA05_HERST1': [-1, 9], 'KBA05_HERST2': [-1, 9], 'KBA05_HERST3': [-1, 9], 'KBA05_HERST4': [-1, 9], 'KBA05_

In [8]:
#double check whether any item in dictionary keys is not in the column names 
list_1 = list(dict_NaN.keys())
list_2 = list(azdias.columns.values)
diff_list = np.setdiff1d(list_1,list_2)
print(diff_list)  



[]


1.2 Convert the unknown values in azdias into NaN. 

In [9]:
azdias_1 = azdias[:10000].copy()

In [10]:
def convert_unknown_to_NaN(df, unknown = dict_NaN):
    '''
    It maps unknown values by column and transform them to nan values.
    
    Input:
    df: original dataframe;
    unkown: dictionary mapping columns and their respective unknown values.
    
    Output:Processed dataframe that maps unknown values to NaN   
    '''
    df_1 = df.copy()
    for col in list(unknown.keys()):
        df_1.loc[:, col] = df_1.loc[:, col].apply(lambda x: np.nan if x in unknown[col] else x)
    return df_1 

In [11]:
azdias_NaN = convert_unknown_to_NaN(df=azdias_1)

In [12]:
azdias_NaN.head()

Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,...,VHN,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,910215,,,,,,,,,,...,,,,,,,,3,1,2
1,910220,,9.0,,,,,,21.0,11.0,...,4.0,8.0,11.0,10.0,3.0,9.0,4.0,5,2,1
2,910225,,9.0,17.0,,,,,17.0,10.0,...,2.0,9.0,9.0,6.0,3.0,9.0,2.0,5,2,3
3,910226,2.0,1.0,13.0,,,,,13.0,1.0,...,0.0,7.0,10.0,11.0,,9.0,7.0,3,2,4
4,910241,,1.0,20.0,,,,,14.0,3.0,...,2.0,3.0,5.0,4.0,2.0,9.0,3.0,4,1,3


1.3 Find the features that have NaN above 15% and delete the features. Here due to the constraint of memory, I chose first 10,000 data to decide the NaN percentage. 

In [13]:
thresh = int(0.85*10000)
azdias_NaN_drop = azdias_NaN.dropna(axis=1, thresh=thresh, inplace=False)
azdias_NaN_drop.head()

Unnamed: 0,LNR,AKT_DAT_KL,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,ANZ_KINDER,ANZ_PERSONEN,ANZ_STATISTISCHE_HAUSHALTE,ANZ_TITEL,ARBEIT,BALLRAUM,...,VERS_TYP,VHA,VK_DHT4A,VK_DISTANZ,VK_ZG11,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,910215,,,,,,,,,,...,,,,,,,,3,1,2
1,910220,9.0,11.0,0.0,0.0,2.0,12.0,0.0,3.0,6.0,...,2.0,0.0,8.0,11.0,10.0,9.0,4.0,5,2,1
2,910225,9.0,10.0,0.0,0.0,1.0,7.0,0.0,3.0,2.0,...,1.0,0.0,9.0,9.0,6.0,9.0,2.0,5,2,3
3,910226,1.0,1.0,0.0,0.0,0.0,2.0,0.0,2.0,4.0,...,1.0,1.0,7.0,10.0,11.0,9.0,7.0,3,2,4
4,910241,1.0,3.0,0.0,0.0,4.0,3.0,0.0,4.0,2.0,...,2.0,0.0,3.0,5.0,4.0,9.0,3.0,4,1,3


Find the columns that were dropped due to high NaN percentage

In [14]:
list_1 = list(azdias_NaN_drop.columns.values)
list_2 = list(azdias_NaN.columns.values)
column_NaN = np.setdiff1d(list_2,list_1).tolist()
print(column_NaN)

['AGER_TYP', 'ALTERSKATEGORIE_FEIN', 'ALTER_HH', 'ALTER_KIND1', 'ALTER_KIND2', 'ALTER_KIND3', 'ALTER_KIND4', 'D19_BANKEN_ONLINE_QUOTE_12', 'D19_GESAMT_ONLINE_QUOTE_12', 'D19_KONSUMTYP', 'D19_LETZTER_KAUF_BRANCHE', 'D19_LOTTO', 'D19_SOZIALES', 'D19_TELKO_ONLINE_QUOTE_12', 'D19_VERSAND_ONLINE_QUOTE_12', 'D19_VERSI_ONLINE_QUOTE_12', 'EXTSEL992', 'KBA05_ALTER1', 'KBA05_ALTER2', 'KBA05_ALTER3', 'KBA05_ALTER4', 'KBA05_ANHANG', 'KBA05_ANTG1', 'KBA05_ANTG2', 'KBA05_ANTG3', 'KBA05_ANTG4', 'KBA05_AUTOQUOT', 'KBA05_BAUMAX', 'KBA05_CCM1', 'KBA05_CCM2', 'KBA05_CCM3', 'KBA05_CCM4', 'KBA05_DIESEL', 'KBA05_FRAU', 'KBA05_GBZ', 'KBA05_HERST1', 'KBA05_HERST2', 'KBA05_HERST3', 'KBA05_HERST4', 'KBA05_HERST5', 'KBA05_KRSAQUOT', 'KBA05_KRSHERST1', 'KBA05_KRSHERST2', 'KBA05_KRSHERST3', 'KBA05_KRSKLEIN', 'KBA05_KRSOBER', 'KBA05_KRSVAN', 'KBA05_KRSZUL', 'KBA05_KW1', 'KBA05_KW2', 'KBA05_KW3', 'KBA05_MAXAH', 'KBA05_MAXBJ', 'KBA05_MAXHERST', 'KBA05_MAXSEG', 'KBA05_MAXVORB', 'KBA05_MOD1', 'KBA05_MOD2', 'KBA05_MOD3'

1.4 Find the columns of object dtypes: either convert the column into int/float dtypes, or drop the column

In [15]:
azdias_object = azdias_NaN_drop.select_dtypes(include='object')

azdias_object.head()

Unnamed: 0,CAMEO_DEU_2015,CAMEO_DEUG_2015,CAMEO_INTL_2015,EINGEFUEGT_AM,OST_WEST_KZ
0,,,,,
1,8A,8.0,51.0,1992-02-10 00:00:00,W
2,4C,4.0,24.0,1992-02-12 00:00:00,W
3,2A,2.0,12.0,1997-04-21 00:00:00,W
4,6B,6.0,43.0,1992-02-12 00:00:00,W


CAMEO_DEU_2015 is categorical information. It is hard to use dummy variables as it has a large number of categories. Meanwhile, the major information about social class is also documented in CAMEO_DEUG_2015. Therefore, I decided to drop it. 

CAMEO_DEUG_2015 is ordinal information. I decided to convert it into integer. Notice that CAMEO_DEUG_2015 has 'X' for unknown. 

CAMEO_INTL_2015 has two information: the first number denotes family wealth, the second number denotes family composition. So I decided to split it into two columns. Notice that CAMEO_INTL_2015 has 'XX' for unknown. 

EINGEFUEGT_AM not much information as to what it is, so I decided to drop it. 

OST_WEST_KZ is converted to a dummy variable, as there is only West(W) and Other(O). If it's West, the value is 1; else 0.


In [16]:
def process_ob_col(df):
    '''
    It does data processing on the columns that have object as dtypes
    
    Input:
    df: original Dataframe;
    
    Output:
    processed DataFrame with only float/int values 
    '''
    #convert some remaining unknown cases into NaN 
    object_NaN_dict = {'CAMEO_DEUG_2015':['X'], 'CAMEO_INTL_2015': ['XX']}
    for col in list(object_NaN_dict.keys()):
        df.loc[:, col] = df.loc[:, col].apply(lambda x: np.nan if x in object_NaN_dict[col] else x)
        
    #drop two columns that do not have much additional information
    df_1 = df.drop(axis=1, columns=['CAMEO_DEU_2015', 'EINGEFUEGT_AM'], inplace=False)
    
    #convert Cameo_Deug_2015 into integers
    df_1.loc[:,'CAMEO_DEUG_2015'] = df_1.loc[:,'CAMEO_DEUG_2015'].apply(lambda x: float(x))
    
    #split CAMEO_INTL_2015 into two columns 
    df_1.loc[:,'CAMEO_INTL_2015'] = df_1.loc[:,'CAMEO_INTL_2015'].apply(lambda x: str(x))
    df_1.loc[:,'CAMEO_INTL_FAM_Wealth'] = [int(df_1['CAMEO_INTL_2015'].iloc[i][0]) if df_1['CAMEO_INTL_2015'].iloc[i] != 'nan' else np.nan for i in range(df_1.shape[0])]
    df_1.loc[:, 'CAMEO_INTL_FAM_COMPOSITION'] = [int(df_1['CAMEO_INTL_2015'].iloc[i][1]) if str(df_1['CAMEO_INTL_2015'].iloc[i]) != 'nan' else np.nan for i in range(df_1.shape[0])]

    #convert OST_WEST_KZ into a dummy variable
    OST_dict = {'W': 1, 'E':0}
    df_1['OST_WEST_KZ'] = df_1['OST_WEST_KZ'].map(OST_dict)
    
    df_2 = df_1.drop(axis=1, columns = 'CAMEO_INTL_2015', inplace = False)
    return df_2

In [17]:
azdias_process = process_ob_col(azdias_NaN_drop)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)


In [18]:
azdias_process.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Columns: 278 entries, LNR to CAMEO_INTL_FAM_COMPOSITION
dtypes: float64(191), int64(87)
memory usage: 21.2 MB


1.5 Delete the rows with high percentage of NaN. Fill all NaN with column mean. 

In [19]:
def drop_row(df):
    """Drop the rows of the DataFrame that has over 15% NaN
       
       Input:original DataFrame; 
       
       Output:a transformed DataFrame that dropped high NaN rows"""
       
    thresh = int(0.85*df.shape[1])
    df_1 = df.dropna(axis=0, thresh=thresh, inplace=False)
    
    return df_1


In [20]:
azdias_drop_row = drop_row(azdias_process)

In [21]:
azdias_drop_row.head()

Unnamed: 0,LNR,AKT_DAT_KL,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,ANZ_KINDER,ANZ_PERSONEN,ANZ_STATISTISCHE_HAUSHALTE,ANZ_TITEL,ARBEIT,BALLRAUM,...,VK_DHT4A,VK_DISTANZ,VK_ZG11,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB,CAMEO_INTL_FAM_Wealth,CAMEO_INTL_FAM_COMPOSITION
1,910220,9.0,11.0,0.0,0.0,2.0,12.0,0.0,3.0,6.0,...,8.0,11.0,10.0,9.0,4.0,5,2,1,5.0,1.0
2,910225,9.0,10.0,0.0,0.0,1.0,7.0,0.0,3.0,2.0,...,9.0,9.0,6.0,9.0,2.0,5,2,3,2.0,4.0
3,910226,1.0,1.0,0.0,0.0,0.0,2.0,0.0,2.0,4.0,...,7.0,10.0,11.0,9.0,7.0,3,2,4,1.0,2.0
4,910241,1.0,3.0,0.0,0.0,4.0,3.0,0.0,4.0,2.0,...,3.0,5.0,4.0,9.0,3.0,4,1,3,4.0,3.0
5,910244,1.0,5.0,0.0,0.0,1.0,2.0,0.0,2.0,6.0,...,10.0,7.0,4.0,9.0,7.0,4,2,1,5.0,4.0


In [22]:
def fill_col_mean(df):
    """Fill the NaN with column mean
       
       Input: original DataFrame
       
       Output: transformed DataFrame that fills all NaN"""
    
    col_mean = df.mean()
    df_1 = df.fillna(col_mean, inplace=False)
    
    return df_1

In [23]:
azdias_finished = fill_col_mean(azdias_drop_row)

In [24]:
azdias_finished.isnull().any().any()

False

In [25]:
-1 in azdias_finished #make sure there is no unknown cases, which tend to be denoted by -1

False

1.6 Final Step for Data Cleaning: I created a function that collects all previous data cleaning procedures. So that for all the rest of DataFrames, I will just need to run this function on them for all preprocessing. 

In [26]:
def data_process(df):
    """All data cleaning process for the raw DataFrame from csv file 
       input: original dataframe 
       return: processed dataframe that has only integer/float as its values and no NaN"""
    df_1 = convert_unknown_to_NaN(df) #convert the unknown values to NaN
    df_2 = df_1.drop(axis=1, columns=column_NaN, inplace=False) #drop the columns that have high percentage of NaN
    df_3 = process_ob_col(df_2) #process the columns that have object dtypes 
    df_4 = drop_row(df_3) #drop the rows that have a high percentage of NaN 
    df_5 = fill_col_mean(df_4) #fill the remaining NaN with column means
    
    return df_5 
    

In [27]:
del azdias_1, azdias_NaN, azdias_NaN_drop, azdias_object, azdias_process, azdias_drop_row, azdias_finished

In [28]:
azdias_processed = data_process(azdias)

In [29]:
azdias_processed.head()

Unnamed: 0,LNR,AKT_DAT_KL,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,ANZ_KINDER,ANZ_PERSONEN,ANZ_STATISTISCHE_HAUSHALTE,ANZ_TITEL,ARBEIT,BALLRAUM,...,VK_DHT4A,VK_DISTANZ,VK_ZG11,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB,CAMEO_INTL_FAM_Wealth,CAMEO_INTL_FAM_COMPOSITION
1,910220,9.0,11.0,0.0,0.0,2.0,12.0,0.0,3.0,6.0,...,8.0,11.0,10.0,9.0,4.0,5,2,1,5.0,1.0
2,910225,9.0,10.0,0.0,0.0,1.0,7.0,0.0,3.0,2.0,...,9.0,9.0,6.0,9.0,2.0,5,2,3,2.0,4.0
3,910226,1.0,1.0,0.0,0.0,0.0,2.0,0.0,2.0,4.0,...,7.0,10.0,11.0,9.0,7.0,3,2,4,1.0,2.0
4,910241,1.0,3.0,0.0,0.0,4.0,3.0,0.0,4.0,2.0,...,3.0,5.0,4.0,9.0,3.0,4,1,3,4.0,3.0
5,910244,1.0,5.0,0.0,0.0,1.0,2.0,0.0,2.0,6.0,...,10.0,7.0,4.0,9.0,7.0,4,2,1,5.0,4.0


In [30]:
azdias_processed.isnull().any().any()

False

## PCA Model

2.1 Final minor adjustment to the data to facilitate the PCA model training. First I noticed that LNR is unique for each row and functions as an identifier for each individual. So I decided to drop it. Meanwhile, I want to normalize the data using a MinMaxScaler, so that the model can be trained faster. 

In [31]:
azdias_processed.drop(axis=1, columns = 'LNR', inplace=True)

In [32]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range= (0, 1)) #set the range to be between 0 and 1 for all values in the dataframe 

azdias_scaled= pd.DataFrame(scaler.fit_transform(azdias_processed.astype(float)))
azdias_scaled.index = azdias_processed.index
azdias_scaled.columns = azdias_processed.columns
azdias_scaled.head()


Unnamed: 0,AKT_DAT_KL,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,ANZ_KINDER,ANZ_PERSONEN,ANZ_STATISTISCHE_HAUSHALTE,ANZ_TITEL,ARBEIT,BALLRAUM,CAMEO_DEUG_2015,...,VK_DHT4A,VK_DISTANZ,VK_ZG11,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB,CAMEO_INTL_FAM_Wealth,CAMEO_INTL_FAM_COMPOSITION
1,1.0,0.018487,0.0,0.0,0.044444,0.026726,0.0,0.25,0.833333,0.875,...,0.7,0.833333,0.9,1.0,0.5,0.8,1.0,0.0,1.0,0.0
2,1.0,0.016807,0.0,0.0,0.022222,0.01559,0.0,0.25,0.166667,0.375,...,0.8,0.666667,0.5,1.0,0.25,0.8,1.0,0.25,0.25,0.75
3,0.0,0.001681,0.0,0.0,0.0,0.004454,0.0,0.125,0.5,0.125,...,0.6,0.75,1.0,1.0,0.875,0.4,1.0,0.375,0.0,0.25
4,0.0,0.005042,0.0,0.0,0.088889,0.006682,0.0,0.375,0.166667,0.625,...,0.2,0.333333,0.3,1.0,0.375,0.6,0.0,0.25,0.75,0.5
5,0.0,0.008403,0.0,0.0,0.022222,0.004454,0.0,0.125,0.833333,0.875,...,0.9,0.5,0.3,1.0,0.875,0.6,1.0,0.0,1.0,0.75


In [96]:
azdias_scaled.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 785421 entries, 1 to 891220
Columns: 277 entries, AKT_DAT_KL to CAMEO_INTL_FAM_COMPOSITION
dtypes: float64(277)
memory usage: 1.6 GB


2.2 Define a PCA Model

In [33]:
import boto3
import sagemaker 
from sagemaker import get_execution_role

session = sagemaker.Session()
role = get_execution_role()
bucket_name = session.default_bucket()

In [34]:
prefix = 'Arvato_Project'

output_path='s3://{}/{}/'.format(bucket_name, prefix)

print('Training artifacts will be uploaded to: {}'.format(output_path))

Training artifacts will be uploaded to: s3://sagemaker-us-west-1-178050996200/Arvato_Project/


In [35]:
# define a PCA model
from sagemaker import PCA

N_COMPONENTS=276  #This is a rule of thumb: n-1, for starting the PCA model 

pca_SM = PCA(role=role,
             instance_count=1,
             instance_type='ml.m5.2xlarge',
             output_path=output_path,
             num_components=N_COMPONENTS, 
             sagemaker_session=session)

In [36]:
train_data_np = azdias_scaled.values.astype('float32')

In [37]:
# convert to RecordSet format
formatted_train_data = pca_SM.record_set(train_data_np)

2.3 Train the Model 

In [38]:
%%time

# train the PCA mode on the formatted data
pca_SM.fit(formatted_train_data) 

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


2021-06-07 00:32:44 Starting - Starting the training job...
2021-06-07 00:32:47 Starting - Launching requested ML instancesProfilerReport-1623025964: InProgress
......
2021-06-07 00:34:13 Starting - Preparing the instances for training.........
2021-06-07 00:35:39 Downloading - Downloading input data
2021-06-07 00:35:39 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[06/07/2021 00:35:56 INFO 139699241162560] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-conf.json: {'algorithm_mode': 'regular', 'subtract_mean': 'true', 'extra_components': '-1', 'force_dense': 'true', 'epochs': 1, '_log_level': 'info', '_kvstore': 'dist_sync', '_num_kv_servers': 'auto', '_num_gpus': 'auto'}[0m
[34m[06/07/2021 00:35:56 INFO 139699241162560] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {'feature

2.4 Access Model Attributes.

In [39]:
training_job_name= 'pca-2021-06-07-00-32-44-462'

In [40]:
import os 
model_key = os.path.join(prefix, training_job_name, 'output/model.tar.gz')
#download and unzip the model 
boto3.resource('s3').Bucket(bucket_name).download_file(model_key, 'model.tar.gz')
os.system('tar -zxvf model.tar.gz')
os.system('unzip model_algo-1')

2304

In [41]:
import mxnet as mx 

In [42]:
pca_model_params = mx.ndarray.load('model_algo-1')

2.5 Pick the top n components that explain 80% variance in the data

In [43]:
s=pd.DataFrame(pca_model_params['s'].asnumpy()) #get the s value for each component
print(s)

               0
0            NaN
1       3.780533
2      11.522208
3      12.418921
4      15.376204
..           ...
271   815.508240
272   919.391479
273   963.828003
274  1116.289307
275  1366.424927

[276 rows x 1 columns]


In [44]:
def explained_variance(s, n_top_components):
    '''Calculates the approx. data variance that n_top_components captures.
       :param s: A dataframe of singular values for top components; 
           the top value is in the last row.
       :param n_top_components: An integer, the number of top components to use.
       :return: The expected data variance covered by the n_top_components.'''
    start_idx = N_COMPONENTS - n_top_components 
    sum = 0
    for i in range(start_idx, N_COMPONENTS):
        sum = sum + np.square(s.iloc[i, 0])
    exp_variance = sum / np.square(s).sum()
    
    #alternative: exp_variance = np.square(s.iloc[start_idx:,:]).sum()/np.square(s).sum()
    return exp_variance

In [49]:
#test how many components can capture 80% variance of the data

n_top_components = 80 # select a value for the number of top components

# calculate the explained variance
exp_variance = explained_variance(s, n_top_components)
print('Explained variance: ', exp_variance[0])

Explained variance:  0.8061778


Top 80 components can explain 80.6% data variance, which I think is sufficient. So I decided to reduce the features to top 80 components. 

2.5 Transform the dataframe to top 80 components to prepare for the K-Means Model. 

Upload train_data_np to S3

In [62]:
data_dir = '../data'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [63]:
df = pd.DataFrame(train_data_np)
df.to_csv(os.path.join(data_dir, 'population.csv'), header=False, index=False)

In [64]:
prefix = 'Arvato_Project'

population_location = session.upload_data(os.path.join(data_dir, 'population.csv'), key_prefix=prefix)

In [65]:
pca_transformer = pca_SM.transformer(instance_count = 1, instance_type = 'ml.m5.xlarge')

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


In [66]:
pca_transformer.transform(population_location, content_type='text/csv', split_type='Line')

........................[34mDocker entrypoint called with argument(s): serve[0m
[34mRunning default environment configuration script[0m
[34m[06/07/2021 03:17:30 INFO 140359184901952] loaded entry point class algorithm.serve.server_config:config_api[0m
[34m[06/07/2021 03:17:31 INFO 140359184901952] nvidia-smi: took 0.033 seconds to run.[0m
[34m[06/07/2021 03:17:31 INFO 140359184901952] nvidia-smi identified 0 GPUs.[0m
[34m[06/07/2021 03:17:31 INFO 140359184901952] loading entry points[0m
[34m[06/07/2021 03:17:31 INFO 140359184901952] Loaded iterator creator application/x-labeled-vector-protobuf for content type ('application/x-labeled-vector-protobuf', '1.0')[0m
[34m[06/07/2021 03:17:31 INFO 140359184901952] Loaded iterator creator application/x-recordio-protobuf for content type ('application/x-recordio-protobuf', '1.0')[0m
[34m[06/07/2021 03:17:31 INFO 140359184901952] Loaded iterator creator protobuf for content type ('protobuf', '1.0')[0m
[34m[06/07/2021 03:17:31 

In [67]:
pca_transformer.wait()

[34mDocker entrypoint called with argument(s): serve[0m
[34mRunning default environment configuration script[0m
[35mDocker entrypoint called with argument(s): serve[0m
[35mRunning default environment configuration script[0m
[34m[06/07/2021 03:17:30 INFO 140359184901952] loaded entry point class algorithm.serve.server_config:config_api[0m
[34m[06/07/2021 03:17:31 INFO 140359184901952] nvidia-smi: took 0.033 seconds to run.[0m
[34m[06/07/2021 03:17:31 INFO 140359184901952] nvidia-smi identified 0 GPUs.[0m
[35m[06/07/2021 03:17:30 INFO 140359184901952] loaded entry point class algorithm.serve.server_config:config_api[0m
[35m[06/07/2021 03:17:31 INFO 140359184901952] nvidia-smi: took 0.033 seconds to run.[0m
[35m[06/07/2021 03:17:31 INFO 140359184901952] nvidia-smi identified 0 GPUs.[0m
[34m[06/07/2021 03:17:31 INFO 140359184901952] loading entry points[0m
[34m[06/07/2021 03:17:31 INFO 140359184901952] Loaded iterator creator application/x-labeled-vector-protobuf for

In [70]:
!aws s3 cp --recursive $pca_transformer.output_path $data_dir

download: s3://sagemaker-us-west-1-178050996200/pca-2021-06-07-03-13-37-419/population.csv.out to ../data/population.csv.out


In [75]:
pca_azdias = pd.read_csv(os.path.join(data_dir, 'population.csv.out'), header=None)

In [76]:
pca_azdias.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,266,267,268,269,270,271,272,273,274,275
0,"{""projection"":[0.0",0.002962,0.003644,-0.009814,-0.000183,0.000558,-0.041419,-0.000217,-0.002356,-0.002631,...,1.010599,-0.013584,0.14347,0.412984,-0.238003,0.805062,-1.091954,1.098113,-1.369603,1.647016286849975]}
1,"{""projection"":[-0.0",-0.004227,0.000597,0.002544,-0.004696,-0.001938,0.009331,0.001769,0.000555,0.015348,...,-0.037286,-0.344759,-0.197108,-0.49863,-0.85072,1.104947,-1.63598,0.710835,-0.276317,0.007751166820526]}
2,"{""projection"":[-0.0",0.001334,0.002924,-0.018144,-0.004146,-0.001604,-0.008281,-2.7e-05,0.001746,0.016393,...,-0.3064,-0.50038,-0.020077,0.819936,0.575298,0.533876,-0.637406,0.852452,1.328577,-1.270629405975341]}
3,"{""projection"":[0.0",0.000383,-0.001195,0.009304,0.004494,-0.056253,0.006406,0.007586,0.085912,-0.035176,...,0.132282,-0.385286,-0.96228,0.527872,0.869326,-1.042782,0.067069,-2.146102,-1.237431,-1.113250732421875]}
4,"{""projection"":[-0.0",-0.00391,9.4e-05,-0.003612,0.000411,-0.021549,-0.016386,-0.023148,0.006358,0.051482,...,0.212654,0.444882,-1.00343,0.764915,-0.116067,0.096514,-0.648185,0.573537,0.475229,-0.505623340606689]}


In [91]:
def clean_out(df):
    """Clean the DataFrame that directly reads from model output, as column 0 and column 275 are messy
       input: the original DataFrame. 
       The function cleaned the original DataFrame so that the transformed DataFrame has output all in the type of float
       Return: None, as the cleaning is on the original DataFrame directly"""
    df.iloc[:,0] = df.iloc[:, 0].apply(lambda x: x.split("["))
    df.iloc[:,0] = df.iloc[:, 0].apply(lambda x: float(x[1]))
    df.iloc[:, 275] = df.iloc[:, 275].apply(lambda x: x.split("]"))
    df.iloc[:, 275] = df.iloc[:, 275].apply(lambda x: float(x[0]))

In [92]:
clean_out(pca_azdias)

In [95]:
pca_azdias.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 785421 entries, 0 to 785420
Columns: 276 entries, 0 to 275
dtypes: float64(276)
memory usage: 1.6 GB


In [97]:
# create dimensionality-reduced data

def create_transformed_df(pca_azdias, n_top_components):
    ''' Return a dataframe of data points with component features. 
        The dataframe should contain component values.
        :param pca_azdias: A DataFrame of values for all pca components, returned by a PCA model.
        :param n_top_components: An integer, the number of top components to use.
        :return: A DataFrame with top n component as columns. The order is listed from most significant to least significant.      
     '''
    end=pca_azdias.shape[1]
    start=end - n_top_components
    
    pca_selected = pca_azdias.iloc[:, start:end]
    transformed_azdias= pca_selected.iloc[:, ::-1]
    
    return transformed_azdias

In [100]:
n = 80 
transformed_pca_azdias = create_transformed_df(pca_azdias, n)

In [101]:
transformed_pca_azdias.head()

Unnamed: 0,275,274,273,272,271,270,269,268,267,266,...,205,204,203,202,201,200,199,198,197,196
0,1.647016,-1.369603,1.098113,-1.091954,0.805062,-0.238003,0.412984,0.14347,-0.013584,1.010599,...,-0.186213,-0.083994,-0.027546,0.158789,-0.14422,0.426835,-0.173856,-0.016744,-0.479797,-0.2977
1,0.007751,-0.276317,0.710835,-1.63598,1.104947,-0.85072,-0.49863,-0.197108,-0.344759,-0.037286,...,-0.098447,-0.141517,-0.099606,0.010261,0.187617,0.030561,-0.211164,0.098304,-0.175605,-0.125958
2,-1.270629,1.328577,0.852452,-0.637406,0.533876,0.575298,0.819936,-0.020077,-0.50038,-0.3064,...,-0.189295,-0.143599,-0.204862,-0.299902,-0.403524,0.187218,-0.676447,-0.122719,-0.398483,0.259492
3,-1.113251,-1.237431,-2.146102,0.067069,-1.042782,0.869326,0.527872,-0.96228,-0.385286,0.132282,...,-0.191818,-0.225572,0.295389,-0.428186,-0.03043,-0.034004,-0.022518,-0.079278,-0.107772,-0.522634
4,-0.505623,0.475229,0.573537,-0.648185,0.096514,-0.116067,0.764915,-1.00343,0.444882,0.212654,...,0.054639,-0.026198,-0.014021,0.226206,0.159745,0.228327,0.320978,0.129235,-0.178949,-0.031519


In [102]:
column_names = []
for i in range(1, n+1):
    column_names.append('c_'+str(i)) 

transformed_pca_azdias.columns = column_names

transformed_pca_azdias.head()

Unnamed: 0,c_1,c_2,c_3,c_4,c_5,c_6,c_7,c_8,c_9,c_10,...,c_71,c_72,c_73,c_74,c_75,c_76,c_77,c_78,c_79,c_80
0,1.647016,-1.369603,1.098113,-1.091954,0.805062,-0.238003,0.412984,0.14347,-0.013584,1.010599,...,-0.186213,-0.083994,-0.027546,0.158789,-0.14422,0.426835,-0.173856,-0.016744,-0.479797,-0.2977
1,0.007751,-0.276317,0.710835,-1.63598,1.104947,-0.85072,-0.49863,-0.197108,-0.344759,-0.037286,...,-0.098447,-0.141517,-0.099606,0.010261,0.187617,0.030561,-0.211164,0.098304,-0.175605,-0.125958
2,-1.270629,1.328577,0.852452,-0.637406,0.533876,0.575298,0.819936,-0.020077,-0.50038,-0.3064,...,-0.189295,-0.143599,-0.204862,-0.299902,-0.403524,0.187218,-0.676447,-0.122719,-0.398483,0.259492
3,-1.113251,-1.237431,-2.146102,0.067069,-1.042782,0.869326,0.527872,-0.96228,-0.385286,0.132282,...,-0.191818,-0.225572,0.295389,-0.428186,-0.03043,-0.034004,-0.022518,-0.079278,-0.107772,-0.522634
4,-0.505623,0.475229,0.573537,-0.648185,0.096514,-0.116067,0.764915,-1.00343,0.444882,0.212654,...,0.054639,-0.026198,-0.014021,0.226206,0.159745,0.228327,0.320978,0.129235,-0.178949,-0.031519


In [103]:
transformed_pca_azdias.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 785421 entries, 0 to 785420
Data columns (total 80 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   c_1     785421 non-null  float64
 1   c_2     785421 non-null  float64
 2   c_3     785421 non-null  float64
 3   c_4     785421 non-null  float64
 4   c_5     785421 non-null  float64
 5   c_6     785421 non-null  float64
 6   c_7     785421 non-null  float64
 7   c_8     785421 non-null  float64
 8   c_9     785421 non-null  float64
 9   c_10    785421 non-null  float64
 10  c_11    785421 non-null  float64
 11  c_12    785421 non-null  float64
 12  c_13    785421 non-null  float64
 13  c_14    785421 non-null  float64
 14  c_15    785421 non-null  float64
 15  c_16    785421 non-null  float64
 16  c_17    785421 non-null  float64
 17  c_18    785421 non-null  float64
 18  c_19    785421 non-null  float64
 19  c_20    785421 non-null  float64
 20  c_21    785421 non-null  float64
 21  c_22    78

2.6 As we will eventually run the same K-Means on the customer data for the comparison. I decided to transform the customer data too to prepare for the K-Means Model. 

In [104]:
#drop the additional three columns in customers DataFrame, as they do not have correspondents in the population dataframe
customers.drop(axis=1, columns = ['CUSTOMER_GROUP', 'ONLINE_PURCHASE', 'PRODUCT_GROUP'], inplace=True)

In [105]:
customers_processed = data_process(customers)

In [107]:
customers_processed.isnull().any().any()

False

In [108]:
customers_processed.drop(axis=1, columns = 'LNR', inplace=True)

In [109]:
scaler_2 = MinMaxScaler(feature_range= (0, 1))

customers_scaled= pd.DataFrame(scaler_2.fit_transform(customers_processed.astype(float)))
customers_scaled.index = customers_processed.index
customers_scaled.columns = customers_processed.columns
customers_scaled.head()

Unnamed: 0,AKT_DAT_KL,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,ANZ_KINDER,ANZ_PERSONEN,ANZ_STATISTISCHE_HAUSHALTE,ANZ_TITEL,ARBEIT,BALLRAUM,CAMEO_DEUG_2015,...,VK_DHT4A,VK_DISTANZ,VK_ZG11,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB,CAMEO_INTL_FAM_Wealth,CAMEO_INTL_FAM_COMPOSITION
0,0.0,0.001912,0.0,0.0,0.095238,0.002667,0.0,0.0,0.333333,0.0,...,0.4,0.166667,0.1,1.0,0.875,0.4,0.0,0.375,0.0,0.5
2,0.0,0.001912,0.0,0.0,0.047619,0.002667,0.0,0.25,1.0,0.5,...,0.9,1.0,1.0,1.0,0.25,0.4,1.0,0.375,0.5,0.75
3,0.0,0.0,0.003213,0.0,0.0,0.002667,0.0,0.0,1.0,0.375,...,0.5,0.25,0.1,1.0,0.875,0.0,0.0,0.375,0.25,0.75
4,0.0,0.013384,0.0,0.0,0.190476,0.018667,0.0,0.25,0.333333,0.75,...,0.2,0.333333,0.3,1.0,0.375,0.0,0.0,0.25,0.75,0.0
5,0.0,0.001912,0.0,0.0,0.095238,0.002667,0.0,0.25,1.0,0.5,...,0.0,0.083333,0.0,1.0,0.125,0.2,0.0,0.25,0.5,0.75


In [112]:
customers_scaled.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 140371 entries, 0 to 191651
Columns: 277 entries, AKT_DAT_KL to CAMEO_INTL_FAM_COMPOSITION
dtypes: float64(277)
memory usage: 297.7 MB


In [113]:
customers_scaled.to_csv(os.path.join(data_dir, 'customers.csv'), header=False, index=False)

In [115]:
customers_location = session.upload_data(os.path.join(data_dir, 'customers.csv'), key_prefix=prefix)

In [117]:
pca_transformer.transform(customers_location, content_type='text/csv', split_type='Line')

.............................[34mDocker entrypoint called with argument(s): serve[0m
[34mRunning default environment configuration script[0m
[35mDocker entrypoint called with argument(s): serve[0m
[35mRunning default environment configuration script[0m
[34m[06/07/2021 05:15:32 INFO 140289434404672] loaded entry point class algorithm.serve.server_config:config_api[0m
[34m[06/07/2021 05:15:32 INFO 140289434404672] nvidia-smi: took 0.030 seconds to run.[0m
[34m[06/07/2021 05:15:32 INFO 140289434404672] nvidia-smi identified 0 GPUs.[0m
[34m[06/07/2021 05:15:32 INFO 140289434404672] loading entry points[0m
[34m[06/07/2021 05:15:32 INFO 140289434404672] Loaded iterator creator application/x-labeled-vector-protobuf for content type ('application/x-labeled-vector-protobuf', '1.0')[0m
[34m[06/07/2021 05:15:32 INFO 140289434404672] Loaded iterator creator application/x-recordio-protobuf for content type ('application/x-recordio-protobuf', '1.0')[0m
[34m[06/07/2021 05:15:32 I

In [120]:
!aws s3 cp --recursive $pca_transformer.output_path $data_dir

download: s3://sagemaker-us-west-1-178050996200/pca-2021-06-07-05-10-53-190/customers.csv.out to ../data/customers.csv.out


In [121]:
pca_customers = pd.read_csv(os.path.join(data_dir, 'customers.csv.out'), header=None)

In [122]:
clean_out(pca_customers)

In [123]:
n = 80 
transformed_pca_customers = create_transformed_df(pca_customers, n)

In [126]:
column_names = []
for i in range(1, n+1):
    column_names.append('c_'+str(i)) 

transformed_pca_customers.columns = column_names

transformed_pca_customers.head()

Unnamed: 0,c_1,c_2,c_3,c_4,c_5,c_6,c_7,c_8,c_9,c_10,...,c_71,c_72,c_73,c_74,c_75,c_76,c_77,c_78,c_79,c_80
0,-2.252123,1.753698,0.103065,-1.090806,-1.064244,0.632722,-0.057567,0.160269,-0.214497,0.117398,...,0.32523,-0.0183,-0.374028,-0.132821,0.324162,0.532375,-0.025256,-0.04794,0.020278,0.080819
1,0.513018,2.112345,-1.026261,-0.264072,0.612925,1.245031,-0.092872,0.297801,-1.008559,-0.562731,...,-0.261533,-0.246735,0.454559,-0.16498,-0.288754,0.086745,-0.026099,0.085968,-0.013625,0.107346
2,-1.654017,1.695449,0.534332,0.014496,-1.015586,-0.519252,1.286682,-0.367541,-0.005661,-0.56822,...,-0.180873,-0.118214,-0.071836,0.168075,0.073534,0.160634,-0.164807,-0.394947,0.088519,0.071462
3,-0.745471,-1.49605,-1.596949,0.785237,-0.59872,1.134042,-0.452675,-0.63122,0.319046,0.595553,...,-0.164907,-0.054612,-0.742969,-0.476534,0.353693,0.105524,-0.123342,-0.56917,0.389028,-0.307141
4,-1.194546,1.270095,-1.568836,-1.421759,-0.930147,-0.161934,0.206986,0.05401,-0.552936,-0.011217,...,0.315226,0.16611,0.384029,0.09158,0.098449,-0.14668,0.289756,-0.714226,-0.063649,-0.075566


## K-Means 

3.1 I used within-cluster sum-of-squares, i.e. inertia, as a metric to find the best number of clusters. I chose 8 as the number of clusters, as beyond 8, the decrease of inertia slows down significantly. Here I trained the whole population data on the K-Means model. 

In [125]:
from sagemaker import KMeans

In [143]:
cluster_num = 8
kmeans = KMeans(role=role,
                instance_count=1,
                instance_type='ml.m5.2xlarge',
                output_path=output_path, 
                k=cluster_num,
                sagemaker_session=session)

In [144]:
azdias_np = transformed_pca_azdias.values.astype('float32')

In [145]:
formatted_azdias = kmeans.record_set(azdias_np)

In [146]:
kmeans.fit(formatted_azdias)

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


2021-06-07 22:56:28 Starting - Starting the training job...
2021-06-07 22:56:34 Starting - Launching requested ML instancesProfilerReport-1623106588: InProgress
.........
2021-06-07 22:58:00 Starting - Preparing the instances for training...
2021-06-07 22:58:58 Downloading - Downloading input data...
2021-06-07 22:59:26 Training - Training image download completed. Training in progress.
2021-06-07 22:59:26 Uploading - Uploading generated training model.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[06/07/2021 22:59:23 INFO 140127454623552] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'init_method': 'random', 'mini_batch_size': '5000', 'epochs': '1', 'extra_center_factor': 'auto', 'local_lloyd_max_iter': '300', 'local_lloyd_tol': '0.0001', 'local_lloyd_init_method': 'kmeans++', 'local_lloyd_num_trials': 'auto', 'half_life_time_size': '0', 'eva

3.2 Create a transformer based on the trained K-Means model. Batch transform both population data and customers data for comparison. 

In [147]:
kmeans_transformer = kmeans.transformer(instance_count = 1, instance_type = 'ml.m5.xlarge')

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


In [149]:
transformed_pca_customers.to_csv(os.path.join(data_dir, 'pca_customers.csv'), header=False, index=False)
transformed_pca_azdias.to_csv(os.path.join(data_dir, 'pca_azdias.csv'), header=False, index=False)

In [150]:
pca_azdias_location = session.upload_data(os.path.join(data_dir, 'pca_azdias.csv'), key_prefix=prefix)
pca_customers_location = session.upload_data(os.path.join(data_dir, 'pca_customers.csv'), key_prefix=prefix)

In [151]:
kmeans_transformer.transform(pca_azdias_location, content_type='text/csv', split_type='Line')

...........................[34mDocker entrypoint called with argument(s): serve[0m
[34mRunning default environment configuration script[0m
[34m[06/07/2021 23:35:47 INFO 139859105855296] loading entry points[0m
[34m[06/07/2021 23:35:47 INFO 139859105855296] Loaded iterator creator application/x-recordio-protobuf for content type ('application/x-recordio-protobuf', '1.0')[0m
[34m[06/07/2021 23:35:47 INFO 139859105855296] loaded request iterator application/json[0m
[34m[06/07/2021 23:35:47 INFO 139859105855296] loaded request iterator application/jsonlines[0m
[34m[06/07/2021 23:35:47 INFO 139859105855296] loaded request iterator application/x-recordio-protobuf[0m
[34m[06/07/2021 23:35:47 INFO 139859105855296] loaded request iterator text/csv[0m
[34m[06/07/2021 23:35:47 INFO 139859105855296] loaded response encoder application/json[0m
[34m[06/07/2021 23:35:47 INFO 139859105855296] loaded response encoder application/jsonlines[0m
[34m[06/07/2021 23:35:47 INFO 1398591058

In [152]:
!aws s3 cp --recursive $kmeans_transformer.output_path $data_dir

download: s3://sagemaker-us-west-1-178050996200/kmeans-2021-06-07-23-31-24-233/pca_azdias.csv.out to ../data/pca_azdias.csv.out


In [153]:
df_azdias_kmeans = pd.read_csv(os.path.join(data_dir, 'pca_azdias.csv.out'), header=None)

In [154]:
df_azdias_kmeans.head()

Unnamed: 0,0,1
0,"{""closest_cluster"": 4.0","""distance_to_cluster"": 3.194801092147827}"
1,"{""closest_cluster"": 4.0","""distance_to_cluster"": 3.446899652481079}"
2,"{""closest_cluster"": 7.0","""distance_to_cluster"": 3.2063260078430176}"
3,"{""closest_cluster"": 3.0","""distance_to_cluster"": 4.0568084716796875}"
4,"{""closest_cluster"": 2.0","""distance_to_cluster"": 3.3404178619384766}"


In [155]:
kmeans_transformer.transform(pca_customers_location, content_type='text/csv', split_type='Line')

..........................[34mDocker entrypoint called with argument(s): serve[0m
[34mRunning default environment configuration script[0m
[35mDocker entrypoint called with argument(s): serve[0m
[35mRunning default environment configuration script[0m
[34m[06/07/2021 23:43:58 INFO 140692085131072] loading entry points[0m
[34m[06/07/2021 23:43:58 INFO 140692085131072] Loaded iterator creator application/x-recordio-protobuf for content type ('application/x-recordio-protobuf', '1.0')[0m
[34m[06/07/2021 23:43:58 INFO 140692085131072] loaded request iterator application/json[0m
[34m[06/07/2021 23:43:58 INFO 140692085131072] loaded request iterator application/jsonlines[0m
[34m[06/07/2021 23:43:58 INFO 140692085131072] loaded request iterator application/x-recordio-protobuf[0m
[34m[06/07/2021 23:43:58 INFO 140692085131072] loaded request iterator text/csv[0m
[34m[06/07/2021 23:43:58 INFO 140692085131072] loaded response encoder application/json[0m
[34m[06/07/2021 23:43:5

In [156]:
!aws s3 cp --recursive $kmeans_transformer.output_path $data_dir

download: s3://sagemaker-us-west-1-178050996200/kmeans-2021-06-07-23-39-49-063/pca_customers.csv.out to ../data/pca_customers.csv.out


In [157]:
df_customers_kmeans = pd.read_csv(os.path.join(data_dir, 'pca_customers.csv.out'), header=None)

In [160]:
df_customers_kmeans.head()

Unnamed: 0,0,1
0,"{""closest_cluster"": 2.0","""distance_to_cluster"": 3.5577127933502197}"
1,"{""closest_cluster"": 6.0","""distance_to_cluster"": 3.769582509994507}"
2,"{""closest_cluster"": 2.0","""distance_to_cluster"": 3.3980414867401123}"
3,"{""closest_cluster"": 0.0","""distance_to_cluster"": 4.010591983795166}"
4,"{""closest_cluster"": 2.0","""distance_to_cluster"": 3.645479440689087}"


3.3 Read in and clean the data from K-Means Model output for both population data and customers data. Analyze two sets of data. 

In [168]:
def clean_kmeans_output(df):
    """Clean the dataframe that directly reads from K-Means model output
       input: the original DataFrame
       return: a cleaned DataFrame with cluster information for each individual"""
    df_1 = df.iloc[:, 0].copy()
    df_1 = df_1.apply(lambda x: x.split(":"))
    df_1 = df_1.apply(lambda x: int(float(x[1])))
    
    return df_1

In [169]:
customers_cluster = clean_kmeans_output(df_customers_kmeans)

In [172]:
customers_cluster.head()

0    2
1    6
2    2
3    0
4    2
Name: 0, dtype: int64

In [173]:
azdias_cluster = clean_kmeans_output(df_azdias_kmeans)

In [174]:
azdias_cluster.head()

0    4
1    4
2    7
3    3
4    2
Name: 0, dtype: int64

Count the percentage for each category for two sets of data. 

In [176]:
(azdias_cluster.value_counts()/azdias_cluster.count())*100

4    15.225592
1    13.795659
6    12.942486
2    12.775314
5    12.091222
7    11.745548
3    11.745421
0     9.678758
Name: 0, dtype: float64

In [177]:
(customers_cluster.value_counts()/customers_cluster.count())*100

3    32.010885
2    29.340106
6    11.693298
7    11.554381
5     9.881671
0     3.134551
1     1.689808
4     0.695300
Name: 0, dtype: float64

3.4 Calculate the percentage difference for each category between customer base and general population. 

In [184]:
azdias_percent = (azdias_cluster.value_counts()/azdias_cluster.count())*100
customers_percent = (customers_cluster.value_counts()/customers_cluster.count())*100 

In [188]:
for i in range(0,8):
    diff = customers_percent.loc[i] - azdias_percent.loc[i]
    print('category {} has {} more or less percents in customers than in general population'.format(i, diff))

category 0 has -6.544207687231854 more or less percents in customers than in general population
category 1 has -12.105850910214343 more or less percents in customers than in general population
category 2 has 16.564791731429132 more or less percents in customers than in general population
category 3 has 20.265464830480155 more or less percents in customers than in general population
category 4 has -14.530292077986726 more or less percents in customers than in general population
category 5 has -2.209551699014632 more or less percents in customers than in general population
category 6 has -1.2491871490291064 more or less percents in customers than in general population
category 7 has -0.19116703843262783 more or less percents in customers than in general population


## According to the result, being in category 3 and 2 increases the individual's chance to be a customer significantly, while being in category 4,1 and 0 decreases the individual's chance to be a customer significantly. 

## Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

In [278]:
mailout_train = pd.read_csv('Data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')
mailout_train.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,...,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,RESPONSE,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,1763,2,1.0,8.0,,,,,8.0,15.0,...,5.0,2.0,1.0,6.0,9.0,3.0,3,0,2,4
1,1771,1,4.0,13.0,,,,,13.0,1.0,...,1.0,2.0,1.0,4.0,9.0,7.0,1,0,2,3
2,1776,1,1.0,9.0,,,,,7.0,0.0,...,6.0,4.0,2.0,,9.0,2.0,3,0,1,4
3,1460,2,1.0,6.0,,,,,6.0,4.0,...,8.0,11.0,11.0,6.0,9.0,1.0,3,0,2,4
4,1783,2,1.0,9.0,,,,,9.0,53.0,...,2.0,2.0,1.0,6.0,9.0,3.0,3,0,1,3


In [279]:
mailout_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42962 entries, 0 to 42961
Columns: 367 entries, LNR to ALTERSKATEGORIE_GROB
dtypes: float64(267), int64(94), object(6)
memory usage: 120.3+ MB


## Process mailout_train

1.1 Seperate mailout_train into train_features and train_labels

In [280]:
train_features = mailout_train.drop(axis=1, columns = 'RESPONSE', inplace=False)
train_labels = mailout_train['RESPONSE']

In [281]:
train_features.head()

Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,...,VHN,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,1763,2,1.0,8.0,,,,,8.0,15.0,...,2.0,5.0,2.0,1.0,6.0,9.0,3.0,3,2,4
1,1771,1,4.0,13.0,,,,,13.0,1.0,...,3.0,1.0,2.0,1.0,4.0,9.0,7.0,1,2,3
2,1776,1,1.0,9.0,,,,,7.0,0.0,...,1.0,6.0,4.0,2.0,,9.0,2.0,3,1,4
3,1460,2,1.0,6.0,,,,,6.0,4.0,...,4.0,8.0,11.0,11.0,6.0,9.0,1.0,3,2,4
4,1783,2,1.0,9.0,,,,,9.0,53.0,...,4.0,2.0,2.0,1.0,6.0,9.0,3.0,3,1,3


In [282]:
train_labels.head()

0    0
1    0
2    0
3    0
4    0
Name: RESPONSE, dtype: int64

1.2 Process train_features to prepare for PCA batch transform 

In [284]:
#reorder the column order to make sure it is the same order as azdias 
column_order = azdias.columns.tolist()
train_features = train_features[column_order]

In [285]:
#A new data_process is needed as we don't want to delete rows: 
def data_process_2(df):
    """All data cleaning process for the raw DataFrame from csv file 
       input: original dataframe 
       return: processed dataframe that has only integer/float as its values and no NaN"""
    df_1 = convert_unknown_to_NaN(df) #convert the unknown values to NaN
    df_2 = df_1.drop(axis=1, columns=column_NaN, inplace=False) #drop the columns that have high percentage of NaN
    df_3 = process_ob_col(df_2) #process the columns that have object dtypes 
    df_4 = fill_col_mean(df_3) #fill the remaining NaN with column means
    
    return df_4

In [288]:
train_features_processed = data_process_2(train_features)

In [289]:
train_features_processed.head()

Unnamed: 0,LNR,AKT_DAT_KL,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,ANZ_KINDER,ANZ_PERSONEN,ANZ_STATISTISCHE_HAUSHALTE,ANZ_TITEL,ARBEIT,BALLRAUM,...,VK_DHT4A,VK_DISTANZ,VK_ZG11,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB,CAMEO_INTL_FAM_Wealth,CAMEO_INTL_FAM_COMPOSITION
0,1763,1.0,15.0,0.0,0.0,1.0,13.0,0.0,3.0,5.0,...,5.0,2.0,1.0,9.0,3.0,3,2,4,3.0,4.0
1,1771,4.0,1.0,0.0,0.0,2.0,1.0,0.0,2.0,5.0,...,1.0,2.0,1.0,9.0,7.0,1,2,3,3.0,2.0
2,1776,1.0,0.0,0.049574,0.0,0.0,1.0,0.0,4.0,1.0,...,6.0,4.0,2.0,9.0,2.0,3,1,4,1.0,4.0
3,1460,1.0,4.0,0.0,0.0,2.0,4.0,0.0,4.0,2.0,...,8.0,11.0,11.0,9.0,1.0,3,2,4,1.0,4.0
4,1783,1.0,53.0,0.0,0.0,1.0,44.0,0.0,3.0,4.0,...,2.0,2.0,1.0,9.0,3.0,3,1,3,4.0,1.0


In [290]:
train_features_processed.isnull().any().any()

False

In [291]:
train_features_processed.drop(axis=1, columns = 'LNR', inplace=True)

In [292]:
scaler_3 = MinMaxScaler(feature_range= (0, 1))

train_features_scaled= pd.DataFrame(scaler_3.fit_transform(train_features_processed.astype(float)))
train_features_scaled.index = train_features_processed.index
train_features_scaled.columns = train_features_processed.columns
train_features_scaled.head()

Unnamed: 0,AKT_DAT_KL,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,ANZ_KINDER,ANZ_PERSONEN,ANZ_STATISTISCHE_HAUSHALTE,ANZ_TITEL,ARBEIT,BALLRAUM,CAMEO_DEUG_2015,...,VK_DHT4A,VK_DISTANZ,VK_ZG11,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB,CAMEO_INTL_FAM_Wealth,CAMEO_INTL_FAM_COMPOSITION
0,0.0,0.034247,0.0,0.0,0.041667,0.03523,0.0,0.25,0.666667,0.5,...,0.4,0.083333,0.0,1.0,0.375,0.4,1.0,0.375,0.5,0.75
1,0.375,0.002283,0.0,0.0,0.083333,0.00271,0.0,0.125,0.666667,0.5,...,0.0,0.083333,0.0,1.0,0.875,0.0,1.0,0.25,0.5,0.25
2,0.0,0.0,0.002479,0.0,0.0,0.00271,0.0,0.375,0.0,0.125,...,0.5,0.25,0.1,1.0,0.25,0.4,0.0,0.375,0.0,0.75
3,0.0,0.009132,0.0,0.0,0.083333,0.01084,0.0,0.375,0.166667,0.125,...,0.7,0.833333,1.0,1.0,0.125,0.4,1.0,0.375,0.0,0.75
4,0.0,0.121005,0.0,0.0,0.041667,0.119241,0.0,0.25,0.5,0.75,...,0.1,0.083333,0.0,1.0,0.375,0.4,0.0,0.25,0.75,0.0


In [294]:
train_features_scaled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42962 entries, 0 to 42961
Columns: 277 entries, AKT_DAT_KL to CAMEO_INTL_FAM_COMPOSITION
dtypes: float64(277)
memory usage: 90.8 MB


1.3 Run the PCA batch transform on the train features; Clean the transformed data to show only the values for top 80 components

In [295]:
train_features_scaled.to_csv(os.path.join(data_dir, 'train_features.csv'), header=False, index=False)

In [296]:
train_features_location = session.upload_data(os.path.join(data_dir, 'train_features.csv'), key_prefix=prefix)

In [297]:
pca_transformer.transform(train_features_location, content_type='text/csv', split_type='Line')

..........................[34mDocker entrypoint called with argument(s): serve[0m
[34mRunning default environment configuration script[0m
[34m[06/08/2021 04:42:27 INFO 140596081870656] loaded entry point class algorithm.serve.server_config:config_api[0m
[34m[06/08/2021 04:42:27 INFO 140596081870656] nvidia-smi: took 0.030 seconds to run.[0m
[34m[06/08/2021 04:42:27 INFO 140596081870656] nvidia-smi identified 0 GPUs.[0m
[34m[06/08/2021 04:42:27 INFO 140596081870656] loading entry points[0m
[34m[06/08/2021 04:42:27 INFO 140596081870656] Loaded iterator creator application/x-labeled-vector-protobuf for content type ('application/x-labeled-vector-protobuf', '1.0')[0m
[34m[06/08/2021 04:42:27 INFO 140596081870656] Loaded iterator creator application/x-recordio-protobuf for content type ('application/x-recordio-protobuf', '1.0')[0m
[34m[06/08/2021 04:42:27 INFO 140596081870656] Loaded iterator creator protobuf for content type ('protobuf', '1.0')[0m
[34m[06/08/2021 04:42:2

In [298]:
!aws s3 cp --recursive $pca_transformer.output_path $data_dir

download: s3://sagemaker-us-west-1-178050996200/pca-2021-06-08-04-38-12-110/train_features.csv.out to ../data/train_features.csv.out


In [299]:
pca_train_features = pd.read_csv(os.path.join(data_dir, 'train_features.csv.out'), header=None)

In [300]:
clean_out(pca_train_features)

In [301]:
n = 80 
transformed_pca_train_features = create_transformed_df(pca_train_features, n)

In [302]:
column_names = []
for i in range(1, n+1):
    column_names.append('c_'+str(i)) 

transformed_pca_train_features.columns = column_names

transformed_pca_train_features.head()

Unnamed: 0,c_1,c_2,c_3,c_4,c_5,c_6,c_7,c_8,c_9,c_10,...,c_71,c_72,c_73,c_74,c_75,c_76,c_77,c_78,c_79,c_80
0,1.144516,1.887148,-1.882622,-0.048405,0.690094,0.020077,0.809921,0.020945,0.075813,-0.021491,...,-0.054432,-0.362911,0.284236,0.155066,0.010453,-0.049075,-0.32216,-0.088001,-0.347405,-0.073487
1,-2.439446,0.340658,0.197613,0.693218,0.979631,0.496533,0.851029,-0.06731,0.234468,0.077654,...,0.473752,-0.252177,0.22471,0.132591,0.029336,-0.059287,-0.424146,-0.242151,0.453012,-0.540896
2,-1.309988,1.616213,-0.912996,0.496057,-1.021015,-1.439816,-0.153049,0.416211,-0.023488,-1.677067,...,0.360875,0.025349,-0.312739,-0.345905,-0.184987,0.010898,0.272235,-0.180548,0.090512,0.021791
3,-0.766258,1.82262,-1.365426,-2.630953,0.756407,0.424133,-0.724646,-0.754212,-0.371986,0.397551,...,-0.066517,-0.093322,-0.24912,-0.003015,0.075503,-0.312867,-0.187745,0.148759,-0.154118,-0.506283
4,-0.149392,1.341818,-1.403229,-0.793912,-0.99283,-1.549229,0.569451,-0.270216,0.536317,0.119262,...,0.585516,-0.055558,-0.153592,-0.103277,0.180957,0.131901,0.213031,-0.011378,0.155462,0.011923


## Note: transformed_pca_train_features will be the new training_features for supervised model input, due to its reduced dimensionality

1.4 Run the K-Means model on the transformed_pca_train_features, to know which category each individual falls into. 

In [303]:
transformed_pca_train_features.to_csv(os.path.join(data_dir, 'pca_train_features.csv'), header=False, index=False)

In [304]:
pca_train_features_location = session.upload_data(os.path.join(data_dir, 'pca_train_features.csv'), key_prefix=prefix)

In [305]:
kmeans_transformer.transform(pca_train_features_location, content_type='text/csv', split_type='Line')

............................[34mDocker entrypoint called with argument(s): serve[0m
[34mRunning default environment configuration script[0m
[34m[06/08/2021 04:51:36 INFO 140331701110592] loading entry points[0m
[34m[06/08/2021 04:51:36 INFO 140331701110592] Loaded iterator creator application/x-recordio-protobuf for content type ('application/x-recordio-protobuf', '1.0')[0m
[34m[06/08/2021 04:51:36 INFO 140331701110592] loaded request iterator application/json[0m
[34m[06/08/2021 04:51:36 INFO 140331701110592] loaded request iterator application/jsonlines[0m
[34m[06/08/2021 04:51:36 INFO 140331701110592] loaded request iterator application/x-recordio-protobuf[0m
[34m[06/08/2021 04:51:36 INFO 140331701110592] loaded request iterator text/csv[0m
[34m[06/08/2021 04:51:36 INFO 140331701110592] loaded response encoder application/json[0m
[34m[06/08/2021 04:51:36 INFO 140331701110592] loaded response encoder application/jsonlines[0m
[34m[06/08/2021 04:51:36 INFO 140331701

In [306]:
!aws s3 cp --recursive $kmeans_transformer.output_path $data_dir

download: s3://sagemaker-us-west-1-178050996200/kmeans-2021-06-08-04-47-09-728/pca_train_features.csv.out to ../data/pca_train_features.csv.out


In [307]:
df_train_features_kmeans = pd.read_csv(os.path.join(data_dir, 'pca_train_features.csv.out'), header=None)

In [308]:
train_features_cluster = clean_kmeans_output(df_train_features_kmeans)

In [309]:
train_features_cluster.head()

0    6
1    7
2    2
3    2
4    2
Name: 0, dtype: int64

1.5 Create a DataFrame that maps each category into a normalized number that indicates the relative likelihood of the individial in this category to be the client. 

In [310]:
#create a panda Dataframe, the index is the category number (0-7). 
#the value is the difference percentage (customer percentage - population percentage) which was shown in Part 1 section 3.4
diff_list = []
for i in range(0,8):
    diff_list.append(customers_percent.loc[i] - azdias_percent.loc[i])
print(diff_list)

[-6.544207687231854, -12.105850910214343, 16.564791731429132, 20.265464830480155, -14.530292077986726, -2.209551699014632, -1.2491871490291064, -0.19116703843262783]


In [312]:
category_df = pd.DataFrame(diff_list)

In [313]:
category_df

Unnamed: 0,0
0,-6.544208
1,-12.105851
2,16.564792
3,20.265465
4,-14.530292
5,-2.209552
6,-1.249187
7,-0.191167


In [314]:
#normalize the value between (0,1)
scaler_4 = MinMaxScaler(feature_range= (0, 1))
category_scaled= pd.DataFrame(scaler_4.fit_transform(category_df.astype(float)))
category_scaled.index = category_df.index
category_scaled

Unnamed: 0,0
0,0.229513
1,0.069676
2,0.893646
3,1.0
4,0.0
5,0.354087
6,0.381687
7,0.412094


1.6 Add additional column to the train features. It can add more information to the training data that can increase model accuracy.

In [315]:
category_series = category_scaled.iloc[:,0]

In [316]:
category_dict = category_series.to_dict()

In [317]:
train_features_cluster_mapped = train_features_cluster.map(category_dict)

In [318]:
train_features_cluster_mapped.head()

0    0.381687
1    0.412094
2    0.893646
3    0.893646
4    0.893646
Name: 0, dtype: float64

In [319]:
train_features_finalized = pd.concat([transformed_pca_train_features, train_features_cluster_mapped], axis=1)

In [320]:
train_features_finalized.head()

Unnamed: 0,c_1,c_2,c_3,c_4,c_5,c_6,c_7,c_8,c_9,c_10,...,c_72,c_73,c_74,c_75,c_76,c_77,c_78,c_79,c_80,0
0,1.144516,1.887148,-1.882622,-0.048405,0.690094,0.020077,0.809921,0.020945,0.075813,-0.021491,...,-0.362911,0.284236,0.155066,0.010453,-0.049075,-0.32216,-0.088001,-0.347405,-0.073487,0.381687
1,-2.439446,0.340658,0.197613,0.693218,0.979631,0.496533,0.851029,-0.06731,0.234468,0.077654,...,-0.252177,0.22471,0.132591,0.029336,-0.059287,-0.424146,-0.242151,0.453012,-0.540896,0.412094
2,-1.309988,1.616213,-0.912996,0.496057,-1.021015,-1.439816,-0.153049,0.416211,-0.023488,-1.677067,...,0.025349,-0.312739,-0.345905,-0.184987,0.010898,0.272235,-0.180548,0.090512,0.021791,0.893646
3,-0.766258,1.82262,-1.365426,-2.630953,0.756407,0.424133,-0.724646,-0.754212,-0.371986,0.397551,...,-0.093322,-0.24912,-0.003015,0.075503,-0.312867,-0.187745,0.148759,-0.154118,-0.506283,0.893646
4,-0.149392,1.341818,-1.403229,-0.793912,-0.99283,-1.549229,0.569451,-0.270216,0.536317,0.119262,...,-0.055558,-0.153592,-0.103277,0.180957,0.131901,0.213031,-0.011378,0.155462,0.011923,0.893646


In [321]:
train_features_finalized.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42962 entries, 0 to 42961
Data columns (total 81 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c_1     42962 non-null  float64
 1   c_2     42962 non-null  float64
 2   c_3     42962 non-null  float64
 3   c_4     42962 non-null  float64
 4   c_5     42962 non-null  float64
 5   c_6     42962 non-null  float64
 6   c_7     42962 non-null  float64
 7   c_8     42962 non-null  float64
 8   c_9     42962 non-null  float64
 9   c_10    42962 non-null  float64
 10  c_11    42962 non-null  float64
 11  c_12    42962 non-null  float64
 12  c_13    42962 non-null  float64
 13  c_14    42962 non-null  float64
 14  c_15    42962 non-null  float64
 15  c_16    42962 non-null  float64
 16  c_17    42962 non-null  float64
 17  c_18    42962 non-null  float64
 18  c_19    42962 non-null  float64
 19  c_20    42962 non-null  float64
 20  c_21    42962 non-null  float64
 21  c_22    42962 non-null  float64
 22

## Benchmark Model: LinearLearner 

In [335]:
#instantiate a LinearLearner model
from sagemaker import LinearLearner

# instantiate LinearLearner
# here as the result is highly imbalanced, I adjusted the positve_example_weight_mult to balanced 
# I also set the criteria to be precision at target recall, to compensate the imbalanced nature of the data set  
LinearLearner_1 = LinearLearner (role = role,
                               instance_count= 1,
                               instance_type = 'ml.m5.2xlarge',
                               output_path = output_path,
                               predictor_type = 'binary_classifier', 
                               sagemaker_session = session, 
                               epochs = 30,
                               binary_classifier_model_selection_criteria='precision_at_target_recall',
                               target_recall = 0.85, 
                               positive_example_weight_mult='balanced')

In [336]:
# create RecordSet of training data
train_labels = train_labels.astype('float32')
train_features = train_features_finalized.astype('float32')

train_labels_np = train_labels.to_numpy()
train_features_np = train_features.to_numpy()

formatted_train_data = LinearLearner_1.record_set(train = train_features_np, labels = train_labels_np)

In [337]:
%%time 
# train the estimator on formatted training data
LinearLearner_1.fit(formatted_train_data)

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


2021-06-08 17:15:45 Starting - Starting the training job...
2021-06-08 17:15:47 Starting - Launching requested ML instancesProfilerReport-1623172545: InProgress
......
2021-06-08 17:17:15 Starting - Preparing the instances for training......
2021-06-08 17:18:15 Downloading - Downloading input data...
2021-06-08 17:18:44 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[06/08/2021 17:18:53 INFO 139839868135232] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma': '0.01', 'init_bias':

The ROC-AUC score is 0.6831. I instantiated another LinearLearner model that uses accuracy as the selection criteria, in the hope to improve ROC-AUC. 

In [338]:
#tune the model a slightly differently to check the result 
LinearLearner_2 = LinearLearner (role = role,
                                 instance_count= 1,
                                 instance_type = 'ml.m5.2xlarge',
                                 output_path = output_path,
                                 predictor_type = 'binary_classifier', 
                                 sagemaker_session = session, 
                                 epochs = 30,
                                 binary_classifier_model_selection_criteria = 'accuracy',
                                 positive_example_weight_mult='balanced')

In [339]:
LinearLearner_2.fit(formatted_train_data)

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


2021-06-08 17:20:03 Starting - Starting the training job...
2021-06-08 17:20:05 Starting - Launching requested ML instancesProfilerReport-1623172803: InProgress
.........
2021-06-08 17:21:54 Starting - Preparing the instances for training......
2021-06-08 17:22:54 Downloading - Downloading input data...
2021-06-08 17:23:34 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[06/08/2021 17:23:39 INFO 140708987352896] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma': '0.01', 'init_bia

The roc_auc_score is 0.6831. I decided to just go with the first model: LinearLearner_1 as the ROC_AUC_Score didn't improve much.

## Custom Model - Pytorch Neural Network 

3.1 Compile model, train and predict documents for the Pytorch Neural Network model. 

In [342]:
!pygmentize Source/model.py

[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m[04m[36m.[39;49;00m[04m[36mfunctional[39;49;00m [34mas[39;49;00m [04m[36mF[39;49;00m

[37m## It lays the neural network structure: how many layers and how many nodes within each layer [39;49;00m
[37m# In the Class SimpleNet, it enables input_dim and hidden_dim to be input information[39;49;00m
[34mclass[39;49;00m [04m[32mSimpleNet[39;49;00m(nn.Module):
    
    [37m## TODO: Define the init function[39;49;00m
    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, input_dim, hidden_dim, output_dim):
        [33m'''Defines layers of a neural network.[39;49;00m
[33m           :param input_dim: Number of input features[39;49;00m
[33m           :param hidden_dim: Size of h

3.2 Prepare and upload the data for Pytorch model

In [343]:
import os

def make_csv(x, y, filename, data_dir):
    '''Merges features and labels and converts them into one csv file with labels in the first column.
       :param x: Data features
       :param y: Data labels
       :param file_name: Name of csv file, ex. 'train.csv'
       :param data_dir: The directory where files will be saved
       '''
    # make data dir, if it does not exist
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    
    # your code here
    file = pd.concat([pd.DataFrame(y), pd.DataFrame(x)], axis=1)
    file.to_csv(os.path.join(data_dir, filename),header=False, index=False)
    
    # nothing is returned, but a print statement indicates that the function has run
    print('Path created: '+str(data_dir)+'/'+str(filename))

In [344]:
make_csv(train_features, train_labels, 'Pytorch_train.csv', data_dir)

Path created: ../data/Pytorch_train.csv


In [347]:
path_pytorch = os.path.join(data_dir, 'Pytorch_train.csv')
Pytorch_train_location = session.upload_data(path=path_pytorch, key_prefix=prefix)

3.3 Instantiate a Pytorch model and train it

In [350]:
# import a PyTorch wrapper
from sagemaker.pytorch import PyTorch


#the estimator is made of three components: normal SageMaker estimator components, Pytorch components and 
#hyperparameters that can be parsed into the model 

estimator = PyTorch(entry_point= 'train.py',
                    source_dir = 'Source', #here we need to use not only train.py, but also model.py; so I specify the folder location 
                    framework_version ='1.0',
                    py_version = 'py3',
                    role = role,
                    instance_count=1,
                    instance_type='ml.m5.2xlarge',
                    output_path = output_path,
                    sagemaker_session=session,
                    hyperparameters={
                        'input_dim': 81,  # num of features
                        'hidden_dim': 55,
                        'output_dim': 1,
                        'epochs': 150 # could change to higher
                    })


In [351]:
estimator.fit({'train': Pytorch_train_location})

2021-06-08 19:00:33 Starting - Starting the training job...
2021-06-08 19:00:34 Starting - Launching requested ML instancesProfilerReport-1623178832: InProgress
......
2021-06-08 19:01:45 Starting - Preparing the instances for training......
2021-06-08 19:03:03 Downloading - Downloading input data
2021-06-08 19:03:03 Training - Downloading the training image..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-06-08 19:03:16,057 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-06-08 19:03:16,060 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-06-08 19:03:16,072 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-06-08 19:03:17,484 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-06-08 19:03:17,715 sagemaker-container

## Part 3: Kaggle Competition

Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link [here](http://www.kaggle.com/t/21e6d45d4c574c7fa2d868f0e8c83140), you'll be taken to the competition page where, if you have a Kaggle account, you can enter.

Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.

## Prepare the data for both LinearLearner Model and Pytorch Model 

1.1 Prepare data for both models

In [358]:
mailout_test = pd.read_csv('Data/Udacity_MAILOUT_052018_TEST.csv', sep=';')

In [359]:
column_order = azdias.columns.tolist()
mailout_test = mailout_test[column_order]

In [360]:
mailout_test_processed = data_process_2(mailout_test)

In [361]:
mailout_test_processed.isnull().any().any()

False

In [362]:
identification_info = mailout_test_processed['LNR']
mailout_test_processed.drop(axis=1, columns = ['LNR'], inplace=True)

In [364]:
mailout_test_processed.shape

(42833, 277)

In [365]:
scaler_4 = MinMaxScaler(feature_range= (0, 1))

mailout_test_scaled= pd.DataFrame(scaler_4.fit_transform(mailout_test_processed.astype(float)))
mailout_test_scaled.columns = mailout_test_processed.columns
mailout_test_scaled.head()

Unnamed: 0,AKT_DAT_KL,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,ANZ_KINDER,ANZ_PERSONEN,ANZ_STATISTISCHE_HAUSHALTE,ANZ_TITEL,ARBEIT,BALLRAUM,CAMEO_DEUG_2015,...,VK_DHT4A,VK_DISTANZ,VK_ZG11,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB,CAMEO_INTL_FAM_Wealth,CAMEO_INTL_FAM_COMPOSITION
0,0.0,0.005277,0.0,0.0,0.142857,0.005333,0.0,0.25,0.833333,0.125,...,0.4,0.416667,0.2,1.0,0.375,0.4,0.0,0.375,0.0,0.5
1,0.0,0.05277,0.0,0.0,0.071429,0.056,0.0,0.375,1.0,0.5,...,0.4,0.083333,0.0,1.0,0.625,0.4,0.0,0.375,0.5,0.0
2,1.0,0.005277,0.0,0.0,0.285714,0.005333,0.0,0.375,0.0,0.75,...,0.8,0.416667,0.2,1.0,0.5,0.4,1.0,0.375,0.75,0.0
3,0.75,0.002639,0.0,0.0,0.0,0.002667,0.0,0.375,0.0,0.125,...,0.5,0.416667,0.2,1.0,0.25,0.4,1.0,0.375,0.0,0.5
4,0.0,0.002639,0.0,0.0,0.285714,0.002667,0.0,0.25,0.833333,0.5,...,0.1,0.25,0.2,1.0,0.875,0.6,1.0,0.375,0.5,0.0


In [366]:
mailout_test_scaled.to_csv(os.path.join(data_dir, 'mail_test.csv'), header=False, index=False)

In [367]:
mailout_test_location = session.upload_data(os.path.join(data_dir, 'mail_test.csv'), key_prefix=prefix)

In [368]:
pca_transformer.transform(mailout_test_location, content_type='text/csv', split_type='Line')

...............................[34mDocker entrypoint called with argument(s): serve[0m
[35mDocker entrypoint called with argument(s): serve[0m
[34mRunning default environment configuration script[0m
[34m[06/08/2021 19:22:50 INFO 140180668028736] loaded entry point class algorithm.serve.server_config:config_api[0m
[34m[06/08/2021 19:22:50 INFO 140180668028736] nvidia-smi: took 0.033 seconds to run.[0m
[34m[06/08/2021 19:22:50 INFO 140180668028736] nvidia-smi identified 0 GPUs.[0m
[34m[06/08/2021 19:22:50 INFO 140180668028736] loading entry points[0m
[34m[06/08/2021 19:22:50 INFO 140180668028736] Loaded iterator creator application/x-labeled-vector-protobuf for content type ('application/x-labeled-vector-protobuf', '1.0')[0m
[34m[06/08/2021 19:22:50 INFO 140180668028736] Loaded iterator creator application/x-recordio-protobuf for content type ('application/x-recordio-protobuf', '1.0')[0m
[34m[06/08/2021 19:22:50 INFO 140180668028736] Loaded iterator creator protobuf fo

In [370]:
!aws s3 cp --recursive $pca_transformer.output_path $data_dir

download: s3://sagemaker-us-west-1-178050996200/pca-2021-06-08-19-17-54-294/mail_test.csv.out to ../data/mail_test.csv.out


In [371]:
pca_mailout_test= pd.read_csv(os.path.join(data_dir, 'mail_test.csv.out'), header=None)

In [372]:
clean_out(pca_mailout_test)

In [373]:
n = 80 
transformed_pca_mailout_test = create_transformed_df(pca_mailout_test, n)

In [376]:
column_names = []
for i in range(1, n+1):
    column_names.append('c_'+str(i)) 

transformed_pca_mailout_test.columns = column_names

transformed_pca_mailout_test.head()

Unnamed: 0,c_1,c_2,c_3,c_4,c_5,c_6,c_7,c_8,c_9,c_10,...,c_71,c_72,c_73,c_74,c_75,c_76,c_77,c_78,c_79,c_80
0,-1.5438,1.400662,-1.666795,-1.081734,-1.011073,-0.949689,0.662075,-0.06756,-0.347799,-0.603502,...,0.158677,0.075576,0.229689,-0.086758,0.202016,-0.006913,-0.306524,-0.229235,-0.185732,-0.517027
1,0.256624,1.593392,-0.817794,0.230621,-0.59063,-0.100754,0.873446,-0.434377,0.335984,0.891219,...,0.4564,-0.078358,0.040518,-0.161276,-0.071263,-0.335096,-0.173424,0.100908,0.433059,-0.5636
2,-0.864506,1.721451,-0.658591,-0.312438,0.660538,0.331321,-0.998032,0.980327,-1.140898,0.707565,...,-0.470539,-0.063139,-0.159207,0.161077,-0.192976,0.253821,-0.119096,-0.229712,-0.185274,0.1624
3,-0.370743,1.989629,0.121335,0.897753,1.075825,0.022024,-1.188638,-0.460623,-0.022873,-0.280394,...,0.184332,0.150099,0.467827,-0.462807,-0.017165,0.063159,-0.21661,0.010917,0.088971,0.33834
4,-3.513388,-1.0493,-1.10683,1.220684,0.950794,0.017198,0.91459,1.116144,1.937798,0.843823,...,-0.128742,-0.236225,-0.057347,-0.276576,0.483695,-0.094855,-0.238647,-0.090327,0.153859,0.415749


In [378]:
transformed_pca_mailout_test.shape

(42833, 80)

Transformed_pca_train_features will be used as the input information, due to its reduced dimensionality. 

In [379]:
transformed_pca_mailout_test.to_csv(os.path.join(data_dir, 'pca_mailout_test.csv'), header=False, index=False)

In [380]:
pca_mailout_test_location = session.upload_data(os.path.join(data_dir, 'pca_mailout_test.csv'), key_prefix=prefix)

In [381]:
kmeans_transformer.transform(pca_mailout_test_location, content_type='text/csv', split_type='Line')

............................[34mDocker entrypoint called with argument(s): serve[0m
[34mRunning default environment configuration script[0m
[34m[06/08/2021 19:34:46 INFO 140473528604480] loading entry points[0m
[34m[06/08/2021 19:34:46 INFO 140473528604480] Loaded iterator creator application/x-recordio-protobuf for content type ('application/x-recordio-protobuf', '1.0')[0m
[34m[06/08/2021 19:34:46 INFO 140473528604480] loaded request iterator application/json[0m
[34m[06/08/2021 19:34:46 INFO 140473528604480] loaded request iterator application/jsonlines[0m
[34m[06/08/2021 19:34:46 INFO 140473528604480] loaded request iterator application/x-recordio-protobuf[0m
[34m[06/08/2021 19:34:46 INFO 140473528604480] loaded request iterator text/csv[0m
[34m[06/08/2021 19:34:46 INFO 140473528604480] loaded response encoder application/json[0m
[34m[06/08/2021 19:34:46 INFO 140473528604480] loaded response encoder application/jsonlines[0m
[34m[06/08/2021 19:34:46 INFO 140473528

In [382]:
!aws s3 cp --recursive $kmeans_transformer.output_path $data_dir

download: s3://sagemaker-us-west-1-178050996200/kmeans-2021-06-08-19-30-10-544/pca_mailout_test.csv.out to ../data/pca_mailout_test.csv.out


In [383]:
df_mailout_test_kmeans = pd.read_csv(os.path.join(data_dir, 'pca_mailout_test.csv.out'), header=None)

In [384]:
mailout_test_cluster = clean_kmeans_output(df_mailout_test_kmeans)

In [385]:
mailout_test_cluster.head()

0    2
1    2
2    7
3    7
4    3
Name: 0, dtype: int64

In [387]:
type(mailout_test_cluster)

pandas.core.series.Series

In [388]:
mailout_test_cluster_mapped = mailout_test_cluster.map(category_dict)

In [389]:
mailout_test_cluster_mapped.head()

0    0.893646
1    0.893646
2    0.412094
3    0.412094
4    1.000000
Name: 0, dtype: float64

In [390]:
mailout_test_finalized = pd.concat([transformed_pca_mailout_test, mailout_test_cluster_mapped], axis=1)

In [391]:
mailout_test_finalized.shape

(42833, 81)

## Use LinearLearner to predict

In [392]:
linear_predictor = LinearLearner_1.deploy(initial_instance_count=1, instance_type='ml.t2.xlarge')

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


---------------!

In [393]:
mailout_test = mailout_test_finalized.astype('float32')
mailout_test_np = mailout_test.to_numpy()

In [394]:
prediction_batches = [linear_predictor.predict(batch) for batch in np.array_split(mailout_test_np, 100)]
test_preds = np.concatenate([np.array([x.label['predicted_label'].float32_tensor.values[0] for x in batch]) for batch in prediction_batches])

In [398]:
test_preds_df = pd.DataFrame(test_preds)

In [399]:
test_preds_df.head()

Unnamed: 0,0
0,0.0
1,1.0
2,0.0
3,1.0
4,0.0


In [400]:
finalized_prediction = pd.concat([identification_info, test_preds_df], axis=1)

In [404]:
finalized_prediction.head()

Unnamed: 0,LNR,RESPONSE
0,1754,0.0
1,1770,1.0
2,1465,0.0
3,1470,1.0
4,1478,0.0


In [403]:
finalized_prediction.columns = ['LNR', 'RESPONSE']

In [407]:
finalized_prediction.to_csv(os.path.join(data_dir, 'finalized_prediction_benchmark'), index=False)

Benchmark Model get a score of 0.5984

## Use Custom Pytorch Model to predict

In [408]:
pytorch_predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.t2.xlarge')

-------------!

In [414]:
test_preds = np.squeeze(pytorch_predictor.predict(mailout_test_np))

In [417]:
type(test_preds)

numpy.ndarray

In [418]:
test_preds_df = pd.DataFrame(test_preds)

In [419]:
finalized_prediction = pd.concat([identification_info, test_preds_df], axis=1)

In [420]:
finalized_prediction.head()

Unnamed: 0,LNR,0
0,1754,0.004032807
1,1770,0.0001765097
2,1465,9.861202e-16
3,1470,2.503072e-07
4,1478,0.0070894


In [421]:
finalized_prediction.columns = ['LNR', 'RESPONSE']

In [422]:
finalized_prediction.to_csv(os.path.join(data_dir, 'finalized_prediction_Pytorch'), index=False)

## Pytorch models gets a score of 0.533. Two places to improve the model: imbalanced data; add more weight to the final category information. 