# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

In [1]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import shutil
import sagemaker
from sagemaker import get_execution_role
import subprocess
import json
from sklearn.model_selection import train_test_split



# magic word for producing visualizations in notebook
%matplotlib inline

  from pandas.core.computation.check import NUMEXPR_INSTALLED
Matplotlib is building the font cache; this may take a moment.


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
# Setup Sagemaker Session
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
execution_role = sagemaker.session.get_execution_role()
region = sagemaker_session.boto_region_name

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
#download data to notebook
#define data location constants
local_data_dir = 'data'
s3_data_path = f's3://{bucket}/data' 
s3_model_path = f's3://{bucket}/model'

## Initial Model and Kaggle Submission

Below I will be setting up the an initial AutoGluon run without any refienment of the data. Then I'll be submitting to Kaggle.

In [4]:
%%capture

!pip install -U pip
!pip install -U setuptools wheel
!pip install -U "mxnet<2.0.0" bokeh==2.0.1
!pip install autogluon #--no-cache-dir
!pip install kaggle
!pip install python-dotenv
from autogluon.tabular import TabularPredictor


### Setting up Kaggle Creds


In [12]:
!mkdir -p kaggle
!touch kaggle/kaggle.json
!chmod 600 kaggle/kaggle.json

env: KAGGLE_CONFIG_DIR=kaggle


In [5]:
%env KAGGLE_CONFIG_DIR=kaggle

env: KAGGLE_CONFIG_DIR=kaggle


In [21]:
from dotenv import dotenv_values

CONFIG = dotenv_values('env.txt')
kaggle_username = CONFIG['KAGGLE_USERNAME']
kaggle_key = CONFIG['KAGGLE_KEY']

# Save API token the kaggle.json file
with open("kaggle/kaggle.json", "w") as f:
    f.write(json.dumps({"username": kaggle_username, "key": kaggle_key}))

### Downloading and Prepping Data

In [138]:
train_data = pd.read_csv(f'{s3_data_path}/train.csv')

  train_data = pd.read_csv(f'{s3_data_path}/train.csv')


In [139]:
train_data.head()

Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,...,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,RESPONSE,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,1763,2,1.0,8.0,,,,,8.0,15.0,...,5.0,2.0,1.0,6.0,9.0,3.0,3,0,2,4
1,1771,1,4.0,13.0,,,,,13.0,1.0,...,1.0,2.0,1.0,4.0,9.0,7.0,1,0,2,3
2,1776,1,1.0,9.0,,,,,7.0,0.0,...,6.0,4.0,2.0,,9.0,2.0,3,0,1,4
3,1460,2,1.0,6.0,,,,,6.0,4.0,...,8.0,11.0,11.0,6.0,9.0,1.0,3,0,2,4
4,1783,2,1.0,9.0,,,,,9.0,53.0,...,2.0,2.0,1.0,6.0,9.0,3.0,3,0,1,3


In [140]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42962 entries, 0 to 42961
Columns: 367 entries, LNR to ALTERSKATEGORIE_GROB
dtypes: float64(267), int64(94), object(6)
memory usage: 120.3+ MB


### Doing Simple Training
To start for our first submission we're going to drop all the columns but the ones explored in the proposal section, this is simply to get a basic submission and baseline to compare against at the end of the project.

In [141]:
train_data[['LNR', 'RESPONSE', 'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_STATUS_FEIN', 'ANREDE_KZ']].describe()

Unnamed: 0,LNR,RESPONSE,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
count,42962.0,42962.0,42357.0,42357.0,42357.0,42962.0
mean,42803.120129,0.012383,17.661071,5.274996,5.927001,1.595084
std,24778.339984,0.110589,14.085702,4.470538,3.398336,0.490881
min,1.0,0.0,0.0,0.0,1.0,1.0
25%,21284.25,0.0,6.0,2.0,3.0,1.0
50%,42710.0,0.0,15.0,4.0,5.0,2.0
75%,64340.5,0.0,32.0,10.0,9.0,2.0
max,85795.0,1.0,40.0,12.0,10.0,2.0


In [142]:
selected_columns = ['LNR', 'RESPONSE', 'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_STATUS_FEIN', 'ANREDE_KZ']
init_train = train_data[selected_columns]

In [143]:
init_train.head()

Unnamed: 0,LNR,RESPONSE,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
0,1763,0,8.0,2.0,3.0,2
1,1771,0,19.0,5.0,9.0,2
2,1776,0,0.0,0.0,10.0,1
3,1460,0,16.0,4.0,3.0,2
4,1783,0,9.0,3.0,6.0,1


In [144]:
init_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42962 entries, 0 to 42961
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   LNR                  42962 non-null  int64  
 1   RESPONSE             42962 non-null  int64  
 2   LP_LEBENSPHASE_FEIN  42357 non-null  float64
 3   LP_LEBENSPHASE_GROB  42357 non-null  float64
 4   LP_STATUS_FEIN       42357 non-null  float64
 5   ANREDE_KZ            42962 non-null  int64  
dtypes: float64(3), int64(3)
memory usage: 2.0 MB


In [145]:
features = ['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_STATUS_FEIN', 'ANREDE_KZ']
init_train = init_train.dropna(subset=features).copy()

In [146]:
init_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42357 entries, 0 to 42961
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   LNR                  42357 non-null  int64  
 1   RESPONSE             42357 non-null  int64  
 2   LP_LEBENSPHASE_FEIN  42357 non-null  float64
 3   LP_LEBENSPHASE_GROB  42357 non-null  float64
 4   LP_STATUS_FEIN       42357 non-null  float64
 5   ANREDE_KZ            42357 non-null  int64  
dtypes: float64(3), int64(3)
memory usage: 2.3 MB


In [147]:
# Normalize key values
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
columns_to_normalize = ['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_STATUS_FEIN']
init_train[columns_to_normalize] = scaler.fit_transform(init_train[columns_to_normalize])


In [148]:
init_train.describe()

Unnamed: 0,LNR,RESPONSE,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
count,42357.0,42357.0,42357.0,42357.0,42357.0,42357.0
mean,42801.576764,0.012442,-1.087026e-16,6.106135e-17,4.5796010000000003e-17,1.59551
std,24780.83482,0.110848,1.000012,1.000012,1.000012,0.490799
min,1.0,0.0,-1.253845,-1.17996,-1.449845,1.0
25%,21275.0,0.0,-0.8278756,-0.7325817,-0.8613144,1.0
50%,42709.0,0.0,-0.1889223,-0.285203,-0.2727842,2.0
75%,64347.0,0.0,1.01799,1.056933,0.9042763,2.0
max,85795.0,1.0,1.585948,1.504312,1.198541,2.0


In [149]:
#categorize sex
init_train['ANREDE_KZ'] = init_train['ANREDE_KZ'].map({1: 'Male', 2: 'Female'}).astype('category')

In [150]:
init_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42357 entries, 0 to 42961
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   LNR                  42357 non-null  int64   
 1   RESPONSE             42357 non-null  int64   
 2   LP_LEBENSPHASE_FEIN  42357 non-null  float64 
 3   LP_LEBENSPHASE_GROB  42357 non-null  float64 
 4   LP_STATUS_FEIN       42357 non-null  float64 
 5   ANREDE_KZ            42357 non-null  category
dtypes: category(1), float64(3), int64(2)
memory usage: 2.0 MB


In [151]:
init_train.head()

Unnamed: 0,LNR,RESPONSE,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
0,1763,0,-0.685886,-0.732582,-0.861314,Female
1,1771,0,0.095057,-0.061514,0.904276,Female
2,1776,0,-1.253845,-1.17996,1.198541,Male
3,1460,0,-0.117927,-0.285203,-0.861314,Female
4,1783,0,-0.614891,-0.508892,0.021481,Male


In [152]:
init_train['RESPONSE'] = init_train['RESPONSE'].map({0: 'NOPURCHASE', 1: 'PURCHASE'}).astype('category')
init_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42357 entries, 0 to 42961
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   LNR                  42357 non-null  int64   
 1   RESPONSE             42357 non-null  category
 2   LP_LEBENSPHASE_FEIN  42357 non-null  float64 
 3   LP_LEBENSPHASE_GROB  42357 non-null  float64 
 4   LP_STATUS_FEIN       42357 non-null  float64 
 5   ANREDE_KZ            42357 non-null  category
dtypes: category(2), float64(3), int64(1)
memory usage: 1.7 MB


In [153]:
init_train.head()

Unnamed: 0,LNR,RESPONSE,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
0,1763,NOPURCHASE,-0.685886,-0.732582,-0.861314,Female
1,1771,NOPURCHASE,0.095057,-0.061514,0.904276,Female
2,1776,NOPURCHASE,-1.253845,-1.17996,1.198541,Male
3,1460,NOPURCHASE,-0.117927,-0.285203,-0.861314,Female
4,1783,NOPURCHASE,-0.614891,-0.508892,0.021481,Male


In [154]:
#Get Buyers
purchase_records = init_train[init_train['RESPONSE'] == 'PURCHASE']
purchase_records.head()

Unnamed: 0,LNR,RESPONSE,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
90,21511,PURCHASE,1.585948,1.504312,1.198541,Female
129,61905,PURCHASE,-0.614891,-0.508892,0.315746,Male
173,15467,PURCHASE,1.514953,1.504312,1.198541,Male
205,25211,PURCHASE,-0.117927,-0.285203,-0.861314,Male
248,83461,PURCHASE,0.166052,-0.061514,1.198541,Male


In [155]:
init_train['RECORD_WEIGHT'] = np.where(init_train.index.isin(purchase_records.index), 100, 1)

In [156]:
init_train.describe()

Unnamed: 0,LNR,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,RECORD_WEIGHT
count,42357.0,42357.0,42357.0,42357.0,42357.0
mean,42801.576764,-1.087026e-16,6.106135e-17,4.5796010000000003e-17,2.231744
std,24780.83482,1.000012,1.000012,1.000012,10.973985
min,1.0,-1.253845,-1.17996,-1.449845,1.0
25%,21275.0,-0.8278756,-0.7325817,-0.8613144,1.0
50%,42709.0,-0.1889223,-0.285203,-0.2727842,1.0
75%,64347.0,1.01799,1.056933,0.9042763,1.0
max,85795.0,1.585948,1.504312,1.198541,100.0


In [157]:
init_train.head()

Unnamed: 0,LNR,RESPONSE,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ,RECORD_WEIGHT
0,1763,NOPURCHASE,-0.685886,-0.732582,-0.861314,Female,1
1,1771,NOPURCHASE,0.095057,-0.061514,0.904276,Female,1
2,1776,NOPURCHASE,-1.253845,-1.17996,1.198541,Male,1
3,1460,NOPURCHASE,-0.117927,-0.285203,-0.861314,Female,1
4,1783,NOPURCHASE,-0.614891,-0.508892,0.021481,Male,1


In [158]:
purchase_records_sample = init_train[init_train['RESPONSE'] == 'PURCHASE']
purchase_records_sample.head()

Unnamed: 0,LNR,RESPONSE,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ,RECORD_WEIGHT
90,21511,PURCHASE,1.585948,1.504312,1.198541,Female,100
129,61905,PURCHASE,-0.614891,-0.508892,0.315746,Male,100
173,15467,PURCHASE,1.514953,1.504312,1.198541,Male,100
205,25211,PURCHASE,-0.117927,-0.285203,-0.861314,Male,100
248,83461,PURCHASE,0.166052,-0.061514,1.198541,Male,100


In [159]:
# Normalize the weights
init_train['RECORD_WEIGHT'] /= init_train['RECORD_WEIGHT'].sum()
init_train.head()

Unnamed: 0,LNR,RESPONSE,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ,RECORD_WEIGHT
0,1763,NOPURCHASE,-0.685886,-0.732582,-0.861314,Female,1.1e-05
1,1771,NOPURCHASE,0.095057,-0.061514,0.904276,Female,1.1e-05
2,1776,NOPURCHASE,-1.253845,-1.17996,1.198541,Male,1.1e-05
3,1460,NOPURCHASE,-0.117927,-0.285203,-0.861314,Female,1.1e-05
4,1783,NOPURCHASE,-0.614891,-0.508892,0.021481,Male,1.1e-05


In [160]:
purchase_records_sample = init_train[init_train['RESPONSE'] == 'PURCHASE']
purchase_records_sample.head()

Unnamed: 0,LNR,RESPONSE,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ,RECORD_WEIGHT
90,21511,PURCHASE,1.585948,1.504312,1.198541,Female,0.001058
129,61905,PURCHASE,-0.614891,-0.508892,0.315746,Male,0.001058
173,15467,PURCHASE,1.514953,1.504312,1.198541,Male,0.001058
205,25211,PURCHASE,-0.117927,-0.285203,-0.861314,Male,0.001058
248,83461,PURCHASE,0.166052,-0.061514,1.198541,Male,0.001058


In [161]:
init_train.drop(columns=['LNR'], inplace=True)
init_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42357 entries, 0 to 42961
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   RESPONSE             42357 non-null  category
 1   LP_LEBENSPHASE_FEIN  42357 non-null  float64 
 2   LP_LEBENSPHASE_GROB  42357 non-null  float64 
 3   LP_STATUS_FEIN       42357 non-null  float64 
 4   ANREDE_KZ            42357 non-null  category
 5   RECORD_WEIGHT        42357 non-null  float64 
dtypes: category(2), float64(4)
memory usage: 1.7 MB


In [162]:
## Split Data into Training and Validation sets
init_train_data, init_valid_data = train_test_split(init_train, test_size=0.2, random_state=1)

In [163]:
init_train_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 33885 entries, 39087 to 16011
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   RESPONSE             33885 non-null  category
 1   LP_LEBENSPHASE_FEIN  33885 non-null  float64 
 2   LP_LEBENSPHASE_GROB  33885 non-null  float64 
 3   LP_STATUS_FEIN       33885 non-null  float64 
 4   ANREDE_KZ            33885 non-null  category
 5   RECORD_WEIGHT        33885 non-null  float64 
dtypes: category(2), float64(4)
memory usage: 1.4 MB


In [164]:
predictor = TabularPredictor(label="RESPONSE", verbosity=4, sample_weight='RECORD_WEIGHT').fit(
    train_data=init_train_data,
    time_limit=600,
    presets="best_quality"
)

In [165]:
results = predictor.evaluate(init_valid_data)

  _warn_prf(average, modifier, msg_start, len(result))


In [166]:
predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                      model  score_val eval_metric  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0       WeightedEnsemble_L3   0.990822    accuracy      11.232841  331.159106                0.086079          21.868798            3       True         18
1   RandomForestEntr_BAG_L2   0.990350    accuracy       9.832861  276.697928                1.108165          14.902271            2       True         16
2   RandomForestGini_BAG_L2   0.989671    accuracy       9.960351  273.347094                1.235655          11.551436            2       True         15
3    NeuralNetFastAI_BAG_L1   0.987546    accuracy       0.737310  162.591293                0.737310         162.591293            1       True         10
4     KNeighborsDist_BAG_L1   0.987546    accuracy       0.881267    0.063840                0.881267           0.063840            1       True          2
5 



{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
  'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
  'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
  'RandomForestGini_BAG_L1': 'StackerEnsembleModel_RF',
  'RandomForestEntr_BAG_L1': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
  'ExtraTreesGini_BAG_L1': 'StackerEnsembleModel_XT',
  'ExtraTreesEntr_BAG_L1': 'StackerEnsembleModel_XT',
  'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
  'XGBoost_BAG_L1': 'StackerEnsembleModel_XGBoost',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel',
  'LightGBMXT_BAG_L2': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
  'RandomForestGini_BAG_L2': 'StackerEnsembleModel_RF',
  'RandomForestEntr_BAG_L2': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L2': 'StackerEnsembleModel_CatBoost',
  'WeightedEnsemble_L3': 'WeightedEnsembleModel'},
 'model_perfor

In [167]:
model_path = predictor.path
print(f"The model was saved in: {model_path}")

The model was saved in: AutogluonModels/ag-20231201_133737


In [168]:
## Make copy of test file for submission
test_data = pd.read_csv(f'{s3_data_path}/test.csv')

  test_data = pd.read_csv(f'{s3_data_path}/test.csv')


In [169]:
init_test_columns = [col for col in selected_columns if col != 'RESPONSE']
init_test = test_data[init_test_columns]
init_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42833 entries, 0 to 42832
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   LNR                  42833 non-null  int64  
 1   LP_LEBENSPHASE_FEIN  42255 non-null  float64
 2   LP_LEBENSPHASE_GROB  42255 non-null  float64
 3   LP_STATUS_FEIN       42255 non-null  float64
 4   ANREDE_KZ            42833 non-null  int64  
dtypes: float64(3), int64(2)
memory usage: 1.6 MB


In [170]:
init_test.describe()

Unnamed: 0,LNR,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
count,42833.0,42255.0,42255.0,42255.0,42833.0
mean,42993.16562,17.775175,5.313194,5.936528,1.595475
std,24755.599728,14.096616,4.475535,3.412463,0.490806
min,2.0,0.0,0.0,1.0,1.0
25%,21650.0,6.0,2.0,3.0,1.0
50%,43054.0,15.0,4.0,5.0,2.0
75%,64352.0,32.0,10.0,9.0,2.0
max,85794.0,40.0,12.0,10.0,2.0


In [171]:
## Need to fill in NA values
init_test.loc[:, 'LP_LEBENSPHASE_FEIN'] = init_test['LP_LEBENSPHASE_FEIN'].fillna(init_test['LP_LEBENSPHASE_FEIN'].mean())
init_test.loc[:, 'LP_LEBENSPHASE_GROB'] = init_test['LP_LEBENSPHASE_GROB'].fillna(init_test['LP_LEBENSPHASE_GROB'].mean())
init_test.loc[:, 'LP_STATUS_FEIN'] = init_test['LP_STATUS_FEIN'].fillna(init_test['LP_STATUS_FEIN'].mean())
init_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42833 entries, 0 to 42832
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   LNR                  42833 non-null  int64  
 1   LP_LEBENSPHASE_FEIN  42833 non-null  float64
 2   LP_LEBENSPHASE_GROB  42833 non-null  float64
 3   LP_STATUS_FEIN       42833 non-null  float64
 4   ANREDE_KZ            42833 non-null  int64  
dtypes: float64(3), int64(2)
memory usage: 1.6 MB


In [172]:
# Use the same scaler from training data
init_test.loc[:, columns_to_normalize] = scaler.transform(init_test[columns_to_normalize])

In [173]:
init_test.describe()

Unnamed: 0,LNR,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
count,42833.0,42833.0,42833.0,42833.0,42833.0
mean,42993.16562,0.008101,0.008544,0.002803,1.595475
std,24755.599728,0.994011,0.994352,0.99737,0.490806
min,2.0,-1.253845,-1.17996,-1.449845,1.0
25%,21650.0,-0.827876,-0.732582,-0.861314,1.0
50%,43054.0,-0.188922,-0.285203,0.002803,2.0
75%,64352.0,1.01799,1.056933,0.904276,2.0
max,85794.0,1.585948,1.504312,1.198541,2.0


In [174]:
#categorize the same as training data
init_test.loc[:, 'ANREDE_KZ'] = init_test['ANREDE_KZ'].map({1: 'Male', 2: 'Female'}).astype('category')

In [177]:
init_test.loc[:, ~init_test.columns.isin(['LNR'])]

Unnamed: 0,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
0,0.166052,-0.061514,1.198541,Male
1,-0.827876,-0.732582,-1.449845,Male
2,1.585948,1.504312,1.198541,Female
3,-1.253845,-1.179960,-0.861314,Female
4,1.372964,1.504312,0.904276,Female
...,...,...,...,...
42828,-1.253845,-1.179960,-0.272784,Female
42829,-0.401907,-0.508892,0.904276,Male
42830,1.514953,1.504312,1.198541,Male
42831,-0.614891,-0.508892,0.315746,Female


In [178]:
init_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42833 entries, 0 to 42832
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   LP_LEBENSPHASE_FEIN  42833 non-null  float64 
 1   LP_LEBENSPHASE_GROB  42833 non-null  float64 
 2   LP_STATUS_FEIN       42833 non-null  float64 
 3   ANREDE_KZ            42833 non-null  category
dtypes: category(1), float64(3)
memory usage: 1.0 MB


In [179]:
init_test.head()

Unnamed: 0,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
0,0.166052,-0.061514,1.198541,Male
1,-0.827876,-0.732582,-1.449845,Male
2,1.585948,1.504312,1.198541,Female
3,-1.253845,-1.17996,-0.861314,Female
4,1.372964,1.504312,0.904276,Female


In [180]:
predictions = predictor.predict(init_test.copy())
predictions.describe()

count          42833
unique             1
top       NOPURCHASE
freq           42833
Name: RESPONSE, dtype: object

In [93]:
predictions.head()

0    NOPURCHASE
1    NOPURCHASE
2    NOPURCHASE
3    NOPURCHASE
4    NOPURCHASE
Name: RESPONSE, dtype: object