# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

In [1]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import shutil
import sagemaker
from sagemaker import get_execution_role
import subprocess
import json





# magic word for producing visualizations in notebook
%matplotlib inline

  from pandas.core.computation.check import NUMEXPR_INSTALLED


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
# Setup Sagemaker Session
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
execution_role = sagemaker.session.get_execution_role()
region = sagemaker_session.boto_region_name

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
#download data to notebook
#define data location constants
local_data_dir = 'data'
s3_data_path = f's3://{bucket}/data' 
s3_model_path = f's3://{bucket}/model'

## Initial Model and Kaggle Submission

Below I will be setting up the an initial AutoGluon run without any refienment of the data. Then I'll be submitting to Kaggle.

In [4]:
%%capture

!pip install -U pip
!pip install -U setuptools wheel
!pip install -U "mxnet<2.0.0" bokeh==2.0.1
!pip install autogluon --no-cache-dir
!pip install kaggle
!pip install python-dotenv
from autogluon.tabular import TabularPredictor


Collecting autogluon
  Downloading autogluon-1.0.0-py3-none-any.whl.metadata (12 kB)
Collecting autogluon.core==1.0.0 (from autogluon.core[all]==1.0.0->autogluon)
  Downloading autogluon.core-1.0.0-py3-none-any.whl.metadata (13 kB)
Collecting autogluon.features==1.0.0 (from autogluon)
  Downloading autogluon.features-1.0.0-py3-none-any.whl.metadata (12 kB)
Collecting autogluon.tabular==1.0.0 (from autogluon.tabular[all]==1.0.0->autogluon)
  Downloading autogluon.tabular-1.0.0-py3-none-any.whl.metadata (14 kB)
Collecting autogluon.multimodal==1.0.0 (from autogluon)
  Downloading autogluon.multimodal-1.0.0-py3-none-any.whl.metadata (14 kB)
Collecting autogluon.timeseries==1.0.0 (from autogluon.timeseries[all]==1.0.0->autogluon)
  Downloading autogluon.timeseries-1.0.0-py3-none-any.whl.metadata (13 kB)
Collecting autogluon.common==1.0.0 (from autogluon.core==1.0.0->autogluon.core[all]==1.0.0->autogluon)
  Downloading autogluon.common-1.0.0-py3-none-any.whl.metadata (12 kB)
Collecting ray<

### Setting up Kaggle Creds


In [12]:
!mkdir -p kaggle
%env KAGGLE_CONFIG_DIR=kaggle
!touch kaggle/kaggle.json
!chmod 600 kaggle/kaggle.json

env: KAGGLE_CONFIG_DIR=kaggle


In [21]:
from dotenv import dotenv_values

CONFIG = dotenv_values('env.txt')
kaggle_username = CONFIG['KAGGLE_USERNAME']
kaggle_key = CONFIG['KAGGLE_KEY']

# Save API token the kaggle.json file
with open("kaggle/kaggle.json", "w") as f:
    f.write(json.dumps({"username": kaggle_username, "key": kaggle_key}))

### Downloading and Prepping Data

In [45]:
train_data = pd.read_csv(f'{s3_data_path}/train.csv')

  train_data = pd.read_csv(f'{s3_data_path}/train.csv')


In [46]:
train_data.head()

Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,...,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,RESPONSE,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,1763,2,1.0,8.0,,,,,8.0,15.0,...,5.0,2.0,1.0,6.0,9.0,3.0,3,0,2,4
1,1771,1,4.0,13.0,,,,,13.0,1.0,...,1.0,2.0,1.0,4.0,9.0,7.0,1,0,2,3
2,1776,1,1.0,9.0,,,,,7.0,0.0,...,6.0,4.0,2.0,,9.0,2.0,3,0,1,4
3,1460,2,1.0,6.0,,,,,6.0,4.0,...,8.0,11.0,11.0,6.0,9.0,1.0,3,0,2,4
4,1783,2,1.0,9.0,,,,,9.0,53.0,...,2.0,2.0,1.0,6.0,9.0,3.0,3,0,1,3


In [47]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42962 entries, 0 to 42961
Columns: 367 entries, LNR to ALTERSKATEGORIE_GROB
dtypes: float64(267), int64(94), object(6)
memory usage: 120.3+ MB


### Doing Simple Training
To start for our first submission we're going to drop all the columns but the ones explored in the proposal section, this is simply to get a basic submission and baseline to compare against at the end of the project.

In [48]:
train_data[['RESPONSE', 'LNR', 'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_STATUS_FEIN', 'ANREDE_KZ']].describe()

Unnamed: 0,RESPONSE,LNR,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
count,42962.0,42962.0,42357.0,42357.0,42357.0,42962.0
mean,0.012383,42803.120129,17.661071,5.274996,5.927001,1.595084
std,0.110589,24778.339984,14.085702,4.470538,3.398336,0.490881
min,0.0,1.0,0.0,0.0,1.0,1.0
25%,0.0,21284.25,6.0,2.0,3.0,1.0
50%,0.0,42710.0,15.0,4.0,5.0,2.0
75%,0.0,64340.5,32.0,10.0,9.0,2.0
max,1.0,85795.0,40.0,12.0,10.0,2.0


In [49]:
selected_columns = ['RESPONSE', 'LNR', 'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_STATUS_FEIN', 'ANREDE_KZ']
init_train = train_data[selected_columns]

In [50]:
init_train.head()

Unnamed: 0,RESPONSE,LNR,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
0,0,1763,8.0,2.0,3.0,2
1,0,1771,19.0,5.0,9.0,2
2,0,1776,0.0,0.0,10.0,1
3,0,1460,16.0,4.0,3.0,2
4,0,1783,9.0,3.0,6.0,1


In [51]:
init_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42962 entries, 0 to 42961
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   RESPONSE             42962 non-null  int64  
 1   LNR                  42962 non-null  int64  
 2   LP_LEBENSPHASE_FEIN  42357 non-null  float64
 3   LP_LEBENSPHASE_GROB  42357 non-null  float64
 4   LP_STATUS_FEIN       42357 non-null  float64
 5   ANREDE_KZ            42962 non-null  int64  
dtypes: float64(3), int64(3)
memory usage: 2.0 MB


In [52]:
features = ['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_STATUS_FEIN', 'ANREDE_KZ']
init_train = init_train.dropna(subset=features).copy()

In [53]:
init_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42357 entries, 0 to 42961
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   RESPONSE             42357 non-null  int64  
 1   LNR                  42357 non-null  int64  
 2   LP_LEBENSPHASE_FEIN  42357 non-null  float64
 3   LP_LEBENSPHASE_GROB  42357 non-null  float64
 4   LP_STATUS_FEIN       42357 non-null  float64
 5   ANREDE_KZ            42357 non-null  int64  
dtypes: float64(3), int64(3)
memory usage: 2.3 MB


In [54]:
# Normalize key values
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
columns_to_normalize = ['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_STATUS_FEIN']
init_train[columns_to_normalize] = scaler.fit_transform(init_train[columns_to_normalize])


In [55]:
init_train.describe()

Unnamed: 0,RESPONSE,LNR,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
count,42357.0,42357.0,42357.0,42357.0,42357.0,42357.0
mean,0.012442,42801.576764,-1.087026e-16,6.106135e-17,4.5796010000000003e-17,1.59551
std,0.110848,24780.83482,1.000012,1.000012,1.000012,0.490799
min,0.0,1.0,-1.253845,-1.17996,-1.449845,1.0
25%,0.0,21275.0,-0.8278756,-0.7325817,-0.8613144,1.0
50%,0.0,42709.0,-0.1889223,-0.285203,-0.2727842,2.0
75%,0.0,64347.0,1.01799,1.056933,0.9042763,2.0
max,1.0,85795.0,1.585948,1.504312,1.198541,2.0


In [56]:
#categorize sex
init_train['ANREDE_KZ'] = init_train['ANREDE_KZ'].map({1: 'Male', 2: 'Female'}).astype('category')

In [57]:
init_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42357 entries, 0 to 42961
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   RESPONSE             42357 non-null  int64   
 1   LNR                  42357 non-null  int64   
 2   LP_LEBENSPHASE_FEIN  42357 non-null  float64 
 3   LP_LEBENSPHASE_GROB  42357 non-null  float64 
 4   LP_STATUS_FEIN       42357 non-null  float64 
 5   ANREDE_KZ            42357 non-null  category
dtypes: category(1), float64(3), int64(2)
memory usage: 2.0 MB


In [58]:
init_train.head()

Unnamed: 0,RESPONSE,LNR,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
0,0,1763,-0.685886,-0.732582,-0.861314,Female
1,0,1771,0.095057,-0.061514,0.904276,Female
2,0,1776,-1.253845,-1.17996,1.198541,Male
3,0,1460,-0.117927,-0.285203,-0.861314,Female
4,0,1783,-0.614891,-0.508892,0.021481,Male


In [59]:
init_train.set_index('LNR', inplace=True)
predictor = TabularPredictor(label="RESPONSE", eval_metric='log_loss').fit(
    train_data=init_train,
    time_limit=600,
    presets="best_quality"
)

No path specified. Models will be saved in: "AutogluonModels/ag-20231130_140410"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
Dynamic stacking is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
Detecting stacked overfitting by sub-fitting AutoGluon on the input data. That is, copies of AutoGluon will be sub-fit on subset(s) of the data. Then, the holdout validation data is used to detect stacked overfitting.
Sub-fit(s) time limit is: 600 seconds.
Starting holdout-based sub-fit for dynamic stacking. Context path is: AutogluonModels/ag-20231130_140410/ds_sub_fit/sub_fit_ho.
Running the sub-fit in a ray process to avoid memory leakage.


OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 172.16.41.198, ID: 6c4c5d6a5a83e73b75a2d41da1b682098f8c4f7afb0ff6fcd1425029) where the task (task ID: 29bc9987691ae2e47fd79286bed5c45830c8ef1d01000000, name=_sub_fit, pid=22103, memory used=0.43GB) was running was 3.57GB / 3.76GB (0.950027), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: a60f430bacbaed466b0240e1c016508befed931f29a0aab2f9aed63e) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.16.41.198`. To see the logs of the worker, use `ray logs worker-a60f430bacbaed466b0240e1c016508befed931f29a0aab2f9aed63e*out -ip 172.16.41.198. Top 10 memory users:
PID	MEM(GB)	COMMAND
5821	1.82	/home/ec2-user/anaconda3/envs/python3/bin/python -m ipykernel_launcher -f /home/ec2-user/.local/shar...
22103	0.43	ray::_sub_fit
10860	0.16	/home/ec2-user/anaconda3/envs/JupyterSystemEnv/bin/python3.10 -m pylsp
3702	0.15	/home/ec2-user/anaconda3/envs/JupyterSystemEnv/bin/python3.10 /home/ec2-user/anaconda3/envs/JupyterS...
31047	0.08	/home/ec2-user/anaconda3/envs/python3/bin/python -u /home/ec2-user/anaconda3/envs/python3/lib/python...
30951	0.07	/home/ec2-user/anaconda3/envs/python3/bin/python /home/ec2-user/anaconda3/envs/python3/lib/python3.1...
30947	0.06	/home/ec2-user/anaconda3/envs/python3/bin/python -u /home/ec2-user/anaconda3/envs/python3/lib/python...
31019	0.06	/home/ec2-user/anaconda3/envs/python3/bin/python -u /home/ec2-user/anaconda3/envs/python3/lib/python...
18873	0.04	/home/ec2-user/anaconda3/envs/JupyterSystemEnv/bin/python3.10 /home/ec2-user/anaconda3/envs/JupyterS...
22336	0.04	/home/ec2-user/anaconda3/envs/python3/bin/python /home/ec2-user/anaconda3/envs/python3/lib/python3.1...
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

In [19]:
predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                      model  score_val eval_metric  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0       WeightedEnsemble_L2  -0.066645    log_loss       3.607294  299.268882                0.003067           4.502586            2       True         13
1    NeuralNetFastAI_BAG_L1  -0.066768    log_loss       1.335047  244.308689                1.335047         244.308689            1       True         10
2         LightGBMXT_BAG_L1  -0.066780    log_loss       0.175785   25.935876                0.175785          25.935876            1       True          3
3           LightGBM_BAG_L1  -0.066829    log_loss       0.050700   24.467825                0.050700          24.467825            1       True          4
4      LightGBMLarge_BAG_L1  -0.066842    log_loss       0.067685   24.387013                0.067685          24.387013            1       True         12
5 



{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
  'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
  'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
  'RandomForestGini_BAG_L1': 'StackerEnsembleModel_RF',
  'RandomForestEntr_BAG_L1': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
  'ExtraTreesGini_BAG_L1': 'StackerEnsembleModel_XT',
  'ExtraTreesEntr_BAG_L1': 'StackerEnsembleModel_XT',
  'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
  'XGBoost_BAG_L1': 'StackerEnsembleModel_XGBoost',
  'LightGBMLarge_BAG_L1': 'StackerEnsembleModel_LGB',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif_BAG_L1': -0.19939988684776724,
  'KNeighborsDist_BAG_L1': -0.19938935052767662,
  'LightGBMXT_BAG_L1': -0.06677984446545061,
  'LightGBM_BAG_L1': -0.06682867410013044,
  'RandomForestGini_BAG_L1': -0.07491378454467366,
  'RandomForestEnt

In [18]:
model_path = predictor.path
print(f"The model was saved in: {model_path}")

The model was saved in: AutogluonModels/ag-20231130_133606


In [21]:
## Make copy of test file for submission
test_data = pd.read_csv(f'{s3_data_path}/test.csv')

  test_data = pd.read_csv(f'{s3_data_path}/test.csv')


In [22]:
init_test_columns = [col for col in selected_columns if col != 'RESPONSE']
init_test = test_data[init_test_columns]
init_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42833 entries, 0 to 42832
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   LNR                  42833 non-null  int64  
 1   LP_LEBENSPHASE_FEIN  42255 non-null  float64
 2   LP_LEBENSPHASE_GROB  42255 non-null  float64
 3   LP_STATUS_FEIN       42255 non-null  float64
 4   ANREDE_KZ            42833 non-null  int64  
dtypes: float64(3), int64(2)
memory usage: 1.6 MB


In [23]:
init_test.describe()

Unnamed: 0,LNR,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
count,42833.0,42255.0,42255.0,42255.0,42833.0
mean,42993.16562,17.775175,5.313194,5.936528,1.595475
std,24755.599728,14.096616,4.475535,3.412463,0.490806
min,2.0,0.0,0.0,1.0,1.0
25%,21650.0,6.0,2.0,3.0,1.0
50%,43054.0,15.0,4.0,5.0,2.0
75%,64352.0,32.0,10.0,9.0,2.0
max,85794.0,40.0,12.0,10.0,2.0


In [24]:
## Need to fill in NA values
init_test['LP_LEBENSPHASE_FEIN'].fillna(init_test['LP_LEBENSPHASE_FEIN'].mean(), inplace=True)
init_test['LP_LEBENSPHASE_GROB'].fillna(init_test['LP_LEBENSPHASE_GROB'].mean(), inplace=True)
init_test['LP_STATUS_FEIN'].fillna(init_test['LP_STATUS_FEIN'].mean(), inplace=True)
init_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42833 entries, 0 to 42832
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   LNR                  42833 non-null  int64  
 1   LP_LEBENSPHASE_FEIN  42833 non-null  float64
 2   LP_LEBENSPHASE_GROB  42833 non-null  float64
 3   LP_STATUS_FEIN       42833 non-null  float64
 4   ANREDE_KZ            42833 non-null  int64  
dtypes: float64(3), int64(2)
memory usage: 1.6 MB


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  init_test['LP_LEBENSPHASE_FEIN'].fillna(init_test['LP_LEBENSPHASE_FEIN'].mean(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  init_test['LP_LEBENSPHASE_GROB'].fillna(init_test['LP_LEBENSPHASE_GROB'].mean(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  init_test['LP_STATUS_FEIN'].fillna(init_test['LP_STATUS_FEIN'].mean(), inplace=True)


In [26]:
# Use the same scaler from training data
init_test[columns_to_normalize] = scaler.transform(init_test[columns_to_normalize])  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  init_test[columns_to_normalize] = scaler.transform(init_test[columns_to_normalize])


In [31]:
init_test.describe()

Unnamed: 0,LNR,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN
count,42833.0,42833.0,42833.0,42833.0
mean,42993.16562,0.008101,0.008544,0.002803
std,24755.599728,0.994011,0.994352,0.99737
min,2.0,-1.253845,-1.17996,-1.449845
25%,21650.0,-0.827876,-0.732582,-0.861314
50%,43054.0,-0.188922,-0.285203,0.002803
75%,64352.0,1.01799,1.056933,0.904276
max,85794.0,1.585948,1.504312,1.198541


In [32]:
#categorize the same as training data
init_test['ANREDE_KZ'] = init_test['ANREDE_KZ'].map({1: 'Male', 2: 'Female'}).astype('category')  # Use the same mapping as in training data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  init_test['ANREDE_KZ'] = init_test['ANREDE_KZ'].map({1: 'Male', 2: 'Female'}).astype('category')  # Use the same mapping as in training data


In [33]:
init_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42833 entries, 0 to 42832
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   LNR                  42833 non-null  int64   
 1   LP_LEBENSPHASE_FEIN  42833 non-null  float64 
 2   LP_LEBENSPHASE_GROB  42833 non-null  float64 
 3   LP_STATUS_FEIN       42833 non-null  float64 
 4   ANREDE_KZ            0 non-null      category
dtypes: category(1), float64(3), int64(1)
memory usage: 1.3 MB


In [35]:
init_test.head()

Unnamed: 0,LNR,LP_LEBENSPHASE_FEIN,LP_LEBENSPHASE_GROB,LP_STATUS_FEIN,ANREDE_KZ
0,1754,0.166052,-0.061514,1.198541,
1,1770,-0.827876,-0.732582,-1.449845,
2,1465,1.585948,1.504312,1.198541,
3,1470,-1.253845,-1.17996,-0.861314,
4,1478,1.372964,1.504312,0.904276,


In [34]:
predictions = predictor.predict(init_test)
predictions.describe()

count    42833.0
mean         0.0
std          0.0
min          0.0
25%          0.0
50%          0.0
75%          0.0
max          0.0
Name: RESPONSE, dtype: float64

In [21]:
predictions.describe()

count    42833.0
mean         0.0
std          0.0
min          0.0
25%          0.0
50%          0.0
75%          0.0
max          0.0
Name: RESPONSE, dtype: float64