## Introduction

This post discuss in detail the project "Create a Customer Segmentation Report for Arvato Financial Services", which is one of the Data Scientist's nanodegree capstone options. It is in fact a continuation of a previous project whose solution was posted [here](https://github.com/bvcmartins/dsndProject3). I chose it because of its broad scope which involves a reasonably complex data cleaning procedure, unsupervised learning, analysis of imbalanced data, and prediction using supervised learning tools. In the following I will discuss my solution for it.

### The dataset

Arvato kindly provided us the following four datasets:

1. Azdias: general elements of the german population (891221 entries, 366 features)
2. Customers: same features but containing only customers (191652 entries, 369 features)
3. Mailout_train: training set containing potential customers chosen to receive mail ads. It also contains information if the ad was responded 
4. Mailout_test: testing set for the supervised learning model

On top of that two other files were provided:
1. DIAS Attributes - Values 2017: information about code levels of each attribute
2. DIAS Information Levels - Attributes 2017: high-level information about each attribute

Most of the features are ordinal and the numbers only represent a label for ranked value levels. Columns marked as float are actually comprised by ints but were only marked that way because they contain NaN, which is itself a float. The latest pandas version allows us to use the type Int64 which supports a integer NaN. 

There are also 6 features of type object. These are categorical variables, except for EINGEFUEGT_AM, which is datetime.

Most of the features contained NaNs. Actually, NaNs comprised almost 10% of all data.

### Data Cleaning

Cleaning this dataset was a relatively complex task. The steps are outlined below:

* pre-cleaning
* converting missing values to NaN
* assessing missing data per feature
* assessing missing data per row
* converting mixed-type features to ordinal or binary features
* one-hot encoding categorical features
* standard scaling numerical features

#### Pre-cleaning

We defined a function to perform general-purpose operations like converting all numeric features to Int64 (support to integer NaN) and make substitutions for some non-standard missing data encodings.

#### Converting missing values to NaN

The challenge with this step was that the missing data coding was feature-dependent. Most of the missing values were coded as -1 or 0 and some of them were coded (and not listed in file DIAS) as X or XX. The latter were converted to NaNs during pre-cleaning while the former were first converted to a not-used code (-100) to avoid problems with datatype and then to NaNs.

#### Assessing missing data per feature

After having all missing values converted to NaN, we were able to assess which features had more than 445 000, half of the total number of entries, missing. As shown below, we found 9 features satrisfying this requirement. They correponded to 18% of all missing values and were all dropped.

![](./figures/main_40_0.png)

#### Assessing missing data per row

After analyzing missing data per column, we turned our attention to missing values patterns associated with rows. As shown in the figure below, the distribution of missing data per row is multimodal. We selected the leftmost cluster, with values above 180, for a statistical test.

![](./figures/main_47_0.png)

We applied the Kolmogorov-Smirnov test to check if the selected rows are overall different than the main body of data. The null hypothesis is that both groups are identical.

Because we were executing multiple comparisons, we applied the very strict Bonferroni correction to the p-values.
The results showed that the difference between the two groups were significant only for 8.2% of the test features. Note that this number is not a p-value and it should not be compared with the 0.05 significance level. 

![](./figures/ks_test.png)

We decided that differences in 8.2% of the test columns was acceptable and we did not drop the rows.

#### Data Imputation

Imputation was carried out separately for numeric and object variables. Numeric variables were imputed using the median value of the column while object-type variables used the most frequent value.

#### Re-encode mixed features

After removing all NaNs, the next step was to re-encode the variables of mixed type. These were:

* PRAEGENDE_JUGENDJAHRE
* CAMEO_INTL_2015
* LP_LEBENSPHASE_FEIN
* LP_LEBENSPHASE_GROB

Variable PLZ8_BAUMAX could also have been reencoded but the explanatory gain was too small. The description of the derived variables and their levels is provided on the notebook.

#### One-hot encoding

The data was cleaned and variables were reencoded. Next we performed one-hot encoding of all categorical variables. Binary variables were left out. A total of 13 features were transformed.

#### Scaling

At first Standard Scaler was used but it is was too sensitive to outliers. A combination of Robust scaling with outlier Removal led to improved results.

#### Removing Outliers

Some of the features had outliers that were skewing the scaling that comes next. Outlier removal was needed at this step to avoid problems with the PCA step. After trying some of the built-in outlier detection methods from sklearn (isolation forest, local outlier factor) we decided that the most robust method for this application was to simply calculate the inter-quartile range and remove everything outside of it (an absolute distance of 3 for a z-transformed distribution). However this approach was too harsh and we determined that a distance of 4 would be a better compromise.

In the end of the data-cleaning process, we defined a pandas pipeline that streamlined the cleaning of the dataframes used in the next section. 

### Customer Segmentation Report

The first part of this project analyzed a dataset comprising a sample of the general population of Germany. This dataset will serve as the basis for the prediction analysis that will be carried out below. The analysis comprised the following steps:

1. Reduce dimensionality of the general population dataset
2. Clusterize the reduced space in order to identify customer segments
3. Apply the PCA transformation defined in 1 to the customers dataset
4. Clusterize the customer reduced space and identify which cluster have a population excess

#### PCA

After being cleaned and prepared, the dataset had 496 features. Considering that some of these features might be redundant or not important for prediction, it is desirable for a good prediction performance to reduce its dimensionality. We carried this out by applying a PCA transformation and selecting the minimum number of dimensions that cna explain 80% of the variance.

Becuase of its non-statistical nature, PCA analysis is very straightforward and basically only requires us to select the number of dimensions we want to keep. In this case we tested spaces containing between 110 and 120 dimensions and selected 124 dimensions as the minimum number needed to explain 80% of the variance.

![](figures/screeplot_pca.png)

#### K-Means

Next, after the space has had its dimensions reduced, we used k-means to generate clusters containing similar instances. This is the basis for the customer segmentation that is carried out using specific customer data. K-means is the most straightforward approach to this problem and it basically requires only the definition of the number of clusters. 

Another option, which might be explored in future versions of this project, is to use dbscan to generate the clusters. Some preliminary testst did not show an improvemetn in prediction, however the non-separable nature of this dataset might profit from a density-cansed approach.

The scree plot showed that the score started growing linearly at around 12 clusters, which was the number we chose for this analysis. A more formal choice could have used the silhouette score but its high computational cost prevented its use.

![](figures/screeplot_kmeans.png)

With 12 clusters we obatained a approximately balanced cluster population distribution. This is the baseline distribution for the comparison with the customer cluster occupation.

![](figures/general_pop.png)

If we plot the cluster spatial distribution projected on the first two PCA azes we notice that the distribution is a single clusters which is not easily separable. This is not a good case for k-means and probably dbscan would perform better.

![](figures/general_clusters.png)

#### Application to Customer dataset

We applied this pipeline to customers, our second dataset, which shows how current company clients are distributed according to the same features present in azdias.

As shown below, the customers dataset shows a population excess in clusters 1 and 9. This result suggests that every person in azdias that falls in these clusters would have a higher probablility of becoming a client.

![](figures/customer_clusters.png)

The PCA component weights for clusters 1 and 9 is shown below. Note how only the first 10 components are relevant for the description of the cluster.



### Supervised Learning Model

The previous part section was aimed at selecting new potential customers that would receive a mailout campaign. The mailout data was split in two approximately equal parts, each with around 43000 entries. One of the blocks is the training set, which contains the same features seen above plus a RESPONSE column which indicates if the person became a customer of the company following the campaign. The other block was used to generate predictions.

Training data has around 43 000 entries and 367 features.

As expected for this kind of study, the response classes are very imbalanced. The classes are:

* 0 - did not become customer after campaign
* 1 - became customer after campaign

Only 1.2% of the entries are of class 1.

![](figures/mailout_clusters_1.png)

![](figures/mailout_clusters_0.png)

In [2]:
!pwd

/home/brunom/projects/dsnd_capstone
