## Introduction

This post discuss in detail the project "Create a Customer Segmentation Report for Arvato Financial Services", which is one of the Data Scientist's nanodegree capstone options. It is in fact a continuation of a previous project whose solution was posted [here](https://github.com/bvcmartins/dsndProject3). I chose it because of its broad scope which involves a reasonably complex data cleaning procedure, unsupervised learning, analysis of imbalanced data, and prediction using supervised learning tools. In the following I will discuss my solution for it.

### The dataset

Arvato kindly provided us the following four datasets:

1. Azdias: general elements of the german population (891221 entries, 366 features)
2. Customers: same features but containing only customers (191652 entries, 369 features)
3. Mailout_train: training set containing potential customers chosen to receive mail ads. It also contains information if the ad was responded 
4. Mailout_test: testing set for the supervised learning model

On top of that two other files were provided:
1. DIAS Attributes - Values 2017: information about code levels of each attribute
2. DIAS Information Levels - Attributes 2017: high-level information about each attribute

Most of the features are ordinal and the numbers only represent a label for ranked value levels. Columns marked as float are actually comprised by ints but were only marked that way because they contain NaN, which is itself a float. The latest pandas version allows us to use the type Int64 which supports a integer NaN. 

There are also 6 features of type object. These are categorical variables, except for EINGEFUEGT_AM, which is datetime.

Most of the features contained NaNs. Actually, NaNs comprised almost 10% of all data.

### Data Cleaning

Cleaning this dataset was a relatively complex task. The steps are outlined below:

* pre-cleaning
* converting missing values to NaN
* assessing missing data per feature
* assessing missing data per row
* converting mixed-type features to ordinal or binary features
* one-hot encoding categorical features
* standard scaling numerical features

#### Pre-cleaning

We defined a function to perform general-purpose operations like converting all numeric features to Int64 (support to integer NaN) and make substitutions for some non-standard missing data encodings.

#### Converting missing values to NaN

The challenge with this step was that the missing data coding was feature-dependent. Most of the missing values were coded as -1 or 0 and some of them were coded (and not listed in file DIAS) as X or XX. The latter were converted to NaNs during pre-cleaning while the former were first converted to a not-used code (-100) to avoid problems with datatype and then to NaNs.

#### Assessing missing data per feature

After having all missing values converted to NaN, we were able to assess which features had more than 445 000, half of the total number of entries, missing. As shown below, we found 9 features satrisfying this requirement. They correponded to 18% of all missing values and were all dropped.

![](./figures/main_40_0.png)

#### Assessing missing data per row

After analyzing missing data per column, we turned our attention to missing values patterns associated with rows. As shown in the figure below, the distribution of missing data per row is multimodal. We selected the leftmost cluster, with values above 180, for a statistical test.

![](./figures/main_47_0.png)

We applied the Kolmogorov-Smirnov test to check if the selected rows are overall different than the main body of data. The null hypothesis is that both groups are identical.

Because we were executing multiple comparisons, we applied the very strict Bonferroni correction to the p-values.
The results showed that the difference between the two groups were significant only for 8.2% of the test features. Note that this number is not a p-value and it should not be compared with the 0.05 significance level. 

We decided that differences in 8.2% of the test columns was acceptable and we did not drop the rows.

#### Data Imputation



In [2]:
!pwd

/home/brunom/projects/dsnd_capstone
