# Black Box Data Analysis

## Previous Notebooks

- [Data Ingestion and Cleaning](1-Data_Ingestion_and_Cleaning.ipynb)
- [EDA](2-EDA.ipynb)
- [Apriori](3-Apriori.ipynb)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfTransformer

## Clustering

In this notebook I use k-means to cluster the annualities. I try four different sets of features: the patterns obtained from apriori in the previous notebooks and some statistics computed considering each annuality as a time series (e.g. mean, standard deviation...).

The clusterings are done on dataset of over 5GBs, so in order to be able to execute them I ran this notebook on [Cloud Datalab](https://cloud.google.com/datalab/), retrieving data from [Google Cloud Storage Buckets](https://cloud.google.com/storage/docs/json_api/v1/buckets) by means of the `datalab.storage` module that allows the integration of jupyter notebooks executed on Datalab with the Cloud Storage.

In [3]:
from datalab.context import Context
import datalab.storage as storage
from io import BytesIO
# setting the bucket name
sample_bucket = storage.Bucket('k2proj_ale')

In [3]:
# setting the file for time series' data
remote_csv = sample_bucket.item('voucher.csv').read_from()
# reading the file into a panda df
vouchers = pd.read_csv(BytesIO(remote_csv))
# setting indices and dropping columns
vouchers.drop(['km_bin', 'km_quant'], inplace=True, axis=1)
vouchers.set_index(['n_voucher', 'annuality'], inplace=True)

In [5]:
# setting the file for arbitrary ranges
remote_pickle = sample_bucket.item('processed/vou_bin_full.pkl').read_from()
# reading the file into a panda df
vou_apriori = pd.read_pickle(BytesIO(remote_pickle))
# setting the file for quantiles' ranges
remote_pickle = sample_bucket.item('processed/vou_quant_full.pkl').read_from()
# reading the file into a panda df
vou_apriori_quant = pd.read_pickle(BytesIO(remote_pickle))

### tf-idf

Before clustering the data I will normalize it.

First I'll apply tf-idf to the apriori features: this will normalize the data and weight less some very common patterns (e.g. the AAA pattern which translates to "not using the car for three days in a row").

In [7]:
tfidf = TfidfTransformer()
vou_apriori = pd.DataFrame(tfidf.fit_transform(vou_apriori.values).todense(), index=vou_apriori.index, columns=vou_apriori.columns)
vou_apriori_quant = pd.DataFrame(tfidf.fit_transform(vou_apriori_quant.values).todense(), index=vou_apriori_quant.index, columns=vou_apriori_quant.columns)

### Scaling

Then I'll scale the time series' data to have comparable units to the ones used for apriori features after tf-idf and I'll join the scaled features back in the apriori datasets.

In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
sc = StandardScaler()
vouchers = pd.DataFrame(sc.fit_transform(vouchers), index=vouchers.index, columns=vouchers.columns)

In [10]:
vou_apriori_all = vou_apriori.copy()
vou_apriori_all = vou_apriori_all.join(vouchers)
vou_apriori_quant_all = vou_apriori_quant.copy()
vou_apriori_quant_all = vou_apriori_quant_all.join(vouchers)

### K-Means

Finally I'll use k-means to cluster the annualities: I'm fitting four different clustering using 10 clusters, a number I determined by running the algorithm on a subset of the data and looking at inertia for various number of clusters and choice of features (see this [notebook](0-10000 Clients Analysis.ipynb)).

The clustering I try are:

- only apriori data with arbitrary ranges
- only apriori data with quantiles ranges determined from quantiles
- apriori data with arbitrary ranges and time series' data
- apriori data with quantiles ranges determined from quantiles and time series' data

![Inertia](./reports/figures/inertia.png)

In [11]:
km = KMeans()
km_quant = KMeans()
km_all = KMeans()
km_quant_all = KMeans()

In [12]:
km.set_params(n_clusters=10)
km.fit(vou_apriori)
km_quant.set_params(n_clusters=10)
km_quant.fit(vou_apriori_quant)
km_all.set_params(n_clusters=10)
km_all.fit(vou_apriori_all)
km_quant_all.set_params(n_clusters=10)
km_quant_all.fit(vou_apriori_quant_all)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [13]:
vou_apriori['label'] = km.labels_
vou_apriori_quant['label'] = km_quant.labels_
vou_apriori_all['label'] = km_all.labels_
vou_apriori_quant_all['label'] = km_quant_all.labels_

I'm also saving the distances from each cluster's centroid, in order to be able to find the closest annualities to them later on.

In [14]:
distances = pd.DataFrame(km.transform(vou_apriori.drop('label', axis=1)), 
                         index=vou_apriori.index, 
                         columns=['dist0', 'dist1', 'dist2', 'dist3', 'dist4', 'dist5', 'dist6', 'dist7', 'dist8', 'dist9'])
distances_quant = pd.DataFrame(km_quant.transform(vou_apriori_quant.drop('label', axis=1)), 
                               index=vou_apriori_quant.index, 
                               columns=['dist0', 'dist1', 'dist2', 'dist3', 'dist4', 'dist5', 'dist6', 'dist7', 'dist8', 'dist9'])
distances_all = pd.DataFrame(km_all.transform(vou_apriori_all.drop('label', axis=1)), 
                             index=vou_apriori_all.index, 
                             columns=['dist0', 'dist1', 'dist2', 'dist3', 'dist4', 'dist5', 'dist6', 'dist7', 'dist8', 'dist9'])
distances_quant_all = pd.DataFrame(km_quant_all.transform(vou_apriori_quant_all.drop('label', axis=1)), 
                                   index=vou_apriori_quant_all.index, 
                                   columns=['dist0', 'dist1', 'dist2', 'dist3', 'dist4', 'dist5', 'dist6', 'dist7', 'dist8', 'dist9'])

In [15]:
vou_apriori = vou_apriori.join(distances)
vou_apriori_quant = vou_apriori_quant.join(distances_quant)
vou_apriori_all = vou_apriori_all.join(distances_all)
vou_apriori_quant_all = vou_apriori_quant_all.join(distances_quant_all)

Finally I'm saving the labelled datasets to a bucket:

In [18]:
sample_bucket.item('processed/vou_bin_full_labeled.csv')\
             .write_to(vou_apriori[['label', 'dist0', 'dist1', 'dist2', 'dist3', 'dist4', 'dist5', 'dist6', 'dist7', 'dist8', 'dist9']]\
               .to_csv(), 'text/csv')

In [28]:
sample_bucket.item('processed/vou_quant_full_labeled.csv')\
             .write_to(vou_apriori_quant[['label', 'dist0', 'dist1', 'dist2', 'dist3', 'dist4', 'dist5', 'dist6', 'dist7', 'dist8', 'dist9']]\
               .to_csv(), 'text/csv')

In [16]:
sample_bucket.item('processed/vou_bin_all_full_labeled.csv')\
             .write_to(vou_apriori_all[['label', 'dist0', 'dist1', 'dist2', 'dist3', 'dist4', 'dist5', 'dist6', 'dist7', 'dist8', 'dist9']]\
               .to_csv(), 'text/csv')

In [21]:
sample_bucket.item('processed/vou_quant_all_full_labeled.csv')\
             .write_to(vou_apriori_quant_all[['label', 'dist0', 'dist1', 'dist2', 'dist3', 'dist4', 'dist5', 'dist6', 'dist7', 'dist8', 'dist9']]\
               .to_csv(), 'text/csv')

## Following Notebooks

- [Clustering on Premises](4b-Clustering_on_Prem.ipynb)
- [Interpreting Clusters](5-Interpreting_Clusters.ipynb)