# Black Box Data Analysis

## Previous Notebooks

- [Data Ingestion and Cleaning](1-Data_Ingestion_and_Cleaning.ipynb)
- [EDA](2-EDA.ipynb)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Apriori

In this notebook I use the apriori algorithm to find recurrent patterns for each client's insurance annuality.

[Apriori](https://en.wikipedia.org/wiki/Apriori_algorithm) is an iterative procedure that finds larger and larger item sets that appear frequently in a dataset and I apply it to the daily mileages of an insurance annuality, representing them as a sequence of letters, one for each day representing a given range of kilometers. The algorithm will mine frequent patterns of *consecutive letters* in each sequence by looking at all the patterns in all the sequences.

First I'll test the algorithm on a single annuality, then I'm going to use it on all the data.

In [2]:
vouchers = pd.read_csv('./data/processed/voucher.csv', sep=',')

In [3]:
vouchers.set_index(['n_voucher', 'annuality'], inplace=True) #669482 records

In [4]:
vouchers.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,km_bin,km_quant,km_day_mean,km_day_median,km_day_quant25,km_day_quant75,km_day_std,km_day_min,km_day_max,km_day_sum,km_day_count,km_day_count_zero
n_voucher,annuality,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
3058761,2017,ABABAAAABABAABABAAAABAABBAAAABAAAAAABAAAAABBAA...,ABABAAAABABAABABAAAABAABBAAAABAAAAAABAAAAABBAA...,0.536811,0.0,0.0,0.0,1.403325,0.0,12.702,88.037,164,41
7443483,2016,BGDBCBDBDBCABEAEDBBBCBCCBBHDABACBBBBEDBCCDFGAI...,BFDBBBDBDBCABEAEDBBBBBBCBBGDABADBBBBEDBCCDFFAH...,10.39029,5.022,0.0,13.595,18.776679,0.0,241.569,3730.114,359,247
2909524,2016,HBEGHDFIHIHGHIEEEIIGFHHEIIDGHEEHFIHHHFEECIGBIE...,GBEFGEFIGIGFGHEEEHHFFHGEHHDFGEEHFHGGGFEEBHFBHE...,43.680388,36.956,14.322,55.492,47.298928,0.0,480.42,15987.022,366,322
2735973,2015,AADBDBBBBBGBBBCCHBFBBBBBEBBDBCFAAHBBCBBBBCBBBB...,AADBDBBBBBFBBBCBGBFBBBBBEBBDBCFAAGBBCBBBBBBBBB...,17.254551,0.0,0.0,13.317,37.457406,0.0,230.577,6297.911,365,176
2806463,2015,BDCABBDDHDEDBCGCCBBCBBCCCBBBBABBBCDCDBCBBBBADB...,BDBABBDDGDEDBBFBCBBCBBCCBBBBBABBBBDCDBBBBBBADB...,22.697526,17.938,10.527,26.169,21.578107,0.0,125.154,6650.375,293,286


In [5]:
vou_apriori = vouchers[['km_bin', 'km_quant']].copy()

### Single Annuality

In [6]:
test = vou_apriori.loc[(2189826, 2015)].values[0]
test

'EEAAAECEEEABEADEEBAEEEEEAADEEGEAAEEEEEADAAAAAAAAAAAAAAEEEEAABABAAEAAAAAAAAAACBCDABAEEEDABAAAAEAAEEEEAAAAFEAAABEEEEEAAEEEEEAAEEAAAAAAEEEEBAEEHEGAAEEEEEABEEEEECAEEEEEBAAAECEAAEEEEEACEEAAAAAAABBABAEEAEEAACAEEEAAEEDEEAAEEEEEAAEECEAAAECBEEAAEEEEEAAEEGFEBAFEEEEABEEEEDABEEEEDAAEEEEEAAAAEEEAAEEEEEAADEEEEAAEEEEEJAAEEEFAAEEEEEAAECCEAAAEEEEEAAEEEEFBAEGEAAAAGEEEFAAAAAAAAAGEEE'

In [7]:
def apriori_single(eps, string):
    from collections import Counter
    from itertools import product
    freq = {}
    cnt = dict(Counter(string))
    freq[1] = {k:v for k, v in cnt.items() if v >= eps}
    freq[0] = freq[1]
    for i in range(2, 8):
        freq[i] = {}
        for item in product(freq[i-1], freq[0]):
            key = ''.join(item)
            value = test.count(key)
            if value >= eps:
                freq[i][key] = value
    return freq

In [8]:
apriori_single(20, test)

{0: {'A': 143, 'E': 170},
 1: {'A': 143, 'E': 170},
 2: {'AA': 58, 'AE': 35, 'EA': 35, 'EE': 66},
 3: {'AAA': 23, 'AAE': 27, 'AEE': 28, 'EAA': 28, 'EEA': 26, 'EEE': 30},
 4: {'AAEE': 21, 'AEEE': 21, 'EEAA': 20, 'EEEE': 23},
 5: {},
 6: {},
 7: {}}

I'm interested in the frequent patterns of three to seven letters, which represents subsequent daily distances between a couple of fixed quantities: for example in this case EEAA stands for two days in a row with 30 to 40 km travelled and two with 0 km travelled, and this pattern has been repeated 20 times in the annuality for this black box.

### All Data

I'm going to apply apriori to the arbitrary ranges and to the ranges determined using quantiles, obtaining two distinct set of frequent items.

In [9]:
def apriori(eps, df, col):
    from collections import Counter
    from itertools import product
    freq = {}
    cnt = df[col].apply(lambda x: Counter(x))
    cnt = dict(cnt.sum())
    freq[1] = {k:v for k, v in cnt.items() if v >= eps}
    freq[0] = freq[1]
    for i in range(2, 8):
        freq[i] = {}
        for item in product(freq[i-1], freq[0]):
            key = ''.join(item)
            value = df[col].apply(lambda x: x.count(key)).sum()
            if value >= eps:
                freq[i][key] = value
        print(i, len(freq[i]))
    return freq

#### Arbitrary ranges

In [19]:
freq = apriori(100000, vou_apriori, 'km_bin')

2 105
3 442
4 265
5 169
6 91
7 62


In [21]:
freq

{0: {'A': 61671232,
  'B': 36460031,
  'C': 32278814,
  'D': 23294271,
  'E': 16419601,
  'F': 11552275,
  'G': 8220973,
  'H': 15889149,
  'I': 8030280,
  'J': 1487761,
  'K': 1029516},
 1: {'A': 61671232,
  'B': 36460031,
  'C': 32278814,
  'D': 23294271,
  'E': 16419601,
  'F': 11552275,
  'G': 8220973,
  'H': 15889149,
  'I': 8030280,
  'J': 1487761,
  'K': 1029516},
 2: {'AA': 22722773,
  'AB': 8155068,
  'AC': 5056166,
  'AD': 2924634,
  'AE': 1790952,
  'AF': 1164958,
  'AG': 797994,
  'AH': 1539041,
  'AI': 907650,
  'AJ': 223795,
  'AK': 161681,
  'BA': 8347267,
  'BB': 8845488,
  'BC': 6410436,
  'BD': 3130183,
  'BE': 1764306,
  'BF': 1088005,
  'BG': 715033,
  'BH': 1307205,
  'BI': 692150,
  'BJ': 154021,
  'BK': 114593,
  'CA': 4988685,
  'CB': 6465018,
  'CC': 6729246,
  'CD': 4490339,
  'CE': 2404888,
  'CF': 1388541,
  'CG': 872781,
  'CH': 1490214,
  'CI': 727613,
  'CJ': 152754,
  'CK': 111225,
  'DA': 2846935,
  'DB': 3142568,
  'DC': 4474466,
  'DD': 4026122,
  'DE

Since it's a time consuming procedure, I'm saving the results in a pickle file:

In [30]:
import pickle

# with open('./data/interim/freq.pkl', 'wb') as handle:
#     pickle.dump(freq, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('./data/interim/freq.pkl', 'rb') as handle:
    freq = pickle.load(handle)

#### Quantiles' ranges

In [10]:
freq_quant = apriori(100000, vou_apriori, 'km_quant')

2 81
3 374
4 317
5 127
6 107
7 75


In [11]:
freq_quant

{0: {'A': 61671232,
  'B': 44283768,
  'C': 21626604,
  'D': 21656029,
  'E': 21573294,
  'F': 21699646,
  'G': 10821855,
  'H': 8596047,
  'I': 4405428},
 1: {'A': 61671232,
  'B': 44283768,
  'C': 21626604,
  'D': 21656029,
  'E': 21573294,
  'F': 21699646,
  'G': 10821855,
  'H': 8596047,
  'I': 4405428},
 2: {'AA': 22722773,
  'AB': 9513800,
  'AC': 3306286,
  'AD': 2793235,
  'AE': 2385220,
  'AF': 2142951,
  'AG': 1043258,
  'AH': 902746,
  'AI': 634443,
  'BA': 9697871,
  'BB': 11746128,
  'BC': 5306141,
  'BD': 3933032,
  'BE': 3015871,
  'BF': 2459046,
  'BG': 1103576,
  'BH': 885402,
  'BI': 547320,
  'CA': 3256401,
  'CB': 5342772,
  'CC': 3514209,
  'CD': 3026558,
  'CE': 2245951,
  'CF': 1669650,
  'CG': 690533,
  'CH': 522461,
  'CI': 299860,
  'DA': 2722149,
  'DB': 3947178,
  'DC': 3013992,
  'DD': 3552030,
  'DE': 3128403,
  'DF': 2320054,
  'DG': 896355,
  'DH': 638907,
  'DI': 342875,
  'EA': 2316364,
  'EB': 3022856,
  'EC': 2232316,
  'ED': 3099205,
  'EE': 3737776

Since it's a time consuming procedure, I'm saving the results in a pickle file:

In [12]:
import pickle

# with open('./data/interim/freq_quant.pkl', 'wb') as handle:
#     pickle.dump(freq_quant, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('./data/interim/freq_quant.pkl', 'rb') as handle:
    freq_quant = pickle.load(handle)

### Creating Features

Finally I will create two new datasets having as features the counts of the occurrences of the patterns with length of 3 or more for each annuality.

#### Arbitrary ranges

In [32]:
vou_bin = vou_apraiori.copy()

In [33]:
features = [list(freq[i].keys()) for i in range(3, 8)]
features = [item for sublist in features for item in sublist]

In [34]:
for f in features:
    vou_bin[f] = vou_bin['km_bin'].apply(lambda x: x.count(f))

In [36]:
vou_bin.drop(['km_bin', 'km_quant'], axis=1, inplace=True)

In [37]:
vou_bin.to_pickle('./data/processed/vou_bin_full.pkl')

#### Quantiles' ranges

In [13]:
vou_quant = vou_apriori.copy()

In [14]:
features = [list(freq_quant[i].keys()) for i in range(3, 8)]
features = [item for sublist in features for item in sublist]

In [15]:
for f in features:
    vou_quant[f] = vou_quant['km_quant'].apply(lambda x: x.count(f))

In [16]:
vou_quant.drop(['km_bin', 'km_quant'], axis=1, inplace=True)

In [17]:
vou_quant.to_pickle('./data/processed/vou_quant_full.pkl')

## Following Notebooks

- [Clustering on the Cloud](4a-Clustering_on_Cloud.ipynb)
- [Clustering on Premises](4b-Clustering_on_Prem.ipynb)
- [Interpreting Clusters](5-Interpreting_Clusters.ipynb)