# Machine Learning Engineer Nanodegree
## Capstone Project
Felipe Santos  
October 23th, 2018

## I. Definition

### Project Overview

Optical character recognition (OCR) is the process to transform text images to text in a computer format, giving us information in an unstructured way, so is necessary to classify this data to gain knowledge about them and to have better document retrieval. The Tradeshift company make on that foundation a machine learning based product to classify the text blocks in a document to dates, address, and names to enrich the data of the OCR process. This organization resolves to host a competition on Kaggle, a data science platform, opening their data for community try to beat their machine learning model to this classification problem. The competition and the dataset can be access through this [link](https://www.kaggle.com/c/tradeshift-text-classification) and is available to the Kaggle community who intends to beat their benchmark like me.

### Problem Statement

In this competition, we have to create a supervised machine learning algorithm to predict the possibility for a block of text being from a particular label, however, the block can have multiple labels. For all the documents, words are detected and combined to form text blocks that may overlap to each other. Each text block is enclosed within a spatial box, which is depicted by a red line in the sketch below. The text blocks from all documents are aggregated in a data set where each text block corresponds to one sample (row). The text is inputted by the OCR and the host organization gives us some features like the hashed content of the text, position, and size of the box, if the text can be parsed as a date, as a number and include information about the surrounds text blocks in the original document. The final classifier is intended to beat the benchmark of the Tradeshift organization, some tasks involved to reach that goal are:

- Download and preprocess the Tradeshift dataset;
- Do some feature engineering;
- Train different classifiers;
- Tuning the hyperparameters of the algorithm;
- Beat the benchmark.

### Metrics

The evaluation metric chosen by the organizers for this competition was the negative logarithm of the likelihood function averaged over Nt test samples and K labels. As shown by the following equation a + b =c.
On the equation:

$$\textrm{LogLoss} = \frac{1}{N_{t} \cdot K} \sum_{idx=1}^{N_{t} \cdot K} \textrm{LogLoss}_{idx}$$
$$= \frac{1}{N_{t} \cdot K} \sum_{idx=1}^{N_{t} \cdot K} \left[ - y_{idx} \log(\hat{y}_{idx}) - (1 - y_{idx}) \log(1 - \hat{y}_{idx})\right]$$
$$= \frac{1}{N_{t} \cdot K} \sum_{i=1}^{N_{t}} \sum_{j=1}^K \left[ - y_{ij} \log(\hat{y}_{ij}) - (1 - y_{ij}) \log(1 - \hat{y}_{ij})\right]$$

- $f$ is the prediction model
- $\theta$ is the parameter of the model
- $\hat{y}_{ij}$ is the predicted probability of the jth-label is true for the ith-sample
- $log$ represents the natural logarithm
- $idx = (i - 1) * K + j$

This function penalizes probabilities that are confident and wrong, in the worst case, prediction of true(1) for a false label (0) add infinity to the LogLoss function as $-log(0) = \infty$, which makes a total score infinity regardless of the others scores.

This metric is also symmetric in the sense than predicting 0.1 for a false (0) sample has the same penalty as predicting 0.9 for a positive sample (1). The value is bounded between zero and infinity, i.e. $\textrm{LogLoss} \in [0, \infty)$. The competition corresponds to a minimization problem where smaller metric values, $\textrm{LogLoss} \sim 0$, implies better prediction models. To avoid complication with infinity values the predictions are bounded to within the range $[10^{-15},1-10^{-15}]$

#### Example
This is an example from the competition
If the 'answer' file is:
``` csv
id_label,pred
1_y1,1.0000
1_y2,0.0000
1_y3,0.0000
1_y4,0.0000
2_y1,0.0000
2_y2,1.0000
2_y3,0.0000
2_y4,1.0000
3_y1,0.0000
3_y2,0.0000
3_y3,1.0000
3_y4,0.0000

```

And the submission file is:
``` csv
id_label,pred
1_y1,0.9000
1_y2,0.1000
1_y3,0.0000
1_y4,0.3000
2_y1,0.0300
2_y2,0.7000
2_y3,0.2000
2_y4,0.8500
3_y1,0.1900
3_y2,0.0000
3_y3,1.0000
3_y4,0.2700

``` 
the score is 0.1555 as shown by:

$$L = - \frac{1}{12} \left [ log(0.9) + log(1-0.1) + log(1-0.0) +log(1-0.3) + log(1-0.03) + log(0.7) + log(1-0.2) + log(0.85) + log(1-0.19)  + log(1-0.0) + log(1.0) +log(1-0.27) \right ] = 0.1555$$

## II. Analysis

### Data Exploration

1. [Loading Data](#loading_data)
2. [First Look](#first_look)
3. [Metadata](#metadata)
4. [Descriptive statistics](#descriptive)
5. [Data Quality Checks](#quality_check)

In this section, you will be expected to analyze the data you are using for the problem. This data can either be in the form of a dataset (or datasets), input data (or input files), or even an environment. The type of data should be thoroughly described and, if possible, have basic statistics and information presented (such as discussion of input features or defining characteristics about the input or environment). Any abnormalities or interesting qualities about the data that may need to be addressed have been identified (such as features that need to be transformed or the possibility of outliers). Questions to ask yourself when writing this section:
- _If a dataset is present for this problem, have you thoroughly discussed certain features about the dataset? Has a data sample been provided to the reader?_
- _If a dataset is present for this problem, are statistics about the dataset calculated and reported? Have any relevant results from this calculation been discussed?_
- _If a dataset is **not** present for this problem, has discussion been made about the input space or input data for your problem?_
- _Are there any abnormalities or characteristics about the input space or dataset that need to be addressed? (categorical variables, missing values, outliers, etc.)_

#### Loading Data <a class="anchor" id="loading_data"></a>

In [1]:
%load_ext autoreload
%autoreload 2

import src.describe as d
import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)

train_features = d.read_train_features()
train_labels = d.read_train_labels()

#### First look <a class="anchor" id="first_look"></a>

In [2]:
train_features.shape

(1700000, 146)

In [3]:
train_features.head()

Unnamed: 0,id,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,x37,x38,x39,x40,x41,x42,x43,x44,x45,x46,x47,x48,x49,x50,x51,x52,x53,x54,x55,x56,x57,x58,x59,x60,x61,x62,x63,x64,x65,x66,x67,x68,x69,x70,x71,x72,x73,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83,x84,x85,x86,x87,x88,x89,x90,x91,x92,x93,x94,x95,x96,x97,x98,x99,x100,x101,x102,x103,x104,x105,x106,x107,x108,x109,x110,x111,x112,x113,x114,x115,x116,x117,x118,x119,x120,x121,x122,x123,x124,x125,x126,x127,x128,x129,x130,x131,x132,x133,x134,x135,x136,x137,x138,x139,x140,x141,x142,x143,x144,x145
0,1,NO,NO,dqOiM6yBYgnVSezBRiQXs9bvOFnRqrtIoXRIElxD7g8=,GNjrXXA3SxbgD0dTRblAPO9jFJ7AIaZnu/f48g5XSUk=,0.576561,0.073139,0.481394,0.115697,0.472474,YES,NO,NO,NO,NO,42,0.396065,3,6,0.991018,0.0,0.82,3306,4676,YES,NO,YES,0,0.405047,0.46461,NO,NO,NO,NO,mimucPmJSF6NI6KM6cPIaaVxWaQyIQzSgtwTTb9bKlc=,s7mTY62CCkWUFc36AW2TlYAy5CIcniD2Vz+lHzyYCLg=,0.576561,0.073139,0.481394,0.115697,0.45856,YES,NO,YES,NO,NO,9,0.368263,2,10,0.992729,0.0,0.94,3306,4676,YES,NO,YES,1,0.375535,0.451301,+2TNtXRI6r9owdGCS80Ia9VVv8ZpuOpVaHEvxRGGu78=,NO,NO,Op+X3asn5H7EQJErI7PR0NkUs3YB+Ld/8OfWuiOC8tU=,GeerC2BbPUcQfQO86NmvOsKrfTvmW7HF+Iru9y+7DPA=,0.576561,0.073139,0.481394,0.115697,0.487598,YES,NO,NO,NO,NO,42,0.363131,6,10,0.987596,0.0,0.71,3306,4676,YES,NO,YES,0,0.375535,0.479734,bxU52teuxC05EZyzFihSiKHczE2ZAIVCXekVLG7j3C0=,NO,NO,+dia7tCOijlRGbABX0YKG5L85x/hXLyJwwplN5Qab04=,f4Uu1R9nnf/h03aqiRQT0Fw3WItzNToLCyRlW1Pn8Z8=,0.576561,0.073139,0.481394,0.115697,0.473079,YES,NO,NO,NO,NO,37,0.333618,4,6,0.987169,0.0,0.89,3306,4676,YES,NO,YES,1,0.34645,0.46461,0.576561,0.073139,0.481394,0.115697,0.473079,YES,NO,NO,NO,NO,42,0.363131,5,6,0.987596,0.0,0.81,3306,4676,YES,NO,YES,2,0.375535,0.46461
1,2,,,,,0.0,0.0,0.0,0.0,0.0,,,,,,0,0.0,0,0,0.0,0.0,0.0,0,0,,,,0,0.0,0.0,NO,NO,NO,NO,l0G2rvmLGE6mpPtAibFsoW/0SiNnAuyAc4k35TrHvoQ=,lblNNeOLanWhqgISofUngPYP0Ne1yQv3QeNHqCAoh48=,1.058379,0.125832,0.932547,0.663037,0.569047,YES,NO,NO,NO,NO,9,0.709921,5,6,0.96824,0.0,0.81,4678,3306,YES,NO,YES,3,0.741682,0.560282,MZZbXga8gvaCBqWpzrh2iKdOkcsz/bG/z4BVjUnqWT0=,NO,NO,TqL9cs8ZFzALzVpZv6wYBDi+6zwhrdarQE/3FH+XAlA=,aZTF/lredyP4cukeN8bh6kpBjYmS1QFNpPOg2LVm3Lg=,1.058379,0.125832,0.932547,0.663037,0.628474,YES,NO,NO,NO,NO,2,0.679371,8,7,0.937387,0.0,0.84,4678,3306,YES,NO,YES,1,0.741984,0.619282,YvZUuCDjLu9VvkCdBWgARWQrvm+FSXgxp0zIrMjcLBc=,NO,NO,dsyhxXKNNJy4WVGD/v4+UGyW3jHWkx2xTdg3STsf34A=,X6dDAI/DZOWvu0Dg6gCgRoNr2vTUz/mc4SdHTNUPS38=,1.058379,0.125832,0.932547,0.663037,0.602394,NO,NO,NO,NO,NO,11,0.581367,3,6,0.966122,0.0,0.87,4678,3306,NO,NO,NO,3,0.615245,0.59363,1.058379,0.125832,0.932547,0.663037,0.602394,YES,NO,NO,NO,NO,9,0.709921,4,6,0.96824,0.0,0.51,4678,3306,YES,NO,YES,4,0.741682,0.59363
2,3,NO,NO,ib4VpsEsqJHzDiyL0dZLQ+xQzDPrkxE+9T3mx5fv2wI=,X6dDAI/DZOWvu0Dg6gCgRoNr2vTUz/mc4SdHTNUPS38=,1.341803,0.051422,0.935572,0.04144,0.50171,NO,NO,YES,NO,NO,2,0.838475,3,5,0.966122,0.0,0.74,4678,3306,NO,NO,NO,2,0.872353,0.493159,NO,NO,YES,YES,9TRXThP/ifDpJRGFX1LQseibUA1NJ3XM53gy+1eZ46k=,XSJ6E8aAoZC7/KAu3eETpfMg3mCq7HVBFIVIsoMKh9E=,1.341803,0.051422,0.935572,0.04144,0.447627,YES,NO,NO,NO,YES,2,0.752269,5,7,0.95493,0.0,0.82,4678,3306,YES,NO,YES,2,0.797338,0.438435,cr+kkNnNFV9YL0vz029hk3ohIDmGuABRVNhFe0ePZyo=,NO,NO,oFsUwSLCWcj8UA1cqILh5afKVcvwlFA+ohJ147Wkz5I=,WV5vAHFyqkeuyFB5KVNGFOBuwjkUGKYc8wh9QfpVzAA=,1.341803,0.051422,0.935572,0.04144,0.522873,YES,NO,NO,NO,NO,1,0.732305,6,6,0.95493,0.0,0.8,4678,3306,YES,NO,YES,0,0.777374,0.513681,X6dDAI/DZOWvu0Dg6gCgRoNr2vTUz/mc4SdHTNUPS38=,NO,NO,mRPnGiKVOWTk/vzZaqlLXZRtdrkcQ/sX0hqBCqOuKq0=,oo9tGpHvTredpg9JkHgYbZAuxcwtSpQxU5mA/zUbxY8=,1.341803,0.051422,0.935572,0.04144,0.50171,NO,NO,NO,NO,NO,2,0.65729,6,5,0.936479,0.0,0.79,4678,3306,NO,NO,NO,0,0.720811,0.493159,1.341803,0.051422,0.935572,0.04144,0.50171,NO,NO,YES,NO,NO,5,0.742589,3,5,0.966122,0.0,0.85,4678,3306,NO,NO,NO,1,0.776467,0.493159
3,4,YES,NO,BfrqME7vdLw3suQp6YAT16W2piNUmpKhMzuDrVrFQ4w=,YGCdISifn4fLao/ASKdZFhGIq23oqzfSbUVb6px1pig=,0.653912,0.041471,0.940787,0.090851,0.556564,YES,NO,NO,NO,NO,37,0.127405,8,15,0.959171,0.0,0.96,3306,4678,YES,NO,YES,1,0.168234,0.546582,NO,NO,YES,NO,BfrqME7vdLw3suQp6YAT16W2piNUmpKhMzuDrVrFQ4w=,YGCdISifn4fLao/ASKdZFhGIq23oqzfSbUVb6px1pig=,0.653912,0.041471,0.940787,0.090851,0.556564,YES,NO,NO,NO,NO,37,0.127405,8,15,0.959171,0.0,0.96,3306,4678,YES,NO,YES,1,0.168234,0.546582,XQG0f+jmjLI0UHAXXH2RYL4MEHa+yd9okO+730PCZuc=,YES,NO,/1yAAEg6Qib4GMD+wvGOlGmpCIPIAzioWtcCwbns9/I=,YGCdISifn4fLao/ASKdZFhGIq23oqzfSbUVb6px1pig=,0.653912,0.041471,0.940787,0.090851,0.557774,YES,NO,NO,NO,NO,22,0.067764,8,15,0.959598,0.0,0.93,3306,4678,YES,NO,YES,2,0.108166,0.547792,Vl+TDNSupucNoI+Fqeo7bMCkxg1hRjgTSS6NYb9BW00=,YES,NO,/1yAAEg6Qib4GMD+wvGOlGmpCIPIAzioWtcCwbns9/I=,YGCdISifn4fLao/ASKdZFhGIq23oqzfSbUVb6px1pig=,0.653912,0.041471,0.940787,0.090851,0.557774,YES,NO,NO,NO,NO,22,0.067764,8,15,0.959598,0.0,0.93,3306,4678,YES,NO,YES,2,0.108166,0.547792,0.653912,0.041471,0.940787,0.090851,0.557774,YES,NO,NO,NO,NO,0,0.067764,17,15,0.92755,0.0,0.945,3306,4678,NO,NO,YES,3,0.168234,0.546582
4,5,NO,NO,RTjsrrR8DTlJyaIP9Q3Z8s0zseqlVQTrlSe97GCWfbk=,3yK2OPj1uYDsoMgsxsjY1FxXkOllD8Xfh20VYGqT+nU=,1.415919,0.0,1.0,0.0,0.375297,NO,NO,YES,NO,NO,1,0.523543,4,11,0.963004,0.0,1.0,1263,892,NO,NO,NO,2,0.560538,0.361045,NO,NO,NO,NO,XEDyQD4da6aJkZiBf+r7LD2VdhLGnCMsSpuRFUyCZgg=,Co/nVSLofrWsM5qpcKLXfekegArokgN29XjEXttuXK4=,1.415919,0.0,1.0,0.0,0.300079,YES,NO,NO,NO,YES,6,0.16704,3,3,0.971973,0.0,1.0,1263,892,YES,NO,YES,1,0.195067,0.285827,wIHg6aGH2GMPX6l1pCTzeS1bXE4jxRqmd9ubES4HgW8=,NO,NO,ST8+q2Jgb91pWEwLwmSoJzXEGsQKeQGbzlLbgHPtj4w=,rB07AAHPffU4zFFF8IrqfKSltyWcPyy4+q+IM5SLZiQ=,1.415919,0.0,1.0,0.0,0.400633,NO,NO,NO,NO,NO,9,0.144619,10,14,0.944507,-0.5,1.0,1263,892,NO,NO,NO,1,0.221973,0.386382,WYQEP5EEzM+P+nfkHKLkGko/S3RdBgfEQ3IcyYwrChE=,NO,NO,fylJzYvYlM0+kRBeLB3eFKKgCibqxFvBa8hL+WStwCE=,IoM2E9pNxABFR+H3yfapUL+ThKm7GtTzY7js9H/H99o=,1.415919,0.0,1.0,0.0,0.375297,NO,NO,NO,NO,NO,1,0.065022,8,11,0.92713,0.0,1.0,1263,892,NO,NO,NO,0,0.137892,0.361045,1.415919,0.0,1.0,0.0,0.375297,NO,NO,NO,NO,NO,9,0.146861,11,11,0.900224,0.0,1.0,1263,892,NO,NO,NO,1,0.246637,0.361045


In [4]:
train_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1700000 entries, 0 to 1699999
Columns: 146 entries, id to x145
dtypes: float64(55), int64(31), object(60)
memory usage: 1.8+ GB


In [5]:
train_labels.shape

(1700000, 34)

In [6]:
train_labels.head()

Unnamed: 0,id,y1,y2,y3,y4,y5,y6,y7,y8,y9,y10,y11,y12,y13,y14,y15,y16,y17,y18,y19,y20,y21,y22,y23,y24,y25,y26,y27,y28,y29,y30,y31,y32,y33
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2,3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [7]:
train_labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1700000 entries, 0 to 1699999
Data columns (total 34 columns):
id     int64
y1     int64
y2     int64
y3     int64
y4     int64
y5     int64
y6     int64
y7     int64
y8     int64
y9     int64
y10    int64
y11    int64
y12    int64
y13    int64
y14    int64
y15    int64
y16    int64
y17    int64
y18    int64
y19    int64
y20    int64
y21    int64
y22    int64
y23    int64
y24    int64
y25    int64
y26    int64
y27    int64
y28    int64
y29    int64
y30    int64
y31    int64
y32    int64
y33    int64
dtypes: int64(34)
memory usage: 441.0 MB


#### Metadata <a class="anchor" id="metadata"></a>

In this section, we will categorize the collumns to try to facilitate the manipulation. We'll store:
* **dtype**: int, float, str
* **category**: content, numerical, boolean

In [5]:
meta = d.create_features_meta(train_features)
meta.head(10)

Unnamed: 0_level_0,role,category,dtype
varname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
id,id,numerical,int64
x1,input,boolean,object
x2,input,boolean,object
x3,input,content,object
x4,input,content,object
x5,input,numerical,float64
x6,input,numerical,float64
x7,input,numerical,float64
x8,input,numerical,float64
x9,input,numerical,float64


Extract all boolean features:

In [6]:
meta[meta.category == 'boolean'].index

Index(['x1', 'x2', 'x10', 'x11', 'x12', 'x13', 'x14', 'x24', 'x25', 'x26',
       'x30', 'x31', 'x32', 'x33', 'x41', 'x42', 'x43', 'x44', 'x45', 'x55',
       'x56', 'x57', 'x62', 'x63', 'x71', 'x72', 'x73', 'x74', 'x75', 'x85',
       'x86', 'x87', 'x92', 'x93', 'x101', 'x102', 'x103', 'x104', 'x105',
       'x115', 'x116', 'x117', 'x126', 'x127', 'x128', 'x129', 'x130', 'x140',
       'x141', 'x142'],
      dtype='object', name='varname')

See the quantity of feature per category:

In [7]:
pd.DataFrame({'count' : meta.groupby(['category', 'dtype'])['dtype'].size()}).reset_index()

Unnamed: 0,category,dtype,count
0,boolean,object,50
1,content,object,10
2,numerical,int64,31
3,numerical,float64,55


#### Descriptive Statistics <a class="anchor" id="descriptive"></a>

In this section we will apply the _describe_ method on the features splited by category and dtype to calculate the mean, standart deviation, max, min... 

**_Numerical float variables_**

In [8]:
float_features = meta[(meta.category == 'numerical') & (meta.dtype == 'float64')].index
float_train_features = train_features[float_features]
float_train_features_describe = float_train_features.describe()
float_train_features_describe

Unnamed: 0,x5,x6,x7,x8,x9,x16,x19,x20,x21,x28,x29,x36,x37,x38,x39,x40,x47,x50,x51,x52,x59,x60,x66,x67,x68,x69,x70,x77,x80,x81,x82,x89,x90,x96,x97,x98,x99,x100,x107,x110,x111,x112,x119,x120,x121,x122,x123,x124,x125,x132,x135,x136,x137,x144,x145
count,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0
mean,0.9551493,0.05531406,0.7906443,0.1731225,0.4462953,0.4196774,0.8185989,-0.06392546,0.7858669,0.452319,0.4378473,1.082852,0.06117782,0.8954949,0.1830648,0.4846973,0.4348303,0.9112922,-0.08473989,0.8910999,0.489881,0.4749733,1.082232,0.06106558,0.884159,0.1989396,0.5124087,0.427005,0.8953298,-0.08269609,0.8798031,0.4870141,0.5029204,1.027301,0.05602975,0.8466713,0.1833974,0.4780893,0.3693654,0.8783841,-0.06801648,0.8460217,0.4053633,0.4689895,1.12141,0.06366866,0.9229534,0.1967985,0.5163925,0.4397928,0.9336593,0.07678663,0.9231219,0.5238243,0.5053399
std,0.5278641,0.1318832,0.3549407,0.3326885,0.3026847,0.2945485,0.342265,0.4972203,0.3457971,0.3019166,0.3010004,0.4222894,0.1369235,0.2233018,0.3280037,0.2633006,0.2817998,0.1892331,0.6268609,0.202493,0.2823276,0.2633976,0.4265242,0.136556,0.2422241,0.3513484,0.2731636,0.2850214,0.2214036,0.606453,0.2234765,0.2869244,0.2727644,0.476128,0.1275268,0.2958083,0.3368791,0.2857023,0.2660854,0.2688927,0.5270641,0.2804542,0.267306,0.2848163,0.3789949,0.1397335,0.1604844,0.3432269,0.2577492,0.2781111,0.07191982,1.538817,0.121564,0.2703335,0.2582544
min,0.0,0.0,0.0,0.0,-1.042755,-0.5919283,-0.3520179,-46.0,0.0,-0.5762332,-1.051465,0.0,0.0,0.0,0.0,-1.839272,-0.6244395,-0.4652466,-48.0,0.0,-0.5997758,-1.847189,0.0,0.0,0.0,0.0,-1.839272,-0.6244395,-0.4652466,-48.0,0.0,-0.5997758,-1.847189,0.0,0.0,0.0,0.0,-1.042755,-0.6244395,-0.3520179,-48.0,0.0,-0.5997758,-1.051465,0.0,0.0,0.0,0.0,-2.584323,-0.6793722,-0.3520179,-48.0,0.0,-0.5762332,-2.592241
25%,0.6367211,0.0,0.8438324,0.0,0.1961279,0.1670404,0.9292196,0.0,0.79,0.206278,0.1854545,0.7068146,0.0,0.92317,0.0,0.2782293,0.1717569,0.9327354,0.0,0.86,0.2331839,0.2686212,0.7068146,0.0,0.9204477,0.0,0.3056215,0.1636771,0.9310954,0.0,0.86,0.2302266,0.2955626,0.6539119,0.0,0.9056261,0.0,0.2541568,0.1267636,0.9383872,0.0,0.85,0.1729664,0.2435905,0.7879613,0.0,0.9277072,0.0,0.3151227,0.1785164,0.916293,0.0,0.88,0.2756953,0.3041061
50%,1.270115,0.0,0.9588627,0.0,0.4393339,0.4002242,0.9630045,0.0,0.95,0.4433551,0.4295221,1.294118,0.0,1.0,0.0,0.4632322,0.4104658,0.9582577,0.0,1.0,0.4697309,0.4536817,1.294118,0.0,1.0,0.0,0.5002273,0.3976471,0.9576656,0.0,1.0,0.4639082,0.4899103,1.294118,0.0,0.9788328,0.0,0.4656831,0.3279412,0.9652466,0.0,0.98,0.3721973,0.456057,1.296128,0.0,1.0,0.0,0.4957192,0.4170404,0.9495516,0.0,1.0,0.5235426,0.484375
75%,1.414798,0.05837871,1.0,0.1451906,0.6866182,0.682207,0.9809417,0.0,1.0,0.7197309,0.6767036,1.414798,0.06624319,1.0,0.2001202,0.6844131,0.6894619,0.9758016,0.0,1.0,0.7387773,0.6750069,1.414798,0.06594071,1.0,0.2177858,0.739096,0.6863479,0.9760349,0.0,1.0,0.7432735,0.7292162,1.414798,0.062954,1.0,0.18,0.704996,0.609583,0.9814708,0.0,1.0,0.6479412,0.6952862,1.414798,0.06836056,1.0,0.221213,0.7247429,0.690583,0.9725093,0.0,1.0,0.7640909,0.7132486
max,2.732124,0.9987901,1.0,1.753333,1.942155,7.929372,0.9997862,14.0,1.0,7.96861,1.932647,2.732124,0.998242,1.0,1.753333,1.942155,3.884529,0.9997862,14.0,1.0,3.923767,1.932647,2.732124,0.998242,1.0,1.793262,1.942155,3.884529,0.9997862,9.333333,1.0,3.923767,1.932647,2.732124,0.995312,1.0,1.753333,1.942155,7.929372,0.9997862,14.0,1.0,7.96861,1.932647,2.732124,0.998242,1.0,1.753333,1.942155,7.974215,0.9999985,147.0,1.0,8.002242,1.932647


In [9]:
float_train_features_describe.loc()[['min','max']]

Unnamed: 0,x5,x6,x7,x8,x9,x16,x19,x20,x21,x28,x29,x36,x37,x38,x39,x40,x47,x50,x51,x52,x59,x60,x66,x67,x68,x69,x70,x77,x80,x81,x82,x89,x90,x96,x97,x98,x99,x100,x107,x110,x111,x112,x119,x120,x121,x122,x123,x124,x125,x132,x135,x136,x137,x144,x145
min,0.0,0.0,0.0,0.0,-1.042755,-0.591928,-0.352018,-46.0,0.0,-0.576233,-1.051465,0.0,0.0,0.0,0.0,-1.839272,-0.624439,-0.465247,-48.0,0.0,-0.599776,-1.847189,0.0,0.0,0.0,0.0,-1.839272,-0.624439,-0.465247,-48.0,0.0,-0.599776,-1.847189,0.0,0.0,0.0,0.0,-1.042755,-0.624439,-0.352018,-48.0,0.0,-0.599776,-1.051465,0.0,0.0,0.0,0.0,-2.584323,-0.679372,-0.352018,-48.0,0.0,-0.576233,-2.592241
max,2.732124,0.99879,1.0,1.753333,1.942155,7.929372,0.999786,14.0,1.0,7.96861,1.932647,2.732124,0.998242,1.0,1.753333,1.942155,3.884529,0.999786,14.0,1.0,3.923767,1.932647,2.732124,0.998242,1.0,1.793262,1.942155,3.884529,0.999786,9.333333,1.0,3.923767,1.932647,2.732124,0.995312,1.0,1.753333,1.942155,7.929372,0.999786,14.0,1.0,7.96861,1.932647,2.732124,0.998242,1.0,1.753333,1.942155,7.974215,0.999998,147.0,1.0,8.002242,1.932647


In [10]:
float_train_features.isnull().any().any()

False

The features that are scaled between [0,1] are: x6, x7, x21, x37, x38, x52, x67, x68, x82, x97, x98, x112, x122, x123, x137.

So we could apply scaling on the other features depends on the classifier.

And we don't have any NaN values on this features.


**_Numerical int variables_**

In [11]:
int_features = meta[(meta.category == 'numerical') & (meta.dtype == 'int64')].index
int_train_features = train_features[int_features]
int_train_features_describe = int_train_features.describe()
int_train_features_describe

Unnamed: 0,id,x15,x17,x18,x22,x23,x27,x46,x48,x49,x53,x54,x58,x76,x78,x79,x83,x84,x88,x106,x108,x109,x113,x114,x118,x131,x133,x134,x138,x139,x143
count,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0,1700000.0
mean,850000.5,6.154404,4.487084,8.096322,2301.595,1874.765,2.814097,8.15102,6.973005,8.36331,2605.915,2098.133,2.640614,8.259205,7.502141,8.401614,2581.685,2079.391,2.685396,7.851039,4.854556,8.419647,2469.781,1998.761,2.314082,4.809295,9.301809,8.842868,2688.457,2163.423,3.632134
std,490747.9,8.957511,4.623426,7.123864,1745.12,1517.991,4.409801,10.3605,9.311837,6.656968,1632.477,1421.664,3.82331,10.55778,11.29549,6.83433,1648.193,1433.16,3.921535,10.01937,4.483374,6.886586,1691.323,1470.973,3.914509,8.966942,7.725215,6.665332,1591.809,1392.787,9.42412
min,1.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0
25%,425000.8,1.0,2.0,3.0,1261.0,892.0,0.0,2.0,3.0,4.0,1262.0,892.0,0.0,2.0,3.0,4.0,1262.0,892.0,0.0,2.0,2.0,4.0,1261.0,892.0,0.0,0.0,4.0,4.0,1262.0,892.0,1.0
50%,850000.5,3.0,4.0,7.0,1263.0,892.0,1.0,4.0,5.0,7.0,1263.0,918.0,1.0,4.0,5.0,7.0,1263.0,918.0,1.0,4.0,4.0,7.0,1263.0,918.0,1.0,1.0,7.0,7.0,1263.0,918.0,2.0
75%,1275000.0,7.0,6.0,11.0,4400.0,3307.0,4.0,9.0,8.0,11.0,4659.0,3307.0,4.0,9.0,8.0,12.0,4643.0,3307.0,4.0,9.0,7.0,12.0,4400.0,3307.0,3.0,5.0,13.0,12.0,4672.0,3308.0,5.0
max,1700000.0,153.0,237.0,219.0,19500.0,14167.0,672.0,153.0,371.0,219.0,19500.0,14167.0,337.0,153.0,271.0,219.0,19500.0,14167.0,284.0,153.0,243.0,219.0,19500.0,14167.0,410.0,153.0,301.0,219.0,19500.0,14167.0,1219.0


In [12]:
int_train_features_describe.loc()[['min','max']]

Unnamed: 0,id,x15,x17,x18,x22,x23,x27,x46,x48,x49,x53,x54,x58,x76,x78,x79,x83,x84,x88,x106,x108,x109,x113,x114,x118,x131,x133,x134,x138,x139,x143
min,1.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0
max,1700000.0,153.0,237.0,219.0,19500.0,14167.0,672.0,153.0,371.0,219.0,19500.0,14167.0,337.0,153.0,271.0,219.0,19500.0,14167.0,284.0,153.0,243.0,219.0,19500.0,14167.0,410.0,153.0,301.0,219.0,19500.0,14167.0,1219.0


In [13]:
int_train_features_describe.isnull().any().any()

False

All the int numerical features are not scaled, so depending on the algorithm we have to scale the feature, we don't have any missing value. The problem here is we don't know when the feature is a categorical feature or a quantitative. 

**_Content variables_**

In [14]:
content_features = meta[(meta.category == 'content')].index
content_train_features = train_features[content_features]
content_train_features_describe = content_train_features.describe()
content_train_features_describe

Unnamed: 0,x3,x4,x34,x35,x61,x64,x65,x91,x94,x95
count,1451737,1451737,1649154,1649154,1699968,1628939,1628939,1699968,1559381,1559381
unique,201881,26428,241245,35700,461141,245418,36814,134998,184925,26791
top,MZZbXga8gvaCBqWpzrh2iKdOkcsz/bG/z4BVjUnqWT0=,hCXwO/JldK5zcd9ejOD1FwmEgCf96eTdEVy7OtY2Y2g=,MZZbXga8gvaCBqWpzrh2iKdOkcsz/bG/z4BVjUnqWT0=,YvZUuCDjLu9VvkCdBWgARWQrvm+FSXgxp0zIrMjcLBc=,X/hdUOVR5KuExVGLzjhLcM2CyIqym9t0Nh+ZX05M+1w=,MZZbXga8gvaCBqWpzrh2iKdOkcsz/bG/z4BVjUnqWT0=,YvZUuCDjLu9VvkCdBWgARWQrvm+FSXgxp0zIrMjcLBc=,+yhSY//Hpg7u0bSA7NYmcmRFgv3bF4Tw3BMHrBqaTtA=,MZZbXga8gvaCBqWpzrh2iKdOkcsz/bG/z4BVjUnqWT0=,hCXwO/JldK5zcd9ejOD1FwmEgCf96eTdEVy7OtY2Y2g=
freq,51212,86750,57978,84162,60369,65666,93540,89494,47700,105073


In [15]:
uniques = set()
for c in content_train_features.columns:
    uniques.update(content_train_features[c].unique().tolist())
print('total uniques words={}'.format(len(uniques)))

# flattening all the words to count them
all_words = pd.Series(content_train_features.values.flatten('F'))
all_words = all_words.to_frame().reset_index()
print('total words={}'.format(all_words.shape[0]))
all_words = all_words.rename(columns= {0: 'words'})
all_words = pd.DataFrame({'count' : all_words.groupby(['words'])['words'].size()}).reset_index()
all_words.sort_values('count', ascending=False).head(10)

total uniques words=979749
total words=17000000


Unnamed: 0,words,count
565834,YvZUuCDjLu9VvkCdBWgARWQrvm+FSXgxp0zIrMjcLBc=,392698
538278,X6dDAI/DZOWvu0Dg6gCgRoNr2vTUz/mc4SdHTNUPS38=,356811
692301,hCXwO/JldK5zcd9ejOD1FwmEgCf96eTdEVy7OtY2Y2g=,317031
376725,MZZbXga8gvaCBqWpzrh2iKdOkcsz/bG/z4BVjUnqWT0=,273502
199214,B+EJpnEbkYtLnwDQYN1dP1rcfnoCnxAjKLYwQZE07Ew=,260233
15027,+yhSY//Hpg7u0bSA7NYmcmRFgv3bF4Tw3BMHrBqaTtA=,260166
528829,WV5vAHFyqkeuyFB5KVNGFOBuwjkUGKYc8wh9QfpVzAA=,237367
264280,FExKgjj6CsbToTubdZ+kGsOmUx3gCvZVJCdZPcdPNF4=,208934
808722,oo9tGpHvTredpg9JkHgYbZAuxcwtSpQxU5mA/zUbxY8=,182455
49401,1CiKJR7D66tRwH6l6wwv0p+D/tAuoW+NdSNqPTbvDoQ=,176907


In [16]:
content_train_features.isnull().sum()

x3     248263
x4     248263
x34    50846 
x35    50846 
x61    32    
x64    71061 
x65    71061 
x91    32    
x94    140619
x95    140619
dtype: int64

On the hashed words we have 979_749 unique words on 17_000_000 (1.7kk rows x 10 collumns) words giving 5.76% of uniques words on the total words. This show us that word can have a huge impact on the classifier because we have some words multiples times. But we have to take care of the NaN values and treat them.

**_Boolean variables_**

In [17]:
bool_vars = meta[(meta.category == 'boolean')].index
train_features[bool_vars].describe()
train_features[bool_vars].isnull().sum()

x1      248190
x2      248190
x10     248263
x11     248263
x12     248263
x13     248263
x14     248263
x24     248263
x25     248263
x26     248263
x30     0     
x31     0     
x32     50772 
x33     50772 
x41     50846 
x42     50846 
x43     50846 
x44     50846 
x45     50846 
x55     50846 
x56     50846 
x57     50846 
x62     70978 
x63     70978 
x71     71061 
x72     71061 
x73     71061 
x74     71061 
x75     71061 
x85     71061 
x86     71061 
x87     71061 
x92     140526
x93     140526
x101    140619
x102    140619
x103    140619
x104    140619
x105    140619
x115    140619
x116    140619
x117    140619
x126    32    
x127    32    
x128    32    
x129    32    
x130    32    
x140    32    
x141    32    
x142    32    
dtype: int64

On the boolean values, only on 2 features we have no missing values. So we have to treat all this missing values here.

**_Labels variables_**

In [46]:
total = train_labels.shape[0]
for col in train_labels.columns:
    if col != 'id':
        print(train_labels[col].value_counts(sort=True))
        print('')

0    1689631
1    10369  
Name: y1, dtype: int64

0    1698871
1    1129   
Name: y2, dtype: int64

0    1664400
1    35600  
Name: y3, dtype: int64

0    1677704
1    22296  
Name: y4, dtype: int64

0    1699855
1    145    
Name: y5, dtype: int64

0    1573102
1    126898 
Name: y6, dtype: int64

0    1635569
1    64431  
Name: y7, dtype: int64

0    1698519
1    1481   
Name: y8, dtype: int64

0    1567117
1    132883 
Name: y9, dtype: int64

0    1670709
1    29291  
Name: y10, dtype: int64

0    1698432
1    1568   
Name: y11, dtype: int64

0    1575878
1    124122 
Name: y12, dtype: int64

0    1675185
1    24815  
Name: y13, dtype: int64

0    1700000
Name: y14, dtype: int64

0    1695913
1    4087   
Name: y15, dtype: int64

0    1681200
1    18800  
Name: y16, dtype: int64

0    1699824
1    176    
Name: y17, dtype: int64

0    1699704
1    296    
Name: y18, dtype: int64

0    1698863
1    1137   
Name: y19, dtype: int64

0    1695057
1    4943   
Name: y20, dtype: int64

0 

We only have two types of response on labels 0 and 1, making a binary classification problem

In [47]:
total_perc = 0
for col in train_labels.columns:
    if col != 'id':
        total_1 = total - train_labels[col].value_counts(sort=True)[0]
        perc = total_1 / total
        total_perc += perc
        print('Column {} has {} positive labels, {:.2%} of total'.format(col, total_1, perc))
print('total_perc={}'.format(total_perc))

Column y1 has 10369 positive labels, 0.61% of total
Column y2 has 1129 positive labels, 0.07% of total
Column y3 has 35600 positive labels, 2.09% of total
Column y4 has 22296 positive labels, 1.31% of total
Column y5 has 145 positive labels, 0.01% of total
Column y6 has 126898 positive labels, 7.46% of total
Column y7 has 64431 positive labels, 3.79% of total
Column y8 has 1481 positive labels, 0.09% of total
Column y9 has 132883 positive labels, 7.82% of total
Column y10 has 29291 positive labels, 1.72% of total
Column y11 has 1568 positive labels, 0.09% of total
Column y12 has 124122 positive labels, 7.30% of total
Column y13 has 24815 positive labels, 1.46% of total
Column y14 has 0 positive labels, 0.00% of total
Column y15 has 4087 positive labels, 0.24% of total
Column y16 has 18800 positive labels, 1.11% of total
Column y17 has 176 positive labels, 0.01% of total
Column y18 has 296 positive labels, 0.02% of total
Column y19 has 1137 positive labels, 0.07% of total
Column y20 has

As total_perc sum for more than 1, we some records that have more than one label

#### Data Quality Checks <a class="anchor" id="quality_check"></a>

**_Checking Missings Values_**

In [18]:
vars_with_missing = []
for f in train_features.columns:
    missings = train_features[f].isnull().sum()
    if missings > 0:
        vars_with_missing.append(f)
        missings_perc = missings/train_features.shape[0]
        category = meta.loc[f]['category']
        dtype = meta.loc[f]['dtype']
        
        print('Variable {} ({}, {}) has {} records ({:.2%}) with missing values'.format(f, category, dtype, missings, missings_perc))
        
print('In total, there are {} variables with missing values'.format(len(vars_with_missing)))

Variable x1 (boolean, object) has 248190 records (14.60%) with missing values
Variable x2 (boolean, object) has 248190 records (14.60%) with missing values
Variable x3 (content, object) has 248263 records (14.60%) with missing values
Variable x4 (content, object) has 248263 records (14.60%) with missing values
Variable x10 (boolean, object) has 248263 records (14.60%) with missing values
Variable x11 (boolean, object) has 248263 records (14.60%) with missing values
Variable x12 (boolean, object) has 248263 records (14.60%) with missing values
Variable x13 (boolean, object) has 248263 records (14.60%) with missing values
Variable x14 (boolean, object) has 248263 records (14.60%) with missing values
Variable x24 (boolean, object) has 248263 records (14.60%) with missing values
Variable x25 (boolean, object) has 248263 records (14.60%) with missing values
Variable x26 (boolean, object) has 248263 records (14.60%) with missing values
Variable x32 (boolean, object) has 50772 records (2.99%)

**_Checking the cardinality of the int variables_**

Cardinality means the differents values of a variable, so we will see which feature will became dummy variables.

In [19]:
for f in int_train_features:
    dist_values = int_train_features[f].value_counts().shape[0]
    print('Variable {} has {} distinct values'.format(f, dist_values))

Variable id has 1700000 distinct values
Variable x15 has 97 distinct values
Variable x17 has 119 distinct values
Variable x18 has 108 distinct values
Variable x22 has 498 distinct values
Variable x23 has 419 distinct values
Variable x27 has 193 distinct values
Variable x46 has 122 distinct values
Variable x48 has 167 distinct values
Variable x49 has 110 distinct values
Variable x53 has 499 distinct values
Variable x54 has 419 distinct values
Variable x58 has 149 distinct values
Variable x76 has 127 distinct values
Variable x78 has 184 distinct values
Variable x79 has 109 distinct values
Variable x83 has 498 distinct values
Variable x84 has 417 distinct values
Variable x88 has 141 distinct values
Variable x106 has 109 distinct values
Variable x108 has 123 distinct values
Variable x109 has 109 distinct values
Variable x113 has 500 distinct values
Variable x114 has 419 distinct values
Variable x118 has 159 distinct values
Variable x131 has 125 distinct values
Variable x133 has 158 distinc

At this point i can't see if I will treat this variables as categorical and transform in dummy variables or treat them as quatitative variables.

### Exploratory Visualization

1. [Categorical Variables](#categorical_variables)

In this section, you will need to provide some form of visualization that summarizes or extracts a relevant characteristic or feature about the data. The visualization should adequately support the data being used. Discuss why this visualization was chosen and how it is relevant. Questions to ask yourself when writing this section:
- _Have you visualized a relevant characteristic or feature about the dataset or input data?_
- _Is the visualization thoroughly analyzed and discussed?_
- _If a plot is provided, are the axes, title, and datum clearly defined?_

#### Categorical Variables <a class="anchor" id="categorical_variables"></a> 