# Machine Learning Engineer Nanodegree
## Capstone Proposal
Felipe Santos

October 4th, 2018

## Proposal

My proposal to the capstone project is beating the benchmark in the Tradeshift Text Classification on Kaggle Competition, in this competition, the machine learning engineer has to classify text blocks in documents to certain labels, being a multiclass classification problem with tabular data. This competition started on 10/02/2014 and ended on 11/10/2014 and today is an featured competition.


### Domain Background
"Optical character recognition (also optical character reader, OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo" [1]. This method is used as an entry method so the document became more easily stored, compact and searched. But the process only does some dummy translate from image to text so text classification algorithms come to give us information about this unstructured format and transform from document retrieval to knowledge discovery [2]. The need of automatically retrieval of useful knowledge from the huge amount of textual data in order to assist the human analysis is fully apparent [3].

Tradeshift competition is about predicted the probability that a piece of text belongs to a given class. The dataset was created from thousands of documents, representing millions of words. In each document, several bounding boxes with text inside are selected and features are extracted from this texts and labels are assigned. For the text extraction process is used OCR (optical character recognition) and the supervised machine learning method is used to gain information and classify the text, the dataset is previously performed the OCR text extraction process and the features are already extracted. I want to learn about this project to gains insights into a future project of my own that have some similarities with this competition.

### Problem Statement

In this competition, we have to create a supervised machine learning algorithm to predict labels from the text that is parsed from OCR and the features give to us from Tradeshift dataset. For all the documents, words are detected and combined to form text blocks that may overlap to each other. Each text block is enclosed within a spatial box, which is depicted by a red line in the sketch below. The text blocks from all documents are aggregated in a data set where each text block corresponds to one sample (row).

![text classification](imgs/text-classification.png)
![text classification](imgs/text-classification-2.png)

### Datasets and Inputs

The files with the dataset used for this capstone is on the [link](https://www.kaggle.com/c/tradeshift-text-classification) in the section Data. We have 4 files on the link, all in the csv format with a 1-row header and each row stores a different sample and each collumn is separeted with comma:
- **train.csv**, contains all features for the training set;
- **trainLabels.csv**, contains one row per label per sample and the order of the rows is the align with the train.csv;
- **test.csv**, contains all features for the testing set;
- **sampleSubmission.csv**, contains a sample submission to the kaggle competition.

This dataset has ~2.1M samples with 80% as training set and 20% as the testing set, compounding of 145 features and having 33 labels to classify. The test set is split into public (30%) and private (70%) sets, which are used for the public and the private leaderboard on the competition. The features of the dataset goes to one of these categories:
- **Content**: The cryptographic hash of the raw text.
- **Parsing**: Indicates if the text parses as number, text, alphanumeric, etc.
- **Spatial**: Indicates the box position, size, etc.
- **Relational**: Includes information about the surrounding text blocks in the original document. If there is not such a surrounding text block, e.g. a text block in the top of the document does not have any other text block upper than itself, these features are empty (no-value).

The feature values can be:
- **Numbers**. Continuous/discrete numerical values.
- **Boolean**. The values include YES (true) or NO (false).
- **Categorical**. Values within a finite set of possible values.

Some observations: 
* The order of samples and features is random. In fact, two consecutive samples in the table will most likely not belong to the same document.
* Some documents are OCR'ed; hence, some noise in the data is expected.
* The documents have different formats and the text belongs to several languages.
* The number of pages and text blocks per document is not constant.
* The meaning of the features and class is not provided.

### Data Exploration

1. [Loading Data](#loading_data)
2. [First Look](#first_look)
2. [Metadata](#metadata)

#### Loading Data <a class="anchor" id="loading_data"></a>

In [1]:
import src.describe as d
import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)

train_features = d.read_train_features()
test_features = d.read_test_features()

#### First look <a class="anchor" id="first_look"></a>

In [2]:
train_features.shape

(1700000, 146)

In [3]:
train_features.head()

Unnamed: 0,id,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,x37,x38,x39,x40,x41,x42,x43,x44,x45,x46,x47,x48,x49,x50,x51,x52,x53,x54,x55,x56,x57,x58,x59,x60,x61,x62,x63,x64,x65,x66,x67,x68,x69,x70,x71,x72,x73,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83,x84,x85,x86,x87,x88,x89,x90,x91,x92,x93,x94,x95,x96,x97,x98,x99,x100,x101,x102,x103,x104,x105,x106,x107,x108,x109,x110,x111,x112,x113,x114,x115,x116,x117,x118,x119,x120,x121,x122,x123,x124,x125,x126,x127,x128,x129,x130,x131,x132,x133,x134,x135,x136,x137,x138,x139,x140,x141,x142,x143,x144,x145
0,1,NO,NO,dqOiM6yBYgnVSezBRiQXs9bvOFnRqrtIoXRIElxD7g8=,GNjrXXA3SxbgD0dTRblAPO9jFJ7AIaZnu/f48g5XSUk=,0.576561,0.073139,0.481394,0.115697,0.472474,YES,NO,NO,NO,NO,42,0.396065,3,6,0.991018,0.0,0.82,3306,4676,YES,NO,YES,0,0.405047,0.46461,NO,NO,NO,NO,mimucPmJSF6NI6KM6cPIaaVxWaQyIQzSgtwTTb9bKlc=,s7mTY62CCkWUFc36AW2TlYAy5CIcniD2Vz+lHzyYCLg=,0.576561,0.073139,0.481394,0.115697,0.45856,YES,NO,YES,NO,NO,9,0.368263,2,10,0.992729,0.0,0.94,3306,4676,YES,NO,YES,1,0.375535,0.451301,+2TNtXRI6r9owdGCS80Ia9VVv8ZpuOpVaHEvxRGGu78=,NO,NO,Op+X3asn5H7EQJErI7PR0NkUs3YB+Ld/8OfWuiOC8tU=,GeerC2BbPUcQfQO86NmvOsKrfTvmW7HF+Iru9y+7DPA=,0.576561,0.073139,0.481394,0.115697,0.487598,YES,NO,NO,NO,NO,42,0.363131,6,10,0.987596,0.0,0.71,3306,4676,YES,NO,YES,0,0.375535,0.479734,bxU52teuxC05EZyzFihSiKHczE2ZAIVCXekVLG7j3C0=,NO,NO,+dia7tCOijlRGbABX0YKG5L85x/hXLyJwwplN5Qab04=,f4Uu1R9nnf/h03aqiRQT0Fw3WItzNToLCyRlW1Pn8Z8=,0.576561,0.073139,0.481394,0.115697,0.473079,YES,NO,NO,NO,NO,37,0.333618,4,6,0.987169,0.0,0.89,3306,4676,YES,NO,YES,1,0.34645,0.46461,0.576561,0.073139,0.481394,0.115697,0.473079,YES,NO,NO,NO,NO,42,0.363131,5,6,0.987596,0.0,0.81,3306,4676,YES,NO,YES,2,0.375535,0.46461
1,2,,,,,0.0,0.0,0.0,0.0,0.0,,,,,,0,0.0,0,0,0.0,0.0,0.0,0,0,,,,0,0.0,0.0,NO,NO,NO,NO,l0G2rvmLGE6mpPtAibFsoW/0SiNnAuyAc4k35TrHvoQ=,lblNNeOLanWhqgISofUngPYP0Ne1yQv3QeNHqCAoh48=,1.058379,0.125832,0.932547,0.663037,0.569047,YES,NO,NO,NO,NO,9,0.709921,5,6,0.96824,0.0,0.81,4678,3306,YES,NO,YES,3,0.741682,0.560282,MZZbXga8gvaCBqWpzrh2iKdOkcsz/bG/z4BVjUnqWT0=,NO,NO,TqL9cs8ZFzALzVpZv6wYBDi+6zwhrdarQE/3FH+XAlA=,aZTF/lredyP4cukeN8bh6kpBjYmS1QFNpPOg2LVm3Lg=,1.058379,0.125832,0.932547,0.663037,0.628474,YES,NO,NO,NO,NO,2,0.679371,8,7,0.937387,0.0,0.84,4678,3306,YES,NO,YES,1,0.741984,0.619282,YvZUuCDjLu9VvkCdBWgARWQrvm+FSXgxp0zIrMjcLBc=,NO,NO,dsyhxXKNNJy4WVGD/v4+UGyW3jHWkx2xTdg3STsf34A=,X6dDAI/DZOWvu0Dg6gCgRoNr2vTUz/mc4SdHTNUPS38=,1.058379,0.125832,0.932547,0.663037,0.602394,NO,NO,NO,NO,NO,11,0.581367,3,6,0.966122,0.0,0.87,4678,3306,NO,NO,NO,3,0.615245,0.59363,1.058379,0.125832,0.932547,0.663037,0.602394,YES,NO,NO,NO,NO,9,0.709921,4,6,0.96824,0.0,0.51,4678,3306,YES,NO,YES,4,0.741682,0.59363
2,3,NO,NO,ib4VpsEsqJHzDiyL0dZLQ+xQzDPrkxE+9T3mx5fv2wI=,X6dDAI/DZOWvu0Dg6gCgRoNr2vTUz/mc4SdHTNUPS38=,1.341803,0.051422,0.935572,0.04144,0.50171,NO,NO,YES,NO,NO,2,0.838475,3,5,0.966122,0.0,0.74,4678,3306,NO,NO,NO,2,0.872353,0.493159,NO,NO,YES,YES,9TRXThP/ifDpJRGFX1LQseibUA1NJ3XM53gy+1eZ46k=,XSJ6E8aAoZC7/KAu3eETpfMg3mCq7HVBFIVIsoMKh9E=,1.341803,0.051422,0.935572,0.04144,0.447627,YES,NO,NO,NO,YES,2,0.752269,5,7,0.95493,0.0,0.82,4678,3306,YES,NO,YES,2,0.797338,0.438435,cr+kkNnNFV9YL0vz029hk3ohIDmGuABRVNhFe0ePZyo=,NO,NO,oFsUwSLCWcj8UA1cqILh5afKVcvwlFA+ohJ147Wkz5I=,WV5vAHFyqkeuyFB5KVNGFOBuwjkUGKYc8wh9QfpVzAA=,1.341803,0.051422,0.935572,0.04144,0.522873,YES,NO,NO,NO,NO,1,0.732305,6,6,0.95493,0.0,0.8,4678,3306,YES,NO,YES,0,0.777374,0.513681,X6dDAI/DZOWvu0Dg6gCgRoNr2vTUz/mc4SdHTNUPS38=,NO,NO,mRPnGiKVOWTk/vzZaqlLXZRtdrkcQ/sX0hqBCqOuKq0=,oo9tGpHvTredpg9JkHgYbZAuxcwtSpQxU5mA/zUbxY8=,1.341803,0.051422,0.935572,0.04144,0.50171,NO,NO,NO,NO,NO,2,0.65729,6,5,0.936479,0.0,0.79,4678,3306,NO,NO,NO,0,0.720811,0.493159,1.341803,0.051422,0.935572,0.04144,0.50171,NO,NO,YES,NO,NO,5,0.742589,3,5,0.966122,0.0,0.85,4678,3306,NO,NO,NO,1,0.776467,0.493159
3,4,YES,NO,BfrqME7vdLw3suQp6YAT16W2piNUmpKhMzuDrVrFQ4w=,YGCdISifn4fLao/ASKdZFhGIq23oqzfSbUVb6px1pig=,0.653912,0.041471,0.940787,0.090851,0.556564,YES,NO,NO,NO,NO,37,0.127405,8,15,0.959171,0.0,0.96,3306,4678,YES,NO,YES,1,0.168234,0.546582,NO,NO,YES,NO,BfrqME7vdLw3suQp6YAT16W2piNUmpKhMzuDrVrFQ4w=,YGCdISifn4fLao/ASKdZFhGIq23oqzfSbUVb6px1pig=,0.653912,0.041471,0.940787,0.090851,0.556564,YES,NO,NO,NO,NO,37,0.127405,8,15,0.959171,0.0,0.96,3306,4678,YES,NO,YES,1,0.168234,0.546582,XQG0f+jmjLI0UHAXXH2RYL4MEHa+yd9okO+730PCZuc=,YES,NO,/1yAAEg6Qib4GMD+wvGOlGmpCIPIAzioWtcCwbns9/I=,YGCdISifn4fLao/ASKdZFhGIq23oqzfSbUVb6px1pig=,0.653912,0.041471,0.940787,0.090851,0.557774,YES,NO,NO,NO,NO,22,0.067764,8,15,0.959598,0.0,0.93,3306,4678,YES,NO,YES,2,0.108166,0.547792,Vl+TDNSupucNoI+Fqeo7bMCkxg1hRjgTSS6NYb9BW00=,YES,NO,/1yAAEg6Qib4GMD+wvGOlGmpCIPIAzioWtcCwbns9/I=,YGCdISifn4fLao/ASKdZFhGIq23oqzfSbUVb6px1pig=,0.653912,0.041471,0.940787,0.090851,0.557774,YES,NO,NO,NO,NO,22,0.067764,8,15,0.959598,0.0,0.93,3306,4678,YES,NO,YES,2,0.108166,0.547792,0.653912,0.041471,0.940787,0.090851,0.557774,YES,NO,NO,NO,NO,0,0.067764,17,15,0.92755,0.0,0.945,3306,4678,NO,NO,YES,3,0.168234,0.546582
4,5,NO,NO,RTjsrrR8DTlJyaIP9Q3Z8s0zseqlVQTrlSe97GCWfbk=,3yK2OPj1uYDsoMgsxsjY1FxXkOllD8Xfh20VYGqT+nU=,1.415919,0.0,1.0,0.0,0.375297,NO,NO,YES,NO,NO,1,0.523543,4,11,0.963004,0.0,1.0,1263,892,NO,NO,NO,2,0.560538,0.361045,NO,NO,NO,NO,XEDyQD4da6aJkZiBf+r7LD2VdhLGnCMsSpuRFUyCZgg=,Co/nVSLofrWsM5qpcKLXfekegArokgN29XjEXttuXK4=,1.415919,0.0,1.0,0.0,0.300079,YES,NO,NO,NO,YES,6,0.16704,3,3,0.971973,0.0,1.0,1263,892,YES,NO,YES,1,0.195067,0.285827,wIHg6aGH2GMPX6l1pCTzeS1bXE4jxRqmd9ubES4HgW8=,NO,NO,ST8+q2Jgb91pWEwLwmSoJzXEGsQKeQGbzlLbgHPtj4w=,rB07AAHPffU4zFFF8IrqfKSltyWcPyy4+q+IM5SLZiQ=,1.415919,0.0,1.0,0.0,0.400633,NO,NO,NO,NO,NO,9,0.144619,10,14,0.944507,-0.5,1.0,1263,892,NO,NO,NO,1,0.221973,0.386382,WYQEP5EEzM+P+nfkHKLkGko/S3RdBgfEQ3IcyYwrChE=,NO,NO,fylJzYvYlM0+kRBeLB3eFKKgCibqxFvBa8hL+WStwCE=,IoM2E9pNxABFR+H3yfapUL+ThKm7GtTzY7js9H/H99o=,1.415919,0.0,1.0,0.0,0.375297,NO,NO,NO,NO,NO,1,0.065022,8,11,0.92713,0.0,1.0,1263,892,NO,NO,NO,0,0.137892,0.361045,1.415919,0.0,1.0,0.0,0.375297,NO,NO,NO,NO,NO,9,0.146861,11,11,0.900224,0.0,1.0,1263,892,NO,NO,NO,1,0.246637,0.361045


In [4]:
test_features.shape

(545082, 146)

In [5]:
test_features.head()

Unnamed: 0,id,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,x37,x38,x39,x40,x41,x42,x43,x44,x45,x46,x47,x48,x49,x50,x51,x52,x53,x54,x55,x56,x57,x58,x59,x60,x61,x62,x63,x64,x65,x66,x67,x68,x69,x70,x71,x72,x73,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83,x84,x85,x86,x87,x88,x89,x90,x91,x92,x93,x94,x95,x96,x97,x98,x99,x100,x101,x102,x103,x104,x105,x106,x107,x108,x109,x110,x111,x112,x113,x114,x115,x116,x117,x118,x119,x120,x121,x122,x123,x124,x125,x126,x127,x128,x129,x130,x131,x132,x133,x134,x135,x136,x137,x138,x139,x140,x141,x142,x143,x144,x145
0,1700001,YES,NO,I7K7j9mvUurktDbybGa8nYojS5TrOrQqvAandHsdjv8=,YGCdISifn4fLao/ASKdZFhGIq23oqzfSbUVb6px1pig=,0.653698,0.041684,0.940573,0.090851,0.873563,YES,NO,NO,NO,NO,37,0.126977,8,15,0.958529,0.0,0.98,3306,4678,YES,NO,YES,1,0.168448,0.863581,NO,NO,YES,NO,I7K7j9mvUurktDbybGa8nYojS5TrOrQqvAandHsdjv8=,YGCdISifn4fLao/ASKdZFhGIq23oqzfSbUVb6px1pig=,0.653698,0.041684,0.940573,0.090851,0.873563,YES,NO,NO,NO,NO,37,0.126977,8,15,0.958529,0.0,0.98,3306,4678,YES,NO,YES,1,0.168448,0.863581,8ngG7Rfo7qXTJAywYaCsCPAw+f11cPD9a+uHlRuieyM=,YES,NO,o6BnPdK/EIene9/AvXkPZ6aSO/2zILFcZioWHGVP2pU=,YGCdISifn4fLao/ASKdZFhGIq23oqzfSbUVb6px1pig=,0.653698,0.041684,0.940573,0.090851,0.874773,YES,NO,NO,NO,NO,36,0.066909,8,15,0.958316,0.0,0.98,3306,4678,YES,NO,YES,2,0.108593,0.864791,Vl+TDNSupucNoI+Fqeo7bMCkxg1hRjgTSS6NYb9BW00=,YES,NO,o6BnPdK/EIene9/AvXkPZ6aSO/2zILFcZioWHGVP2pU=,YGCdISifn4fLao/ASKdZFhGIq23oqzfSbUVb6px1pig=,0.653698,0.041684,0.940573,0.090851,0.874773,YES,NO,NO,NO,NO,36,0.066909,8,15,0.958316,0.0,0.98,3306,4678,YES,NO,YES,2,0.108593,0.864791,0.653698,0.041684,0.940573,0.090851,0.874773,YES,NO,NO,NO,NO,0,0.066909,17,15,0.925835,0.0,0.98,3306,4678,NO,NO,YES,3,0.168448,0.863581
1,1700002,NO,NO,cSuQaz0xx+UxAtskjMBN1xCeacDm/4oJYEIDFL3CMoU=,vOPAlFJxrZoxaQht2ylUA2U0jUyFGxJF7iy/SNua2+U=,1.294118,0.0,1.0,0.0,0.468855,YES,NO,NO,NO,NO,24,0.662309,5,2,0.96732,-0.666667,1.0,1188,918,YES,NO,YES,1,0.694989,0.455387,NO,NO,NO,NO,cSuQaz0xx+UxAtskjMBN1xCeacDm/4oJYEIDFL3CMoU=,vOPAlFJxrZoxaQht2ylUA2U0jUyFGxJF7iy/SNua2+U=,1.294118,0.0,1.0,0.0,0.468855,YES,NO,NO,NO,NO,24,0.662309,5,2,0.96732,-0.666667,1.0,1188,918,YES,NO,YES,1,0.694989,0.455387,cSuQaz0xx+UxAtskjMBN1xCeacDm/4oJYEIDFL3CMoU=,NO,NO,7e8q0Rrx2L8hpW8RdrFNt8ynVvcsY7Q7ViSyWOwuKu8=,vOPAlFJxrZoxaQht2ylUA2U0jUyFGxJF7iy/SNua2+U=,1.294118,0.0,1.0,0.0,0.495791,YES,NO,NO,NO,NO,24,0.662309,5,3,0.96732,-0.5,1.0,1188,918,YES,NO,YES,2,0.694989,0.482323,vOPAlFJxrZoxaQht2ylUA2U0jUyFGxJF7iy/SNua2+U=,NO,NO,cSuQaz0xx+UxAtskjMBN1xCeacDm/4oJYEIDFL3CMoU=,vOPAlFJxrZoxaQht2ylUA2U0jUyFGxJF7iy/SNua2+U=,1.294118,0.0,1.0,0.0,0.468855,YES,NO,NO,NO,NO,24,0.662309,5,2,0.96732,-0.666667,1.0,1188,918,YES,NO,YES,1,0.694989,0.455387,1.294118,0.0,1.0,0.0,0.482323,YES,NO,NO,NO,NO,24,0.662309,5,2,0.96732,-0.666667,1.0,1188,918,YES,NO,YES,1,0.694989,0.468855
2,1700003,NO,NO,VduR3ZHc2+rs/i34uA1VtOPTyOogJacJNc3mBuRRjIU=,ZSznNzP7c1xuAbA4HWA+NnJ4UXhlkZckpXtvQW/EJPw=,0.457076,0.318488,0.360094,0.444477,0.354545,YES,NO,NO,NO,NO,4,0.336361,3,2,0.976267,0.0,0.9,4400,3413,YES,NO,YES,1,0.360094,0.344773,NO,NO,NO,NO,X/hdUOVR5KuExVGLzjhLcM2CyIqym9t0Nh+ZX05M+1w=,+yhSY//Hpg7u0bSA7NYmcmRFgv3bF4Tw3BMHrBqaTtA=,0.292997,0.317316,0.359801,0.280398,0.227273,YES,NO,YES,NO,YES,3,0.317609,1,2,0.992089,0.0,0.96,4400,3413,YES,NO,YES,0,0.32552,0.218182,X/hdUOVR5KuExVGLzjhLcM2CyIqym9t0Nh+ZX05M+1w=,NO,NO,X/hdUOVR5KuExVGLzjhLcM2CyIqym9t0Nh+ZX05M+1w=,+yhSY//Hpg7u0bSA7NYmcmRFgv3bF4Tw3BMHrBqaTtA=,0.619689,0.319074,0.360973,0.607091,0.480682,YES,NO,YES,NO,YES,4,0.319074,1,2,0.992089,0.0,0.93,4400,3413,YES,NO,YES,0,0.326985,0.471591,+yhSY//Hpg7u0bSA7NYmcmRFgv3bF4Tw3BMHrBqaTtA=,NO,NO,MOZj/907WDOWl1+ZpoMfqXxs1oBk76QxZwvAwW9LNgY=,B48p7PcIW1G1nphWvrJegvSKCnWQMjPUX07rpP4Vj6Y=,0.47407,0.096103,0.214767,0.445063,0.354773,YES,NO,NO,NO,NO,7,0.096396,11,1,0.900381,0.0,0.75,4400,3413,YES,NO,YES,0,0.196015,0.345455,0.457076,0.318488,0.360094,0.444477,0.354545,YES,NO,YES,NO,YES,3,0.318488,1,2,0.992382,0.0,0.97,4400,3413,YES,NO,YES,0,0.326106,0.345455
3,1700004,NO,NO,S7l8SI3WbSTbhCSCcNJBWCtNjh8fSqS3ZhPZ3X+EGGU=,aZTF/lredyP4cukeN8bh6kpBjYmS1QFNpPOg2LVm3Lg=,1.085878,0.874811,0.939825,1.054732,0.767586,YES,NO,NO,NO,NO,4,0.885697,8,1,0.948594,0.0,0.8,4677,3307,YES,NO,YES,0,0.937103,0.759461,NO,NO,NO,NO,PbaMhgf10YDVUUMlKO6IxqKSIB0vvZIN64Z3pUJzBgQ=,olN1LoaeSyI8h+udI/jquozrw4R8YQ+cVwHq1dOUO5s=,1.08739,0.650438,0.837617,0.366193,0.75433,NO,NO,NO,NO,NO,1,0.679468,9,1,0.928636,0.0,0.88,4677,3307,NO,NO,NO,0,0.750832,0.746632,sHweftx4qqeyXKMZcPZdldPKrQu4dM9zuoa+7ooC9x0=,NO,NO,tEsP0R/Cq4ryYxehkYXOF2AyXZR22wtnejWWVZGhFYg=,KHv6EVs1MRD/fNwsFRp7fjDNC+Bx+9doF4J8CMDPB40=,1.33414,0.534321,0.702752,1.323254,0.942912,NO,NO,NO,NO,NO,3,0.534321,20,1,0.831569,0.0,0.79,4677,3307,NO,NO,NO,0,0.702752,0.935643,aZTF/lredyP4cukeN8bh6kpBjYmS1QFNpPOg2LVm3Lg=,NO,NO,m1PSqaYPEc+sfNboZvjb1Ip8oyPpq0NOjKnM1sJljr4=,aZTF/lredyP4cukeN8bh6kpBjYmS1QFNpPOg2LVm3Lg=,1.088298,0.442395,0.561536,1.057151,0.769083,YES,NO,NO,NO,NO,1,0.510735,8,1,0.949501,0.0,0.83,4677,3307,YES,NO,YES,2,0.561234,0.760744,1.08739,0.650438,0.837617,0.366193,0.768441,YES,NO,NO,NO,NO,1,0.700333,8,1,0.949501,0.0,0.87,4677,3307,YES,NO,YES,1,0.750832,0.760316
4,1700005,NO,NO,YKBNeCS2s8QQz1xy7Aes8Ina4f27Q022R7ggwOX58l0=,6E1RoTs+Er+giZsKw158lVBRFEjPAoSaWCcPwdXuk6k=,1.414798,0.0,1.0,0.0,0.348653,NO,NO,NO,NO,NO,5,0.516816,8,11,0.936099,0.0,1.0,1262,892,NO,NO,NO,10,0.580717,0.337559,NO,NO,NO,NO,YKBNeCS2s8QQz1xy7Aes8Ina4f27Q022R7ggwOX58l0=,6E1RoTs+Er+giZsKw158lVBRFEjPAoSaWCcPwdXuk6k=,1.414798,0.0,1.0,0.0,0.348653,NO,NO,NO,NO,NO,5,0.516816,8,11,0.936099,0.0,1.0,1262,892,NO,NO,NO,10,0.580717,0.337559,l1NizwRHYtgiyed9qQPABdS+DsXm82LoA4WH2tp6hrg=,NO,NO,YKBNeCS2s8QQz1xy7Aes8Ina4f27Q022R7ggwOX58l0=,6E1RoTs+Er+giZsKw158lVBRFEjPAoSaWCcPwdXuk6k=,1.414798,0.0,1.0,0.0,0.348653,NO,NO,NO,NO,NO,5,0.516816,8,11,0.936099,0.0,1.0,1262,892,NO,NO,NO,10,0.580717,0.337559,WzetbpM8ANovx8fMs5eICpHyI7l4Ms148EdQN9Dqwg8=,NO,NO,U51h5TWl46NEumKBW6cXNIQ5L1eD3Ru9cREW4p71slo=,DPMubReLUctb4CKC75lbXJO030Si6cjLqJqk4hIOxoQ=,1.414798,0.0,1.0,0.0,0.348653,NO,NO,NO,NO,NO,3,0.451794,7,11,0.943946,0.0,1.0,1262,892,NO,NO,NO,9,0.507848,0.337559,1.414798,0.0,1.0,0.0,0.348653,NO,NO,NO,NO,NO,0,0.451794,16,11,0.880045,-1.0,1.0,1262,892,NO,NO,NO,10,0.580717,0.337559


In [6]:
train_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1700000 entries, 0 to 1699999
Columns: 146 entries, id to x145
dtypes: float64(55), int64(31), object(60)
memory usage: 1.8+ GB


In [7]:
test_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545082 entries, 0 to 545081
Columns: 146 entries, id to x145
dtypes: float64(55), int64(31), object(60)
memory usage: 607.2+ MB


#### Metadata <a class="anchor" id="metadata"></a>

In this section, we will categorize the collumns to try to facilitate the manipulation. We'll store:
* **dtype**: int, float, str
* **category**: content, numerical, boolean


In [10]:
meta = d.create_features_meta(train_features)
meta

Unnamed: 0_level_0,category,dtype
varname,Unnamed: 1_level_1,Unnamed: 2_level_1
id,numerical,int64
x1,numerical,object
x2,numerical,object
x3,numerical,object
x4,numerical,object
x5,numerical,float64
x6,numerical,float64
x7,numerical,float64
x8,numerical,float64
x9,numerical,float64


### References
* [0] - https://www.kaggle.com/c/tradeshift-text-classification
* [1] - https://en.wikipedia.org/wiki/Optical_character_recognition
* [2] - http://ccis2k.org/iajit/PDF/vol.5,no.1/3-37.pdf - Zakaria Elberrichi, Abdelattif Rahmoun, and Mohamed Amine Bentaalah - Using WordNet for Text Categorization - The International Arab Journal of Information Technology, Vol. 5, No. 1, January 2008
* [3] - http://odur.let.rug.nl/vannoord/TextCat/textcat.pdf - William B. Cavnar and John M. Trenkle - N-Gram-Based Text Categorization - Environmental Research Institute of Michigan
* [4] - http://www.ijaiem.org/Volume2Issue3/IJAIEM-2013-03-13-025.pdf - Bhumika, Prof Sukhjit Singh Sehra, Prof Anand Nayyar - A REVIEW PAPER ON ALGORITHMS USED FOR TEXT CLASSIFICATION - International Journal of Application or Innovation in Engineering & Management (IJAIEM) - Volume 2, Issue 3, March 2013
* [5] - https://www.researchgate.net/publication/319688772_Using_text_mining_to_classify_research_papers - Sulova, Snezhana & Todoranova, Latinka & Penchev, Bonimir & Nacheva, Radka. (2017). Using text mining to classify research papers. 10.5593/SGEM2017/21/S07.083. 
* [6] - https://www.researchgate.net/publication/280609521_Text_Classification_of_Technical_Papers_Based_on_Text_Segmentation - Nguyen, Thien & Shirai, Kiyoaki. (2013). Text Classification of Technical Papers Based on Text Segmentation. 10.1007/978-3-642-38824-8_25. 
* [7] - https://blog.tradeshift.com/hundreds-compete-to-improve-machine-learning-algorithm-for-5k-prize/ 