The most important aspect before **any application** of Machine Learning is undoubtedly the pre-processing of data.

Several books, blog posts, tutorials are concern with the nice part of the job (e.g. insights of the data, applying novel algorithms, et cetera) or the application for it self of the algorithms(e.g. Linear Regression, Support Vector Machines, Neural Networks, et cetera ...).

Nevertheless the dataset (A.K.A. dataset, database, et cetera) plays an important decisive role in the success or lack thereof in the implementation of Machine Learning.

In a quick analogy, the data is like a key and algorithms are like a lock of a door.

You may have a key with some friction  ... (e.g. truncation data, missing data, non-standard data, et cetera)

You may have the door lock with some friction ... (e.g. parametric adjustment for 'fitting', eliminating steps of calculations, using 'ensembles'algorithms in the pipeline, et cetera)

But the lock will only open with the corresponding key. (e.g. database with the attributes of prediction and outcome (target) correct).

Now that we know the importance of having the data to application of algorithms, and consequently creating models, let's talk about an important aspect of the issue of input data that is the Pre-processing.

The pre-processing of data is any activity on the adjustment of the data both in terms of format, sampling so that the algorithms are able to perform construction activities of the models in order to have a better use computationally (temporal complexity + complexity space).

The main activities of the preprocessing of data are:


- Amostragem (Sampling)
- Feature Engineering que contém (i) Feature Extraction e (ii) Feature Selection
- Binning 
- Scaling
- Standarize
- Normalization
- Dimensionality Reduction

First at all, lets import the dataset of scikit package. 

In [1]:
from sklearn import datasets

Now, let's create an object called *datasets* and load Iris dataset.

In [2]:
datasets = datasets.load_iris()

This dataset is splited in two sets, one called data and another called target.
O conjunto de dados 'data' possuí todos os atributos (i.e. variáveis independentes, predictors, features, etc) relativos à base Iris.

The set target have all classes of dataset Iris. 

Let's see the quantity of data in each dataset using the function shape. 

In [3]:
print(datasets.data.shape)

(150, 4)


In resultset data we have 150 records and 4 attributes.

Let's see qhat we have in dataset target.

In [4]:
print(datasets.target.shape)

(150,)


As we can see, we have only a column that is the dependent variable. 

Best pratice: Always look to your data. Always.

In [5]:
datasets.data

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5.4,  3.7,  1.5,  0.2],
       [ 4.8,  3.4,  1.6,  0.2],
       [ 4.8,  3. ,  1.4,  0.1],
       [ 4.3,  3. ,  1.1,  0.1],
       [ 5.8,  4. ,  1.2,  0.2],
       [ 5.7,  4.4,  1.5,  0.4],
       [ 5.4,  3.9,  1.3,  0.4],
       [ 5.1,  3.5,  1.4,  0.3],
       [ 5.7,  3.8,  1.7,  0.3],
       [ 5.1,  3.8,  1.5,  0.3],
       [ 5.4,  3.4,  1.7,  0.2],
       [ 5.1,  3.7,  1.5,  0.4],
       [ 4.6,  3.6,  1. ,  0.2],
       [ 5.1,  3.3,  1.7,  0.5],
       [ 4.8,  3.4,  1.9,  0.2],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 5. ,  3.4,  1.6,  0.4],
       [ 5.2,  3.5,  1.5,  0.2],
       [ 5.2,  3.4,  1.4,  0.2],
       [ 4.7,  3.2,  1.6,  0.2],
       [ 4

In [6]:
datasets.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

For explamplification we'll put all data in another objects. 

For naming proposes every time that we use the predictor variables we'll use X as object, and y for dependent variable in target dataset. 

In [7]:
X = datasets.data

In [8]:
y = datasets.target

Let's see the data in X and y objects.

In [9]:
X

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5.4,  3.7,  1.5,  0.2],
       [ 4.8,  3.4,  1.6,  0.2],
       [ 4.8,  3. ,  1.4,  0.1],
       [ 4.3,  3. ,  1.1,  0.1],
       [ 5.8,  4. ,  1.2,  0.2],
       [ 5.7,  4.4,  1.5,  0.4],
       [ 5.4,  3.9,  1.3,  0.4],
       [ 5.1,  3.5,  1.4,  0.3],
       [ 5.7,  3.8,  1.7,  0.3],
       [ 5.1,  3.8,  1.5,  0.3],
       [ 5.4,  3.4,  1.7,  0.2],
       [ 5.1,  3.7,  1.5,  0.4],
       [ 4.6,  3.6,  1. ,  0.2],
       [ 5.1,  3.3,  1.7,  0.5],
       [ 4.8,  3.4,  1.9,  0.2],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 5. ,  3.4,  1.6,  0.4],
       [ 5.2,  3.5,  1.5,  0.2],
       [ 5.2,  3.4,  1.4,  0.2],
       [ 4.7,  3.2,  1.6,  0.2],
       [ 4

In [10]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

To import and use some data stored in web, you can use numpy library.

In [11]:
import numpy as np
import urllib
import urllib.request

Object that get data from an URL

In [12]:
with urllib.request.urlopen("http://goo.gl/j0Rvxq") as url:
    s = url.read()

Store the object

In [13]:
raw_data = s

With the object stored, we will parse the file.

In [14]:
dataset = np.loadtxt(raw_data, delimiter=",")

Now let's check our dataset as it is.

In [15]:
print(dataset.shape)

(23279,)



At similar to the previous example, we assign the independent variables in the object X and the dependent variable y in the subject ; being that:

- The variable 0 to 7 will be the independent variable ( X)
- The variable 8 is the dependent variable.



In [16]:
dataset[8]

44.0

In [17]:
y = dataset[:,8]

IndexError: too many indices for array

Loaded and checked data, we now perform some pre-processing activities.

The first preprocessing that we will use will be the Scaling .

The scaling is widely used when there is a set of data values ​​with a very high variance , and in terms of computational cost would be expensive .



First, we will import the ' preprocessing ' library scikit-learn package .

In [None]:
from sklearn import preprocessing

Let's create an object named ' normalized_X ' where we will apply the scale function of preprocessing library.

In [None]:
normalized_X = preprocessing.scale(X)

Object created , we will now make a small comparison between the object X ( our data ) and normalized_X object (which just normalize ) .

In [None]:
X

In [None]:
normalized_X

How can attest visually , we had a significant change of values, in which virtually the left values ​​of the dozen or half of the house , to a much higher level of decimal precision and also possess negative values.


Another form of pre-processing and normalization.

Normalization is the same principle as the standardization , but with the fundamental difference that the level of scale is reduced.

In the same way we did in the standardization , we will create an object and we use ' normalize ' the preprocessing library.

In [18]:
standarized_X = preprocessing.normalize(X)

NameError: name 'preprocessing' is not defined

Let us now compare the output data of the X and standarized_X objects.

In [19]:
X

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5.4,  3.7,  1.5,  0.2],
       [ 4.8,  3.4,  1.6,  0.2],
       [ 4.8,  3. ,  1.4,  0.1],
       [ 4.3,  3. ,  1.1,  0.1],
       [ 5.8,  4. ,  1.2,  0.2],
       [ 5.7,  4.4,  1.5,  0.4],
       [ 5.4,  3.9,  1.3,  0.4],
       [ 5.1,  3.5,  1.4,  0.3],
       [ 5.7,  3.8,  1.7,  0.3],
       [ 5.1,  3.8,  1.5,  0.3],
       [ 5.4,  3.4,  1.7,  0.2],
       [ 5.1,  3.7,  1.5,  0.4],
       [ 4.6,  3.6,  1. ,  0.2],
       [ 5.1,  3.3,  1.7,  0.5],
       [ 4.8,  3.4,  1.9,  0.2],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 5. ,  3.4,  1.6,  0.4],
       [ 5.2,  3.5,  1.5,  0.2],
       [ 5.2,  3.4,  1.4,  0.2],
       [ 4.7,  3.2,  1.6,  0.2],
       [ 4

In [20]:
standarized_X

NameError: name 'standarized_X' is not defined

All data for the scale of 0 to 1.

Thus the algorithms can do a much better job , as the range of data has been decreased .

Now , a key issue when dealing with a large volume of data in relation to the number of attributes (or dimensionality ) is to know which independent variables are important for the prediction model training.

To do this activity , we will use one of Scikit libraries called ' ExtraTreesClassifier ' that has a function that certifies which variables have the greatest predictive power .

First, we will import the library.

In [21]:
from sklearn.ensemble import ExtraTreesClassifier

Imported library, now let's create our model making a call in the library.

In [22]:
model = ExtraTreesClassifier()

Now we will hold the ' fitting ' the model.

One tip is that whenever you use any model for fitting using supervised learning libraries , the rule will always be :

- objeto_do_modelo.fit (X,y)


Objeto_do_modelo wherein the predictive model is used , the independent variables x and y is the dependent variable to be predicted .

In [23]:
model.fit(datasets.data, datasets.target)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

We adjusted the model , but we will use the basic settings . But generally we use the decision tree algorithm to obtain the metric of importance of each variable .

In [24]:
print(model.feature_importances_)

[ 0.12863308  0.02569993  0.46076935  0.38489764]


In [25]:
model.fit(X, y)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [26]:
print(model.feature_importances_)

[ 0.08015787  0.04929704  0.31759867  0.55294642]


This result shows that in terms of predictive importance , the most significant variables are the three (57% ) 4 ( 33%) , and 1 ( 5 % each).

If there were too many variables to consider , and we wanted to reduce the computational cost (and therefore time ) we could eliminate these two variables , but as we are simplifying we will consider everything.

We finished the basic preprocessing.

As a general rule modeling , the ideal is that this information will already processed the database , as the DBMS engine is suitable for this type of activity , both in terms of speed and robustness in processing .

And besides , the important thing is that the more modularized where there is a chain of tools doing what she does best functioning , the complexity of the system as a whole fell , and increases robuztez about the problems that may occur.