<img src="cover.png">

# LEZIONE 2


4 Giugno 2018

### La libreria Scikit-Learn
<img src="scikit-learn-logo-small.png">
The scikit-learn project started as scikits.learn, a Google Summer of Code project by David Cournapeau. Its name stems from the notion that it is a "SciKit" (SciPy Toolkit), a separately-developed and distributed third-party extension to SciPy. The original codebase was later rewritten by other developers. In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel, all from INRIA took leadership of the project and made the first public release on February the 1st 2010. Of the various scikits, scikit-learn as well as scikit-image were described as "well-maintained and popular" in November 2012.

As of 2018, scikit-learn is under active development.

Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

http://scikit-learn.org/stable/index.html

### Definizioni

In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.

Supervised learning: in which the data comes with additional attributes that we want to predict This problem can be either:

* classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of classification problem would be the handwritten digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.
* regression: if the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.



#### Training set and testing set

Machine learning is about learning some properties of a data set and applying them to new data. This is why a common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the training set on which we learn data properties and one that we call the testing set on which we test these properties.

<img src="ml_map.png">

http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
header = [name + '_' + str(i) for name in ['Ax','Ay','Az','Gx','Gy','Gz'] for i in range(1,21)]+['movement']
data = pd.read_csv('movements.csv', header=None, index_col=0, names=header)
data.head()

Unnamed: 0,Ax_1,Ax_2,Ax_3,Ax_4,Ax_5,Ax_6,Ax_7,Ax_8,Ax_9,Ax_10,...,Gz_12,Gz_13,Gz_14,Gz_15,Gz_16,Gz_17,Gz_18,Gz_19,Gz_20,movement
1524893000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-22.458015,1
1524893000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-22.458015,-29.381679,1
1524893000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-22.458015,-29.381679,-24.10687,1
1524893000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-22.458015,-29.381679,-24.10687,-14.061069,1
1524893000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-22.458015,-29.381679,-24.10687,-14.061069,-1.458015,1


Assomiglia a:
<img src="Features.png">
che è la forma corretta dei dati per essere "digeriti" da un algoritmo IA

### Quindi....
Quanti sono i campioni?

In [3]:
len(data)

751

Ogni campione è marcato con il corrispettivo movimento nella colonna "movement".

Quante sono le features?

In [8]:
len(data.columns) - 1

120

Estraiamo i nomi delle features:

In [19]:
features = [col for col in data.columns.tolist() if col!='movement']
print features

['Ax_1', 'Ax_2', 'Ax_3', 'Ax_4', 'Ax_5', 'Ax_6', 'Ax_7', 'Ax_8', 'Ax_9', 'Ax_10', 'Ax_11', 'Ax_12', 'Ax_13', 'Ax_14', 'Ax_15', 'Ax_16', 'Ax_17', 'Ax_18', 'Ax_19', 'Ax_20', 'Ay_1', 'Ay_2', 'Ay_3', 'Ay_4', 'Ay_5', 'Ay_6', 'Ay_7', 'Ay_8', 'Ay_9', 'Ay_10', 'Ay_11', 'Ay_12', 'Ay_13', 'Ay_14', 'Ay_15', 'Ay_16', 'Ay_17', 'Ay_18', 'Ay_19', 'Ay_20', 'Az_1', 'Az_2', 'Az_3', 'Az_4', 'Az_5', 'Az_6', 'Az_7', 'Az_8', 'Az_9', 'Az_10', 'Az_11', 'Az_12', 'Az_13', 'Az_14', 'Az_15', 'Az_16', 'Az_17', 'Az_18', 'Az_19', 'Az_20', 'Gx_1', 'Gx_2', 'Gx_3', 'Gx_4', 'Gx_5', 'Gx_6', 'Gx_7', 'Gx_8', 'Gx_9', 'Gx_10', 'Gx_11', 'Gx_12', 'Gx_13', 'Gx_14', 'Gx_15', 'Gx_16', 'Gx_17', 'Gx_18', 'Gx_19', 'Gx_20', 'Gy_1', 'Gy_2', 'Gy_3', 'Gy_4', 'Gy_5', 'Gy_6', 'Gy_7', 'Gy_8', 'Gy_9', 'Gy_10', 'Gy_11', 'Gy_12', 'Gy_13', 'Gy_14', 'Gy_15', 'Gy_16', 'Gy_17', 'Gy_18', 'Gy_19', 'Gy_20', 'Gz_1', 'Gz_2', 'Gz_3', 'Gz_4', 'Gz_5', 'Gz_6', 'Gz_7', 'Gz_8', 'Gz_9', 'Gz_10', 'Gz_11', 'Gz_12', 'Gz_13', 'Gz_14', 'Gz_15', 'Gz_16', 'Gz_17', 

Scombiamo a caso l'ordine dei campioni...

In [32]:
shuffled_data = data.sample(frac=1)
shuffled_data.head(10)

Unnamed: 0,Ax_1,Ax_2,Ax_3,Ax_4,Ax_5,Ax_6,Ax_7,Ax_8,Ax_9,Ax_10,...,Gz_12,Gz_13,Gz_14,Gz_15,Gz_16,Gz_17,Gz_18,Gz_19,Gz_20,movement
1524893000.0,1.529785,1.305908,0.923096,0.713379,0.473633,0.306641,0.081055,-0.125732,-0.159912,0.027588,...,38.21374,42.145038,43.78626,48.305344,49.503817,52.21374,44.480916,14.183206,-12.229008,3
1524893000.0,0.0,0.0,0.0,0.0,0.997559,1.009277,1.000488,1.046631,1.133301,1.23877,...,-5.793893,-6.496183,-7.465649,-7.274809,-5.656489,-3.969466,-2.221374,-1.007634,-0.290076,5
1524893000.0,-0.152588,0.05957,0.311279,0.433838,0.658936,0.864258,1.140137,1.550537,1.709961,1.562256,...,-10.396947,-18.427481,-26.572519,-34.847328,-43.374046,-44.076336,-41.29771,-35.648855,-31.229008,1
1524893000.0,0.342041,0.152344,0.0354,-0.091309,-0.087891,0.081055,0.303955,0.513916,0.7146,0.921143,...,39.633588,35.80916,36.541985,33.167939,12.343511,-1.961832,-10.938931,-18.656489,-23.709924,1
1524893000.0,0.691162,0.968506,1.103027,1.244385,1.395508,1.242432,0.862793,0.752441,0.509277,0.127197,...,9.541985,21.312977,31.541985,37.305344,39.206107,44.664122,43.610687,42.618321,33.419847,1
1524893000.0,0.918945,0.624268,0.407959,0.202393,-0.019287,-0.115967,-0.002197,0.120605,0.335693,0.548096,...,40.847328,41.053435,41.412214,40.083969,26.229008,13.160305,-2.816794,-13.053435,-21.320611,1
1524893000.0,1.999939,1.999939,1.988037,1.604248,1.273438,1.159424,1.280273,1.456543,1.594971,1.659668,...,-42.946565,-40.030534,-34.160305,-25.961832,-15.099237,-6.282443,2.045802,15.358779,23.862595,3
1524893000.0,0.971924,1.016113,1.025879,1.061768,1.029297,1.125732,1.178711,1.202637,1.102783,1.057617,...,1.641221,1.389313,0.167939,-0.641221,-2.083969,-2.633588,-2.984733,-3.748092,-4.48855,5
1524893000.0,1.19043,1.000244,0.942139,0.949707,1.032227,1.035889,1.137451,0.971436,0.849121,0.544678,...,-22.21374,-6.557252,5.80916,15.778626,26.419847,32.59542,35.374046,37.274809,39.854962,1
1524893000.0,0.593262,0.83667,1.297607,1.813232,1.999939,1.999939,1.999939,1.770996,1.414795,1.145996,...,-23.755725,-26.068702,-31.732824,-38.412214,-39.099237,-43.732824,-44.916031,-39.641221,-34.030534,3


** Ora proviamo a costruire un classificatore! **

### Import delle librerie

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn import neighbors




In [42]:
n_of_train_samples = 500
train_data = shuffled_data.iloc[0:n_of_train_samples]
test_data = shuffled_data.iloc[n_of_train_samples:len(shuffled_data)]

In [None]:
X_train = shuffled_data[features].values
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)