# Lab1 - Getting started with scikit-learn for Machine Learning

Scikit-learn is the leading machine learning software in Python. It is a project started in Paris, Inria and Telecom ParisTech. It is easy to use and extend.

A basic scikit-learn Start Tutorial is available [here](http://scikit-learn.org/stable/tutorial/basic/tutorial.html).

The goal of this first notebook is to get used with the scikit-learn library. We will load the iris dataset, a well-known dataset for machine learning, and split it into a training dataset and a test dataset. The first one will be used to fit a **k-nearest-neighbour classifier**, while the second one is used to test the just built model.

The first step is to **import** the libraries needed. The *sklearn* library provides the iris dataset, which is loaded by means of the *load_iris()* method. The *numpy* library provides the methods *unique*, which returns the sorted unique elements of an array - in our case, it shows that there are only three classes as possible outputs [0, 1, 2].

In [7]:
# Import the necessary classes
import numpy as np
from sklearn import datasets

# Load and parse the data file
iris = datasets.load_iris()
iris_X = iris.data
iris_Y = iris.target
np.unique(iris_Y)

array([0, 1, 2])

Once the initial data are ready, we can start defining the **training** and **testing** datasets, starting from the original iris dataset. First a random seed is initialized, then a random permutation is applied on the elements of the iris_X instance of the iris dataset - obtaining a shuffled version of the iris dataset.
Then, we generate 4 sub-datasets:

- **iris_X_train** and **iris_Y_train**: taking all elements from the beginning, minus the last 10
- **iris_X_test** and **iris_Y_test**: taking the last 10 elements of the array

**NB:** Python arrays support the following syntax:
    
    array2 = array[start_point:end_point]
    
This code assigns to *array2* all elements from *start_point* to *end_point* from the source *array*. This syntax also implicitly defines 0 as *start_point* and *len(array)* if nothing is specified respectively before and after the *:* . Negative start/end points (i.e.: -X) stand for X elements before the end of the array: **[-10:]** means "*the last 10 elements of the array*", while **[:-10]** means "*all elements from the beginning excluding the last 10*".

In [8]:
# Split iris data in train and test data
# A random permutation, to split the data randomly
np.random.seed ( 0 )
indices = np.random.permutation(len(iris_X))
# Take some elements from the shuffled array
iris_X_train = iris_X[indices[:-10]]
iris_Y_train = iris_Y[indices[:-10]]
iris_X_test = iris_X[indices[-10:]]
iris_Y_test = iris_Y[indices[-10:]]

Now, we create and fit a **k-nearest-neighbour classifier**. 

In [49]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(iris_X_train, iris_Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Evaluate model on test instances and compute test error

In [53]:
from sklearn.metrics import accuracy_score
iris_prediction = knn.predict(iris_X_train)
iris_prediction
iris_Y_test
accuracy_score(iris_Y_test, knn.predict(iris_X_test))

0.90000000000000002