In [None]:
from sklearn.datasets import load_iris 
iris_dataset = load_iris() # load_iris is a Bunch object, which is similar to a dictionary

In [None]:
print('keys of iris dataset: \n{}'.format(iris_dataset.keys()))

In [None]:
# target_names is an array of strings
print('Target names: {}'.format(iris_dataset['target_names']))

In [None]:
# the value of feature_names is a list of strings, giving the description of each feature
print('Feature names: \n{}'.format(iris_dataset['feature_names']))

In [None]:
# the data itself is contained in teh target and data fields
# data contains teh numeric measurements of sepal length, sepal width, petal length, and petal
# width in a NumPy array
print('Type of data: {}'.format(type(iris_dataset['data'])))
print('Shape of data: {}'.format(iris_dataset['data'].shape))

Training and Testing Data
To assess the model's performance, we must show it new data for which we have labels. This is usually done by splitting the labeled data we have collected into two parts. One part of teh data is used to build our machine learning model, and is called the training data or training set. The rest of the data will be used to assess how well the model works; this is called teh test data, test set, or hold-out set.

scikit-learn contains a function that shuffles the dataset and splits it for you: teh train_test_split function. This function extracts 75% of the rows in teh data as the training set, together with teh corresponding labels for this data. The remaining 25% of the data, together with teh remaining labels, is delared as the test set. Deciding how much data you want to put into the training and test set respectively is somewhat arbitrary, but using a test set containing 25% of the data is a good rule of thumb.

in scikit-learn, data is usuallyt denoted with a capital X, while labels are denoted by a lowercase y. We use capital X because the data is a two-dimensional array (a matrix) and the lowercase y because the target is a one-dimensional array (a vector).

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'],iris_dataset['target'], random_state=0)
print('X_train shape: {}'.format(X_train.shape))
print('y_train shape: {}'.format(y_train.shape))
print('X_test shape: {}'.format(X_test.shape))
print('y_test shape: {}'.format(y_test.shape))

Looking at your data
One way to look at your data is to use a scatter plot. We can look at multiple dimensions of data by creating a pair plot, which looks at all possible pairs of features. If you have a small number of features, such as the four we have here, this is quite reasonable. 

To create the plot, we first turn the NumPy array into a Pandas DataFrame. pandas has a function to create pair plots called scatter_matrix. The diagonal of this matrix is filled with histograms of each feature.

In [None]:
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import mglearn
from IPython.display import display

# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'],iris_dataset['target'], random_state=0)
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)

# create a scatter matrix from teh dataframe, color by y_train
scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o', hist_kwds={'bins':20}, s=60, alpha=.8, cmap=mglearn.cm3)

Building your first model: K-Nearest Neighbors
Building this model only consists of building the training set. To make a prediction for a new data point, the algorithm finds the point in teh training set that is closest to the new point. Then it assigns the label of this training point to the new data point.

The k-nearest neighbors classification algorithm is implemented in teh KNeighborsClassifier class in teh neighbors module. Before we can use the model, we need to instantiate the class into an object. We will set the parameter of the object (k-neighbor number) to 1.

In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)

# we're going to build a model on the training set by calling the fit method, which takes a NumPy array as argument
knn.fit(X_train, y_train)

# build some new data (create a NumPy array)
X_new = np.array([[4,2.9,1,0.2]])

# now we use our KNN object to make a prediction
prediction = knn.predict(X_new)
print('Prediction {}'.format(prediction))
print('Predicted target name: {}'.format(iris_dataset['target_names'][prediction]))

Evaluating the Model
We know what the correct species is for each iris in the test set. Therefore, we can make a prediction for each iris in the test data and compare it against its label (the known species). We can measure how well the model works by computing the accuracy, which is the fraction of flowers for which the right species was predicted.

In [None]:
# we want to print out the predictions our model makes based on our test input data
y_pred = knn.predict(X_test)
print('Test set predictions:\n {}'.format(y_pred))

# manually calculate the score
print('Test set score: {:.2f}'.format(np.mean(y_pred == y_test)))

# we could also use the score method of the knn object, which will compute the test set accuracy for us
print('Test set score: {:.2f}'.format(knn.score(X_test,y_test)))