### Classification
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [2]:
# ---- Load libraries ----
iris = datasets.load_iris() # return_X_y = True

# ---- Transform iris dataset to dataframe ----
iris_df = pd.DataFrame(iris.data, columns = iris.feature_names)

In [3]:
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


We now split the dataset into test and train to cross-validate the training process.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(iris_df, iris["target"], test_size=0.25, random_state=42)

#### Build a model with 3 clusters

In [5]:
knn = KNeighborsClassifier(n_neighbors = 3)

In [6]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

#### How good is the classifier?

In [7]:
y_pred = knn.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 1.0


#### Credits
* [DataCamp](https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn)
* [Sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
* [Aletta Smits](https://nl.linkedin.com/in/alettasmits)
* [Wikipedia](https://en.wikipedia.org/wiki/Statistical_classification)