<img src="header.png" align="left"/>

# Exercise Classification of IRIS flowers (10 points)

The goal of the exercise is to estimate the species of an iris flower using 4 features. For this we use 
different classification methods.
We use a data set by Edgar Anderson or R. Fischer from 1936 [1][2]. The dataset contains 150 samples, each with
4 measured values petal length, petal width, sepal length, sepal width as features and the correct class as label.  

```
[1] Edgar Anderson (1936). "The species problem in Iris". Annals of the Missouri Botanical Garden. 23 (3): 457–509. doi:10.2307/2394164. JSTOR 2394164.
[2] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Ann. Eugen., vol. 7, no. 2, pp. 179–188, 1936.
```

**NOTE**

Document your results by simply adding a markdown cell or a python cell (as comment) and writing your statements into this cell. For some tasks the result cell is already available.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ditomax/mlexercises/blob/master/02%20Exercise%20classification%20of%20IRIS%20flowers.ipynb)


# Import of modules

In [2]:
#
# Prepare colab
#
COLAB=False
try:
    %tensorflow_version 2.x
    print("running on google colab")
    COLAB=True
except:
    print("not running on google colab")


#
# Turn off some warnings
#
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
simplefilter(action='ignore', category=Warning)

#
# Import modules
#
import pandas as pd
import numpy as np
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

not running on google colab


In [3]:
#
# Set size of figures
#
plt.rcParams['figure.figsize'] = [16, 9]

# Loading and checking data

In [None]:
# 
# Load data
# 
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']

if COLAB:
    df = pd.read_csv('https://raw.githubusercontent.com/ditomax/mlexercises/master/data/iris/iris.csv', names=names)
else:
    df = pd.read_csv('data/iris/iris.csv', names=names)


In [None]:
#
# Basic data check
#
print(df.head())

<div class="alert alert-block alert-info">

## Task

Check the distribution of the classes and implement your own code to print the distribution of classes in this dataset (1 point)

</div>

In [1]:
# your code here

In [None]:
# 
# Separate the dataset into training data and test data
#
array = df.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.40
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=42)

<div class="alert alert-block alert-info">

## Task

Search the internet for the description of the model_selection.train_test_split function (2 points)
1. Describe the used parameters. 
1. Find out what stratification means
1. Change the split to 70% training data and 30% test data
</div>

In [None]:
# your changed code here

In [None]:
#
# Train a simple classifier using the kNN method
#
knn_classifier = KNeighborsClassifier(n_neighbors=5,metric='euclidean')
knn_classifier.fit(X_train, Y_train)

In [None]:
#
# Calculate the accuracy of the trained model
#
predictions = knn_classifier.predict(X_validation)
print('Accuracy: {}'.format(accuracy_score(Y_validation, predictions)))
#
# Task: search the internet for a concise description of the accuracy quality measure. (1 point)

In [None]:
#
# Print a confusion matrix
#
# Task: 
print(confusion_matrix(Y_validation, predictions))

<div class="alert alert-block alert-info">

## Task

Search the internet for a description of the meaning of a confusion matrix and write it down here. (1 point)

</div>

In [3]:
# your description goes here

# Test of a second classification method

In [4]:
#
# Train another classifier using the decision tree method
# 

<div class="alert alert-block alert-info">

## Task

Implement a decision tree classifier for the IRIS dataset (2 points)
    
Test the classifier with accuracy_score using Y_validation
</div>

In [5]:
# your code here
tree_classifier = ...

In [None]:
#
# Calculate accuracy
#
predictions = tree_classifier.predict(X_validation)
print('Accuracy: {}'.format(accuracy_score(Y_validation, predictions)))

# Testing multiple methods at the same time

In [None]:
scoring = 'accuracy'

models = []
models.append(('KNN', KNeighborsClassifier(n_neighbors=5)))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))

results = []
names = []
for name, model in models:
    kfold = model_selection.KFold(n_splits=5, random_state=42,shuffle=True)
    cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    print("Modell {}: accuracy {:.3f} (deviation {:.3f})".format(name, cv_results.mean(), cv_results.std()))

<div class="alert alert-block alert-info">

## Task

Explain which model you would use for a project on the IRIS data. (2 points)
</div>

In [7]:
# your decision here

# Optimizing one method with hyperparameter optimization

The idea of hyperparameter optimization is to test several variations of hyperparameters and select those parameters which produce the best quality (accuracy).

<div class="alert alert-block alert-info">

## Task

Implement a hyperparameter search for the hyperparameter n_neighbors of the KNeighborsClassifier and select and write down the best result. (2 points)
</div>

In [None]:
scoring = 'accuracy'

In [None]:
results = []
parameters = []

#
# your code here
#

In [None]:
# Compare Algorithms
fig = plt.figure()
fig.suptitle('n_neighbors Comparison')
ax = fig.add_subplot(111)
positions = range(len(parameters))
ax.set_xticks(np.arange(0, len(parameters)))
plt.violinplot(results,positions)
ax.set_xticklabels(parameters)
plt.show()

In [8]:
# your selection of the best result for n_neighbors goes here