# Iris Plant Species Classification

## Analyze the data using the same techniques as for the last assignment.
Decide for yourself which and how to use the specific commands. Answer
the following questions in the report and include figures supporting your
answers:

### Which classes exist? Are they (roughly) balanced?

In [None]:
import matplotlib as plt
import pandas as pd
from sklearn import preprocessing

import utils

plt.rc('font', size=16)

df = pd.read_csv('iris.csv')
utils.ratio(df, 'Name')

Classes: Iris-setosa, Iris-versicolor, Iris-virginica
They are perfectly balanced.

### Which noteworthy trends of features and relations between features as well as features and Classes do you see?

In [None]:
import seaborn as sns

sns.pairplot(df, hue='Name')

In [None]:
from matplotlib import pyplot as plt
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(), annot=True)
plt.title("Correlation matrix")
plt.show()

In [None]:
df.plot.box(by='Name',figsize=(40, 15), fontsize=16)

PetalLength and PetalWidth correlate well.
PetalLength and SepalWidth correlate negatively. (see correlation matrix above)
PetalLength and PetalWidth are well segmented and can be used to distinguish.

### If you would need to distinguish the classes with those features, which features would you choose, any why?

PetalLength and PetalWidth because they don't overlap significantly. (see boxplot above)
SepalLength and SepalWidth are not ideal to distinguish between flowers, since these features tend to overlap more, than any other feature.

## Training

In order to classify the three different Iris plant species, set up your first
ML toolchain including the following steps:

### Data and Feature Preprocessing (if necessary and applicable)
#### Are there any outliers in the data which might need to be removed?

In [None]:
X = df[['PetalLength', 'PetalWidth', 'SepalLength', 'SepalWidth']]
X.describe()

As we can see from the boxplot and the describe info, we do have some outliers that we could remove. Since we only have 150 samples it is not a good idea to simply remove the outliers (probably never is). A better approach would be to fix them to the mean / median or clip the outliers to some max value.

We decided to go for the second method and simply clip the values to the 99% and 1% quantile.

In [None]:
y = df['Name']
q_max = X.quantile(.99)
q_min = X.quantile(.01)
# Outlier Removal

X.clip(lower=q_min, upper=q_max, axis=1, inplace=True)
X.describe()

* Are there any missing values which need to be taken care of?

In [None]:
# NaN
df.isnull().values.any()

#### Do you need to apply any feature preprocessing steps? (e.g Normalization, Feature Deletion/Reduction/Addition)

We do not need to apply normalization nor feature deletion but some models perform better on normalized data.
With 150 samples feature deletion does not really provide any performance benefits, but we decided to do it anyway with sklearn.

In [None]:
# Scaling
scaler = preprocessing.StandardScaler().fit(X, y)

X_scaled = scaler.transform(X)


X_scaled = pd.DataFrame(X_scaled, columns=["PetalLength", "PetalWidth", "SepalLength", "SepalWidth"])
X_scaled

In [None]:
X_scaled.plot.density()

We decided to add some features to see if we can gain any useful information.


In [None]:
X_scaled["PetalLengthSquared"] = X_scaled["PetalLength"] * X_scaled["PetalLength"]
X_scaled["PetalWidthSquared"] = X_scaled["PetalWidth"] * X_scaled["PetalWidth"]
X_scaled["SepalLengthSquared"] = X_scaled["SepalLength"] * X_scaled["SepalLength"]
X_scaled["SepalWidthSquared"] = X_scaled["SepalWidth"] * X_scaled["SepalWidth"]

X_scaled["Name"] = y
sns.pairplot(X_scaled, hue="Name")
X_scaled.drop(columns=["Name"], inplace=True)  # remove target

* Are there any categorical features that need to be transformed so that it can be used for classification task?
    * No since all our features are numerical, we do not have any categorical features, besides the target feature `Name`.
* Do you think it makes sense to derive any more features from the given ones? Why/why not?
    * It could make sense, depending on the data. It is possible to generate information that can help a model perform better. Since we do not have a lot of samples we can definitely try to derive new features.
* Split up the dataset into a training and a separate held back test set in a clever way
    * Why is such a train/test split important?
        * A: So we can validate our model and check whether we just made a lookup table of our data. It's our last safety line and important to measure the performance of our model.
    * Which train/test split percentage do you choose and why?
        * A: we choose a 70/30 split since we do not have a lot of samples and want enough data to validate our model and 70 / 30 % of 150 are integers.
    * Think about how can you make sure to include samples from all three classes in both datasets and why this is important.
        * A: If a class has no samples in our training data, the model can at best make a wild guess if a sample of that class is passed to the model. We ensured that every class is represented by using `sklearn.model_selection.train_test_split` and supplying it with the `stratify` parameter.


#### Feature Selection

In the lecture we learned that feture selection like PCA or LDA should only be applied to the training data and not the test data, so we need to split our data now, we used a 30/70 split where we use 70% of our data for training and 30% for validation.

* Use an appropriate cross-validation setup for the training:
    * `X_train` and `y_train` represents our training data and `X_train` and `y_train` our held back test set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, test_size=0.30)  # 70/30 split

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=4)
pca.fit(X_train)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
X_train

### Model Training
* Train different classification models to distinguish between the three Iris Plant Species:
    * Use the following models: k Nearest Neighbour, Decision Tree, Support Vector Machine


#### KNN

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn import svm, neighbors, neural_network

knn = GridSearchCV(
    estimator= neighbors.KNeighborsClassifier(),
    param_grid= [{'n_neighbors': [3, 5, 7, 9], 'weights': ['uniform', 'distance'], 'leaf_size': [15, 20]}],
    scoring= "accuracy",
    cv= 3)

knn.fit(X_train, y_train)
knn.best_params_

#### Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = GridSearchCV(
    estimator= DecisionTreeClassifier(),
    param_grid= [{
        'splitter': ['best', 'random'],
        'max_depth': [10, 100, 1000],
        'criterion': ['gini', 'entropy', 'log_loss'],
        'class_weight': ['balanced']}],
        scoring= "accuracy",
        cv= 3
    )

tree.fit(X_train, y_train)
tree.best_params_

#### SVM

In [None]:
from sklearn.utils.fixes import loguniform

svc = GridSearchCV(
    estimator= svm.SVC(),
    param_grid= [{
        'C': loguniform(0.1, 1, 100, 1000).rvs(20),
        'class_weight': ['balanced'],
        'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'gamma': loguniform(0.000035, 0.000245).rvs(20)
    }]
)

svc.fit(X_train, y_train)
svc.best_params_

#### Neural Network playground

In [None]:
nn = GridSearchCV(
    estimator=neural_network.MLPClassifier(max_iter=10000),
    param_grid= [{
        'hidden_layer_sizes': [6, 9, 12],
        'activation': ['identity', 'logistic', 'tanh', 'relu'],
        'solver': ['lbfgs', 'sgd', 'adam'],
        'learning_rate': ['constant', 'adaptive']
    }]
)

nn.fit(X_train, y_train)
nn.best_params_

* Use different hyperparameter settings for each model and explain why and how you chose them
    * We selected each hyperparameter by trying different combinations, and then using the best fitting hyperparameters.

### Performance Estimates
* Estimate the models’ performances on the held back test set:

In [None]:
knn.score(X_test, y_test)

In [None]:
tree.score(X_test, y_test)

In [None]:
svc.score(X_test, y_test)

In [None]:
nn.score(X_test, y_test)

* Compare the models with their hyperparameter settings with two different error/performance measures
* Why did you choose the specific error/performance measures?
    * We chose the build-in report feature of sklearn, since it includes different scoring algorithms and scores for each label
* What do they tell you?
    * It tells us how well a model performs on the held-back testset, for each label and overall

In [None]:
from sklearn.metrics import classification_report

y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
y_pred = tree.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
y_pred = svc.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
y_pred = nn.predict(X_test)
print(classification_report(y_test, y_pred))

* Which model performs best with which hyperparameter settings and why do you think it does that way?
    * The KNN and SVM Classifiers perform the best, because each Label is mostly cleanly seperated from the others.

In [None]:
knn.best_params_

In [None]:
svc.best_params_

* Explain which model you would use in deployment and why
    * We would use the knn model, since it's the simplest model with the best score.