# Classification models.

In this notebook, 3 classification models will be built for predicting the kind of iris flower from their atributes. This models will be used from the Scikit-Learn library, and some techiques like cross validation will be implemented in case of overfitting.

Finally, a brief comparison will be made between these models, in order to select the better option and compare it in the final notebook with a Multi Layer Perceptron Model.
 

In [7]:
__author__ = "Víctor Vega Sobral"

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from sklearn.svm import SVR
import matplotlib.pyplot as plt
import numpy as np
import joblib

### Import Iris Normalized Data

In [9]:
iris_df = pd.read_csv("datasets/iris_normalized_data.csv")

# Eliminating the first column, with no significant data
iris_df.drop(iris_df.columns[0], axis=1, inplace=True)
iris_df.head()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,-0.900681,1.032057,-1.341272,-1.312977,0
1,-1.143017,-0.124958,-1.341272,-1.312977,0
2,-1.385353,0.337848,-1.398138,-1.312977,0
3,-1.506521,0.106445,-1.284407,-1.312977,0
4,-1.021849,1.26346,-1.341272,-1.312977,0


### Data Split

- Train set: 70% of the dataset.
- Dev set: 15% of the dataset.
- Test set: 15% of the dataset.

For this, we need to first split the data into 85/15, for training+dev/test.

Then, we split the remaining 85 into 70/15, for having 70% for the train set and 15 for the dev set.

We didn´t use a dev set in the previous subjects. However, in **the Deep Learning Specialization by Coursera that I´m doing, Andrew Ng explains the importance of having a dev set**.

Dev set is separated from training and test set, and it´s commonly used to:
- Adjust hyperparameters. 
- Evaluate the model´s performance before the final test. 
- Useful for avoiding possible overfitting and selecting the best model.
- Taking possible decissions about model´s generalization just before the final test.
- Avoiding bias in the evaluation due to the continous usage of test set for validating.

Therefore, the workflow will be the next one:

1. Training and evaluating the models in ``dev`` set. The model with best precission will be the suitable for test set.
    - Train with ``training`` set.
    - Evaluate in ``dev`` set.
    - Create a `csv file` for every set, in order to make possible loading them in other notebooks. 


2. Hyperparameter tuning in `dev` set.
    - If dev set results aren´t good.
    - Avoiding overfitting with **cross-validation techniques** like `GridSearchCV` and `RandomizedSearchCV`.
    - Finally, we re-evaluate the tuned model in dev before passing it to test.

3. **In the next Notebook**:
    - Implement MLP and evaluating it in dev set.
    - Compare its performance with the best traditional model.
    - If MLP is better than the other model, we finally pass this one to the test set.








In [12]:
X = iris_df.drop('class', axis = 1) # Caracteristicas
y = iris_df['class']

# Divide the train set for being 70% of total data
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)  

# Divide the remaining 30% between dev and test set 
X_dev, X_test, y_dev, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Storing the sets in separate csvs

X_train.to_csv("datasets/iris_data_splits/x_train.csv", index=False)
y_train.to_csv("datasets/iris_data_splits/y_train.csv", index=False)
X_dev.to_csv("datasets/iris_data_splits/x_dev.csv", index=False)
y_dev.to_csv("datasets/iris_data_splits/y_dev.csv", index=False)
X_test.to_csv("datasets/iris_data_splits/x_test.csv", index=False)
y_test.to_csv("datasets/iris_data_splits/y_test.csv", index=False)


# Final split proportions:
# Train: 70% 
# Dev: 15% 
# Test: 15%
