# Hyperparameters and Model Validation

We saw earlier the buy-n-large framework of implementing a supervised Machine Learning model.

1. Choose a class of model
2. Choose model hyperparameters
3. Fit the model to the training data
4. Use the model to predict labels for new data

A huge part of this whole process is choosing the model, tuning the parameters of the model and then training and testing of the trained model. Let's have a deeper look into this.

## Model Validation

The concept of Model validation is pretty straight forward. Testing the prediction of the model on a provided training data with known output.

Here's a naive approach to this problem.

### The wrong way

We've used this approach in one of the previous sections. 

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [14]:
import seaborn as sns
Dataset=sns.load_dataset('iris')

X_data=Dataset.iloc[:,:4]
Y_data=Dataset.iloc[:,4]


0          setosa
50     versicolor
100     virginica
Name: species, dtype: object

Now we choose a model to train this data on. For this instance we'll pick k-neighbours.

In [31]:
from sklearn.neighbors import KNeighborsClassifier

model=KNeighborsClassifier(
    n_neighbors=1
)

Now we train the model

In [32]:
model.fit(X_data,Y_data)
Y_pred=model.predict(X_data)

And then we calculate it's accuracy.

In [33]:
from sklearn.metrics import confusion_matrix,accuracy_score
# confusion_matrix(Y_data,Y_pred)
accuracy_score(Y_data,Y_pred)

1.0

From the accuracy above, it should be pretty evident that our model is performing perfectly which is false. Our model trains and evaluates on the same data. Furthermore, the K-nearest-neighbourhood algorithm is an instance based algorithm. i.e. it simply stores the data and predicts the labels of each new data point based on the class of the nearest data points in the training dataset. Here the each data point it tries to predict the label of is already in the dataset. It'll obviously be always right in such a case.

### The right way: Holdout Sets

A correct evaluation can be made by training the model on a subset of the dataset so that we don't over train the model and our evaluation method stays more critical and realistic during evaluation.

In [44]:
from sklearn.model_selection import train_test_split
X_test,X_train,Y_test,Y_train=train_test_split(X_data,Y_data)

In [45]:
model.fit(X_train,Y_train)
Y_pred=model.predict(X_test)

accuracy_score(Y_test,Y_pred)

0.8928571428571429

Here, although our model is not performing perfectly, the accuracy seems more realistic and true since it was tested on data that the model was oblivious of while training.