One of the very important question asked to the data science community is which classification model should I select? In this section we will learn how to select quickly and efficiently the **best classification model** for the given dataset having any number of features and any number of observations.

In this **Model_selection** folder we have all the classification models that we learnt or implemented in this Part-3. In this folder, in all of the codes, we have removed the print commands to simplify the implementation. And also we removed the visualization part on the training and test sets because the visualization only works when we have only two features in the dataset.

**About the dataset:** \\
Here we take the calssic dataset of classification of breast cancer: bening tumor or malignant tumor, from UCI ML repository. This dataset is a generic one containing many features, all are numerical features, and a binary dependent variable vector (the 'class' column) taking values 2 (bening tumor) or 4 (malignant tumor). Each row corresponds to a patient and for each of this patient we gathered 11 information (10 features: all the left columns and 1 dependent variable: the most right column). Doctors understand the meaning of each feature, but data scientist do not know about the meaning of each atttribute/feature. But this is fine as the data scientist can still build classification ML models and understand the correlations between all the features and dependent variable vector. Finally the classification model trained on the training set will be able to predict tumor type (benign:2 or malignant:4) for each patient (new or old).

The basic format of all the code templates in this folder is: (i) Data pre-processing, (ii) Training the build ML classification model on the training set and (iii) Evaluation of the trained model on the test set by computing the confusion matrix and accuracy score.

The only thing change from one code template to another is only one cell in which we build and train the ML model we wanna try.

The code templates in this folder is valid as long as, the dataset contains the features in the left columns and the dependent variable vector in the right most column the below format is going to work, regardless of the number of features in the dataset, and for the numerical features only. If there is categorical feature, don't forget to use data preprocessing tools (like one hot encoding or ordinal encoding).

This code templates are made just modify very minimal like the name of the dataset.

# Logistic Regression

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
# As long as, the dataset contains the features in the left columns and the dependent variable vector in the
# right most column the below format is going to work, regardless of the number of features in the dataset.

dataset = pd.read_csv('Breast_cancer_Wisconsin.csv')
X = dataset.iloc[:, :-1].values   # Values of all the rows and all the columns, except the last columns is put into X, the matrix of features.
y = dataset.iloc[:, -1].values   # Values of all the rows and only the last column is put into y, the dependent variable vector.

## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Training the Logistic Regression model on the Training set

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

## Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[103   4]
 [  5  59]]


0.9473684210526315

So, in logistic regression, we got only 4+5=9 incorrect predictions and accuracy = 94.7%. Great!.

Below we tabulate the accuracy score for each classification model used in this dataset.

**Classification model**    $~~~~~~~~~~~~~$       **Accuracy score**  \\
Logistic regression     $~~~~~~~~~~~~~~~~~~~~~~$            94.7%  \\
K-nearest neighbor (K-NN)     $~~~~~~~~~~~~$            94.7%  \\
Support vector machine (SVM)     $~~~~~$            94.15%  \\
Kernel SVM     $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$            95.3%  \\
Naive Bayes    $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$            94.15%  \\
Decision tree classification     $~~~~~~~~~~~$            95.9%  \\
Random forest classification     $~~~~~~~~~$            93.6%  \\

So, we see that the big winner in this dataset is Decision tree classification. Usually Random forest classification become winner for complex dataset having many features and many observations, but in this case that is not the case. Very surprising result!! That is why we need to apply all the ML models and chose the best fit one.
