# Module 3 - Classification

In the previous module, you learned about data, data visualizations, and several ways to manipulate and preprocess your data. In this module, you will learn an application for supervised learning problems. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Classification is one of many supervised learning methods. In this module, you will work with common techniques used to implement classification and will gain an understanding of metrics that define classification models.


## Introduction to classification

### Classification metrics

The main aim of classification is to find the best boundary to separate the classes of the data. You need to confirm that the results of your classifier are valid and reliable.
In this section, you are going to learn more about metrics you can use to evaluate any classification algorithm.

**Classification matrix**

- x1: True Positive: correct in a true example.
- x2: False Positive: incorrect in a true example.
- x3: False Negative: incorrect in a false example.
- x4: True Negative: correct in a false example.

Precision = x1 / (x1+x2)

Recall = x1 / (x1+x3)

Accuracy = (x1 + x4) / sum(x1:x4)


## Logistic Regression in Scikit-Learn

Using Pandas, you should read the data first in the CSV format into a Pandas DataFrame.
For this sample code, the "Iris" dataset mentioned before is used. You can check the CSV file here.

In [29]:
import pandas as pd
from sklearn.datasets import load_iris

df = load_iris(as_frame=True, return_X_y=True)

y = df[1]

x = df[0]




In [31]:
# normalize data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

x = scaler.fit_transform(x)


array([[-9.00681170e-01,  1.01900435e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.38535265e+00,  3.28414053e-01, -1.39706395e+00,
        -1.31544430e+00],
       [-1.50652052e+00,  9.82172869e-02, -1.28338910e+00,
        -1.31544430e+00],
       [-1.02184904e+00,  1.24920112e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-5.37177559e-01,  1.93979142e+00, -1.16971425e+00,
        -1.05217993e+00],
       [-1.50652052e+00,  7.88807586e-01, -1.34022653e+00,
        -1.18381211e+00],
       [-1.02184904e+00,  7.88807586e-01, -1.28338910e+00,
        -1.31544430e+00],
       [-1.74885626e+00, -3.62176246e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00,  9.82172869e-02, -1.28338910e+00,
        -1.44707648e+00],
       [-5.37177559e-01,  1.47939788e+00, -1.28338910e+00,
        -1.31544430e+00],
       [-1.26418478e+00,  7.88807586e-01, -1.22655167e+00,
      

In [32]:
# divide into train and test sets

from sklearn.model_selection import train_test_split

# 1/3 of the data will be used to test

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.33, random_state=0) 

Use the Scikit-Learn implementation of logistic regression.

Arguments to sklearn.linear_model.LogisticRegression:

- Solver: algorithm to use in the optimization problem. In this example and in the editors, you will use LBFGS which is used for multi-class problems:

    - Ovr: binary classification algorithm used for binary classes and
    - LBFGS: multi-class output.
    
- multi_class: argument that utilizes the algorithm for multiclass output or binary outputs. You will simply choose the value to be ‘auto,’ which automatically uses a multi-class algorithm in case of multiple output classes and a binary algorithm in case of binary problems.

In [41]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver = 'lbfgs', multi_class='auto')

model.fit(x_train, y_train) # fit the model using the train data

y_pred = model.predict(x_test) # generate predicted y's using test data

score = model.score(x_test, y_test) # compare prediction with test data

print(score)


0.94


### Mission: Classifying Iris using Logistic Regression

In this mission, you will work with the Iris dataset and will practice how to extract and classify the data in the Iris dataset using logistic regression. You will then evaluate the developed classifier. 

To complete this mission, perform the following task in the editor provided:

Try classifying the Iris dataset using logistic regression!

To complete this mission, your code should perform the following tasks:

Train a LogisticRegression model using X_train and y_train with the following parameters:
- random_state=0 (for consistent output);
- L-BFGS solver;
- multi_class='multinomial' (because the data is not binary);
- 1000 maximum iterations

In [8]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True )

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

model = LogisticRegression(solver = 'lbfgs', multi_class='multinomial', max_iter = 1000)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

score = precision_score(y_test, y_pred, average='weighted')

print('precision score: '+ str(score))



precision score: 0.98125


### Mission: Python Classifying Mushrooms using Logistic Regression

Mission Instructions:

In this mission, you will work with the Mushrooms dataset and will practice how to extract and classify the data in the Mushrooms dataset using logistic regression. You will then evaluate the developed classifier.

To complete this mission, perform the following task in the editor provided:

- Train a LogisticRegression model using X_train and y_train with the following parameters:
    - random_state=0 (for consistent output);
    - L-BFGS solver;
    - multi_class=’ovr’(because the data is not binary); and
    - 100 maximum iterations.
-  Score your model using X_test and y_test:
    - Using the test dataset, your code should return the score of the model.


In [13]:
def main():
    import pandas as pd
    dataset = pd.read_csv('mushrooms.csv')

    y = dataset.iloc[:, 0].values
    selected_X = dataset.iloc[:, 1:3].values

    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    encoded_y = LabelEncoder().fit_transform(y)
    encoded_X = OneHotEncoder().fit_transform(selected_X).toarray()

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(encoded_X, encoded_y, test_size=0.33, random_state=0)

    from sklearn.linear_model import LogisticRegression

    model = LogisticRegression(solver = 'lbfgs', multi_class='ovr')

    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    from sklearn.metrics import precision_score

    score = precision_score(y_test, y_pred)

    print('precision score: '+ str(precision_score(y_test, y_pred)))

    return score

main()

precision score: 0.5615212527964206


0.5615212527964206

## K-Nearest Neighbours (KNN)

Using Pandas, you should read the data first in the CSV format into a Pandas DataFrame.
For this sample code, the "Iris" dataset mentioned before is used. You can check the CSV file here.

In [3]:
import pandas as pd

from sklearn.datasets import load_iris

X, y = load_iris(as_frame = True, return_X_y=True)

Normalize your data using Scikit-Learn's standard scaler.

In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = scaler.fit_transform(X)

array([[-9.00681170e-01,  1.01900435e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.38535265e+00,  3.28414053e-01, -1.39706395e+00,
        -1.31544430e+00],
       [-1.50652052e+00,  9.82172869e-02, -1.28338910e+00,
        -1.31544430e+00],
       [-1.02184904e+00,  1.24920112e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-5.37177559e-01,  1.93979142e+00, -1.16971425e+00,
        -1.05217993e+00],
       [-1.50652052e+00,  7.88807586e-01, -1.34022653e+00,
        -1.18381211e+00],
       [-1.02184904e+00,  7.88807586e-01, -1.28338910e+00,
        -1.31544430e+00],
       [-1.74885626e+00, -3.62176246e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00,  9.82172869e-02, -1.28338910e+00,
        -1.44707648e+00],
       [-5.37177559e-01,  1.47939788e+00, -1.28338910e+00,
        -1.31544430e+00],
       [-1.26418478e+00,  7.88807586e-01, -1.22655167e+00,
      

Divide your data into train and test sets.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

Use the Scikit-Learn implementation of the classifier and predict the data.

In [10]:
from sklearn.neighbors import KNeighborsClassifier

# choose k value

K = 3

model = KNeighborsClassifier(n_neighbors=K)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

KNeighborsClassifier(n_neighbors=3)

Get the accuracy_score value after fitting.
Note:
The function KNeighborsClassifier().score returns the accuracy_score of the prediction.

In [12]:
score = model.score(X_test, y_test)

print(score)

0.96


### Mission: Classifying Iris using KNN

Mission Instructions:

In this mission, you will work with the Iris dataset and will practice how to extract and classify data using the k-nearest neighbors classifier. You will then evaluate the developed classifier.

Perform the following task to complete this mission:

Step 1    

Try classifying the Iris dataset using k-nearest neighbor (KNN)!

To complete this mission, your code should perform the following tasks:

Train a KNeighborsClassifier model using X_train and y_train with the following parameters:

Three (3) neighbors

Score your model using X_test and y_test.

Using the test dataset, your code should return the score of the model.

In [15]:
def main():

    from sklearn.datasets import load_iris

    X, y = load_iris(return_X_y=True)

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

    from sklearn.neighbors import KNeighborsClassifier

    k = 3

    model = KNeighborsClassifier(n_neighbors=3)

    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    score = model.score(X_test, y_test)

    print(score)

    return(score)

main()

0.96


0.96

# Mission: Classifying Mushrooms using KNN

In this mission, you will work with the Mushrooms dataset and will practice how to extract and classify data using the k-nearest neighbors classifier. You will then evaluate the developed classifier.

To complete this mission, perform the following task in the editor provided:

Step 1    

Try classifying the Mushrooms dataset using k-nearest neighbor (KNN).

To complete this mission, your code should perform the following tasks:

Train a KNeighborsClassifier model using X_train and y_train with the following parameters:

five (5)  neighbors and
score your model using X_test and y_test.
Your code should return the score of the model using the test dataset.



In [22]:

def main():
    import pandas as pd
    dataset = pd.read_csv('mushrooms.csv')

    y = dataset.iloc[:, 0].values
    selected_X = dataset.iloc[:, 1:3].values

    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    encoded_y = LabelEncoder().fit_transform(y)
    encoded_X = OneHotEncoder().fit_transform(selected_X).toarray()

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(encoded_X, encoded_y, test_size=0.33, random_state=0)

    from sklearn.neighbors import KNeighborsClassifier

    k = 5

    model = KNeighborsClassifier(n_neighbors=k)

    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    score = model.score(X_test, y_test)

    print(score)

    return score

main()


0.6135770234986945


0.6135770234986945

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [1., 0., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.]])

## Support Vector Machines

### Optimal hyperplane

There are many possible hyperplanes to classify between classes; the optimal hyperplane is the plane with the maximum margin (distance between data points of both classes) and the least misclassification error.

The boundary hyperplane, chosen by the support vector machine (SVM), is justified when you are testing your model; any new point should be classified with more confidence.

**Why Support Vector Machines (SVMs)?** 
In general, support vector machines (SVMs) are very good when you have a large number of features.

**Non Optiomal Hyperplane**
Some non-optimal hyperplanes are accepted, but they are not ideal, as they are biased to one of the classes. Therefore, it does not provide a proper generalization, which is the end goal.

### How Does Support Vector Machines (SVMs) Work?

An SVM’s objective is to find the hyperplane in an N dimensional space’s (N – number of features) features. As discussed earlier, this could be any input in your machine learning model and any output in the classification that belongs to this input.

**SVMs: Simply Put**
The support vector machine is a hyperplane that provides the best possible separation for different classes in the multi-dimensional hyperspace.

### Kernels

In a support vector machine (SVM), it is easy to have a linear hyperplane between these two classes. However, another burning question which arises is: do you need to add this feature manually to have a hyperplane? No, the support vector machine (SVM) has a technique called the kernel trick.
Applying the kernel trick in the previously mentioned Scenario 5 will lead you to this view in the original dimensional space. The kernel used is called a polynomial kernel.

A hard margin means that the a support vector machine (SVM) is very rigid in finding the best margin that satisfies the minimum classification error and the maximum margin between support vectors. This might cause overfitting.

### Types of SVM Kernels

Kernels are equivalent to a function train to classify data that was introduced in this lesson of linear kernels. Now, you must know that there are more types of kernels, some examples are: Polynomial, Gaussian, Gaussian Radial Basis Function (RBF), and Hyperbolic Tangent Kernel.
Using Scikit-Learn, you could easily use any of the above-mentioned kernels without having to get into the mathematical details.

- Kernel: the kernel used to implement the kernel trick, as previously discussed. The kernel argument can take multiple values, such as:
    - Linear: for a simple linear equation kernel.
    - Poly/RBF: polynomial and radial basis function equations, which are useful for creation of non-linear hyperplanes.

**Paramters**

- Gamma: gamma is mainly used with non-linear hyperplanes. It represents how hard or soft the support vector machine (SVM) margin would be. The higher the gamma, the more accurate the support vector machine (SVM) is trying to achieve for lowest misclassification error and highest separation margin. It can be tuned to avoid overfitting.
- C: Penalty parameter C of the error term. It also controls the tradeoff between a smooth decision boundary and classifying the training points correctly.


### SVM Using Python

Now, you will see how to write a Python code to use the Scikit-Learn implementation of support vector machine (SVM) algorithm to classify data.



You are going to use support vector machine (SVM) to train the classifier, and since you are going to perform a classification task, you will use the support vector classifier class.
You will use the Iris data set, so you will import the datasets from Sklearn.
Also, you need to evaluate the model, so you will evaluate using the confusion matrix introduced in previous lessons.

In [30]:
from sklearn.svm import SVC

from sklearn import datasets

from sklearn import model_selection

from sklearn.metrics import confusion_matrix




Now, load the data and split it into training and testing data using the sklearn.model selection.train_test_split() function.

In [31]:
iris_data = datasets.load_iris()
x = iris_data.data
y = iris_data.target
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y,test_size=0.2)

This is the most important step.
Our data for this example is simple and linearly separable, so you will use the linear kernel.
Also, if you cannot visualize your data (more than 3 dimensions), always try the linear kernel first.


In [32]:
classifier = SVC(kernel='linear')
classifier.fit(x_train, y_train)

SVC(kernel='linear')

Now, you can predict for the test data and see the confusion matrix.

In [33]:
y_pred = classifier.predict(x_test)
conf = confusion_matrix(y_test, y_pred)
print(conf)

[[ 6  0  0]
 [ 0 13  0]
 [ 0  0 11]]


### Mission: Classifying Iris using SVM

Mission Instructions:

In this mission, you will work with the Iris dataset and will practice how to extract and classify data using the support vector classifier. You will then evaluate the developed classifier.

To complete this mission, perform the following task in the provided editor:

**Try classifying the Iris dataset using support vector machines!**

To complete this mission, your code should perform the following tasks in the provided editor:

Train a support vector classifier (SVC) model using X_train and y_train with the following parameters:

random_state=0 (for consistent output);
C = 1.0;
linear kernel; and
auto gamma.
Score your model using X_test and y_test.

Your code should return the score of the model using the test dataset.

Revisit the Iris dataset here.

Score is the precision of training. A higher score means your classifier has classified the data more precisely. To pass this mission, you are supposed to return the score that shows the precision of your classifier considering the given parameters.

In [42]:
def main():

    from sklearn.datasets import load_iris
    X, y = load_iris(return_X_y=True)

    from sklearn.model_selection import train_test_split
    X_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

    from sklearn.svm import SVC

    classifier = SVC(random_state=0, C = 1, kernel='linear', gamma='auto')

    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)

    from sklearn.metrics import precision_score

    score = precision_score(y_test, y_pred, average = 'weighted')

    print('precision score: '+ str(precision_score(y_test, y_pred, average='weighted')))

    return score

main()

precision score: 0.98125


0.98125

### Mission Instructions:

In this mission, you will work with the Mushrooms dataset and will practice how to extract and classify data using the support vector classifier. You will then evaluate the developed classifier.

To complete this mission, perform the following task in the provided editor:

Step 1    

Try classifying the Mushrooms dataset using support vector machines!

To complete this mission, your code should perform the following tasks:

Train a support vector classifier (SVC) model using X_train and y_train with the following parameters:

random_state=0 (for consistent output);
C = 0.2;
RBF kernel; and
auto gamma.
Score your model using X_test and y_test.

Your code should return the score of the model using the test dataset.

Revisit the Mushrooms dataset here.

In [44]:

def main():

    import pandas as pd
    dataset = pd.read_csv('mushrooms.csv')

    y = dataset.iloc[:, 0].values
    selected_X = dataset.iloc[:, 1:3].values

    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    encoded_y = LabelEncoder().fit_transform(y)
    encoded_X = OneHotEncoder().fit_transform(selected_X).toarray()

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(encoded_X, encoded_y, test_size=0.33, random_state=0)

    # new code

    from sklearn.svm import SVC 

    classifier = SVC(random_state=0, C = 0.2, kernel='rbf', gamma='auto')

    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)

    from sklearn.metrics import precision_score

    score = precision_score(y_test, y_pred, average = 'weighted')

    print('precision score: '+ str(precision_score(y_test, y_pred, average='weighted')))

    return score

main()

precision score: 0.6381153467744507


0.6381153467744507

## Decision tree

A decision tree is simply a tree structure built on nodes representing input attributes, branches representing decisions, and leaves representing the final label. Decision trees learn both classification and regression problems. Regression is going to be discussed thoroughly in the next module.

Each split is based on a specific attribute. The algorithm is trying to increase the certainty of decisions using the splits that increase the certainty of the output decision.

**Decision Tree Algorithm**
Imagine you have n-points in your dataset.

Step 1

Select the root of the tree for splitting the data.
Note:
Ways of choosing the best split will be discussed in the next section.

Step 2

After the initial split, start splitting each node, using the split measures, until the stopping condition.

Step 3

Stopping conditions:
No more splits possible
No more instances



### Pruning

The performance of a tree can be further increased by pruning. It involves removing the branches that make use of features having low importance. This way, you reduce the complexity of the tree, and thus increase its predictive power, by reducing the possibility of overfitting the data, which leads to a model that has high training accuracy and low testing accuracy.

### Random forest

A random forest is an ensemble technique that depends on training a large number of individual decision trees that operate individually for prediction and for eventually averaging the output of all the decision trees.

Not all learners are correct all the time. However, using ensemble techniques helps in cancelling individual errors, since the direction of the pack of learners usually compensates for small individual errors. These learners either use a subset of the features or a subset of the data samples.

To implement a random forest: 
- Unique patterns should occur in the data
- Learner predictions should be uncorrelated

### Ensemble techniques

Ensemble techniques are techniques to let a group of weak learners work on the data by increasing the amount of training data at the training stage. Different datasets are produced using simple random sampling with replacement over the whole dataset.

**Bagging:** An ensemble technique where any element has the same probability of appearing in a new data set.

**Boosting:** An ensemble technique where elements are weighted, which means some of the data samples of the data will take part more often in the created datasets for the weak


``from sklearn.tree import DecisionTreeClassifier / RandomForestClassifier
model = DecisionTreeClassifier(criterion=’gini’, splitter=’best’)
model.fit(X_train, y_train)``

### Mission: Decision trees & Iris

Mission Instructions:

In this mission, you will work with the Iris dataset and will practice how to extract and classify the data in the Iris dataset using the decision tree classifier. You will then evaluate the developed classifier.

To complete this mission, perform the following task in the provided editor:

Step 1    

Try classifying the Iris dataset using decision trees!

To complete this mission, your code should perform the following tasks:

Train a decision tree classifier model using X_train and y_train with the following parameters:

random_state=0 (for consistent output); and
information gain criterion.
Score your model using X_test and y_test.

Your code should return the score of the model using the test dataset.

Revisit the Iris dataset here.


In [47]:
def main():

    from sklearn.datasets import load_iris
    X, y = load_iris(return_X_y=True )

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

    from sklearn.tree import DecisionTreeClassifier

    classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)

    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)

    from sklearn.metrics import recall_score

    score = recall_score(y_test, y_pred, average='weighted')

    print('recall score: '+ str(recall_score(y_test, y_pred,  average='weighted')))

    return score

main()

recall score: 0.96


0.96

### Mission: Mushrooms and Random Forest

Mission Instructions:

In this mission, you will work with the Mushrooms dataset and will practice how to extract and classify the data in the Mushrooms dataset using the random forest classifier. You will then evaluate the developed classifier.

To complete this mission, perform the following task in the provided editor:  

Try classifying the Mushrooms dataset using random forests!

To complete this, your code should perform the following tasks:

Train a random forest classifier model using X_train and y_train with the following parameters:

random_state=0 (for consistent output);
Gini impurity criterion; and
100 estimators.
Score your model using X_test and y_test.


In [49]:
def main():

    import pandas as pd
    dataset = pd.read_csv('mushrooms.csv')

    y = dataset.iloc[:, 0].values
    selected_X = dataset.iloc[:, 1:3].values

    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    encoded_y = LabelEncoder().fit_transform(y)
    encoded_X = OneHotEncoder().fit_transform(selected_X).toarray ()

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(encoded_X, encoded_y, test_size=0.33, random_state=0)

    from sklearn.ensemble import RandomForestClassifier

    classifier = RandomForestClassifier(n_estimators=100, criterion='gini', random_state=0)

    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)

    from sklearn.metrics import recall_score

    score = recall_score(y_test, y_pred)

    print('recall score: ' + str(recall_score(y_test, y_pred)))

    return score

main()

recall score: 0.6180392156862745


0.6180392156862745