In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Lecture 7 - Classification, Generalization, Overfitting, Evaluation, Cross-validation and Scikit-learn

---

### Content

1. Classification
2. Generalization
3. Overfitting
4. Classifiers goodness-of-fit
5. Cross-validation
6. Comparing multiple classifiers
7. Scikit-learn
8. No Free Lunch Theorem
9. Feature Selection


### Learning Outcomes

At the end of this lecture, you should be able to:

* explain the difference between classification and regression
* explain the theory of generalization, the phenomenon of overfitting, and the 'no free lunch' theorem
* discuss and apply various measures for evaluating classifier accuracy
* use cross-validation for training and evaluating classifiers
* compare the accuracies of multiple classification algorithms across multiple datasets
* explain the rationale of feature selection and apply it 
---

# Classification

In machine learning, classification is the task of devising a *classifier* capable of assigning a particular class/category to an unknown instance/sample from a set of possible classes. A machine learning algorithm builds, evaluates and optimises a classifier on the training data in  such a way that it discriminate instances of different classes from each other, in the hope that these patterns will *generalise* and be valid on data which the algorithm has not *seen* during the training phase. 

Classification (like regression) belong to a family of **supervised learning** methods. In order to be able to perform supervised learning, a training set is required where all of the samples' **class values are known in advance**. The classifier is **trained** to learn how to map each of the samples' features/attributes to their corresponding class labels.

Once the classifier is trained to do this on a fully labelled dataset, the classifier is then used to classify unknown samples into class labels given the samples feature vectors only. If there are two classes used in the prediction, then this is referred to as a **binary classification problem**. If there are more than just two classes in the dataset, this is than called a **multiclass classification problem**. 

As the number of classes in a prediction problem increase, so does the difficulty in maintaining high accuracy. 

A classifier can be fixed-size, irrespective of the amount of data provided for training. These algorithms are called parametric. Classifiers can also be variable in size, and thus grow in complexity with the amount of available data, allowing it to capture and encode complex decision boundaries. These algorithms are referred to as non-parametric.

Classification is an immensely vital and widely used technique. Examples of classification are found in classifying whether or not a given email is of a class "spam" or "non-spam"; banks use it to classify if a given transaction is "legitimate" or "fraudulent" class; medical staff have technologies to assigning a diagnosis to a given patient as described by observed characteristics of the patient; financial analysts use it to predict if a given stock should be classified as "invest now" or "do not invest" at a given point in time given a range of accompanying indicators. 

In order to perform classification, an algorithm is first needed that creates a classifier. There are many types of machine learning classification algorithms. 

We have looked kNN and seen how it can be used as a classification algorithm. Some of the other well known ones are Neural Networks, Support Vector Machines, Tree classifiers, Boosted Ensembles, Bagging, Random Forests as well as Naive Bayes.

A good summary in a form of a mind map of various sort of algorithms that exist can be seen here:

![Jason Brownlee http://machinelearningmastery.com/parametric-and-nonparametric-machine-learning-algorithms/ ](../figures/algorithms.jpg)

> Source: Jason Brownlee http://machinelearningmastery.com/parametric-and-nonparametric-machine-learning-algorithms/

# Parametric and non-parametric classification algorithms in machine learning

Machine learning algorithms can be broadly categorised into parametric and non-parametric algorithms.  

>    A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs. — Artificial Intelligence: A Modern Approach

**Parametric algorithms** make strong assumptions about the data and are fixed in their complexity/size. Algorithms of this type are:  Logistic Regression,  Perceptron, Neural Networks and Naive Bayes.

There are some advantages and disadvantages associated with these types of algorithms. They tend to be simpler due to few parameters and can be more interpretable. They are also faster to train on average and may require smaller datasets. Their weakness is their inflexibility and the highly constrained form, which, if unsuitable for the particular problem, will generate a poor fit. Often, these kinds of algorithms are more suited to less complex domains.


**Non-parametric machine learning algorithms** do not make strong assumptions about the form of the mapping function that maps features to class labels. Because of this, they have more flexibility in learning the decision boundaries. These kinds of algorithms are good, when you do not have much prior knowledge about the data and have access to reasonably large datasets.

>Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don’t want to worry too much about choosing just the right features. — Artificial Intelligence: A Modern Approach

Some examples of non-parametric machine learning algorithms are: k-Nearest Neighbors, Boosting, Decision Trees, Random Forest, SVM. 

Benefits of these algorithms come in their flexibility and the absence of strong assumptions needed to do training. Also, they tend to have stronger performances. The down side to them is that they do require more data and runtime to train. Their biggest disadvantage would be their propensity to overfit.


### Example dataset - Wine

We will return to the Wine dataset to explore classification.

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mtpl
import seaborn as sns


%matplotlib inline

df = pd.io.parsers.read_csv(
    '../datasets/wine_data.csv',
     usecols=[0,6,7]
    )

df.columns=['Class','Magnesium','Flavanoids']

df['Class'].replace('3', 0, inplace=True)
df.to_csv('../datasets/wine_data_test.csv', header=None, index=None)

df.head(5)

Confirm we have 3 class labels:

In [None]:
df.Class.unique()


Confirm data types:

In [None]:
df.dtypes


Get counts for each class:

In [None]:
df.groupby('Class').count()

In [None]:
df.groupby('Class').count() / df.groupby('Class').count()['Magnesium'].sum()

In classification problems, the **ability to separate classes** from one another is the most important consideration. Histograms of the feature values per class type, can be a useful tool for **eyeballing** some features and to get a rough feeling for their **discriminative power**. 

Here we are visualising the histograms of the two features for each of the three classes:

In [None]:
from matplotlib import pyplot as plt
plt.figure(figsize=(10,8))

colors = ('blue', 'red', 'green')

for label,color in zip(range(0,3), colors):
    mean = np.mean(df['Magnesium'][df['Class'] == label]) # class sample mean
    stdev = np.std(df['Magnesium'][df['Class'] == label]) # class standard deviation
    df['Magnesium'][df['Class'] == label].hist(alpha=0.3, # opacity level
             label='class {} ($\mu={:.2f}$, $\sigma={:.2f}$)'.format(label, mean, stdev), 
             color=color,
             bins=15)

plt.title('Wine data set - Distribution of Magnesium content')
plt.xlabel('Magnesium content', fontsize=14)
plt.ylabel('count', fontsize=14)
plt.legend(loc='upper right')

plt.show()

In [None]:

plt.figure(figsize=(10,8))

colors = ('blue', 'red', 'green')

for label,color in zip(range(0,3), colors):
    mean = np.mean(df['Flavanoids'][df['Class'] == label]) # class sample mean
    stdev = np.std(df['Flavanoids'][df['Class'] == label]) # class standard deviation
    df['Flavanoids'][df['Class'] == label].hist(alpha=0.3, # opacity level
             label='class {} ($\mu={:.2f}$, $\sigma={:.2f}$)'.format(label, mean, stdev), 
             color=color,
             bins=15)

plt.title('Wine data set - Distribution of Flavanoids content')
plt.xlabel('Flavanoids content', fontsize=14)
plt.ylabel('count', fontsize=14)
plt.legend(loc='upper right')

plt.show()

Let's see classification on this dataset in action.

We will use the knn classifier.

In [None]:
from sklearn import neighbors
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
from sklearn import preprocessing

We will first normalise all the features using standardisation with the built in method provided by scikit-learn: 

In [None]:
std_scale = preprocessing.StandardScaler().fit(df[['Flavanoids', 'Magnesium']].dropna())
X = std_scale.transform(df[['Flavanoids', 'Magnesium']].dropna())
X

The above gives us our X values that are our features.

Next we extract the corresponding class values for each value of x:

In [None]:
y = df.Class
y

We can now train our classifier on our training dataset:

In [None]:
knn_classifier = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_classifier.fit(X, y)

Having trained our classifier, we now need to see how well it has trained. 

For this we use the same dataset that we trained the classifier on. This dataset is called the training dataset. 

We ask the classifier to predict the class label for each of the x values in our training dataset.

In [None]:
classification_results = knn_classifier.predict(X)
classification_results

The above shows all the classifications/prediction that the classifier makes for each value x.

We can put this into the original dataframe for later analysis:

In [None]:
df['Classification'] = classification_results
df.head(30)

We can calculate the overall accuracy of our training error on the training dataset which is one way of summarising the effectiveness of our training and is an important step in evaluating classifier training before moving forward to next steps.

In [None]:
knn_classifier.score(X, y)

Valid questions now are:

1. What does the classification accuracy of a classifier on a training dataset tell us about how accurate it will be on data which is outside of the training dataset?

2. How can we be confident that our classifier will perform to similar levels of accuracy on data it has not been trained on?

3. Will the classifier generalize beyond the training dataset?

# Generalization

The goal of training a classifier, or creating a predictive model on a particular dataset is not so that it would make accurate predictions only on these samples. To use an absurd scenario, we could potentially create a look-up table, or an immense set of if-else statements for every example in a dataset and thereby achieve 100% accuracy. This process would be called memorising the data and is practically useless for the task of predictive analytics.

Instead, the **goal is that the classifier would uncover and encode patterns describing the underlying structural relationships, with the intention of them generalising and thus being able to accurately predict data that the algorithm has not previously "seen"**. This is referred to as  generalization. In essence, machine learning is the attempt to take the limited amount of information it can gather and generalise. It embodies the movement from the 'specific to the general'.

Generalization is the property of a classifier or modelling process, whereby the classifier is relevant for prediction purposes on data that were not used to train it. We must keep in mind that every dataset is a finite sample of a total population for a given domain (unless you have a dataset that contains all samples that will ever be used to describe a given problem.). And we want the classifier to be fit-for-purpose on the population of a given domain as a whole.

Hence, **creating a classifier that is perfectly accurate on one training dataset is no guarantee of it being accurate at predicting unseen samples**. In fact, machine learning algorithms are notoriously susceptible to finding meaningless, or phantom patterns that are random idiosyncrasies within a given dataset sample, and have no value beyond the data outside of the training dataset. 

Sometimes we may have concerns that the training data were not representative of the true population which leads to bad generalization - and this does happen, often. The data can be noisy (mislabelled or erroneous data), have lots of outliers (valid but extreme values). But often it is the case that the algorithm used to train a classifier or build a model, created too good of a "fit" (too complex) on the training dataset that is ultimately useless and misleading beyond the samples it trained on.

Training classifiers which are variable-size (optimisable) and can thus increase in complexity, often leads to the pattern below. Both the training and generalisation (test) error decrease initially during the early stages of classifier training. If the training ceases too early, then the classifier has usually not been given enough time to learn the separating class decision boundary sufficiently. This is referred to as underfitting. As the training algorithm builds a more complex classifier (model), the training error will continue to decrease. As the training error approaches zero, the training algorithm is said to 'converge'. However, complete convergence does not necessarily mean that the classifier will generalise well. It is usually quite easy to achieve full convergence on a training set. The difference between the final training error (convergence) and the generalisation (test) error is the indicator of the degree of overfitting that has occurred. 

Generalisation error will always be greater than the training error. However, we want this difference to be as small as possible.   

![Source Wikipedia](../figures/generalization_overfitting.jpg)

Source: Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc

# Overfitting

**Overfitting is usually what happens when classifier generalization is not achieved**. It is the tendency of machine learning and data mining algorithms to tailor classifiers to the training data, at the expense of generalization to previously unseen data points. 

It should be noted that inherent within all machine learning and data mining algorithms is the tendency to overfit to some degree. Some algorithms will accentuate this tendency more than others and it will often depend on the dataset and the type of the problem.

If we try hard enough and push the classifiers to become more and more complex, the algorithms will invariably find patterns in a dataset. The Nobel Laureate Ronald Coase one said

> “If you torture the data long enough, it will confess.”

The problem of overfitting is probably one of the greatest challenges for a data scientist.

> The answer is not to use a data mining procedure
that doesn’t overfit because all of them do. Nor is the answer to simply use models
that produce less overfitting, because there is a fundamental trade-off between model
complexity and the possibility of overfitting. Sometimes we may simply want more
complex models, because they will better capture the real complexities of the application
and thereby be more accurate. There is no single choice or procedure that will eliminate
overfitting. The best strategy is to recognize overfitting and to manage complexity in a
principled way.

Provost, F., & Fawcett, T. (2013). Data Science for Business: What you need to know about data mining and data-analytic thinking. " O'Reilly Media, Inc.".

In [None]:
from time import sleep
from datetime import datetime
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
from pylab import rcParams

rcParams['figure.figsize'] = 18, 10
rcParams['font.size'] = 20
rcParams['figure.dpi'] = 350
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['patch.edgecolor'] = 'white'
rcParams['font.family'] = 'StixGeneral'

wine_df = pd.read_csv(
    '../datasets/wine_data.csv'
    )

X = wine_df.iloc[:, 1:]
y = wine_df.iloc[:, :1]

treeclf = DecisionTreeClassifier(max_depth=1, random_state=1)


In [None]:
training_error = pd.Series([1.0])
test_error = pd.Series([1.0])

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2, test_size=0.5)
for i in range(1,13):
    treeclf = DecisionTreeClassifier(max_depth=i, random_state=i)
    training_error[i] = 1.0 - treeclf.fit(X_train, y_train).score(X_train, y_train) 
    test_error[i] = 1.0 - treeclf.fit(X_train, y_train).score(X_test, y_test) 
    plt.plot(training_error)
    plt.plot(test_error)
    plt.title('Generalisation, Convergence and Overfitting Example on Wine Dataset', size=30)
    plt.xlabel('Classifier Complexity (Decision Tree Depth)', size=25)
    plt.ylabel('Error Rate / Misclassification Rate', size=25)
    

## Bias-Variance Tradeoff 

![https://www.kdnuggets.com/2016/08/bias-variance-tradeoff-overview.html ](https://www.kdnuggets.com/wp-content/uploads/bias-and-variance.jpg) 

# Classifier Evaluation

In order to be able to ascertain if our classifier is both learning the training dataset (converging) or overfitting on previously unseen data, we need to have some evaluation metrics at our disposal. 

Fortunately the field of machine learning and data mining has been around for a long time so there are some very well established metric that enables us to evaluate our classifiers.

Multiclass classification problems are treated similarly, however, there is usually a greater need in multiclass problems to break down the classification results at per-class label level in order to examine the performance. This is often necessary in the presence of unbalanced class distributions. 

A binary class **confusion-matrix** is a good starting point to visualize and understand the performance of a classifier. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).


![Source Wikipedia](../figures/confusion_matrix.jpg)

Source: http://en.wikipedia.org/wiki/Confusion_matrix



![Source Wikipedia](../figures/classification_terms.jpg)

Source: http://en.wikipedia.org/wiki/Confusion_matrix

## Accuracy

Is the most commonly used metric. It is simplistic but useful as the first step. It is the number of correctly classified samples divided by all the samples in the dataset.

**Exercise:** Given the Wine dataframe from above, write a function that accepts it and returns the overall accuracy for the dataset:

In [None]:
def calculate_accuracy(df):
    #your code here

    return  

In [None]:
calculate_accuracy(df)

The opposite of accuracy is the *error rate* and is reported quite often in literature. It is simply 1 - accuracy

## Recall

Evaluates how many samples that belong to a particular class as actually classified as belonging to that class.

**Exercise:** Given the Wine dataframe from above, write a function that accepts it and returns the recall (aka hit rate) for each class:

In [None]:
def calculate_recall_per_class(df):
    recall = {}
    #your code here

        
    return recall

In [None]:
calculate_recall_per_class(df)
#should return something like:
#{1: 0.9322033898305084, 2: 0.6619718309859155, 3: 0.875}

## False Positive Rate

False positive looks at the proportion of samples that are being incorrectly classified as belonging to a particular class, out of all samples that do not belong to that class.

**Exercise:** Given the Wine dataframe from above, write a function that accepts it and returns the false positive rate for each class:

In [None]:
def calculate_false_positive_rate_per_class(df):
    fp_rate = {}
    #your code here

        
    return fp_rate

In [None]:
calculate_false_positive_rate_per_class(df)
#should return something like:
#{1: 0.13445378151260504, 2: 0.09345794392523364, 3: 0.06153846153846154}

## Precision

Precision takes into account the false positive classifications together with the true positive classifications.

**Exercise:** Given the Wine dataframe from above, write a function that accepts it and returns the precision rate for each class:

In [None]:
def calculate_precision_per_class(df):
    precision = {}
    #your code here

        
    return precision

In [None]:
calculate_precision_per_class(df)
#should return something like:
#{1: 0.7746478873239436, 2: 0.8245614035087719, 3: 0.84}

## F1-score

Another important measure that is insightful in respect to the accuracy of each given class label is the F-value (known also as the F-measure or the F1-score).

F1-score is particularly important for datasets with uneven class distributions. Training classifier on such problem is very difficult and all algorithms struggle to produce classifiers that accurately generalise on all class labels within such datasets. Usually the classes with the smallest number of samples experience the worst generalisation. The F1-score to some degree provides a more balanced assessment of a classifier's generalisation for each class label as it is the harmonic mean of the recall and precision values. The harmonic mean is a useful ttype of an average to use in this instance, since, unlike the arithmetic mean, the harmonic mean gives less significance to high-value outliers–providing a truer picture of the average.


**Exercise:** Given the Wine dataframe from above, write a function that accepts it and returns the f1-score for each class:

In [None]:
def calculate_f1score_per_class(df):
    f1score = {}
    #your code here

        
    return f1score

In [None]:
calculate_f1score_per_class(df)
#should return something like:
#{1: 0.8461538461538461, 2: 0.734375, 3: 0.8571428571428571}

## Confusion Matrix

In a **multiclass scenario**, the binary confusion matrix seen above can easily be extended it an $n \times n$ matrix where $n$ is the number of classes. 

In a multiclass setting it is particularly informative to know which classes are being misclassified as labels of another class. 


**Exercise:** Given the Wine df_result from above, write a function that accepts it and returns a confusion matrix data frame:

In [None]:
def calculate_confusion_matrix(df):
    #your code here
            
    return 
    
    

In [None]:
calculate_confusion_matrix(df)


## Geometric Mean

Often though, on class-unbalanced problems, it is necessary to derive a single value to express the accuracy of a dataset. In these cases the total error (or accuracy) metrics as completely inadequate and a very good alternative is the Geometric mean of the recall values is defined as follows:


<div style="font-size: 120%;">  
$Geometric\ mean = \left(\prod_{i=1}^n Recall_i \right)^{1/n}$
</div>

The geometric mean gives a more conservative mean then the arithmetic mean and is thus more appropriate, hence its usage. 

**Exercise:** Write a function that calculates the geometric mean for the data frame below:

In [None]:
df.head()

In [None]:
def calculate_geometric_mean(df):
    gmean = 1.0
    recall = calculate_recall_per_class(df)

    #YOUR CODE HERE 

   
    return np.power(gmean, 1.0 / float(len(recall)))
    

In [None]:
calculate_geometric_mean(df)
#SHOULD PRODUCE 0.8143030848180245


# Cross-Validation

Under generalization and overfitting, we discussed the importance of developing classifiers that do not just remember the training data but are able to generalize on data that the machine learning algorithms had not previously seen. One solution is to split datasets into training and test sets. The training is performed on one, and the generalization is determined on the other.

There are some problems with doing it this way. By defining  two sets, we reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, test) sets. Also, what proportion of data do we use for training and what for testing?

Another approach which is more robust, is to split the whole data several consecutive times in different
train set and test set combinations, and to return the averaged value of the prediction
scores obtained with the different sets. This technique is called **k-fold cross-validation**. Where the k determines the number segments that the data is to be split into. In this procedure, 1 fold is retained for testing the classifier and the other k-1 folds are used for training.

In the image example below, k=4 (4 folds). The classifier will be trained on the combination of 3 folds and  evaluated on the 4th fold (the test fold). This procedure is then repeated 4 times until every fold has been used as the test set once so that we can eventually calculate the average error rate of our model from the error rate of every iteration, which gives us an idea of how well our model generalizes.


![Source Wikipedia](../figures/cross-validation-001_small.png)

Source: https://github.com/rasbt/pattern_classification/blob/master/machine_learning/supervised_intro/images/cross-validation-001_small.png

Keep in mind that the algorithm at all times is provided with the correct answers, but when the algorithm makes predictions, it does not refer to them, but instead uses the correct answers only to compare its own prediction to.

Give the serious problem of class-imbalanced datasets, the most robust way to implement the above procedure is by using what is called the **stratified k-fold cross-validation**. By stratified, we mean that every fold will not only be of equal size as the other folds in terms of the number of samples, but every fold will also have an identical distribution of all class labels.

Overall, the cross-validation approach can be computationally expensive but it is worth the extra effort and it does not waste too much data
when creating the classifiers as it occurs  when fixing an arbitrary test set. This is an advantage in problems where the number of samples is very small. The question once you have trained and tested all your k classifiers is, which classifier to deploy in your application?




**Exercise:** Write a function that takes a dataframe, number of folds, column index that contains class labels, the number of features, number of classes, as well as the file name. The function will then create k stratified folds for cross-validation and write each individual fold to file with an appropriate name. Output for each class and each fold how many samples are being written so that you can confirm that your function is working.

In [None]:
def cross_fold_file_generator(df, kfolds, class_index, num_of_features, num_of_classes, file_stem):
    file_name = ['fold_' + str(x) for x in range(kfolds)]
    
     #your code here
        

For our wine dataset, the function should produce something like the following:

In [None]:
cross_fold_file_generator(df, kfolds = 5, class_index = 0, num_of_features = len(df.columns) - 1, num_of_classes = len(df['Class'].unique()), file_stem = '../datasets/wine_xfold_')

#should generate something like:
#class:  1 fold:  0 samples:  12
#class:  1 fold:  1 samples:  12
#class:  1 fold:  2 samples:  12
#class:  1 fold:  3 samples:  12
#class:  1 fold:  4 samples:  11
#class:  2 fold:  0 samples:  14
#class:  2 fold:  1 samples:  14
#class:  2 fold:  2 samples:  14
#class:  2 fold:  3 samples:  15
#class:  2 fold:  4 samples:  14
#class:  0 fold:  0 samples:  10
#class:  0 fold:  1 samples:  10
#class:  0 fold:  2 samples:  9
#class:  0 fold:  3 samples:  10
#class:  0 fold:  4 samples:  9

## Using sci-kit learn's evaluation metrics and dataset splitting functions

All that we have covered above is conveniently implemented for us in sci-kit learn.

In [None]:
df.head()

We can train a kNN classifier again on all our training data as follows and use the model for prediction:

In [None]:
knn_classifier = knn_classifier.fit(df[['Magnesium','Flavanoids']], df['Class'])
knn_classifier


We can find out how well our classifier learned the training dataset based on overall accuracy:

In [None]:
knn_classifier.score(df[['Magnesium','Flavanoids']], df['Class'])


Or we can use the classifier to classifier individual samples:

In [None]:
inputs = np.array([2.2, 3.3]) 
print('Predicted class: ', knn_classifier.predict(inputs.reshape(-1, 2))[0])


If we want to split the dataset into a training and test set with the test set comprising 20% is the dataset, we do as follows:

In [None]:
# split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[['Magnesium','Flavanoids']], df['Class'], random_state=1, test_size=0.2)

print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)


We can now train the classifier on the training dataset and test it on the unseen dataset:

In [None]:
knn_classifier = knn_classifier.fit(X_train, y_train)
y_pred = knn_classifier.predict(X_test)
y_pred


In [None]:
knn_classifier.score(X_test, y_test)


We can generate a more comprehensive accuracy report that investigated the accuracy at the individual class level:

In [None]:
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred))


Let's see what the confusion matrix looks like:

In [None]:
print(metrics.confusion_matrix(y_test, y_pred))


If we would prefer to use stratified cross-fold validations, then:

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(knn_classifier, X_test, y_test, cv=5, scoring='accuracy')
scores


The code above uses 5 folds and uses accuracy for a general evaluation metric. The returned result is accuracy for each of the folds.

We can find the mean and standard deviation of all the results:

In [None]:
print('mean is: ', scores.mean())
print('STD is: ', scores.std())


### Check Mode 

In [None]:
# example of the majority class naive classifier in scikit-learn
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

# define model
model = DummyClassifier(strategy='most_frequent')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)
# calculate accuracy
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.3f' % accuracy)

### Check Random Guess Accuracy 

In [None]:
# define model
model = DummyClassifier(strategy='uniform')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)
# calculate accuracy
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.3f' % accuracy)

### Check Random Guess Accuracy Based on Stratified Distribution 

In [None]:
# define model
model = DummyClassifier(strategy='stratified')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)
# calculate accuracy
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.3f' % accuracy)


**Exercise:** Load the student grade dataset and use the kNN classifier to predict the class grade based on all the assignment results.

First train all the data and test the accuracy.

Then use different number of folds to test the accuracy of the generalization of your classifiers.

Lastly, experiment with using different combinations of assignment features and observe if the accuracy increases/decreases as you omit some of them.

# Machine Learning - Evaluating Multiple Algorithms

Working in the fields of machine learning inevitably requires the practitioner to **compare the generalizability of one (or more) algorithm against others** in order to determine which might be a better solution for a given problem.

When performing such comparisons, it is important to realize that same algorithms, having different settings (such as the value for k in kNN) are seen as different algorithms and should be treated as such in the comparisons.

The question is: how to best summarize a series of algorithms with different settings and their performances across multiple datasets? 

In such circumstances, the practitioner is referred to **statistical techniques** to provide answers. In particular, the practitioner is expected to provide summaries of every algorithm's accuracy in terms of **mean ranks** across all datasets. The mean ranks provides an informative summary of the **overall performance** of all algorithms that can then be analysed even further using **non-parametric** tests such as Friedman's test and a range of post hoc-tests. 

In essence, these tests evaluate if the difference in the mean ranks is statistically different enough from the expected mean. The rejection of the null-hypothesis then opens the door to detailing further which algorithm's mean ranks differ significantly from others'.

Interested readers are referred to Demšar's article which provides examples on how to conduct statistical comparisons between multiple classifiers' results on multiple datasets:

> Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1-30.

The image below shows an example of how to display a summary of algorithms' performances using the geometric mean and the mean ranks: 

![Source Teo Susnjak](../figures/ml_mean_rank_example.jpg)

# No Free Lunch

Every model/classifier is a simplified representation of reality. By their very definition, simplifications discard irrelevant detail in order to enable a greater emphasis of some aspect of reality of interest for further study.  

These simplifications are founded on assumptions that every machine learning algorithm embodies to varying degrees. These assumptions may hold in some situations, but not others. The consequence is, that a classifier that operates well in a certain situation well may fail in another. Making bold claims that a given machine learning algorithm is more accurate than another is therefore strongly frowned upon.

In 1997, Wolpert and Macready described the “No Free Lunch” theorem which simply states that there is no one model/classifier that works best for every problem. The truth is that assumptions of a really effective and accurate classifier for one problem may not hold for different one. Because of this, it is common in machine learning to try multiple models and find one that works best for a particular problem.  This is especially true in supervised learning; validation or cross-validation is commonly used to assess the predictive accuracies of multiple models of varying complexity to find the best model.  

Therefore, depending on the problem domain and requirements, it is important consider multiple factors before settling on a machine learning algorithm. One must assess the trade-offs between speed, accuracy, and complexity/interpretability of different machine learning algorithms and their classifiers, and select one that works best for that particular problem and a set of requirements.

# Feature Selection $^1$

The main purpose of machine learning is to produce classifiers that generalise in their predictive accuracy beyond the datasets used to train them. To a large degree, **their final accuracy is dependent on the descriptive strength and quality of the features** that constitute the training dataset. 

**It is often tempting to simply provide a machine learning algorithm with as many features as are available for a given dataset. However, doing so has been consistently shown to be associated with negative outcomes.**

The inclusion of large feature numbers in a training dataset presents **computational challenges** that mostly arise during the training phase and can be prohibitive for some algorithms, but can also be a **strain during the detection time** for real-time systems processing high-volume data streams. Unnecessary and **redundant features** increase the search space for a machine learning algorithm. This in turn**dilutes the signal strength of a true pattern** and makes it more likely that due to the presence of noisy and irrelevant features, a spurious pattern will be discovered instead. 

In general it is not known *a priori* which features are meaningful, and **finding the optimal feature subset has been proven to be a NP-complete problem**. Nonetheless, it is still imperative that feature selection algorithms be applied to a dataset as a pre-processing step before training classifiers, in order to reduce feature dimensionality. 

**Not only are both the computational complexity and the generalisability improved by selecting the most concise subset, but the resulting model is more interpretable due to the fact that it is generated with the fewest possible number of parameters.** 

There are many algorithms and techniques available to perform feature selection. Some of the more commonly used are: Chi$^2$, Information Gain, PCA, LDA, Gain Ratiom Gini Index, SVM. 

Feature selection techniques can generally be divided into two broad categories. **Filter methods** are univariate techniques which consider the relevance of a particular feature in isolation to the other features and rank the features according to a metric. These algorithms are computationally efficient since they do not integrate the machine learning algorithm in its evaluation. However, they can be susceptible to selecting subsets of features that may not produce favourable results when combined with a chosen machine learning algorithm. These methods lack the ability to detect interactions among features as well as feature redundancy. 

On the other hand, **wrapper methods** overcome some of these shortcomings. They explicitly use the chosen machine learning algorithm to select the feature subsets and tend to outperform filter methods in predictive accuracy. However, these techniques exhibit bias in favour of a specific machine learning algorithm, and since they are computationally more intensive, they are also frequently impractical on large datasets.

Hybrid filter-wrapper methods have been a subject of recent research due to their ability to exploit the strengths of both strategies. Hybrid approached essentially allow any combination of filter and wrapper methods to be combined. Due to this, some novel and interesting hybrid approaches have recently been proposed such as: using the union of feature-subset outputs from Information Gain, Gain Ratio, Gini Index and correlation filter methods as inputs to the the wrapper Genetic Algorithm, hybridization of the Gravitational Search Algorithm with Support Vector Machine and using Particle Swarm Optimisation-based multi-objective feature selection approach in combination with k-Nearest-Neighbour. Given their flexibility, hybrid approaches thus offer some degree of tuning the trade-offs between accuracy and performance. Nonetheless, devising a feature selection algorithm that is both highly accurate and computationally efficient is still an open question.

For reasons of simplicity, we will consider only filter methods here, and more specifically we will look at Chi$^2$. Chi$^2$ performs a statistical test which ascertains if a given feature and class label are independent. A test result which indicates strong dependence between feature values and the associated class label, means that the given feature possesses discriminative ability. The scores of all feature vectors are then ranked resulting in an ordering of feature usefulness. The Chi$^2$ test does not take into account dependence between feature vectors themselves and is thus unable to detect feature redundancy.

Chi$^2$ is designed to work with feature counts on categorical data that results in non-negative values. However, Chi$^2$ can be used on continuous data by discretizing the feature vectors and counting the occurrences of feature values in the given bins. 

> $^1$ Susnjak, T., Kerry, D., Barczak, A., Reyes, N., & Gal, Y. (2015, November). Wisdom of Crowds: An Empirical Study of Ensemble-Based Feature Selection Strategies. In Australasian Joint Conference on Artificial Intelligence (pp. 526-538). Springer International Publishing.

## Example 

In [None]:
grades = pd.read_csv('../datasets/grades_prediction_mode.csv')
print(grades.head() )
grades = grades.dropna()

### Machine learning with no feature selection 

In [None]:
# create numeric column for the response
# note: features and response must both be entirely numeric!
mapping = {'A+':0, 'A':1, 'A-':2 ,'B+':3, 'B':4, 'B-':5, 'C+':6, 'C':7, 'R':8, 'D':9,'E':10, 'DNC':11}
#mapping = {'A+':0, 'A':0, 'A-':0 ,'B+':1, 'B':1, 'B-':1, 'C+':2, 'C':2, 'R':3, 'D':4,'E':4, 'DNC':4}
grades['grade_num'] = grades.Grade.map(mapping)

# create X (features) three different ways
X = grades [['A1', 'A2', 'A3', 'A4', 'A5', 'Exam']]

# create y (response)
y = grades.grade_num

In [None]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()

In [None]:
from sklearn.tree import DecisionTreeClassifier
treeclf = DecisionTreeClassifier(max_depth=3, random_state=1)

In [None]:
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.2)
X_train.shape

In [None]:
print("Test accuracy for NB (no feature selection):", nb.fit(X_train, y_train).score(X_test, y_test))
print("Test accuracy for Tree (no feature selection):", treeclf.fit(X_train, y_train).score(X_test, y_test))
print("Test accuracy for kNN (no feature selection):", knn.fit(X_train, y_train).score(X_test, y_test))

In [None]:
print('NB', metrics.classification_report(y_test, nb.predict(X_test)) )
print('tree', metrics.classification_report(y_test, treeclf.predict(X_test)) )
print('knn', metrics.classification_report(y_test, knn.predict(X_test)) )

In [None]:
print('NB \n',metrics.confusion_matrix(y_test, nb.predict(X_test)) )
print('tree \n',metrics.confusion_matrix(y_test, treeclf.predict(X_test)) )
print('knn \n',metrics.confusion_matrix(y_test, knn.predict(X_test)) )


### Machine learning with feature selection

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [None]:
X_new = SelectKBest(chi2, k=4).fit_transform(X, y)
X_new.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y, random_state=1, test_size=0.2)

In [None]:
print("Test accuracy for NB (with feature selection):", nb.fit(X_train, y_train).score(X_test, y_test) )
print("Test accuracy for Tree (with feature selection):", treeclf.fit(X_train, y_train).score(X_test, y_test) )
print("Test accuracy for knn (with feature selection):", knn.fit(X_train, y_train).score(X_test, y_test) )

**Exercise:** Given the Wine dataset, perform classification as above using all features for classification and then using a subset of the features. Split your train/test datasets 50/50. Report on the experimental findings. 

In [None]:
wine_df = pd.io.parsers.read_csv('../datasets/wine_data.csv' )
wine_df.head()

# Classification Process Summarized

1. We **begin with a data set** containing multiple samples, elements, records, or instances (all are the same terms used by different disciplines). 

2. Each instance is a **feature vector** consists of a number of features or attributes.  

3. One of the features is special: it represents the instance's class - the **class label**. Each instance **belongs to exactly one class**.

4. Classification problems are either **binary** or **multiclass**.

5. A number of classification algorithms are limited to only binary classification. However, multiclass problems can be **decomposed into series of binary classification problems** ie. an instance belongs to the target class or to any other class.

6. A classifier takes as input an instance (i.e., a feature vector) and **produces a class label**.

7. Creating and using a classifier entails a three-step process of: **training, testing, and deployment** in an application.

8. We first split the existing data set into a **training set** and a **test set**. 

9. In the training phase, we present each instance from the training set to the classification algorithm. 

10. Then compare the class label produced by the algorithm to the true class label of the record in question.

11. If possible, then we adjust the algorithm's **“parameters”** to achieve the greatest possible **accuracy** or, equivalently, the lowest possible **error rate**. 

12. The results can be **summarized** in a so-called **confusion matrix** whose entries are the number of records in each category.

13. Unfortunately, the **error rate derived from the training set** (the training error) is typically **too optimistic** as an indicator of the error rate the classifier would achieve on new data — that is, on data that was not used during the learning phase. 

13. This is the purpose of the test set: after we have optimized the algorithm using only the training data, we let the classifier operate on the elements of the test set to see how well it classifies them. The error rate obtained in this way is the **generalization error** and is a much more reliable indicator of the accuracy of the classifier.

14. Keep in mind the **trade-off** between **classifier complexity** and **overfitting**. The classifier can usually be tweaked to become more complex and  correctly learn all the training samples. But this is called overfitting and memorizing the data. On the other hand, if it is too simple, then it cannot learn the relationships within the data and both its training and generalization error will be poor; this is known as underfitting.

15. Analyse the features in the dataset. Remove the features that might be redundant or irrelevant as they might compromise the generalisability of the classifier. 

16. Once a classifier has been developed and tested, it can be used to **classify truly new and unknown data points** — that is, data points for which the correct class label is not known. (This is in contrast to the test set, where the class labels were known but not used by the classifier when making a prediction.) 

Adapted from: 
> Source: Janert, P. K. (2010). Data analysis with open source tools.  O'Reilly Media, Inc