Classification
========


Supervised learning classification is a type of machine learning where the goal is to build a model that can categorize input data into predefined classes or categories. In this approach, the algorithm is "supervised" because it learns from a labeled dataset, where each data point is associated with a class label indicating its category or class membership.

The primary objective of classification is to learn a decision boundary or a mapping function that can effectively separate different classes in the input data space. Once the model is trained on the labeled data, it can predict the class labels of new, unseen data points based on the patterns it has learned during training.

Commonly used classification algorithms include Logistic Regression, Decision Trees, Random Forests, Support Vector Machines (SVM), and Neural Networks. The choice of algorithm depends on the nature of the data, the complexity of the decision boundary, and the interpretability requirements.

We will now see an example of a Classification problem, using scikit-learn to tackle it.
As usual, we will import numpy and matplotlib to manipulate and visualise the data

In [None]:
import numpy as np
import matplotlib.pyplot as plt

To visualize the workings of machine learning algorithms, it is often helpful to study two-dimensional or one-dimensional data, that is data with only one or two features. 

While in practice, datasets usually have many more features, it is hard to plot high-dimensional data on two-dimensional screens.

We will illustrate some very simple examples before we move on to more "real world" data sets.

First, we will look at a two class classification problem in two dimensions, that is the problem of assigning a class value to each sample in the dataset.

For this quick example, we will use synthetic data generated by the ``make_blobs`` function from `sklearn`, which generates isotropic Gaussian blobs, given the sample size and number of features.

In [None]:
from sklearn.datasets import make_blobs

In [None]:
?make_blobs

First we generate some data

In [11]:
# insert code here

Then we visualise the shape of each array:

In [12]:
# insert code here

As the data is two-dimensional, we can plot each sample as a point in two-dimensional space, with the first feature being the x-axis and the second feature being the y-axis.

In [13]:
# insert code here

As classification is a supervised task, and we are interested in how well the model generalises (i.e. predicts on new, unseen data), we split our data into a **training set**,
to build the model from, and a **test-set**, to evaluate how well our model performs on new data. 

The ``train_test_split`` function from the `model selection` module does that for us, by randomly splitting of 25% of the data for testing.

<img src="./resources/imgs/train_test_split.svg" width="80%">


In [14]:
# insert code here

The `stratify` parameter helps maintaining the proportion of each class in the data. E.g. here we have 50% of the data belonging to class 1 and 50% in class 0.

In [15]:
# insert code here

Let's have a look at the shape of the train and test data:

In [16]:
# insert code here

In [17]:
# insert code here

Half of the split data will still be in each class: 

In [18]:
# insert code here

In [19]:
# insert code here

In [20]:
# insert code here

In [21]:
# insert code here

Now, let's have a look at how the training and testing of a supervised learning model works:
- Feed **training data** (X) and **training labels** (y) into the model; 


Once the model is trained:
- Make predictions using the test data; 
- Evaluate the performance using test labels.

<img src="./resources/imgs/supervised_workflow.svg" width="60%" style="background: #4682B4">


## Scikit-Learn Estimator API

Every algorithm is exposed in scikit-learn via an ''Estimator'' object. For instance a logistic regression is:

In [22]:
# insert code here

This type of statistical model (also known as logit model) is often used for classification and predictive analytics. Logistic regression estimates the probability of an event occurring, such as voted or didn’t vote, True or False, class 0 or class 1, based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1. 

Within machine learning, logistic regression belongs to the family of supervised machine learning models. It is also considered a discriminative model, which means that it attempts to distinguish between classes (or categories). 

All models in scikit-learn have a very consistent interface.
First, we instantiate the estimator object.

In [23]:
# insert code here

In [24]:
# insert code here

In [25]:
# insert code here

In [26]:
# insert code here

To build the model from our data, that is to learn how to classify new points, we call the ``fit`` function with the training data, and the corresponding training labels (the desired output for the training data point):

In [27]:
# insert code here

We can then apply the model to unseen data and use the model to predict the estimated outcome using the ``predict`` method:

In [28]:
# insert code here

We can compare these against the true labels:

In [29]:
# insert code here

We can evaluate our classifier quantitatively by measuring what fraction of predictions is correct (i.e. **accuracy**):

In [30]:
# insert code here

There is also a convenience function, ``score``, that all scikit-learn classifiers have to compute this directly from the test data:
    

In [31]:
# insert code here

It is often helpful to compare the generalization performance (on the test set) to the performance on the training set:

In [32]:
# insert code here

LogisticRegression is a so-called linear model,
that means it will create a decision that is linear in the input space. 

In 2D, this simply means it finds a line to separate the blue from the red:

In [33]:
# insert code here

**Estimated parameters**: All the estimated parameters are attributes of the estimator object, ending with an underscore. 

Here, these are the coefficients and the offset of the line:

In [34]:
# insert code here

Another classifier: K Nearest Neighbors
------------------------------------------------

Another popular and easy to understand classifier is K nearest neighbors (kNN).  

It has one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.

The interface is exactly the same as for ``LogisticRegression``.

In [35]:
# insert code here

In [36]:
# insert code here

This time we set a parameter of the KNeighborsClassifier to tell it we only want to look at one nearest neighbor:

In [37]:
# insert code here

We fit the model with out training data

In [38]:
# insert code here

In [39]:
# insert code here

And then we evaluate the score in the same way:

In [40]:
# insert code here

In [41]:
# insert code here

### Random Forest Classifier

Here we'll explore a class of algorithms based on decision trees.
Decision trees at their root are extremely intuitive.  They
encode a series of "if" and "else" choices, similar to how a person might make a decision.
However, which questions to ask, and how to proceed for each answer is entirely learned from the data.

For example, if you wanted to create a guide to identifying an animal found in nature, you
might ask the following series of questions:

- Is the animal bigger or smaller than a meter long?
    + *bigger*: does the animal have horns?
        - *yes*: are the horns longer than ten centimeters?
        - *no*: is the animal wearing a collar
    + *smaller*: does the animal have two or four legs?
        - *two*: does the animal have wings?
        - *four*: does the animal have a bushy tail?

and so on.  This binary splitting of questions is the essence of a decision tree.
One of the main benefit of tree-based models is that they require little preprocessing of the data.
They can work with variables of different types (continuous and discrete) and are invariant to scaling of the features.

Another benefit is that tree-based models are what is called "non-parametric", which means they don't have a fix set of parameters to learn. Instead, a tree model can become more and more flexible, if given more data.
In other words, the number of free parameters grows with the number of samples and is not fixed, as for example in linear models.


In [42]:
# insert code here

In [43]:
# insert code here

In [44]:
# insert code here

In [45]:
# insert code here

In [46]:
# insert code here

Exercise 1 
=========
Apply the KNeighborsClassifier and RandomForest to the ``iris`` dataset. 
- Load the Data
- split the dataset in train and test subsets
- Train a KNeighborsClassifier and a RandomForestClassifier
- Play with different values of the ``n_neighbors`` in KNeighborsClassifier and observe how training and test score change.
- Play with different values of the ``n_estimators`` in the RandomForestClassifier and observe how training and test score change.
- print out the scores in the different cases

In [47]:
# insert code here

Exercise 2 
========= 
Email Spam Classification

Objective:
The objective of this exercise is to build a machine learning model that can classify emails as "spam" or "not spam" (ham) based on their attributes.

Data:
You can use the classic "Spambase" dataset available in Scikit-learn, which contains features extracted from email texts. The dataset is already preprocessed and ready for use.

Steps:

Load the Spambase dataset from Scikit-learn.
Split the data into training and test sets.
Choose a classification algorithm (e.g., Logistic Regression, Decision Trees, or Random Forests) and create a model using Scikit-learn.
Train the model using the training data.
Evaluate the model's performance on the test data using accuracy or other appropriate metrics.
Experiment with different hyperparameters and algorithms to see how the model's performance changes.
Challenge:
For an extra challenge, try to implement and evaluate multiple classifiers (e.g., Logistic Regression, Decision Trees, and Random Forests) to compare their performance.

Hints:

Use the train_test_split function from Scikit-learn to split the dataset into training and test sets.
Create a model using the chosen classifier, fit it to the training data, and then use it to predict the labels for the test data.
Use the accuracy_score function from Scikit-learn to evaluate the model's performance on the test data.

In [None]:
# insert code here