# Applied Machine Learning (INFR11211)

# Lab 1: Classification

In this lab we work with a spam filtering dataset. We will learn how to perform classification tasks using Naive Bayes and Logistic Regression. For this, we will use the the packages introduced in Lab 0, and `scikit-learn` package (`sklearn`): a machine learning library for Python which works with numpy array, and pandas DataFrame objects.

**Please Note**: Throughout this lab we make reference to [`methods`](https://en.wikipedia.org/wiki/Method_%28computer_programming%29) for specific objects e.g. "make use of the predict method of the MultinomialNB classifier". If you get confused, refer to the documentation and just ctrl+f for the object concerned:
* [Scikit-learn API documentation](https://scikit-learn.org/stable/modules/classes.html) 
* [Seaborn API documentation](https://seaborn.github.io/api.html)
* [Matplotlib Pyplot documentation](https://matplotlib.org/stable/api/pyplot_summary.html)
* [Pandas API documentation](https://pandas.pydata.org/pandas-docs/version/1.5.3/reference/index.html)
* [Numpy documentation](https://numpy.org/doc/stable/)

There are also tonnes of great examples online; googling key words with the word "example" will serve you well. However, note that sometimes examples online will use different version of the packages. 

First, we need to import the packages (run all the code cells as you read along):

In [1]:
# Import packages
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
%matplotlib inline

*Clarification*:

* The `%matplotlib inline` command is a special ipython [built in magic command](http://ipython.readthedocs.io/en/stable/interactive/magics.html) which forces the matplotlib plots to be rendered within the notebook.

## Spambase dataset

The [Spambase](http://archive.ics.uci.edu/ml/datasets/Spambase) classification dataset consists of tagged emails from a single email account. You should read through the description available for this data to get a feel for what you're dealing with. We have downloaded the dataset for you.

You will find the dataset located at `./datasets/spambase.csv` (the `datasets` directory is adjacent to this file). Execute the cell below to load the csv into in a pandas DataFrame object. 

In [2]:
# Load the dataset
data_path = os.path.join(os.getcwd(), 'datasets', 'spambase.csv')
spambase = pd.read_csv(data_path, delimiter = ',')

We have now loaded the data. Let's get a feeling of what the data looks like by using the `head()` method.

In [3]:
spambase.head(5) # Display the 5 first rows of the dataframe

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,is_spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61.0,278.0,1.0
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101.0,1028.0,1.0
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485.0,2259.0,1.0
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40.0,191.0,1.0
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40.0,191.0,1.0


### ========== Question 1 ==========

**a)** Display the number of features in the dataset (i.e. number of columns).

In [8]:
# Your Code goes here:
print(len(spambase.columns))

58


**b)** Display the number of observations (i.e. number of rows).

In [9]:
# Your Code goes here:
print(spambase.size)

266858


**c)** Display the mean and standard deviation of each feature.

In [13]:
# Your Code goes here:
spambase.agg(['mean', 'std']).T

Unnamed: 0,mean,std
word_freq_make,0.104553,0.305358
word_freq_address,0.213015,1.290575
word_freq_all,0.280656,0.504143
word_freq_3d,0.065425,1.395151
word_freq_our,0.312223,0.672513
word_freq_over,0.095901,0.273824
word_freq_remove,0.114208,0.391441
word_freq_internet,0.105295,0.401071
word_freq_order,0.090067,0.278616
word_freq_mail,0.239413,0.644755


We now want to *remove* some of the features from our data. There are various reasons for wanting to do so, for instance we might think that these are not relevant to the task we want to perform (i.e. e-mail classification) or they might have been contaminated with noise during the data collection process.

## Data cleaning

### ========== Question 2 ==========

**a)** Delete the `capital_run_length_average`, `capital_run_length_longest`, and  `capital_run_length_total` features. 
*Hint*: You should make use of the [`drop`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) method. 

*Tip*: some pandas methods have the argument `inplace` which you can use to determine whether they alter the object they are called upon and return nothing, or return a new object. This is particularly useful if you are dealing with huge datasets where you would typically want to operate `inplace`.

In [16]:
# Student needs to provide code similar to below
spambase.drop(["capital_run_length_average", "capital_run_length_longest", 
                          "capital_run_length_total"], axis=1, inplace=True)
## or, less efficiently
# spambase = spambase.drop(["capital_run_length_average", "capital_run_length_longest", 
#                           "capital_run_length_total"], axis=1)


**b)** Display the new number of features. Does it look like what you expected?

In [17]:
# Your Code goes here:
print(len(spambase.columns))

55


The remaining features represent relative frequencies of various important words and characters in emails. This is true for all features except `is_spam` which represents whether the e-mail was annotated as spam or not. So each e-mail is represented by a 55 dimensional vector representing whether or not a particular word exists in an e-mail. This is the so called [bag of words](http://en.wikipedia.org/wiki/Bag_of_words_model) representation and is clearly a very crude approximation since it does not take into account the order of the words in the emails.

### ========== Question 3  ==========
Now we want to simplify the problem by transforming our dataset. We will replace all numerical values which represent word frequencies with a binary value representing whether each word was present in a document or not.

**a)** Crate a new dataframe called `spambase_binary` from `spambase`. *Hint*: Look into the [`copy`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html?highlight=copy#pandas.DataFrame.copy) method in pandas. 

*Tip*: Be careful, in python, unless you explictly say not to, assigment is typically just reference e.g.
```python
i = [1, 3]
j = i
i[1] = 5
print(j)
```
outputs:
```
[1, 5]
```

In [19]:
# Your Code goes here:
spambase_binary = spambase.copy(deep=True)

**b)** Convert all features in `spambase_binary` to Boolean values: 1 if the word or character is present in the email, or 0 otherwise.

In [25]:
# Your Code goes here:
spambase_binary.applymap(lambda x: 1 if x > 0 else 0)

  spambase_binary.applymap(lambda x: 1 if x > 0 else 0)


Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_edu,word_freq_table,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,is_spam
0,0,1,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
1,1,1,1,0,1,1,1,1,0,1,...,0,0,0,0,1,0,1,1,1,1
2,1,0,1,0,1,1,1,1,1,1,...,1,0,0,1,1,0,1,1,1,1
3,0,0,0,0,1,0,1,1,1,1,...,0,0,0,0,1,0,1,0,0,1
4,0,0,0,0,1,0,1,1,1,1,...,0,0,0,0,1,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,1,0,1,0,0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0
4597,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
4598,1,0,1,0,0,0,0,0,0,0,...,1,0,0,1,1,0,0,0,0,0
4599,1,0,0,0,1,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0


**c)** Display the 5 last observations of the transformed dataset.

In [26]:
# Your Code goes here:
print(spambase_binary.tail(5))

      word_freq_make  word_freq_address  word_freq_all  word_freq_3d  \
4596            0.31                0.0           0.62           0.0   
4597            0.00                0.0           0.00           0.0   
4598            0.30                0.0           0.30           0.0   
4599            0.96                0.0           0.00           0.0   
4600            0.00                0.0           0.65           0.0   

      word_freq_our  word_freq_over  word_freq_remove  word_freq_internet  \
4596           0.00            0.31               0.0                 0.0   
4597           0.00            0.00               0.0                 0.0   
4598           0.00            0.00               0.0                 0.0   
4599           0.32            0.00               0.0                 0.0   
4600           0.00            0.00               0.0                 0.0   

      word_freq_order  word_freq_mail  ...  word_freq_edu  word_freq_table  \
4596              0.0     

## Multinomial Naive Bayes classification

Given the transformed dataset, we now wish to train a Naïve Bayes classifier to distinguish spam from regular email by fitting a distribution of the number of occurrences of each word for all the spam and non-spam e-mails. Read about the [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) and the underlying assumption if you are not already familiar with it from the lectures. In this lab we focus on the [Multinomial Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes). 

We will make use of the `MultinomialNB` class in `sklearn`. **Check out the user guide [description](https://scikit-learn.org/stable/modules/naive_bayes.html?highlight=multinomialnb#multinomial-naive-bayes) and [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html?highlight=multinomialnb#sklearn.naive_bayes.MultinomialNB) to familiarise yourself with this class.**

All classifiers in `sklearn` implement a `fit()` and `predict()` [method](https://en.wikipedia.org/wiki/Method_%28computer_programming%29). The first learns the parameters of the model and the latter classifies inputs. For a Naive Bayes classifier, the [`fit`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html?highlight=multinomialnb#sklearn.naive_bayes.MultinomialNB.fit) method takes at least two input arguments `X` and `y`, where `X` are the input features and `y` are the labels associated with each example in the training dataset (i.e. targets). 

As a first step we extract the input features and targets from the DataFrame. To do so, we will use the [`values`](https://pandas.pydata.org/pandas-docs/version/1.3.1/reference/api/pandas.DataFrame.values.html) property. For the input features we want to select all columns except `is_spam` and for this we may use the [`drop`](https://pandas.pydata.org/pandas-docs/version/1.3.1/reference/api/pandas.DataFrame.drop.html) method which discards the specified columns along the given axis. In fact, we can combine these two operations in one step.

### ========== Question 4 ==========

**a)** Create a Pandas DataFrame object `X` containing only the features (i.e. exclude the label `is_spam`). We need to do this as it is the input Scikit-learn objects expect for fitting. These will be our training features. *Hint*: make use of the `drop` method.

In [27]:
# Your Code goes here:
X = spambase_binary.drop('is_spam', axis=1).values

**b)** Create a Pandas Series object `y` that contains only the label from `spambase_binary`. These will be our training class labels. 

In [28]:
# Your Code goes here:
Y = spambase_binary['is_spam'].values

**c)** Display the dimensionality (i.e. `shape`) of each of the two arrays. *Hint:* The shape of `X` and `y` should be `(4601, 54)` and `(4601,)` respectively.

In [29]:
# Your Code goes here:
print(X.shape)
print(Y.shape)

(4601, 54)
(4601,)


**d)** Display the count of the number of emails that are spam (i.e. where `y == 1`), and those that are not spam (i.e. where `y == 0`).

In [35]:
# Your Code goes here:
print("spam count:" ,  np.sum(Y == 1))
print("non-spam count:" , np.sum(Y == 0))

spam count: 1813
non-spam count: 2788


### ========== Question 5 ==========

Now we want to train a Multinomial Naive Bayes classifier. Initialise a `MultinomialNB` object and [`fit`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html?highlight=multinomialnb#sklearn.naive_bayes.MultinomialNB.fit) the classifier using the `X` and `y` arrays extracted in the cell above.

In [36]:
# Your Code goes here:
MultinomialNB_model = MultinomialNB()
MultinomialNB_model.fit(X, Y)

## Model evaluation

We can evaluate the classifier by looking at the classification accuracy on the data. 

Scikit-learn model objects have built in scoring methods. The default [`score` method for `MultinomialNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html?highlight=multinomialnb+score#sklearn.naive_bayes.MultinomialNB.score) estimates the classification accuracy score. Alternatively, you can compute the prediction for the training data and make use of the [`accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html?highlight=accuracy_score#sklearn.metrics.accuracy_score) function (that is in fact what the classifier's `score()` method does under the hood).

### ========== Question 6 ========== 

**a)** Display the log-prior probabilities for each class. *Hint:* use tab-completion to figure out which feature of the `MultinomialNB` structure you are interested in.

In [37]:
# Your Code goes here:
MultinomialNB_model.predict_log_proba(X)

array([[-4.95399423e+00, -7.08017803e-03],
       [-1.03709914e+01, -3.13287076e-05],
       [-1.61321821e+01, -9.86012338e-08],
       ...,
       [-1.23494348e-01, -2.15267169e+00],
       [-5.76123192e-01, -8.25703501e-01],
       [-3.11236681e-01, -1.31878704e+00]])

**b)** Extract the predictions from your classifier using the training data as input. *Hint*: make use of the `predict` method of the `MultinomialNB` classifier.

In [39]:
# Your Code goes here:
Pred = MultinomialNB_model.predict(X)

**c)** Compute the classification accuracy on the training data by either using the `accuracy_score` metric or the `score` method of the `MultinomialNB`. 

In [None]:
# Your Code goes here:
MultinomialNB_model.score(X, Y)

### ========== Question 7 ==========

The empirical log probability of input features given a class $P\left(x_i  |  y\right)$ is given by the feature `feature_log_prob` of the classifier. For each feature there are two such conditional probabilities, one for each class. 

**a)** What dimensionality do you expect the `feature_log_prob_` array to have? Why?

***Student needs to answer similar to below:***

There is a probability for each feature(/variable) conditional on each of the two outcome values, so the dimensionality should logically be (54,2) or (2,54)



**b)** Inspect the log probabilities of the features. Verify that it has the expected dimensionality (i.e. `shape`).

In [None]:
# Your Code goes here:

**c)** Create a list of the names of the features that have higher log probability when the email is `Ham` than `Spam` i.e. what features imply an email is more likely to be `Ham`? *Hint:* There are a many ways to do this. Try it on your own then, if you get stuck, you can do it using index numbers (look up [`np.argwhere`](https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html)), or using a boolean mask (look up [pandas indexing](https://pandas.pydata.org/docs/user_guide/indexing.html)). The column names of a Pandas DataFrame are contained in the `columns` feature.

In [None]:
# Your Code goes here:


### ========== Question 8 ==========

For the final part of this section we will now pretend we are spammers wishing to fool a spam checking system based on Naïve Bayes into classifying a spam e-mail as ham (i.e. a valid e-mail). For this we will use a test set consisting of just one data point (i.e. e-mail). This tiny dataset is called `spambase_test` and has already been pre-processed for you which means that the redundant features have been removed and word frequencies have been replaced by word presence/absence.

**a)** Load `./datasets/spambase_test.csv` dataset into a new pandas structure

In [None]:
# Your Code goes here:


**b)** Use `spambase_test` to create a pandas DataFrame object X_test, contatining the test features, and pandas Series object y_test, containing the test outcome

In [None]:
# Your Code goes here: 


**c)** Feed the input features into the classifier and compare the outcome to the true label. Make sure you don't feed the target into the classifier as you will receive an error (why?). Does the classifer classify the spam e-mail correctly?

In [None]:
# Your Code goes here:


**d)** Pick one (perhaps random) feature that has higher probability for the ham class (using your feature names from earlier) and set the corresponding value in `X_test` to 1. Now predict the new outcome. Has it changed? If not, keep modifying more features until you have achieved the desired outcome (i.e. model classifies the e-mail as ham).

In [None]:
# Your Code goes here:


## Logistic Regression

### ========== Question 9 ==========
We now train a [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression) classifier and [`fit`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression.fit#sklearn.linear_model.LogisticRegression.fit) it by using the training data. Use the `lbfgs` solver and default settings for the other parameters. Report the classification accuracy on both the training (`X` and `y`) and test sets (`X_test` and `y_test`). Does your classifier generalise well on unseen data?
(You can use the default [`score` method for `LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression.fit#sklearn.linear_model.LogisticRegression.score) to evaluate the classification accuracy)


**a)** Train the Logistic Regression binary classifier.

In [None]:
# Your Code goes here:


**b)** Print the weight and bias of the binary Logistic Regression classifier (look up the Attributes of [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression)).

In [None]:
# Your Code goes here:
