# MATHS1004 Mathematics for Data Science I
## Computer Lab 6

Welcome to the final computer lab! In this lab we'll pull together a few of the pieces we've learned throughout this series of labs, and also delve a little deeper into naive Bayes classifiers.

But first:

## A warmup activity

In Week 10 we calculated the expected value of winnings for X Lotto, based on buying a single ticket. The facts were:
- Each draw consists of 6 random numbers drawn from a barrel containing the numbers 1-45;
- A single ticket costs $4.20;
- A single ticket gives you 6 entries (i.e., 6 sets of 6 numbers);
- You win the jackpot if you get all 6 numbers drawn correct, and lose the cost of your ticket otherwise.

Based on this we calculated that your exepected winnings are negative, even when the jackpot is as high as $4 million.

**Question**: What do your expected winnings look like if you buy multiple tickets? Is there a point at which you might expect to make a profit?

Answer this question by creating a function `expected_winnings(jackpot,ticket_price,num_plays)`, which uses the definition of expectation for a discrete random variable to calculate the expected winnings given a `jackpot` value, cost of a single ticket `ticket_price`, and number of tickets purchased `num_plays`.
- Use your function to plot expected winnings as a function of number of tickets purchased, for the jackpot and ticket prices given above. What happens to your expected winnings as you buy more tickets?
- At what value of the `jackpot` does playing X Lotto start to become profitable?
- How cheap would tickets need to become for you to want to purchase a ticket in a $4M draw?

You'll need to use the [scipy function for the binomial coefficient](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.special.binom.html), so I'll load that for you below.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.special

def expected_winnings(jackpot,ticket_price,num_plays):
    

## Naive Bayes classification in practice

Let's see how Naive Bayes can be used to predict the rating of film reviews, using a famous dataset. The [IMDB reviews dataset](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) contains 50k reviews from IMDB.com. Download the csv file that came with this lab, put it in an appropriate directory (remember having this fun in Lab 1?), and then load it using `pandas` (I've added a few options so the output looks more readable).

In [None]:
import pandas as pd


pd.set_option('display.max_colwidth', 280)

df = pd.read_csv('PATH_TO_imdb_master.csv', encoding="ISO-8859-1",index_col=0)

df.head()

This dataset contains positive, negative, and neutral reviews. We'll just consider the postive and negative reviews, and a small sample for now. Create a new dataframe `dfpn` containing just the reviews having `label` == `pos` or `label` == `neg`. Then execute the cell underneath to take a sample of just 100 reviews.

In [None]:
dfs = dfpn.sample(100,random_state=19)
dfs.tail(10)

Now to train a model! The dataset came with a split of the data into training + testing, so let's use that. Create variables `docs_train` and `docs_test` containing the reviews where the `type` is 'train' and 'test', respectively. Do the same for the labels, to create `labels_train` and `labels_test`. 

So now we have a lot of text data, but how should we convert this into counts, like we did for spam filtering? More importantly, which words or phrases should we use? The beauty of a NB classifier is that it's robust to using *all* the data, so we'll just use a handy function to count instances of *all* words.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

tokeniser = CountVectorizer()

counts_train = tokeniser.fit_transform(docs_train.str.lower())
counts_test = tokeniser.transform(docs_test.str.lower())

print(counts_train.shape)

The last lines convert to a large sparse matrix of counts. (How many features or predictors do we have here?) And now we can use those counts `counts_train` along with the labels `labels_train` to "fit" our model.

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(counts_train, labels_train)

Easy as that! Now we can make predictions on the unseen `test` data. Let's also take a look at the actual probabilities predicted by the model:

In [None]:
y_pred = clf.predict(counts_test)
p_pred = clf.predict_proba(counts_test)


print('predict / actual / probabilities')
print()
for a,b,c in zip(y_pred,y_test,p_pred):
    print(a,'\t',b,'\t',c)

Take a moment to look at those probabilities. Notice that the model makes a lot of mistakes, and sometimes is highly overconfident about those mistakes (for example, the very first line), but other times is much less confident.

A nice property of NB compared with other machine learning algorithms is that you can look at the relative probabilities of each word to predict the two classes. These are contained in `clf.feature_log_prob_` and we can use them as below to see the top words predicting the positive versus negative classes:

In [None]:
log_prob_ratio_sorted = (clf.feature_log_prob_[0, :]/clf.feature_log_prob_[1, :]).argsort()


n_features = 20
print('Top words for "negative" class:')
print(np.take(tokeniser.get_feature_names(), log_prob_ratio_sorted[:n_features]))
print()
print('Top words for "positive" class:')
print(np.take(tokeniser.get_feature_names(), log_prob_ratio_sorted[-n_features:]))



This model is not that great, and there's a lot of randomness here, but already we see a few words that "pass the stupidity test": "awful" for the negative class, and "liked" and "wonderful" for the positive.

Finally, how to summarise all of this information? It's useful to look at a [report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) of measures like accuracy (the proportion of correct predictions), as well as things like *precision*, *recall*, etc (which you'll hear more about as you progress further into data science). The [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix) is particularly useful, telling us the number of misclassifications we made in the off-diagonal elements. I'll import the functions, and then you write the commands to use them on your model.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report



Overall: our model is not bad, particularly given that we only used 100 out of 50K reviews! Now:
- Go back and change the sample of 100 to, say, 1000 reviews. How does the model improve? Make sure you look not just at the accuracy etc, but the informative features as well.
- What about 10,000 reviews? See how things improve then.
- You might like to make a plot of how the accuracy improves as the amount of data increases. How about the other measures like precision, recall, and F1?

## Naive Bayes for continuous data

Naive Bayes has plenty of application to text and other discrete forms of data, but what if my data is continuous? How do I calculate any of the probabilities $P()$ in 

$$
P(c|x_1,x_2,\ldots,x_n) = \frac{P(x_1 | c) P(x_2|c) \ldots P(x_n | c) P(c)}{P(x_1,x_2,\ldots,x_n)}
$$

if the $x_i$'s are continuous?

The answer is to use one of the models for *continuous random variables* (our lecture topic in Week 11), of which normal (or *Gaussian*) random variables are by far the most common.

We'll demonstrate using the very famous [iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) first introduced by the very famous Ronald Fisher (who happened to live at the University of Adelaide at the end of his life and is buried in St Peter's Cathedral just over the river from here!). Load it up from `sklearn`:

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Shuffle the rows to make training easier later on
data = np.zeros((len(y),5))
data[:,:-1] = X
data[:,-1] = y
np.random.shuffle(data)
X = data[:,:-1]
y = data[:,-1]

print(X.shape)

The game is to predict which of the 3 types of iris (Setosa, Versicolour, and Virginica) the 150 samples are, given the 4 predictors (Sepal Length, Sepal Width, Petal Length and Petal Width).

First, let's see if there is any structure in this dataset! Have an explore, by making a few different scatter plots of the different predictors. Colour your points by the type of iris to see if there is any structure.

You should see that there is some clustering, but how to spot this automatically?

PCA, of course! Using the skills you learnt in Lab 4, do a Principal Component Analysis of the iris dataset, and plot PC1 versus PC2. You should be able to see clusters!

(You can follow the procedure from Lab 4, or you might want to look at `sklearn`'s PCA function, which I used in lectures -- the notebook is included with this lab.)

Definitely clusters there! We should be able to train a NB model to classify irises.

Using the process from above, but this time importing and using `GaussianNB` instead of `MultinomialNB`, and using the first 100 rows of the data in `X` to train the model, create a NB classification model to predict iris type. Use `sklearn` to report on the accuracy etc of the model, plus look at the confusion matrix to see which classes were misclassified. 

Overall your model should be performing pretty well! From your exploratory data analysis earlier, can you see why those particular types of iris were more difficult to distinguish?

Naive Bayes is a principled, interpretable, and often-overlooked data science tool. Challenge: You might like to see how it performs on some other examples we've encountered during this course, e.g.,:
- the breast cancer dataset (also using continuous data);
- the Titanic dataset (which contains both continuous and discrete data!)