<a href="https://colab.research.google.com/github/anyuanay/INFO213/blob/main/INFO213_Week9_naiveBayes_lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 213: Data Science Programming 2
___

### Week 9: Probability, Naive Bayes, and Text Classification

**Question:**
- How to do fast and straigthforward probabilistic predictions on text classification?


**Objectives:**
- Define the problem of using conditional probabilities for classification
- Describe the assumptions in Naive Bayes classification
- Estimate the prior probabilities for Naive Bayes classification
- Explain smoothing techiques in estimation
- Implement Naive Bayes classification from scratch
- Apply Naive Bayes methods in the Scikit Learn package

## Introduction

- Naive Bayes models are a group of extremely fast and simple classification algorithms that are often suitable for very high-dimensional datasets.
- Because they are so fast and have so few tunable parameters, they end up being very useful as a quick-and-dirty baseline for a classification problem.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

# Retrieval Practice on Probability

## Bayesian Classification

- Naive Bayes classifiers are built on Bayesian classification methods.
- These rely on Bayes's theorem, which is an equation describing the relationship of conditional probabilities of statistical quantities.
- In Bayesian classification, we're interested in finding the probability of a label given some observed features, which we can write as $P(L~|~{\rm features})$.
- Bayes's theorem tells us how to express this in terms of quantities we can compute more directly:

$$
P(L~|~{\rm features}) = \frac{P({\rm features}~|~L)P(L)}{P({\rm features})}
$$

- If we are trying to decide between two labels—let's call them $L_1$ and $L_2$—then one way to make this decision is to compute the ratio of the posterior probabilities for each label:

$$
\frac{P(L_1~|~{\rm features})}{P(L_2~|~{\rm features})} = \frac{P({\rm features}~|~L_1)}{P({\rm features}~|~L_2)}\frac{P(L_1)}{P(L_2)}
$$

- All we need now is some model by which we can compute $P({\rm features}~|~L_i)$ for each label.
- Such a model is called a *generative model* because it specifies the hypothetical random process that generates the data.
- Specifying this generative model for each label is the main piece of the training of such a Bayesian classifier.
- The general version of such a training step is a very difficult task, but we can make it simpler through the use of some simplifying assumptions about the form of this model.

- This is where the "naive" in "naive Bayes" comes in: if we make very naive assumptions about the generative model for each label, we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification.
- Different types of naive Bayes classifiers rest on different naive assumptions about the data.

**Naive Assumption**
- Given a lable L and a set of data containing $n$ features. The event "the data contains a feature $f_i$ is independent of the events of the data containing any other features".
- In other words, the features of the data are mutually independent. With the properties of conditional probabilities, we can write
$$
P(f_1, f_2, ..., f_n|L) = P(f_1|L)\times P(f_2|L)\times ...\times P(f_n|L).
$$

### A Really Dumb Spam Filter

- Imagine a “universe” that consists of receiving a message chosen randomly from all possible messages.
- Let $S$ be the event “the message is spam” and $V$ be the event “the message contains the word Gold” and $R$ be the event "the message contains the word Rolex.".
- Then Bayes’s Theorem tells us that the probability
that the message is spam conditional on containing the word Gold and Rolex is:

$$P(S|V, R) = \frac{P(V, R|S)\times P(S)}{P(V, R)}$$

- By the Naive assumption, we can write it as:

$$P(S|V, R) = \frac{P(V, R|S)\times P(S)}{P(V, R)} = \frac{P(V|S)\times P(R|S)\times P(S)}{P(V, R)}$$

- Given a set of spaming and non-spamming emails, we can estimate the quantities $P(V|S)$, $P(R|S)$, $P(S)$, and $P(V, R)$ by counting the number of occurences of each event.

## Probability
- It is hard to do data science without some sort of understanding of probability and its
mathematics.
- For our purposes you should think of probability as a way of quantifying the uncertainty
associated with events chosen from a some universe of events. - The universe consists of all possible outcomes. And any subset of these outcomes is an event; for
example, “the die rolls a one” or “the die rolls an even number.”

- Notationally, we write $P(E)$ to mean “the probability of the event $E$.”
- We’ll use probability theory to build models. We’ll use probability theory to evaluate
models. We’ll use probability theory all over the place.

### Dependence and Independence
- Roughly speaking, we say that two events E and F are dependent if knowing something
about whether E happens gives us information about whether F happens (and
vice versa). Otherwise they are independent.

- Mathematically, we say that two events E and F are independent if the probability that
they both happen is the product of the probabilities that each one happens:

$$P(E, F) = P(E)\times P(F)$$

### Properties of Probability
- Given a set of possible events $S$. The probabilities of events in $S$ must satisfy the following properties:
    1. $P(a) \geq 0$ for $a\in S$
    2. $\sum_{a\in S} P(a) = 1$

### Conditional Probability
- When two events E and F are independent, then by definition we have:

$$P(E, F) = P(E)\times P(F)$$

- If they are not necessarily independent (and if the probability of F is not zero), then
we define the probability of E “conditional on F” as:

$$P(E|F) = P(E, F) /P(F)$$

- You should think of this as the probability that E happens, given that we know that F happens. We often rewrite this as:

$$P(E, F) = P(E|F)\times P(F)$$

- When E and F are independent, you can check that this gives:

$$P(E|F) = P(E)$$

- which is the mathematical way of expressing that knowing F occurred gives us no additional information about whether E occurred.

### Example
- One common tricky example involves a family with two (unknown) children.
- If we assume that:
    1. Each child is equally likely to be a boy or a girl
    2. The gender of the second child is independent of the gender of the first child

- Then the event “two boys” has probability 1/4, the event “one girl, one boy” has probability 1/2, and the event “two girls” has probability 1/4.

- OK. Let us consider the following two events:
    1. what is the probability of the event “both children are girls” (B) conditional on the event “the older child is a girl” (G)?
    2. what is the probability of the event “both children are girls” conditional on the event “at least one of the children is a girl” (L)?

- If you have gotten the answers, let us simulate the events and check your answers.

```
import random
def random_kid():
    return random.choice(["boy", "girl"])
```

```
both_girls = 0
older_girl = 0
either_girl = 0
girl_boy = 0

random.seed(0)
for _ in range(10000):
    younger = random_kid()
    older = random_kid()
    if older == "girl":
        older_girl += 1
    if older == "girl" and younger == "girl":
        both_girls += 1
    if older == "girl" or younger == "girl":
        either_girl += 1
    if younger != older:
        girl_boy += 1
```

```
print('The probability of "both children are girls (B) conditional on the event the older child is a girl (G)" is: ' + \
     str(both_girls / older_girl))
```

```
print('The probability of "both children are girls (B) conditional on the event at least one of the children is a girl (L)" is: ' + \
     str(both_girls / either_girl))
```

```
print('The probability of "one boy one girl (D) conditional on the event one child is a girl (L)" is: ' + \
     str(girl_boy / either_girl))
```

#### Why is it that if you know one child is girl then having another child as boy is twice as likely as having another girl?

### Bayes’s Theorem
- One of the data scientist’s best friends is Bayes’s Theorem, which is a way of “reversing” conditional probabilities.
- Let’s say we need to know the probability of some event
$E$ conditional on some other event F occurring.
- But we only have information about the probability of $F$ conditional on E occurring. Using the definition of conditional
probability twice tells us that:

$$P(E|F) = \frac{P(E, F)}{P(F)} = \frac{P(F|E) P(E)} {P(F)}$$

- The event $F$ can be split into the two mutually exclusive events “$F$ and $E$” and “$F$ and
not $E$.” If we write $¬E$ for “not $E$” (i.e., “$E$ doesn’t happen”), then:

$$P(F) = P(F, E) + P(F, ¬E)$$

so that:

$$P(E|F) = \frac{P(F|E) P(E)}  {P(F|E) P(E) + P(F|¬E) P(¬E)}$$

which is how Bayes’s Theorem is often stated.

#### Exercise
- Imagine a certain disease that affects 1 in every 10,000 people.
- And imagine that there is a test for this disease that gives the correct result (“diseased” if you have
the disease, “nondiseased” if you don’t) 99% of the time.
- What is the probability of a person having the disease if the person has a positive test?

In [None]:
(0.99*0.0001) / (0.99*0.0001 + 0.01*0.99)

0.009900990099009901

### Random Variables
- A random variable is a variable whose possible values have an associated probability distribution.
- A very simple random variable equals 1 if a coin flip turns up heads and 0 if the flip turns up tails.

### Expected Value
- The expected value $E[X]$ of a random varaible $X$ is the average value of $X$ weighted by its probabilities:

$$E[X] = P(X = x_{1}) x_{1} + P(X=x_{2}) x_{2}+...+P(X=x_{n}) x_{n}$$

### Variance
- The variance of a random variable $X$ is the expected value of the squared deviation from the mean of $X$, $\mu= E[X]$:

$$Var(X) = E[(X-\mu)^2]$$

### Standard Deviation
- Standard deviation is the squred root of variance

### Continuous Distributions
- A coin flip corresponds to a discrete distribution—one that associates positive probability with discrete outcomes.
- Often we’ll want to model distributions across a continuum
of outcomes. (For our purposes, these outcomes will always be real numbers, although that’s not always the case in real life.)
- For example, the uniform distribution
puts equal weight on all the numbers between 0 and 1.

- Because there are infinitely many numbers between 0 and 1, this means that the weight it assigns to individual points must necessarily be zero.
- For this reason, we represent a continuous distribution with a probability density function (pdf) such that
the probability of seeing a value in a certain interval equals the integral of the density function over the interval.

- We will often be more interested in the cumulative distribution function (cdf), which
gives the probability that a random variable is less than or equal to a certain value.

#### Exercise
- Write functions for the probability density function and the cumulative distribtuion function of the uniform distribution.

### The Normal Distribution
- The normal distribution is the king of distributions. It is the classic bell curve–shaped distribution and is completely determined by two parameters: its mean $\mu$ and its
standard deviation $\sigma$. The mean indicates where the bell is centered, and the
standard deviation how “wide” it is.

- It has the distribution function:
$$
f(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi} \sigma} \exp (-\frac{(x − \mu)^2}{2\sigma^2})
$$

- which we can implement as:

```
def normal_pdf(x, mu=0, sigma=1):
    sqrt_two_pi = math.sqrt(2 * math.pi)
    return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (sqrt_two_pi * sigma))
```

```
import math
import matplotlib.pyplot as plt
%matplotlib inline
xs = [x / 10.0 for x in range(-50, 50)]
plt.figure(figsize = (15, 9))
plt.plot(xs,[normal_pdf(x,sigma=1) for x in xs],'-',label='mu=0,sigma=1')
plt.plot(xs,[normal_pdf(x,sigma=2) for x in xs],'--',label='mu=0,sigma=2')
plt.plot(xs,[normal_pdf(x,sigma=0.5) for x in xs],':',label='mu=0,sigma=0.5')
plt.plot(xs,[normal_pdf(x,mu=-1) for x in xs],'-.',label='mu=-1,sigma=1')
plt.legend()
plt.title("Various Normal pdfs")
plt.grid()
plt.show()
```

- When $\mu = 0$ and $\sigma = 1$, it’s called the standard normal distribution.
- If Z is a standard
normal random variable, then it turns out that:
$$X = \sigma Z + \mu$$
is also normal but with mean $\mu$ and standard deviation $\sigma$.
- Conversely, if X is a normal random variable with mean $\mu$ and standard deviation $\sigma$,
$$Z = (X − \mu) /\sigma$$
is a standard normal variable -- **Rescalling**

### The Central Limit Theorem

- One reason the normal distribution is so useful is the central limit theorem, which says (in essence) that a random variable defined as the average of a large number of independent and identically distributed random variables is itself approximately normally
distributed.

- In particular, if $x_{1}, ..., x_{n}$ are random variables with mean $\mu$ and standard deviation $\sigma$,
and if n is large, then:
$$
(x_{1} + ... + x_{n})/n
$$
is approximately normally distributed with mean $\mu$ and standard deviation $\sigma / \sqrt{(n)}$.
- Equivalently (but often more usefully),
$$
\frac{(x_{1} + ... + x_{n}) − \mu n}{\sigma \sqrt{(n)}}
$$
is approximately normally distributed with mean 0 and standard deviation 1.

### Example
- An easy way to illustrate this is by looking at binomial random variables, which have
two parameters n and p.
- A Binomial(n,p) random variable is simply the sum of n
independent Bernoulli(p) random variables, each of which equals 1 with probability p and 0 with probability 1 − p:

```
def bernoulli_trial(p):
    return 1 if random.random() < p else 0
def binomial(n, p):
    return sum(bernoulli_trial(p) for _ in range(n))
```

- The mean of a Bernoulli(p) variable is $p$, and its standard deviation is $p(1 − p)$.
- The central limit theorem says that as n gets large, a Binomial(n,p) variable is approximately
a normal random variable with mean $\mu = np$ and standard deviation $\sigma = \sqrt{np(1 − p)}$ . If we plot both, you can easily see the resemblance:

```
from collections import Counter
def normal_cdf(x, mu=0,sigma=1):
    return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2
def make_hist(p, n, num_points):
    data = [binomial(n, p) for _ in range(num_points)]
    # use a bar chart to show the actual binomial samples
    histogram = Counter(data)
    plt.figure(figsize=(15, 9))
    plt.bar([x - 0.4 for x in histogram.keys()], \
    [v / num_points for v in histogram.values()], \
    0.8, \
    color='0.75')
    mu = p * n
    sigma = math.sqrt(n * p * (1 - p))
    # use a line chart to show the normal approximation
    xs = range(min(data), max(data) + 1)
    ys = [normal_cdf(i + 0.5, mu, sigma) - normal_cdf(i - 0.5, mu, sigma) for i in xs]
    #plt.figure(figsize=(15,9))
    plt.plot(xs,ys)
    plt.title("Binomial Distribution vs. Normal Approximation")
    plt.grid()
    plt.show()
```

```
make_hist(0.75, 100, 10000)
```

# Retrieval Practice

## Naive Bayes Spam Filter

- Imagine now that we have a vocabulary of many words $w_{1}, ...,w_{n}$.
- To move this into the realm of probability theory, we’ll write $X_{i}$ for the event “a message contains the
word $w_{i}$.”
- Also imagine that (through some unspecified-at-this-point process) we’ve come up with an estimate $P(X_{i}|S)$ for the probability that a spam message contains
the ith word, and a similar estimate $P(X_{i}|\neg S)$ for the probability that a nonspam
message contains the ith word.

- Naive Bayes method:
$$
P(X_{1} = x_{1}, . . . , X_{n} = x_{n}|S) = P(X_{1} = x_{1}|S) \times ⋯ \times P(X_{n} = x_{n}|S)
$$

- The Naive Bayes assumption allows us to compute each of the probabilities on the right simply by multiplying together the individual probability estimates for each
vocabulary word.

- In practice, you usually want to avoid multiplying lots of probabilities together, to avoid a problem called underflow, in which computers don’t deal well with floating point
numbers that are too close to zero.

- Recalling from algebra that
$\log (ab) = \log (a) + \log (b)$ and that $\exp \log (x) = x$, we usually compute $p_{1} \times ...\times p_{n}$ as
the equivalent (but floating-point-friendlier):

$$
\exp (\log (p_{1}) + ⋯ + \log (p_{n}))
$$

- If we have a fair number of “training” messages labeled as spam and not-spam, an obvious first try is to estimate $P(X_{i}|S)$ simply as the fraction of spam messages containing
word $w_{i}$.

- To avoid zero-probability problem, we usually use some kind of smoothing. In particular, we’ll choose a pseudocount—k—and estimate the probability of seeing
the ith word in a spam as:
$$
P(X_{i}|S) = (k + number\ of\ spams\ containing\ w_{i}) / (2k + number\ of\ spams)
$$

- Similarly for $P(X_{i}|\neg S)$.



## Use Scikit-Learn Multinomial Naive Bayes

- We assume that features were generated from a simple multinomial distribution.
- The multinomial distribution describes the probability of observing counts among a number of categories, and thus multinomial naive Bayes is most appropriate for features that represent counts or count rates.

- The idea is precisely the same as before, except that instead of modeling the data distribution with the best-fit Gaussian, we model the data distribuiton with a best-fit multinomial distribution.

### Example: Classifying Text

- One place where multinomial naive Bayes is often used is in text classification, where the features are related to word counts or frequencies within the documents to be classified.

- Here we will use the sparse word count features from the 20 Newsgroups corpus to show how we might classify these short documents into categories.

- Let's download the data and take a look at the target names:

```
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
data.target_names
```

For simplicity here, we will select just a few of these categories, and download the training and testing set:

```
categories = ['comp.graphics', 'rec.autos', 'sci.space', 'talk.politics.misc']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

```

```
# Map target numbers to target names
target_num_name = {i: name for i, name in enumerate(train.target_names)}
target_num_name
```

```
train.keys()
```

Here is a representative entry from the data:

```
print(train.data[5])
```

```
train.target[5]
```

```
pd.unique(train.target)
```

```
test.data[2]
```

```
test.target[2]
```

- In order to use this data for machine learning, we need to be able to convert the content of each string into a vector of numbers.
- For this we will use the count vectorizer, and create a pipeline that attaches it to a multinomial naive Bayes classifier:

```
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
```

```
model_vectorizer = CountVectorizer()
clf = MultinomialNB()
pipe = make_pipeline(model_vectorizer, clf)
```

- With this pipeline, we can apply the model to the training data, and predict labels for the test data:

```
pipe.fit(train.data, train.target)
```

```
preds = pipe.predict(test.data)
```

- Now that we have predicted the labels for the test data, we can evaluate them to learn about the performance of the estimator.
- For example, here is the confusion matrix between the true and predicted labels for the test data:

```
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, preds)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
```

```
from sklearn.metrics import precision_recall_fscore_support

precision, recall, f1, support = precision_recall_fscore_support(test.target, preds)
```

```
# Print nicely with class names
for i, name in enumerate(test.target_names):
    print(f"{name:25s} | Precision: {precision[i]:.2f}  Recall: {recall[i]:.2f}  F1: {f1[i]:.2f}  Support: {support[i]}")
```

- Here's a quick utility function that will return the prediction for a single string:

```
def predict_category(s, train=train, pipe=pipe):
    pred = pipe.predict([s])
    return train.target_names[pred[0]]
```

Let's try it out:

In [None]:
predict_category('Honda Accord is a Sedan')

'rec.autos'

In [None]:
predict_category('discussion solar system')

'sci.space'

In [None]:
predict_category('determining the screen size')

'comp.graphics'

In [None]:
predict_category('George Washington was the first president of U.S.A.')

'talk.politics.misc'

- Remember that this is nothing more sophisticated than a simple probability model for the (weighted) frequency of each word in the string; nevertheless, the result is striking.
- Even a very naive algorithm, when used carefully and trained on a large set of high-dimensional data, can be surprisingly effective.

## When to Use Naive Bayes

Because naive Bayesian classifiers make such stringent assumptions about data, they will generally not perform as well as a more complicated model.
That said, they have several advantages:

- They are extremely fast for both training and prediction
- They provide straightforward probabilistic prediction
- They are often very easily interpretable
- They have very few (if any) tunable parameters

These advantages mean a naive Bayesian classifier is often a good choice as an initial baseline classification.
If it performs suitably, then congratulations: you have a very fast, very interpretable classifier for your problem.
If it does not perform well, then you can begin exploring more sophisticated models, with some baseline knowledge of how well they should perform.

Naive Bayes classifiers tend to perform especially well in one of the following situations:

- When the naive assumptions actually match the data (very rare in practice)
- For very well-separated categories, when model complexity is less important
- For very high-dimensional data, when model complexity is less important

The last two points seem distinct, but they actually are related: as the dimension of a dataset grows, it is much less likely for any two points to be found close together (after all, they must be close in *every single dimension* to be close overall).
This means that clusters in high dimensions tend to be more separated, on average, than clusters in low dimensions, assuming the new dimensions actually add information.
For this reason, simplistic classifiers like naive Bayes tend to work as well or better than more complicated classifiers as the dimensionality grows: once you have enough data, even a simple model can be very powerful.

# Retrieval Practice

# Applying Machine Learning To Sentiment Analysis

## Obtaining the IMDb movie review dataset

The IMDB movie review set can be downloaded from [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/).
After downloading the dataset, decompress the files.

A) If you are working with Linux or MacOS X, open a new terminal window, `cd` into the download directory and execute

`tar -zxf aclImdb_v1.tar.gz`

B) If you are working with Windows, download an archiver such as [7Zip](http://www.7-zip.org) to extract the files from the download archive.

**Optional code to download and unzip the dataset via Python:**

In [None]:
import os
import sys
import tarfile
import time
import urllib.request

source = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
target = 'aclImdb_v1.tar.gz'

if os.path.exists(target):
    os.remove(target)

def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.**2 * duration)
    percent = count * block_size * 100. / total_size

    sys.stdout.write(f'\r{int(percent)}% | {progress_size / (1024.**2):.2f} MB '
                     f'| {speed:.2f} MB/s | {duration:.2f} sec elapsed')
    sys.stdout.flush()


if not os.path.isdir('aclImdb') and not os.path.isfile('aclImdb_v1.tar.gz'):
    urllib.request.urlretrieve(source, target, reporthook)

100% | 80.23 MB | 9.33 MB/s | 8.60 sec elapsed

In [None]:
if not os.path.isdir('aclImdb'):

    with tarfile.open(target, 'r:gz') as tar:
        tar.extractall()

## Preprocessing the movie dataset into more convenient format

Install pyprind by uncommenting the next code cell.

In [None]:
!pip install pyprind

Collecting pyprind
  Downloading PyPrind-2.11.3-py2.py3-none-any.whl.metadata (1.1 kB)
Downloading PyPrind-2.11.3-py2.py3-none-any.whl (8.4 kB)
Installing collected packages: pyprind
Successfully installed pyprind-2.11.3


In [None]:
import pyprind
import pandas as pd
import os
import sys
from packaging import version


# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}

# if the progress bar does not show, change stream=sys.stdout to stream=2
pbar = pyprind.ProgBar(50000, stream=2)

df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file),
                      'r', encoding='utf-8') as infile:
                txt = infile.read()

            if version.parse(pd.__version__) >= version.parse("1.3.2"):
                x = pd.DataFrame([[txt, labels[l]]], columns=['review', 'sentiment'])
                df = pd.concat([df, x], ignore_index=False)

            else:
                df = df.append([[txt, labels[l]]],
                               ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:31


Shuffling the DataFrame:

In [None]:
import numpy as np


if version.parse(pd.__version__) >= version.parse("1.3.2"):
    df = df.sample(frac=1, random_state=0).reset_index(drop=True)

else:
    np.random.seed(0)
    df = df.reindex(np.random.permutation(df.index))

Optional: Saving the assembled data as CSV file:

In [None]:
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [None]:
import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')

# the following is necessary on some computers:
df = df.rename(columns={"0": "review", "1": "sentiment"})

df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [None]:
df.shape

(50000, 2)

## Transforming documents into feature vectors

By calling the fit_transform method on CountVectorizer, we just constructed the vocabulary of the bag-of-words model and transformed the following three sentences into sparse feature vectors:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

Now let us print the contents of the vocabulary to get a better understanding of the underlying concepts:

In [None]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


- As we can see from executing the preceding command, the vocabulary is stored in a Python dictionary, which maps the unique words to integer indices. Next let us print the feature vectors that we just created:

- Each index position in the feature vectors shown here corresponds to the integer values that are stored as dictionary items in the CountVectorizer vocabulary.
- For example, the first feature at index position 0 resembles the count of the word "and", which only occurs in the last document, and the word "is" at index position 1 (the 2nd feature in the document vectors) occurs in all three sentences. - Those values in the feature vectors are also called the raw term frequencies: *tf (t,d)*—the number of times a term t occurs in a document *d*.

In [None]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


## Assessing word relevancy via term frequency-inverse document frequency

In [None]:
np.set_printoptions(precision=2)

- When we are analyzing text data, we often encounter words that occur across multiple documents from both classes.
- Those frequently occurring words typically don't contain useful or discriminatory information.
- In this subsection, we will learn about a useful technique called term frequency-inverse document frequency (tf-idf) that can be used to downweigh those frequently occurring words in the feature vectors.
- The tf-idf can be defined as the product of the term frequency and the inverse document frequency:

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

- Here the tf(t, d) is the term frequency that we introduced in the previous section,
and the inverse document frequency *idf(t, d)* can be calculated as:
$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$
where $n_d$ is the total number of documents, and *df(d, t)* is the number of documents *d* that contain the term *t*.
- Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training examples; the log is used to ensure that low document frequencies are not given too much weight.

- Scikit-learn implements yet another transformer, the `TfidfTransformer`, that takes the raw term frequencies from `CountVectorizer` as input and transforms them into tf-idfs:

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True,
                         norm='l2',
                         smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs))
      .toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


- As we saw in the previous subsection, the word "is" had the largest term frequency in the 3rd document, being the most frequently occurring word. However, after transforming the same feature vector into tf-idfs, we see that the word "is" is now associated with a relatively small tf-idf (0.45) in document 3 since it is also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.


- However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd have noticed that the `TfidfTransformer` calculates the tf-idfs slightly differently compared to the standard textbook equations that we defined earlier. The equations for the idf and tf-idf that were implemented in scikit-learn are:

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$

- The tf-idf equation that was implemented in scikit-learn is as follows:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

- While it is also more typical to normalize the raw term frequencies before calculating the tf-idfs, the `TfidfTransformer` normalizes the tf-idfs directly.

- y default (`norm='l2'`), scikit-learn's TfidfTransformer applies the L2-normalization, which returns a vector of length 1 by dividing an un-normalized feature vector *v* by its L2-norm:

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

- To make sure that we understand how `TfidfTransformer` works, let us walk through an example and calculate the tf-idf of the word "is" in the 3rd document.

- The word "is" has a term frequency of 3 (tf = 3) in document 3 ($d_3$), and the document frequency of this term is 3 since the term "is" occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:

$$\text{idf}("is", d_3) = log \frac{1+3}{1+3} = 0$$

- Now in order to calculate the tf-idf, we simply need to add 1 to the inverse document frequency and multiply it by the term frequency:

$$\text{tf-idf}("is", d_3)= 3 \times (0+1) = 3$$

In [None]:
tf_is = 3
n_docs = 3
idf_is = np.log((n_docs+1) / (3+1))
tfidf_is = tf_is * (idf_is + 1)
print(f'tf-idf of term "is" = {tfidf_is:.2f}')

tf-idf of term "is" = 3.00


- If we repeated these calculations for all terms in the 3rd document, we'd obtain the following tf-idf vectors: [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29].
- However, we notice that the values in this feature vector are different from the values that we obtained from the `TfidfTransformer` that we used previously.
- The final step that we are missing in this tf-idf calculation is the L2-normalization, which can be applied as follows:

$$\text{tfi-df}_{norm} = \frac{[3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]}{\sqrt{[3.39^2, 3.0^2, 3.39^2, 1.29^2, 1.29^2, 1.29^2, 2.0^2 , 1.69^2, 1.29^2]}}$$

$$=[0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19]$$

$$\Rightarrow \text{tfi-df}_{norm}("is", d3) = 0.45$$

- As we can see, the results match the results returned by scikit-learn's `TfidfTransformer` (below). Since we now understand how tf-idfs are calculated, let us proceed to the next sections and apply those concepts to the movie review dataset.

In [None]:
tfidf = TfidfTransformer(use_idf=True, norm=None, smooth_idf=True)
raw_tfidf = tfidf.fit_transform(count.fit_transform(docs)).toarray()[-1]
raw_tfidf

array([3.39, 3.  , 3.39, 1.29, 1.29, 1.29, 2.  , 1.69, 1.29])

In [None]:
l2_tfidf = raw_tfidf / np.sqrt(np.sum(raw_tfidf**2))
l2_tfidf

array([0.5 , 0.45, 0.5 , 0.19, 0.19, 0.19, 0.3 , 0.25, 0.19])

## Cleaning text data

In [None]:
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [None]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [None]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [None]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [None]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [None]:
df['review'] = df['review'].apply(preprocessor)

## Processing documents into tokens

In [None]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [None]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [None]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [None]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')
 if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

# Training a logistic regression model for document classification

Strip HTML and punctuation to speed up the GridSearch later:

In [None]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

small_param_grid = [{
    'vect__ngram_range': [(1, 1)],
    'vect__stop_words': [None],
    'vect__tokenizer': [tokenizer],
    'clf__penalty': ['l2'],
    'clf__C': [1.0]
}]

lr_tfidf = Pipeline([
    ('vect', tfidf),
    ('clf', LogisticRegression(solver='liblinear'))
])

gs_lr_tfidf = GridSearchCV(
    lr_tfidf,
    small_param_grid,
    scoring='accuracy',
    cv=5,
    verbose=1,
    n_jobs=-1
)

**Important Note about `n_jobs`**

- Please note that it is highly recommended to use `n_jobs=-1` (instead of `n_jobs=1`) in the previous code example to utilize all available cores on your machine and speed up the grid search.
- However, some Windows users reported issues when running the previous code with the `n_jobs=-1` setting related to pickling the tokenizer and tokenizer_porter functions for multiprocessing on Windows.
- Another workaround would be to replace those two functions, `[tokenizer, tokenizer_porter]`, with `[str.split]`. However, note that the replacement by the simple `str.split` would not support stemming.

In [None]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 1 candidates, totalling 5 fits




In [None]:
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')

Best parameter set: {'clf__C': 1.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7ae126ff22a0>}
CV Accuracy: 0.889


In [None]:
clf = gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')

Test Accuracy: 0.894


# Retrieval Practice

## comment:
    
- Please note that `gs_lr_tfidf.best_score_` is the average k-fold cross-validation score. I.e., if we have a `GridSearchCV` object with 5-fold cross-validation (like the one above), the `best_score_` attribute returns the average score over the 5-folds of the best model. To illustrate this with an example:

In [None]:
from sklearn.linear_model import LogisticRegression
import numpy as np

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

np.random.seed(0)
np.set_printoptions(precision=6)
y = [np.random.randint(3) for i in range(25)]
X = (y + np.random.randn(25)).reshape(-1, 1)

cv5_idx = list(StratifiedKFold(n_splits=5, shuffle=False).split(X, y))

lr = LogisticRegression()
cross_val_score(lr, X, y, cv=cv5_idx)

array([0.6, 0.4, 0.6, 0.2, 0.6])

- By executing the code above, we created a simple data set of random integers that shall represent our class labels. Next, we fed the indices of 5 cross-validation folds (`cv3_idx`) to the `cross_val_score` scorer, which returned 5 accuracy scores -- these are the 5 accuracy values for the 5 test folds.  

- Next, let us use the `GridSearchCV` object and feed it the same 5 cross-validation sets (via the pre-generated `cv3_idx` indices):

In [None]:
from sklearn.model_selection import GridSearchCV

lr = LogisticRegression()
gs = GridSearchCV(lr, {}, cv=cv5_idx, verbose=3).fit(X, y)

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5] END ..................................., score=0.600 total time=   0.0s
[CV 2/5] END ..................................., score=0.400 total time=   0.0s
[CV 3/5] END ..................................., score=0.600 total time=   0.0s
[CV 4/5] END ..................................., score=0.200 total time=   0.0s
[CV 5/5] END ..................................., score=0.600 total time=   0.0s


- As we can see, the scores for the 5 folds are exactly the same as the ones from `cross_val_score` earlier.

- Now, the best_score_ attribute of the `GridSearchCV` object, which becomes available after `fit`ting, returns the average accuracy score of the best model:

In [None]:
gs.best_score_

np.float64(0.48)

- As we can see, the result above is consistent with the average score computed with `cross_val_score`.

In [None]:
lr = LogisticRegression()
cross_val_score(lr, X, y, cv=cv5_idx).mean()

np.float64(0.48)

# Working with bigger data - online algorithms and out-of-core learning

In [None]:
# This cell is not contained in the book but
# added for convenience so that the notebook
# can be executed starting here, without
# executing prior code in this notebook

import os
import gzip


if not os.path.isfile('movie_data.csv'):
    if not os.path.isfile('movie_data.csv.gz'):
        print('Please place a copy of the movie_data.csv.gz'
              'in this directory. You can obtain it by'
              'a) executing the code in the beginning of this'
              'notebook or b) by downloading it from GitHub:'
              'https://github.com/rasbt/machine-learning-book/'
              'blob/main/ch08/movie_data.csv.gz')
    else:
        with gzip.open('movie_data.csv.gz', 'rb') as in_f, \
                open('movie_data.csv', 'wb') as out_f:
            out_f.write(in_f.read())

In [None]:
import numpy as np
import re
from nltk.corpus import stopwords


# The `stop` is defined as earlier in this chapter
# Added it here for convenience, so that this section
# can be run as standalone without executing prior code
# in the directory
stop = stopwords.words('english')


def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)  # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [None]:
next(stream_docs(path='movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

In [None]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier


vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)

In [None]:
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

clf = SGDClassifier(loss='log_loss', random_state=1)


doc_stream = stream_docs(path='movie_data.csv')

In [None]:
import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:36


In [None]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print(f'Accuracy: {clf.score(X_test, y_test):.3f}')

Accuracy: 0.868
