# Part 1: Introduction to text classification

## 1. Getting started with Naive Bayes Text Classification

#### Probabilistic Model of Classification

In a probabilistic classification model we want to estimate the value of
$P(c|x)$
, the probability of a sample x being of class c. Naive Bayes is one such probabilistic classifier that uses Bayes' Rule to classify samples. And Naive Bayes is _"Naive"_ because it assumes strong independence among all the features of sample x.

#### Bayes Rule:

$P(c|x) = \frac{P(x|c)P(c)}{P(x)}$

#### Text Classification using Naive Bayes classifier

Consider the task of classifying textual documents into having positive or negative sentiments. We will design the Naive Bayes classifier for this problem as follows:

Samples are text documents, and their features are the words that comprises these documents.

- Each document $d$ is a sequence of words, $d = w_1w_2...w_n$, where $w_i$ are the tokens of the document and $n$ is the total number of tokens in the document $d$.

- The training dataset consists of many document, sentiment pairs, ${d_i, s_i}$

- Each document $d_i$ is associated with a sentiment $s_i \in \{0,1\}$, $0$ being negative sentiment and $1$ being positive sentiment.

Using **Bayes' Rule** we have

$p(s|d) = \frac{p(d|s)p(s)}{p(d|s)p(s) + p(d|\bar{s})p(\bar{s})}$

And from the **independence assumption** of features

$p(d|s) = p(w_1,w_2,..., w_n|s) = p(w_1|s)p(w_2|s)...p(w_n|s)$

Also in the **IMDb reviews dataset** that we are considering here have equal number of positive and negative datasets.

We have $p(s) = 0.5$ and $p(\bar{s})=0.5$.

This simplifies our formulation for
$p(s|d)$

$ p(s|d) = \frac{p(d|s)}{p(d|s) + p(d|\bar{s})} $

If we assign threshold of
$p_T(s|d) = 0.5$
for deciding the final label, the model simplifies to,

$y=
    \begin{cases}
      1, & \text{if } p(d|s=1) \geq p(d|s=0)\\
      0, & \text{otherwise}
    \end{cases}$
#### A measure for numerical stability

$p(w_i)$ will be very small in magnitude, and when we take a product of such very small numbers to compute $p(d|s)$
, even double precision floating points fail to store such small numbers and becomes zero. Hence, for numerical stability, we will convert the probabilities to log probability,

$\log p(d|s) = \log p(w_1,w_2,..., w_n|s) = \log p(w_1|s) + \log p(w_2|s) + ...+ \log p(w_n|s)$

More on Naive Bayes for Sentiment analysis: Github Repository: [bsantraigi/Sentiment Analysis](https://github.com/bsantraigi/Sentiment-Analysis)

In [2]:
# Import 'os' for preliminary tasks like directory listing etc.
import os

# Import re for regex string matching
import re

# Import nltk for word tokenization
import nltk

# Import Python's native data structures Counter and defaultdict
# Counter - maintains count of element
# defaultdict - dictionary data structure with exception handling for missing keys
from collections import Counter, defaultdict

# Import tqdm for progressbars
from tqdm import tqdm_notebook

# Import numpy for different mathematical operations on arrays / matrices
import numpy as np

In [3]:
# Install the nltk punkt component for tokenization
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Daan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Download the data

In [4]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz -P data/

'wget' is not recognized as an internal or external command,
operable program or batch file.


Extract data

In [5]:
%%time
!tar -xzf data/aclImdb_v1.tar.gz -C data/

CPU times: total: 0 ns
Wall time: 76 ms


tar: Error opening archive: Failed to open 'data/aclImdb_v1.tar.gz'


### Data Samples
- Dataset is split into two parts for training and testing
- Positive and negative samples are organized in individual folders
- Each sample document is stored in a .txt file

### Let's read in the data

In [6]:
data_folder = 'data/aclImdb/'

In [7]:
rp = os.path.join(data_folder, 'train/pos')
train_positive = [os.path.join(rp, f) for f in os.listdir(rp)]
rp = os.path.join(data_folder, 'train/neg')
train_negative = [os.path.join(rp, f) for f in os.listdir(rp)]

rp = os.path.join(data_folder, 'test/pos')
test_positive = [os.path.join(rp, f) for f in os.listdir(rp)]
rp = os.path.join(data_folder, 'test/neg')
test_negative = [os.path.join(rp, f) for f in os.listdir(rp)]

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'data/aclImdb/train/pos'

### Regex for cleaning html tags
- Pattern <.*?> means "anything within two angular brackets". The qualifier *? denotes "as few times as possible". This makes sure we match only one html tag at a time.

In [None]:
re_html_cleaner = re.compile(r"<.*?>")

#### Limit number of samples
To quickly train a small model, consider setting n_train and n_test to some relatively small numbers e.g. `1000`. Set,
`n_train = n_test = -1` to use all the samples available.

In [None]:
n_train = 25000
n_test = 25000

### (Conditional) Unigram Counter
- Calculates the distribution $p(w|s=1)$ and $p(w|s=0)$, empirically, from training data.

In [None]:
# Distribution of word tokens in positive samples
positive_word_counts = Counter()

for _fname in tqdm_notebook(train_positive[:n_train], desc="Crunching +ve samples: "):
    with open(_fname) as f:
        text = f.read().strip()
        text = re_html_cleaner.sub(" ", text)
        positive_word_counts += Counter(nltk.word_tokenize(text))

# Distribution of word tokens in negative samples
negative_word_counts = Counter()

for _fname in tqdm_notebook(train_negative[:n_train], desc="Crunching -ve samples: "):
    with open(_fname) as f:
        text = f.read().strip()
        text = re_html_cleaner.sub(" ", text)
        negative_word_counts += Counter(nltk.word_tokenize(text))

In [None]:
print('Top k frequent words from positive class:\n\n')
for w, c in positive_word_counts.most_common(10):
    print(f"{w}\t{c}")

print('\n\nTop k frequent words from negative class:\n\n')
for w, c in negative_word_counts.most_common(10):
    print(f"{w}\t{c}")

#### Unigram counts to probability distribution

$p(w|s) = \frac{N_{s,w}}{N_{s,*}} = \frac{N_{s,w}}{\sum_{w' \in W}N_{s,w'}}$

#### Additive Smoothing
- Note that, if some token, $u$, unseen in training documents, occurrs in a test document, $p(doc_{test}|s)$ becomes $0$ as $N_{s,u}$ for that token is $0$.
- We apply _Additive Smoothing_ to prevent probability from going to zero.

$p(w|s) = \frac{\alpha + N_{s,w}}{\sum_{w' \in W}(\alpha + N_{s,w'})} = \frac{\alpha + N_{s,w}}{\alpha V + \sum_{w' \in W}N_{s,w'}}$

where V is the total vocab size.

In [None]:
len_corpus_pos = sum(positive_word_counts.values())
len_corpus_neg = sum(negative_word_counts.values())
V_pos = len(positive_word_counts)
V_neg = len(negative_word_counts)
alpha = 0.1
log_p_vocab_pos = defaultdict(
    lambda: np.log(alpha/len_corpus_pos),
    {w:np.log((alpha + c)/(V_pos*alpha + len_corpus_pos)) for w,c in positive_word_counts.items()}
)
log_p_vocab_neg = defaultdict(
    lambda: np.log(alpha/len_corpus_neg),
    {w:np.log((alpha + c)/(V_neg*alpha + len_corpus_neg)) for w,c in negative_word_counts.items()}
)

In [None]:
p_data_pos = len(train_positive)/(len(train_positive) + len(train_negative))
print(f"Prob. of +ve sentiment in our dataset: {p_data_pos}")

#### get_prob_pos(doc)

A function that accepts a document string as input, tokenizes it and computes the probability
$p(d|s=1)$
and
$p(d|s=0)$
. It returns 1 if
$p(d|s=1) \geq p(d|s=0)$
otherwise 0.

In [None]:
def get_prob_pos(doc):
    text = doc.strip()
    text = re_html_cleaner.sub(" ", text)
    tokens = nltk.word_tokenize(text)
    p_pos = 1
    p_neg = 1
    for token in tokens:
        p_pos += log_p_vocab_pos[token]
        p_neg += log_p_vocab_neg[token]

    return 1.0*(p_pos >= p_neg) #/(p_pos+p_neg)

In [None]:
results = []
for _fname in tqdm_notebook(test_positive[:n_test], desc="Classifying test data: "):
    with open(_fname) as f:
        results.append((1, get_prob_pos(f.read())))


for _fname in tqdm_notebook(test_negative[:n_test], desc="Classifying test data: "):
    with open(_fname) as f:
        results.append((0, get_prob_pos(f.read())))

### Performance evaluation of our model

**Accuracy:** Overall performance of our model, fraction of samples that were labelled correctly

**Recall:** Out of all +ve data samples in test set, what fraction of it was labelled correctly

**Precision:** How precise is the model? Out of all samples that were tagged +ve by the model, how many were actually positive.

In [None]:
true_pos = 0
false_pos = 0
true_neg = 0
false_neg = 0
for true_label, pred_label in results:
    if true_label == 1 and pred_label == 1:
        true_pos += 1
    elif true_label == 1 and pred_label == 0:
        false_neg += 1
    elif true_label == 0 and pred_label == 1:
        false_pos += 1
    elif true_label == 0 and pred_label == 0:
        true_neg += 1

In [None]:
print(f"Accuracy: {(true_pos + true_neg)/(true_pos + true_neg + false_pos + false_neg):0.4F}")
print(f"Recall: {(true_pos)/(true_pos + false_neg):0.4F}")
print(f"Precision: {(true_pos)/(true_pos + false_pos):0.4F}")

### Classification metrics using sklearn package

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
results[:10]

In [None]:
u, v = zip((1, 1), (1, 0), (0, 1), (0, 1))
print(u)
print(v)

In [None]:
# Use zip to collect the first and second elements of each tuple within results
# array to two separate lists
y_true, y_pred = zip(*results)

In [None]:
print("Accuracy (using sklearn package):", accuracy_score(y_true, y_pred))

In [None]:
# ----------------
# Confusion Matrix
# ----------------
#
# ----------
#  |  0 |  1 <- pred
# ----------
# 0| tn | fp
# ----------
# 1| fn | tp
# ----------

print(confusion_matrix(y_true, y_pred))

### F1-score?

Formula:
F-score = 2 * (precision * recall) / (precision + recall)


In [None]:
# Compute the F1 score, also known as balanced F-score
precision = (true_pos)/(true_pos + false_pos)
recall = (true_pos)/(true_pos + false_neg)
F1_score = 2*(precision*recall)/(precision+recall)
print("F1 Score: ", F1_score)


# Part 2: Understanding text classification

In [None]:
import os
import pandas as pd
import numpy as np

import spacy

import nltk
nltk.download('punkt')
from nltk import word_tokenize
from nltk.text import Text
from nltk.tokenize import word_tokenize, TweetTokenizer
from nltk.probability import FreqDist, MLEProbDist

In [None]:
# Install and import scattertext and its dependency pytextrank
!pip install scattertext
!pip install pytextrank

import scattertext as st
import pytextrank


In [None]:
# Download a spacy language model
!python -m spacy download en_core_web_lg

### Format data

In [None]:
def create_dataframe_from_txt_files():
    data = {'text': [], 'label': []}

    # Define the folders to iterate over
    folders = ["data/aclImdb/test/pos/", "data/aclImdb/train/pos/",
               "data/aclImdb/test/neg/", "data/aclImdb/train/neg/"]

    # Iterate over each folder
    for folder in folders:
        label = "positive" if "pos" in folder else "negative"

        # Iterate over .txt files in the folder
        for filename in os.listdir(folder):
            if filename.endswith(".txt"):
                file_path = os.path.join(folder, filename)
                with open(file_path, 'r', encoding='utf-8') as file:
                    text = file.read()
                    data['text'].append(text)
                    data['label'].append(label)

    # Create a DataFrame
    df = pd.DataFrame(data)
    return df

# Call the function to create the DataFrame
df = create_dataframe_from_txt_files()
df

In [None]:
# Shuffle and downsize
df = df.sample(frac=1).reset_index(drop=True)
df = df.head(5000)

### Run scattertext

In [None]:
from scattertext import *

In [None]:
# Select the spacy tokenizer: LM or whitespace?
nlp = spacy.load('en_core_web_lg')
#nlp = scattertext.WhitespaceNLP.whitespace_nlp

# Build "corpus"
st_corpus = st.CorpusFromPandas(df, category_col='label', text_col='text',nlp=nlp).build()

In [None]:
# Create a term frequency dataframe from the corpus
term_freq_df = st_corpus.get_term_freq_df()
# Most freq positive sentiment
term_freq_df['Positive sentiment'] = st_corpus.get_scaled_f_scores('positive')
# Most freq negative sentiment
term_freq_df['Negative sentiment'] = st_corpus.get_scaled_f_scores('negative')

# Print to distinguishing (by scaled F Score) and save to list
print("Top positive: \n")
print(list(term_freq_df.sort_values(by='Positive sentiment', ascending=False).index[:10]))
print("\nTop negative: \n")
print(list(term_freq_df.sort_values(by='Negative sentiment', ascending=False).index[:10]))

# Sort by top scores
top_pos=term_freq_df.sort_values(by='Positive sentiment', ascending=False)
top_neg=term_freq_df.sort_values(by='Negative sentiment', ascending=False)

print(top_pos)
print(top_neg)

In [None]:
# Plot using scattertext explorer
html = st.produce_scattertext_explorer(
                   st_corpus,
                   category='positive',
                   category_name='Positive',
                   not_category_name='Negative',
                   width_in_pixels=1200,
                   height_in_pixels=600,
                   show_diagonal=True,
                   show_characteristic=False,
                   minimum_term_frequency=10,
                   #d3_color_scale='d3.interpolateViridis',
                   term_scorer=st.ScaledFScorePresets()) #or use term_scorer=st.RankDifference()
                   #transform=st.Scalers.dense_rank)


# Save html file
html_file_name = "explore_sentiment_analysis.html"
open(html_file_name, 'wb').write(html.encode('utf-8'))

### Download results

In [None]:
# Download the html file to open in browser

from google.colab import files
files.download('explore_sentiment_analysis.html')