# AI - Computer Assignment 3

## Text Processing and Naïve Bayes

In this assignment we are going to classify news using Bayes theorem. in Bayes theorem we have

$$ P(C_i | X) = \frac{P(C_i) * P(X | C_i)}{P(X)} $$
            
We try tu classify texts using above formula. For using Bayes theorem we need 4 values:

- Posterior Probability: Which is the probability of a news piece being in category $C_i$ if it include the words $X$.
- Likelihood: Which is the probability of the word $X_i$ being in a news peace of category $C_i$.
- Class Prior Probability: Which is The probability of a news piece being in category $C_i$.
- Posterior Probability: Which is the probability of a word being $X_i$, which is constant and can be removed.

So we can calculate Posterior Probability using following formula:

$$ P(C_i | X) = P(C_i) * P(X_1 | C_i) * \dots * * P(X_n | C_i) $$

In the training phase we calculate the likelihood for each word in each category, to use in evalute and test phases. We also calculate Class Prior Probability which is the number of news pieces in the $C_i$ category divided by the total number of news pieces.

Following block is implementation of Classifier class with utility functions and helper class for category:

In [1]:
import math
from collections import Counter
from random import random

import nltk
import pandas as pd
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from prettytable import PrettyTable

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))


def clean_text(text):
    def get_wordnet_pos(nltk_tag):
        return {
            "J": wordnet.ADJ,
            "N": wordnet.NOUN,
            "V": wordnet.VERB,
            "R": wordnet.ADV,
        }.get(nltk_tag[0].upper(), None)

    result = []
    for sentence in nltk.sent_tokenize(text):
        tagged_sentence = [
            (word, get_wordnet_pos(tag))
            for (word, tag) in nltk.pos_tag(nltk.regexp_tokenize(sentence.lower(), r'\w+'))
            if len(word) > 1 and word not in stop_words
        ]

        for word, tag in tagged_sentence:
            if tag:
                word = lemmatizer.lemmatize(word, tag)
            else:
                word = lemmatizer.lemmatize(word)

            result.append(word)

    return result


class Classifier:

    class Category:

        def __init__(self, category_title, category_total_rows, data_total_rows):
            self.category_title = category_title
            self.category_probability = category_total_rows / data_total_rows

            self.words = Counter()
            self.word_count = 0

        def add_text(self, text):
            for word in clean_text(text):
                self.words[word] += 1
                self.word_count += 1

        def calculate_probability(self, text):
            p = math.log10(self.category_probability)
            for word in clean_text(text):
                p += math.log10((self.words[word] or 0.1) / self.word_count)

            return p

    def __init__(self, data_file_name, classification_cols, category_col, oversample=False):
        self._category_col = category_col
        self._classification_cols = classification_cols

        df = pd.read_csv(data_file_name,
                         usecols=[*classification_cols, category_col])

        df.dropna(inplace=True)

        self.total_rows, _ = df.shape
        category_titles = df[category_col].unique()

        df = df.groupby(category_col)
        max_category_data_rows = max([
            categpry_data_rows
            for categpry_data_rows, _ in [
                category.shape for _, category in df
            ]
        ])

        self.categories = {}
        self.test_data = {}
        for category_title in category_titles:
            category_df = df.get_group(category_title)
            category_total_rows, _ = category_df.shape

            if oversample:
                category_df = category_df.append(
                    category_df.sample(
                        n=max_category_data_rows - category_total_rows,
                        replace=True,
                    )
                )

            category_total_rows, _ = category_df.shape
            category = Classifier.Category(
                category_title, category_total_rows, self.total_rows)

            test_data = []
            for _, row in category_df.iterrows():
                if random() < 0.8:
                    for col in classification_cols:
                        category.add_text(row[col])
                else:
                    test_data.append(row)

            self.test_data[category_title] = pd.DataFrame(
                columns=category_df.columns,
                data=test_data,
            )

            self.categories[category_title] = category

    def _find_category(self, text, include_categories=None):
        _, result_category = max([
            (category.calculate_probability(text), category.category_title)
            for category in self.categories.values()
            if not include_categories or category.category_title in include_categories
        ])

        return result_category

    def evaluate(self, classification_col, categories=None):
        valid_categories = list(self.categories.keys())
        if not categories:
            categories = valid_categories
        else:
            categories = [
                category for category in categories if category in valid_categories]

        actual_categories, predicted_categories = zip(*[
            (category, self._find_category(
                row[classification_col], categories))
            for category in categories
            for _, row in self.test_data[category].iterrows()
        ])

        category_indices = {
            category: index for (index, category) in enumerate(categories)
        }

        confustion_matrix = [[0 for _ in categories] for _ in categories]
        for actual_category, predicted_category in zip(actual_categories, predicted_categories):
            confustion_matrix[
                category_indices[actual_category]
            ][
                category_indices[predicted_category]
            ] += 1

        confustion_matrix_table = PrettyTable(
            field_names=[
                'Confusion Matrix',
                *categories
            ],
        )
        for i, row in enumerate(confustion_matrix):
            confustion_matrix_table.add_row([
                categories[i],
                *row,
            ])

        print(confustion_matrix_table)

        metrics_table = PrettyTable(
            field_names=[
                '',
                'Accuracy',
                'Precision',
                'Recall',
            ],
        )

        all_true_positive_cases = sum(
            confustion_matrix[i][i] for i in range(len(confustion_matrix)))
        all_cases = len(actual_categories)
        accuracy = all_true_positive_cases / all_cases

        for i, category in enumerate(categories):
            true_positive_cases = confustion_matrix[i][i]
            actual_positive_cases = sum(
                confustion_matrix[i][j] for j in range(len(confustion_matrix)))
            predicted_positive_cases = sum(
                confustion_matrix[j][i] for j in range(len(confustion_matrix)))

            precision = true_positive_cases / predicted_positive_cases
            recall = true_positive_cases / actual_positive_cases

            metrics_table.add_row([
                category,
                accuracy,
                precision,
                recall,
            ])

        print(metrics_table)

        return confustion_matrix

    def classify(self, test_file_name, classification_col):
        df = pd.read_csv(test_file_name, usecols=[
                         'index', classification_col]).dropna()

        df[self._category_col] = df[classification_col].apply(
            self._find_category)

        return df


### Cleaning the data
The `clean_text` method is used to clean the given data. It uses `NLTK` tokenizers and `WordNetLemmatizer`. First it gets sentences from the text and then for each sentence, it finds the part of speech for each word which is not a stop word. Then using the `WordNetLemmatizer` from `NLTK` it lemmatizes the remaining words. The resulting words are base or dictionary form of each word, which is known as the lemma. Before processing any peace of text this functions is called on the text.

### Classifer
The `Classifier` gets the dataset and the classification columns and the category columns of the dataset as arguments to constructor. The classifier oversamples the data if it has a ```oversample=True``` argument which will be explained later.

#### Training
Training phase is done in the constructor, after opening the dataset as `pandas` data frame, total rows and category titles are extracted from dataset. Then dataset is grouped by the category column, for each category 80% of data is used for training and the rest is stored to evaluate the model. Number of words for each category are stored in instances of `Category` class.

#### Evaluate
There is a `evaluate` method which gets the classification columns and categories to include, if none is provided all of the categories all used to evaluate the model. It calculates the probability of each peace of news being in every category and chosse the category with highest probability as news category. Then confusion matrix and some metrics are printed.

>##### Confusion Matrix
The confusion matrix, is a table with two dimensions (“Actual” and “Predicted”), and sets of “categories” in both dimensions. Our "Actual" classifications are rows and "Predicted" ones are columns. Then each cell is number of news peaces as the actual category and predicted one. The Confusion matrix in itself is not a performance measure as such, but almost all of the performance metrics are based on Confusion Matrix and the numbers inside it:

>>###### Accuracy
Accuracy in is the number of correct predictions made by the model over all kinds predictions made.

>>###### Precision
Precision is the number of correct predictions over predicted cases in each category.

>>###### Recall
Recall is the number of correct predictions over actual cases in each category.

#### Test
The `test` method gets a test dataset and find categories for every news peace in that. A pandas data frame is returned.

Now let's test the classifiers,
First we train the model:

In [2]:
from time import time

start = time()
classifier = Classifier(
    data_file_name='./data.csv',
    classification_cols=['short_description', 'headline'],
    category_col='category',
    oversample=True,
)
print("Elapsed Time (train):", time() - start)

Elapsed Time (train): 32.95787072181702


Now let's evaluate the model using travel and business categories as wanted in the phase1:

In [3]:
start = time()
confustion_matrix = classifier.evaluate(
    classification_col='short_description',
    categories=["TRAVEL", "BUSINESS"],
)
print("Elapsed Time (evaluate phase1):", time() - start)

+------------------+--------+----------+
| Confusion Matrix | TRAVEL | BUSINESS |
+------------------+--------+----------+
|      TRAVEL      |  1677  |   136    |
|     BUSINESS     |  116   |   1627   |
+------------------+--------+----------+
+----------+--------------------+--------------------+--------------------+
|          |      Accuracy      |     Precision      |       Recall       |
+----------+--------------------+--------------------+--------------------+
|  TRAVEL  | 0.9291338582677166 | 0.9353039598438372 | 0.9249862107004965 |
| BUSINESS | 0.9291338582677166 | 0.9228587634713556 | 0.9334480780263913 |
+----------+--------------------+--------------------+--------------------+
Elapsed Time (evaluate phase1): 7.449356317520142


Now let's evaluate the model using all categories as wanted in phase2:

In [4]:
start = time()
confustion_matrix = classifier.evaluate(
    classification_col='short_description',
)
print("Elapsed Time (evaluate phase2):", time() - start)

+------------------+--------+----------------+----------+
| Confusion Matrix | TRAVEL | STYLE & BEAUTY | BUSINESS |
+------------------+--------+----------------+----------+
|      TRAVEL      |  1613  |       79       |   121    |
|  STYLE & BEAUTY  |  191   |      1474      |    97    |
|     BUSINESS     |   98   |       50       |   1595   |
+------------------+--------+----------------+----------+
+----------------+--------------------+--------------------+--------------------+
|                |      Accuracy      |     Precision      |       Recall       |
+----------------+--------------------+--------------------+--------------------+
|     TRAVEL     | 0.8804061677322301 | 0.8480546792849631 | 0.8896856039713182 |
| STYLE & BEAUTY | 0.8804061677322301 | 0.9195258889582034 | 0.8365493757094211 |
|    BUSINESS    | 0.8804061677322301 | 0.8797573083287369 | 0.9150889271371199 |
+----------------+--------------------+--------------------+--------------------+
Elapsed Time (evalua

And test the model, the result is saved in "output.csv":

In [5]:
start = time()
res = classifier.classify(
    test_file_name='./test.csv',
    classification_col='short_description',
)
print("Elapsed Time (classify):", time() - start)
res[['index', 'category']].to_csv('output.csv', index=False)

Elapsed Time (classify): 6.985297441482544


#### Oversampling
As we can see the data provided for buisiness category is about half in count compared to other categories, which causes the Class prior probability of this category less than others. We interpolate the data of this category to make the length of data in each category the same. This is called oversampling.

#### Lemmatization vs Stemming
Lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words which seems to be a better choice. But here it gives us only 1 percent of better results.

#### Word only in one category
We used logarithm to make the calculation simpler (as the multiplications becomes summation), so if a word is not present in a category we assign a very low likelihood to that word the prevent log(0) problem. By this condition if a word is present only in category, that peace is chosen to have that category.

#### Considering precision as the only metric
Precision is about being precise, which is number of true positive cases over all predicted positive cases. If we have that detects cancer, in a dataset of 100 patients with 5 person diagnosed with cancer, if we only detect one positive case which is true, we have a precision of 100% but there were 4 unpredicted peaple, i.e. we have a recall of 20% which not good.

#### TF-IDF
TF (Term Frequency) is frequency of a word in a document over number of all word in the documents, same as what we used here except that we removed the stop words.
IDF (Inverse Document Frequency) is how important a word is in a category, which is number of documents the words appear in over number of all documents. It tries to weigh down words like stop words that appear a lot in documents and weigh up rare words of each document in each category.
This measure can be used is likelihood.