![Banner](./img/AI_Special_Program_Banner.jpg)

# Text Classification Application - The Naive Bayes Method in `scikit-learn`
---

Here we will look at a first approach for *text processing* in Python, again using `scikit-learn`. In this context, a thorough pre-processing of the data and *feature extraction* ([feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)) is required. We are also dealing with very sparse matrices.

## Table of contents
---

- [Insult detection](#Insult-detection)
    - [Data preparation](#Data-preparation)
    - [Modeling with Naive Bayes classifier](#Modeling-with-Naive-Bayes-classifier)
    - [Using a different vectorizer: CountVectorizer](#Using-a-different-vectorizer:-CountVectorizer)
- [Learning outcomes](#Learning-outcomes)

## Insult detection
---

The aim is to recognize insults in discussion forums. The corresponding data set is available in the `data` folder.
It was downloaded from the [data repository](https://github.com/ipython-books/cookbook-data) of the [Cookbook](https://ipython-books.github.io/). The required file is `troll.csv`. This was originally provided by the company [Impermium](https://impermium.com) as part of a [Kaggle competition](https://www.kaggle.com/c/detecting-insults-in-social-commentary).

***Note:*** Often the specialized package [nltk](http://www.nltk.org/) is used for word processing instead of `scikit-learn`.

### Data preparation

First, we import the required packages again.

In [1]:
import pandas as pd
import numpy as np
import sklearn
import sklearn.model_selection as ms
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb

Then we read the data into a Pandas dataframe.

In [2]:
df = pd.read_csv("data/troll.csv")

Each line contains a comment together with its classification as *offensive* (1) or not (0) and the date of the comment.

In [3]:
df.tail()

Unnamed: 0,Insult,Date,Comment
3942,1,20120502172717Z,"""you are both morons and that is never happening"""
3943,0,20120528164814Z,"""Many toolbars include spell check, like Yahoo..."
3944,0,20120620142813Z,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,20120528205648Z,"""How about Felix? He is sure turning into one ..."
3946,0,20120515200734Z,"""You're all upset, defending this hipster band..."


Now we define the *characteristic matrix* $\mathbf{X}$ and the *classes* (labels) $\mathbf{y}$. The latter is simple:

In [4]:
y = df['Insult']

The feature matrix, on the other hand, is much more difficult to obtain. `scikit-learn` needs numerical values as inputs, so that the text must be converted into a matrix. This **data preprocessing** (i.e. *data preparation* according to [CRISP-DM](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining)) usually takes place in two steps:
1. **Tokenization**: Extracting a **vocabulary**, i.e. a list of words that were used in the text (in our case: in the comments)
2. **Counting:** You then count how often the respective word occurs in each data record (also: *document*). As there are generally very many words and only very few of them are actually used in a particular data record (in our case: in a comment), this results in a matrix that mainly contains zeros (i.e. is sparsely populated).

The entire process is also referred to as **Bag of Words**. With the help of `scikit-learn` we only need two lines of code for this. The [Tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) method is used for "counting", with the help of which words that occur too frequently ("the", "and", etc.) can be handled adequately. *Tf-idf* normalizes the vocabulary so that words that occur in a large number of data records are given less weight than those that only occur in a small number of data records and are therefore *more specific*.

In [5]:
tf = text.TfidfVectorizer()  # tf-idf 
X_vec = tf.fit(df['Comment'])
X = X_vec.transform(df['Comment'])
# X = tf.fit_transform(df['Comment'])
print(f'In {X.shape[0]} comments are {X.shape[1]} different words')

In 3947 comments are 16469 different words


So there are 3947 comments and 16469 different words. Let's take a look at the vocabulary first. Each word found is assigned an *index* in the feature matrix:

In [6]:
# not so easy to get a partial dict ... 
list(X_vec.vocabulary_.items())[:10]

[('you', 16397),
 ('fuck', 5434),
 ('your', 16405),
 ('dad', 3409),
 ('really', 11568),
 ('don', 4075),
 ('understand', 14793),
 ('point', 10754),
 ('xa0', 15720),
 ('it', 7048)]

Thus the word `dad` has the index 3409 and `really` the index 11568.

The feature matrix $X$ is a [scipy.sparse.csr_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html). This special data structure stores only those entries that are not $0$ and their coordinates in the matrix with the help of two special index structures (examples can be found in the link above). Let's take a closer look at all this:

In [7]:
X.nnz # number of not null entries

100269

In [8]:
X.data # not null entries (Tf-Idf weights) as array

array([0.30831977, 0.20722267, 0.48374632, ..., 0.15469143, 0.07678784,
       0.20409929])

In [9]:
X.max() # highest tf-idf-value

1.0

In [10]:
X.indices # indexarray

array([ 3409,  5434, 16397, ..., 15294, 16397, 16405])

In [11]:
X.indptr # array of index pointer

array([     0,      4,     19, ..., 100202, 100231, 100269])

It is also interesting to see how sparsely populated the feature matrix is. We can estimate this as follows:

In [12]:
print("The feature matrix has ~{0:.2f}% not null entries.".format(
          100 * X.nnz / float(X.shape[0] * X.shape[1])))

The feature matrix has ~0.15% not null entries.


Now we train the classifier again. First we have to split our data into training and test data again.

In [13]:
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=.2, random_state = 17)

### Modeling with Naive Bayes classifier

We use the [Multinomial Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_na%C3%AFve_Bayes) as a Naive Bayes classifier, which considers the frequency of the words as an integer value. In practice, however, the model also works with the Tf-idf vectorization. In addition, *smoothing* is performed using a parameter $\alpha$ (an explanation of the approach can be found at [Stanford University](https://nlp.stanford.edu/courses/cs224n/2001/gruffydd/smoothing.html) and we know it as the Laplace estimation ...). Further Naive Bayes models in `scikit-learn` can be found [here](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes).

In [14]:
bnb = ms.GridSearchCV(nb.MultinomialNB(), param_grid={'alpha':np.logspace(-2., 2., 50)})
bnb.fit(X_train, y_train);

How well does this classifier work on our test data?

In [15]:
print(f'The hit rate is around {round(100*bnb.score(X_test, y_test),2)}%')

The hit rate is around 76.71%


With regard to the hit rate, it should be noted that the special division into test and training data, which also includes a random component, plays a role here (a different `random_state` than 17 would also lead to different results). It is also interesting to see which words are most frequently found in offensive comments. This can be found out as follows:

In [16]:
insult_class_prob_sorted = bnb.best_estimator_.feature_log_prob_[1, :].argsort()

word_list_insult = np.take(X_vec.get_feature_names_out(), insult_class_prob_sorted[:-30:-1])

print(f'Words with offensive connotations: \n {word_list_insult}')

Words with offensive connotations: 
 ['you' 'your' 'are' 'to' 'the' 'and' 'idiot' 're' 'of' 'fuck' 'that' 'it'
 'is' 'like' 'stupid' 'xa0' 'an' 'moron' 'in' 'bitch' 'dumb' 'just' 'go'
 'have' 'as' 'not' 'on' 'fucking' 'ass']


Finally, we test how well this works with examples we have devised ourselves.

In [17]:
print(bnb.predict(tf.transform([
    "You are absolutely right.",
    "This is beyond moronic.",
    "LOL"
    ])))

[0 1 0]


The model correctly classifies our self-devised examples.

### Using a different vectorizer: CountVectorizer

Instead of the previously used [Tf-Idf vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), we can also try the simple [Counting vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [18]:
co = text.CountVectorizer()  # count
Xc_vec = co.fit(df['Comment'])
Xc = Xc_vec.transform(df['Comment'])
# Xc = co.fit_transform(df['Comment'])
print(f'In {Xc.shape[0]} comments are {Xc.shape[1]} different words')

In 3947 comments are 16469 different words


In [19]:
list(Xc_vec.vocabulary_.items())[:10]

[('you', 16397),
 ('fuck', 5434),
 ('your', 16405),
 ('dad', 3409),
 ('really', 11568),
 ('don', 4075),
 ('understand', 14793),
 ('point', 10754),
 ('xa0', 15720),
 ('it', 7048)]

In [20]:
Xc.nnz # number of not null entries

100269

Thus, the vocabulary and form of the feature matrix are consistent with what we had before (which was to be expected). The data, on the other hand, are now *frequencies* and no longer Tf-Idf weights:

In [21]:
Xc.data # not null entries as array

array([1, 1, 1, ..., 1, 1, 1])

In [22]:
Xc.max() # highest count

97

In [23]:
Xc.indices # index array

array([ 3409,  5434, 16397, ..., 15294, 16397, 16405], dtype=int32)

In [24]:
Xc.indptr # array of index pointer

array([     0,      4,     19, ..., 100202, 100231, 100269], dtype=int32)

In [25]:
print("The feature matrix has ~{0:.2f}% entries different from 0.".format(
          100 * Xc.nnz / float(Xc.shape[0] * Xc.shape[1])))

The feature matrix has ~0.15% entries different from 0.


Now again the division into training and test data:

In [26]:
(Xc_train, Xc_test, yc_train, yc_test) = ms.train_test_split(Xc, y, test_size=.2, random_state = 17)

And now the multinomial Naive Bayes method again:

In [27]:
bnbc = ms.GridSearchCV(nb.MultinomialNB(), param_grid={'alpha':np.logspace(-2., 2., 50)})
bnbc.fit(Xc_train, yc_train);

How well does this classifier work on our test data?

In [28]:
print(f'The hit rate is around {round(100*bnbc.score(Xc_test, yc_test),2)}%')

The hit rate is around 77.85%


The hit rate achieved is higher than before. It can therefore be assumed that words that generally occur frequently in the data have a certain predictive power for the classification problem. Insults often use the personal pronoun "*you*", but this is likely to occur frequently in general and is therefore attenuated by the Tf-idf vectorizer. However, this assumption would have to be analysed in more detail by comparing the important words between the two vectorizers.

And again the self-invented examples:

In [29]:
print(bnbc.predict(co.transform([
    "You are absolutely right.",
    "This is beyond moronic.",
    "LOL"
    ])))

[0 1 0]


Both models therefore come to the same result in our self-conceived examples.

## Learning outcomes
---

The most important learning objectives of this notebook at a glance:

* Extensive data preparation is necessary when building models that work with text data,
* Words need to be converted to numeric representations by tokenizing and counting in a bag of words process,
* Different vectorization methods count (and therefore weight...) words differently,
* The multinomial Naive Bayes classifier can be used to classify word frequencies and its coefficients can be used to check for important words.