# 📚 Text classification using Python and Scikit-learn

Text classification is a technique that automatically assigns labels to pieces of text, such as articles, blog posts, or reviews. Many businesses use this because it allows you to organize and analyze text data without the need for manual labor. This blog post will teach you how to classify text using Python and the Scikit-learn library.

But first, you might be wondering why learning how to classify text with python and scikit-learn is important. After all, there are many ways to classify text, so what's the big deal?

Well, the thing is, text classification is a very powerful tool. This technique is responsible for keeping your email free of spam, assisting authors in detecting plagiarism, and helping your grammar corrector in understanding the various parts of speech.

And the simplest way to do it is with Python and Scikit-learn! You can be up and running in no time with a little effort.

In the next sections, I'll show you how. Let's get started

## 0️⃣ Prerequisites

1. Create a virtual environment using `conda` or `venv`
2. Install the required libraries: `pip install numpy pandas notebook scikit-learn`

## 1️⃣ Imports

In [68]:
import re
import string

import numpy as np
import pandas as pd

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score, classification_report
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.naive_bayes import MultinomialNB

## 2️⃣ Read the data

First, let's start by reading the data. We'll use a data sample included in `scikit-learn` called **20 news groups**.

Use this code to read the data:

In [210]:
categories = [
    "alt.atheism",
    "misc.forsale",
    "sci.space",
    "soc.religion.christian",
    "talk.politics.guns",
]

news_group_data = fetch_20newsgroups(
    subset="all", remove=("headers", "footers", "quotes"), categories=categories
)

df = pd.DataFrame(dict(text=news_group_data["data"], target=news_group_data["target"]))
df["target"] = df.target.map(lambda x: categories[x])

This code reads the 20 news groups dataset. Here's how it works:
- **Lines 1 to 7:** Define a list of categories, which are the different types of newsgroups that will be used in the analysis.
- **Lines 9 to 11:** Use the `fetch_20newsgroups` function to get data from the 20 news groups dataset. This function removes the headers, footers, and quotes from the data, and only gets data from the categories that are specified in the `categories` list.
- **Lines 13 and 14:** Create a dataframe from the data that was fetched. The dataframe has two columns, one for the text of the newsgroup post and one for the category (target) of the newsgroup. You change the target column so that it displays the actual category name instead of a number.

## 3️⃣ Clean the text column

Next, you'll clean the text to remove the punctuation marks and multiple spaces:

In [216]:
def process_text(text):
    text = str(text).lower()
    text = re.sub(
        f"[{re.escape(string.punctuation)}]", " ", text
    )
    text = " ".join(text.split())
    return text

df["clean_text"] = df.text.map(process_text)

This code lowercases the text and removes any punctuation marks or duplicated spaces and stores the results in a new column called `clean_text`. For that, you use the function `process_text`, which takes a string as input, lowercases it, replaces all punctuation marks with spaces, and removes the duplicated spaces.

## 4️⃣ Train/test split

Next, you'll split the dataset into a training and a testing set:

In [217]:
df_train, df_test = train_test_split(df, test_size=0.20, stratify=df.target)

The `train_test_split` function is used to split a dataset into a training set and a testing set. You provide the dataframe you wish to split and specify the following parameters:
- `test_size`: size of the testing set (as a decimal fraction of the total dataset).
- `stratify`: ensures that the training and testing sets are split in a stratified manner, meaning that the proportion of each class in the dataset is preserved in both sets.

Next, you'll use these datasets to train and evaluate your model.

## 5️⃣ Create bag-of-words features

Machine Learning models cannot handle text features directly. To train your models you'll need to turn your text into numerical features. You'll use `CountVectorizer` for that:

In [223]:
vec = CountVectorizer(
    ngram_range=(1, 3), 
    stop_words="english",
)

X_train = vec.fit_transform(df_train.clean_text)
X_test = vec.transform(df_test.clean_text)

y_train = df_train.target
y_test = df_test.target

In the code above you used `CountVectorizer` to turn the text into numerical features. Here's what's happening:

- **Lines 1 to 4:** You use `CountVectorizer` to build a [bag-of-words representation](https://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation) of the `clean_text` column so that a machine learning model can understand it. You specify two paramters: `ngram_range` and `stop_words`. `ngram_range` is the range of n-grams that the function will use. An n-gram is a sequence of n words. `(1, 3)` means that the function will use sequences of 1, 2, and 3 words. `stop_words` is a list of words that the function will ignore. In this case, the list "english" means that the function will ignore most common words in English.
- **Lines 6 and 7:** You generate the matrices of token counts (bag-of-words) for your training and testing set and save them into `X_train` and `X_test`.
- **Lines 9 and 10:** You save the response variable from the training and testing set into `y_train` and `y_test`.


### Train and evaluate the model

Finally, you just need to train the model by running:

In [225]:
nb = MultinomialNB()
nb.fit(X_train, y_train)

preds = nb.predict(X_test)
print(classification_report(y_test, preds))

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.53      0.68       160
          misc.forsale       0.98      0.89      0.94       195
             sci.space       0.91      0.88      0.89       197
soc.religion.christian       0.65      0.99      0.79       200
    talk.politics.guns       0.92      0.88      0.90       182

              accuracy                           0.85       934
             macro avg       0.89      0.83      0.84       934
          weighted avg       0.88      0.85      0.84       934



In **lines 1 and 2** you train a [Multinomial Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) model. This is a simple probabilistic model that is commonly used when using discrete features such as word counts.

Then, in **lines 4 and 5**, you evaluate the results of the model by computing the precision, recall, and f1 scores.

### Saving and loading the model

If you'd like to save the model for later, then use you can use `joblib`. Here's how you'd save the model you just finished training:

In [196]:
import joblib

joblib.dump(nb, "nb.joblib")

['nb.joblib']

Then, if you want to re-use your model later on, you can simply read it and use it to classify new samples of data as follows:

In [197]:
nb_loaded = joblib.load("nb.joblib")

sample_text = ["space stars planets astronomy"]
sample_vec = vec.transform(sample_text)
nb_loaded.predict(sample_vec)

array(['sci.space'], dtype='<U22')

## Using Cross-validation

In [203]:
kf = StratifiedKFold(n_splits=10)

f1_scores = []
acc_scores = []
kappa_scores = []

for f, (t, v) in enumerate(kf.split(X=df, y=df.target)):
    df_train = df.iloc[t, :]
    df_val = df.iloc[v, :]
    
    vec = CountVectorizer(
        ngram_range=(1, 3), 
        stop_words="english"
    )
    
    X_train = vec.fit_transform(df_train.clean_text)
    y_train = df_train.target
    
    X_val = vec.transform(df_val.clean_text)
    y_val = df_val.target

    nb = MultinomialNB()

    nb.fit(X_train, y_train)
    preds = nb.predict(X_val)
    
    f1_scores.append(f1_score(y_val, preds, average="macro"))
    acc_scores.append(accuracy_score(y_val, preds))
    kappa_scores.append(cohen_kappa_score(y_val, preds))
    
print(
    f"f1={np.mean(f1_scores):.2f}",
    f"accuracy={np.mean(acc_scores):.2f}",
    f"kappa={np.mean(kappa_scores):.2f}",
)

f1=0.84+/-0.04 accuracy=0.84 kappa=0.80
