# 📚 Text classification using Python and Scikit-learn

Text classification is the task of automatically assigning labels to pieces of text, such as articles, blog posts, or reviews. Many businesses use text classification algorithms to save time and money by reducing the amount of manual labor needed to organize and analyze their text data.

These algorithms are extremely powerful tools when used correctly. Text classification models keep your email free of spam, assist authors in detecting plagiarism, and help your grammar checker understand the various parts of speech.

If you want to build a text classifier, you have many options to choose from. You can use traditional methods such as bag-of-words, advanced methods like Word2Vec embeddings, and cutting-edge approaches like BERT or GPT-3.

However, if you want to get something up and running quickly at no cost, you should build your text classification model with Python and Scikit-learn. I'll show you how in this tutorial.

Let's get started!

## 0️⃣ Prerequisites

1. Create a virtual environment using `conda` or `venv`
2. Install the required libraries: `pip install numpy pandas notebook scikit-learn`

## 1️⃣ Imports

In [1]:
import joblib
import re
import string

import numpy as np
import pandas as pd

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score, classification_report
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.naive_bayes import MultinomialNB

## 2️⃣ Read the data

In [2]:
categories = [
    "alt.atheism",
    "misc.forsale",
    "sci.space",
    "soc.religion.christian",
    "talk.politics.guns",
]

news_group_data = fetch_20newsgroups(
    subset="all", remove=("headers", "footers", "quotes"), categories=categories
)

df = pd.DataFrame(dict(text=news_group_data["data"], target=news_group_data["target"]))
df["target"] = df.target.map(lambda x: categories[x])

## 3️⃣ Clean the text column

In [3]:
def process_text(text):
    text = str(text).lower()
    text = re.sub(
        f"[{re.escape(string.punctuation)}]", " ", text
    )
    text = " ".join(text.split())
    return text

df["clean_text"] = df.text.map(process_text)

## 4️⃣ Train/test split

In [4]:
df_train, df_test = train_test_split(df, test_size=0.20, stratify=df.target)

## 5️⃣ Create bag-of-words features

In [None]:
vec = CountVectorizer(
    ngram_range=(1, 3), 
    stop_words="english",
)

X_train = vec.fit_transform(df_train.clean_text)
X_test = vec.transform(df_test.clean_text)

y_train = df_train.target
y_test = df_test.target

## 6️⃣ Train and evaluate the model

In [None]:
nb = MultinomialNB()
nb.fit(X_train, y_train)

preds = nb.predict(X_test)
print(classification_report(y_test, preds))

## 7️⃣ Saving and loading the model

In [None]:
joblib.dump(nb, "nb.joblib")
joblib.dump(vec, "vec.joblib")

In [None]:
nb_saved = joblib.load("nb.joblib")
vec_saved = joblib.load("vec.joblib")

sample_text = ["Space, Stars, Planets and Astronomy!"]
clean_sample_text = process_text(sample_text)
sample_vec = vec_saved.transform(sample_text)
nb_saved.predict(sample_vec)