### Text Classification (w/ Machine Learning)

Text Classification is the task of categorizing text into predefined categories or classes. It is typically done using supervised learning techniques, where the model predicts the category of a given text based on its content.

- Common examples of text classification include:

  - Spam Detection: Classifying emails as "spam" or "ham" (non-spam).
  - Sentiment Analysis: Determining whether a piece of text expresses positive, negative, or neutral sentiment.
  - Topic Categorization: Classifying articles into topics such as sports, politics, technology, etc.

In short, text classification involves analyzing the content of text and assigning it to one or more predefined categories.

#### Text Classification Examples:

- **Spam Detection**: Email -> "spam" or "ham"
- **Sentiment Analysis**: Text -> "positive", "negative", or "neutral"
- **Topic Categorization**: News article -> "Sports", "Politics", "Technology"

#### Text Classification Flowchart:

!["text-classification"](../images/5/5-text-classification-flowchart.png)


---


#### Real-Life Application of Text Classification Using the Spam Dataset &rarr; [Spam_Dataset.csv](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)


In [45]:
# Import Libs
import re

import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\iscie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\iscie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\iscie\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [46]:
# The dataset
data = pd.read_csv("../data/Spam_Dataset.csv", encoding="latin-1")
print(data.head())

     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  


In [47]:
data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1, inplace=True)

data.columns = ["label", "text"]
print(data.head())

  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [48]:
# EDA: Missing values
print(data.isna().sum())

label    0
text     0
dtype: int64


In [49]:
# Text preprocessing
"""
    Special Chars
    Lowercase
    Tokenize
    Remove Stopwords
    Lemmatize
"""

text = list(data["text"])

corpus = []
lemmatizer = WordNetLemmatizer()

for i in range(len(text)):
    r = re.sub(r"^a-zA-Z", " ", text[i])
    r = r.lower()
    r = r.split()
    r = [word for word in r if word not in stopwords.words("english")]
    r = [lemmatizer.lemmatize(word) for word in r]
    r = " ".join(r)

    corpus.append(r)

data["text2"] = corpus

In [50]:
# Train-Test split (67% - 33%)
X = data["text2"]
Y = data["label"]

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.33, random_state=42
)

In [51]:
# Feature extraction: BoW
cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train)

In [None]:
# Classifier training: Model training
dt = DecisionTreeClassifier()
dt.fit(X_train_cv, Y_train)

x_test_cv = cv.transform(X_test)

In [53]:
# Prediction
predictions = dt.predict(x_test_cv)

c_matrix = confusion_matrix(Y_test, predictions)

In [54]:
# Acc
print(
    "Accuracy:",
    100 * (sum(sum(c_matrix)) - c_matrix[1, 0] - c_matrix[0, 1]) / sum(sum(c_matrix)),
)

Accuracy: 97.6073953235454
