<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB888_VII_3_TextClassExample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification: Getting started


In this tutorial, we go through a text classification example to see how text data can be used in ML applications. We rely on (a sllightly reduced version of) the **AG News Classification Dataset** (see here), which contains news articles, their titles, and what type of article it is (class ids 1-4 where 1-World, 2-Sports, 3-Business, 4-Sci/Tech).

As usually, let's start with loading the relevant libaries.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import string

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

from wordcloud import WordCloud

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score

from tensorflow import keras
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

And let's import the dataset:

In [None]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

In [None]:
News = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB888_VII_3_TextClassExample.csv', index_col=0)

In [None]:
News.head()

## Prepare and Clean the Data

We start by preparing the text data. We first reset the "Index" (because the first column really is a category) and make it a class variable:

In [None]:
News = News.reset_index()
News = News.rename(columns={"Class Index": "Category"})
News['Category'] = pd.Categorical(News['Category'])
News.head()

Then, let's combine the Title and Description columns:

In [None]:
News['Combined'] = News['Title'] + ' ' + News['Description']
News = News.drop(['Title', 'Description'], axis=1)
News.head()

Next we do a few steps to process the text into a more usable format. Let's start by **removing punctutation**:

In [None]:
def remove_punctuation(text):
  translator = str.maketrans('', '', string.punctuation)
  return text.translate(translator)

News['Combined'] = News['Combined'].apply(remove_punctuation)
News.head()

Next, let's make it all **lower case**:

In [None]:
News['Combined'] = News['Combined'].str.lower()
News.head()

Next, let's **remove "stopwords"**, i.e., words that don't contribute to the meaning:

In [None]:
nltk.download('stopwords')
stop = stopwords.words('english')
News['Combined'] = News['Combined'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
News.head()

Next, let's do **"stemming"**, i.e., let's remove "-ly" or "ing" etc.

In [None]:
porter = PorterStemmer()
News['Combined'] = News['Combined'].apply(lambda x: ' '.join([porter.stem(word) for word in x.split()]))
News.head()

And finally, let's carry out **"Lemmatization"**, i.e., let's convert the word into root word:

In [None]:
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
News['Combined'] = News['Combined'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))
News.head()

Let's save the data as our "processed" data set:

In [None]:
News.to_csv('GB888_VII_3_TextClassExample_processed.csv', index=False)

## Visualize data

We visualize the data via wordclouds characterizing the differnet categorgies. Recall the categories are  1-World, 2-Sports, 3-Business, 4-Sci/Tech:

In [None]:
# Function to generate and display word cloud
def generate_wordcloud(text, title):
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(title)
    plt.show()

# Generate word cloud for all data
combined_text = ' '.join(News['Combined'].tolist())
generate_wordcloud(combined_text, "Combined Word Cloud (All Data)")

# Generate word clouds for each category
for category in range(1, 5):
    category_text = ' '.join(News[News['Category'] == category]['Combined'].tolist())
    generate_wordcloud(category_text, f"Combined Word Cloud (Category {category})")


So they are quite different!

## Encode the Text

We use Tf-IDF to encode the text. This is a variant of the Bag of Words approach, where we cnsider the importance of words by assigning weights to words based on their frequency in a article and their rarity across all articles:

In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(News['Combined'])
X.shape

The way to think about X is that there seem to be 74389 words, and X gives the frequency of each of these words across all the words.

## ML Modeling

So let's do some ML. We start by splitting our data into a training and test set

In [None]:
y = News['Category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Multi-class Regression

We start by a multi-nomial regression. Note that Logistic Regression in sklearn does have multi-class support, so we just rely on that:

In [None]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

To evaluate, let's use a multi-class confusion matrix:

In [None]:
y_pred = model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues",
            xticklabels=[1, 2, 3, 4], yticklabels=[1, 2, 3, 4])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()


Let's calculate the overall accuracy:

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Pretty good!

### Neural Network Modeling

Let's also try a simple neural network model. Since we are using keras, it's good to convert the outcomes into levels between 0 and 3 (rather than 1 to 4):

In [None]:
y = News['Category'].astype(int) - 1
y = to_categorical(y,4)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Since we have a high-dimensional input vector, we are going with a simple network. We, again somewhat arbitrarily, build a neural network with one hidden layer with ReLu activation functions and three neurons. Since this is a multi-class classification problem we use a softmax output layer:

In [None]:
# Define the neural network model
model = Sequential()
model.add(Dense(3, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(4, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split = 0.2)

It seems like we're overfitting after epoch 5 or 6...

To see how we are doing, we determine the accuracy in the test set and again plot a multi-class confusion matrix

In [None]:
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_true_classes = np.argmax(y_test, axis=1)

conf_matrix = confusion_matrix(y_true_classes, y_pred_classes)

# Plot the confusion matrix using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues",
            xticklabels=[1, 2, 3, 4], yticklabels=[1, 2, 3, 4])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

accuracy = accuracy_score(y_true_classes, y_pred_classes)
print(f"Accuracy: {accuracy}")


So, the performance is fairly similar to the logistic regression model. We of course can work more on refining the model. However, a question is whether a simple feed-forward model using this bag of words input vector (that ignores order, grammar, and "meaning") is the best we can do...