# Random Forest for Topic Classification

In this notebook, a Random Forest is implemented in order to perform topic-classification on the "GenericMixOfTopic" dataset. The classification is multi-labeled with a total of 40 labels corresponding to a certain topic present in the text.

### Require the needed libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, hamming_loss, classification_report
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import LabelEncoder
import json
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor

### Load the data

The file, in "parquet" format can be read with the Pandas library as follows:

In [2]:
file_path = "GenericMixOfTopic.parquet"
data = pd.read_parquet(file_path, engine='pyarrow')

# Dataset reduction (for computational cost being high)
data = data[0:999]
print(data.head())

         id                                              title  \
0  44579372                                  Julius Julskötare   
1  69360653                                         Josia Topf   
2  41642068  St. Peter Chaldean Catholic Cathedral (El Cajo...   
3   4351257                                  Allahabad Address   
4    648505                             Glomerulus (olfaction)   

                  topic                            topics_with_percentages  \
0                 Mixed  b'{"Entertainment":0.67,"Culture":0.17,"Mass_m...   
1                 Mixed  b'{"People":0.5,"Sports":0.5,"Academic_discipl...   
2                 Mixed  b'{"Religion":0.36,"Culture":0.21,"Time":0.14,...   
3                 Mixed  b'{"History":0.3,"Government":0.2,"Philosophy"...   
4  Academic_disciplines  b'{"Academic_disciplines":1.0,"Business":0,"Co...   

                                                text  
0    'Julius Julskötare' (&quot;Julius Christmas ...  
1    Infobox athlete | n

The dataset has no missing values

In [3]:
print(data.isnull().sum())

id                         0
title                      0
topic                      0
topics_with_percentages    0
text                       0
dtype: int64


Notice the data type of the target variable:

In [4]:
type(data['topics_with_percentages'][0])

bytes

### Data preparation

Text preprocessing and preparation is performed on the union of the 'title' and the 'text', the 'combined text'. The data is vectorized with **TF-IDF** (Term Frequency-Inverse Document Frequency), that is a statistical measure of how important a word is to a document in a collection or corpus. Also, proper preprocessing for the target variable is applied.

In [5]:
data['topics_with_percentages'] = data['topics_with_percentages'].apply(lambda x: json.loads(x.decode('utf-8')))

all_topics = set()
for topics in data['topics_with_percentages']:
    all_topics.update(topics.keys())
all_topics = sorted(all_topics)  

def topics_to_vector(topics_dict):
    return [topics_dict.get(topic, 0) for topic in all_topics]
y = pd.DataFrame(data['topics_with_percentages'].apply(topics_to_vector).tolist(), columns=all_topics)
y.head

<bound method NDFrame.head of      Academic_disciplines  Business  Communication  Concepts  Culture  \
0                     0.0       0.0            0.0       0.0     0.17   
1                     0.0       0.0            0.0       0.0     0.00   
2                     0.0       0.0            0.0       0.0     0.21   
3                     0.0       0.0            0.0       0.0     0.00   
4                     1.0       0.0            0.0       0.0     0.00   
..                    ...       ...            ...       ...      ...   
994                   0.0       0.0            0.0       0.0     0.00   
995                   0.0       0.0            0.0       0.0     0.00   
996                   0.0       0.0            0.0       0.0     0.33   
997                   0.0       0.0            0.0       0.0     0.06   
998                   0.0       0.0            0.0       0.0     0.00   

     Economy  Education  Energy  Engineering  Entertainment  ...  People  \
0        0.0     

**TDIDF vectorization** is applied, notice that the split of the dataset occurs before in order to avoid any usage of test dataset

In [6]:
data['combined_text'] = data['title'] + " " + data['text']

X_train_text, X_test_text, y_train, y_test = train_test_split(data['combined_text'], y, test_size=0.4, random_state=42)

tfidf = TfidfVectorizer(max_features=10000, stop_words='english')
X_train = tfidf.fit_transform(X_train_text)
X_test = tfidf.transform(X_test_text)

In [7]:
data.drop(columns=['topic'], inplace=True)

### Model training

In [8]:
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
multi_target_rf_regressor = MultiOutputRegressor(rf_regressor, n_jobs=-1)
multi_target_rf_regressor.fit(X_train, y_train)

### Model evaluation
To evaluate the model, it's possible to make inference on the test set and to compute the **mean squared error** as well as the **mean absolute error** on the test set.

In [9]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

predictions = multi_target_rf_regressor.predict(X_test)

mse = mean_squared_error(y_test, predictions, multioutput='raw_values')
mae = mean_absolute_error(y_test, predictions, multioutput='raw_values')

print("Mean Squared Error for each topic:", mse)
print("Mean Absolute Error for each topic:", mae)

Mean Squared Error for each topic: [0.00500225 0.00905847 0.00013276 0.00369158 0.01500095 0.00826325
 0.00724945 0.00852742 0.00405975 0.01273741 0.00435363 0.00037107
 0.00622053 0.01867246 0.01360279 0.00672659 0.009994   0.01010954
 0.01322949 0.00345883 0.0032599  0.01020598 0.00683212 0.008464
 0.00469459 0.00976192 0.00574811 0.00653638 0.00565337 0.02941734
 0.00264023 0.00633472 0.00858599 0.00901926 0.02060709 0.01890832
 0.0127583  0.00417683 0.00411572]
Mean Absolute Error for each topic: [0.02720175 0.03260025 0.0012795  0.01122125 0.055798   0.02353875
 0.03119775 0.01234775 0.01794875 0.053096   0.02239175 0.00228125
 0.01505925 0.07072675 0.05758725 0.0245325  0.06153525 0.03144775
 0.038952   0.011901   0.008667   0.021279   0.01973875 0.018928
 0.017419   0.04369025 0.01120325 0.02405875 0.0188575  0.11031425
 0.00878825 0.0318835  0.0264815  0.0257875  0.05723275 0.06283175
 0.0339285  0.040018   0.02536575]


### Visualizations

Visualization of 5 random samples, with ground truth compared with the prediction

In [11]:
import random
sample_indices = random.sample(list(y_test.index), 5)

for idx in sample_indices:
    row = data.loc[idx]
    print(f"Text: {row['combined_text'][:300]}...")  # Displaying first 300 characters of combined_text
    actual_topics = {topic: y_test.at[idx, topic] for topic in all_topics}
    predicted_topics = {topic: predictions[list(y_test.index).index(idx)][i] for i, topic in enumerate(all_topics)}
    print(f"Actual Topics with Percentages: {actual_topics}")
    print(f"Predicted Topics with Percentages: {predicted_topics}")
    print("\n" + "-"*80 + "\n")

Text: Italy at the 1992 Winter Olympics   Italy competed at the 1992 Winter Olympics in Albertville , France . Medalists Competitors The following is the list of number of competitors in the Games. Alpine skiing ;Men Men's combined ;Women Women's combined Biathlon ;Men ;Men's 4 x 7.5&amp;nbsp;km relay ;Wo...
Actual Topics with Percentages: {'Academic_disciplines': 0.0, 'Business': 0.0, 'Communication': 0.0, 'Concepts': 0.0, 'Culture': 0.0, 'Economy': 0.0, 'Education': 0.0, 'Energy': 0.0, 'Engineering': 0.0, 'Entertainment': 0.0, 'Entities': 0.0, 'Ethics': 0.0, 'Food_and_drink': 0.0, 'Geography': 0.0, 'Government': 0.0, 'Health': 0.0, 'History': 0.0, 'Human_behavior': 0.0, 'Humanities': 0.0, 'Information': 0.0, 'Internet': 0.0, 'Knowledge': 0.0, 'Language': 0.0, 'Law': 0.0, 'Life': 0.0, 'Mass_media': 0.0, 'Mathematics': 0.0, 'Military': 0.0, 'Nature': 0.0, 'People': 0.0, 'Philosophy': 0.0, 'Politics': 0.0, 'Religion': 0.0, 'Science': 0.0, 'Society': 0.0, 'Sports': 1.0, 'Technology': 0.0