AI - TP1_2

Bastien SAUVAT et Bastien FAISANT

# Exercise 3 : Text classification on the Ohsumed dataset

*Objective : The goal of this exercise is to realize a text classifier using deep neural networks. Your task
is to construct a classifier, using the available training set, and evaluate it using the test set. The classifier
should predict the category for the articles.*

In [1]:
import os
from collections import defaultdict
from keras.preprocessing.text import text_to_word_sequence, Tokenizer
import pandas as pd
import numpy as np
import sklearn
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from functools import reduce
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer  # to encode text to int
from tensorflow.keras.preprocessing.sequence import pad_sequences   # to do padding or truncating
from tensorflow.keras.models import Sequential     # the model
from tensorflow.keras.layers import Embedding, Dropout, GlobalAveragePooling1D, Bidirectional, LSTM, Dense, MaxPooling1D, Conv1D # layers of the architecture
from tensorflow.keras.callbacks import ModelCheckpoint   # save model
from tensorflow.keras.models import load_model   # load saved model
from tensorflow.keras import regularizers
import re
import string
import matplotlib.pyplot as plt

## Data parsing

We start by parsing the Ohsumed dataset in order to create two dataframes (training and test) with one column for the texts and another for the associated classes.

In [2]:
def get_info(path: str):
    data = list(os.walk(path))[1:]
    files = []
    for d in data:
        folder_name = d[0]
        for file in d[2]:
            files.append((folder_name.split('/')[-1], os.path.join(folder_name, file)))

    d = defaultdict(int)
    texts = defaultdict(list)
    for (cate, file) in files:
        with open(file, 'r') as outfile:
            text = outfile.read()
            texts[cate].append(text)
            words = text_to_word_sequence(text)
            for word in words:
                d[word] += 1
    words = sorted(d.items(), key=lambda x: x[1], reverse=True)
    return (texts, words)

In [3]:
training_texts, training_words = get_info("./data/ohsumed-first-20000-docs/training/")
test_texts, test_words = get_info("./data/ohsumed-first-20000-docs/test/")

In [4]:
def get_df(dataset: defaultdict[any, list]):
    classes = []
    texts = []
    for classe, liste_texts in dataset.items():
        for text in liste_texts:
            texts.append(text)
            classes.append(classe)

    df = pd.DataFrame({'Classes': classes, 'Texts': texts})
    return df


In [5]:
train_set = get_df(training_texts)
test_set = get_df(test_texts)

## Data exploration

In [6]:
train_set.head()

Unnamed: 0,Classes,Texts
0,C01,Augmentation mentoplasty using Mersilene mesh....
1,C01,Multiple intracranial mucoceles associated wit...
2,C01,Replacement of an aortic valve cusp after neon...
3,C01,The value of indium 111 leukocyte scanning in ...
4,C01,Febrile infants less than eight weeks old. Pre...


In [7]:
train_set.info()
train_set.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10433 entries, 0 to 10432
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Classes  10433 non-null  object
 1   Texts    10433 non-null  object
dtypes: object(2)
memory usage: 163.1+ KB


Unnamed: 0,Classes,Texts
count,10433,10433
unique,23,6286
top,C23,Magnetic resonance imaging of radiation optic ...
freq,1799,6


In [8]:
test_set.info()
test_set.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12733 entries, 0 to 12732
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Classes  12733 non-null  object
 1   Texts    12733 non-null  object
dtypes: object(2)
memory usage: 199.1+ KB


Unnamed: 0,Classes,Texts
count,12733,12733
unique,23,7643
top,C23,The butterfly rash and the malar flush. What d...
freq,2153,7


In [9]:
train_examples_per_category = train_set['Classes'].value_counts()
print("Training Set - Examples per Category:")
print(train_examples_per_category)

Training Set - Examples per Category:
Classes
C23    1799
C14    1249
C04    1163
C10     621
C06     588
C21     546
C20     525
C12     491
C08     473
C01     423
C18     388
C17     295
C05     283
C13     281
C15     215
C16     200
C19     191
C11     162
C02     158
C09     125
C07     100
C22      92
C03      65
Name: count, dtype: int64


In [10]:
test_examples_per_category = test_set['Classes'].value_counts()
print("\nTest Set - Examples per Category:")
print(test_examples_per_category)


Test Set - Examples per Category:
Classes
C23    2153
C04    1467
C14    1301
C10     941
C21     717
C20     695
C06     632
C08     600
C12     548
C01     506
C05     429
C18     400
C13     386
C17     348
C15     320
C02     233
C16     228
C11     202
C19     191
C07     146
C09     129
C22      91
C03      70
Name: count, dtype: int64


In [11]:
# Calculate word frequency
word_frequency = {}
for word, freq in training_words:
    word_frequency[word] = freq

# Most common words
most_common_words = training_words[:10]
print("\nMost Common Words:")
print(most_common_words)

# Least common words
least_common_words = training_words[-10:]
print("\nLeast Common Words:")
print(least_common_words)


Most Common Words:
[('the', 85034), ('of', 84510), ('and', 57271), ('in', 55122), ('to', 30870), ('with', 30625), ('a', 30482), ('patients', 22491), ('was', 19231), ('were', 16884)]

Least Common Words:
[('dma', 1), ('suberimidate', 1), ('dms', 1), ('dithiobis', 1), ('sulfosuccinimidylpropionate', 1), ('dtssp', 1), ('opossums', 1), ('131iodine', 1), ('intraperitonealization', 1), ('ball', 1)]


In [13]:
# Calculate word frequency
word_frequency = {}
for word, freq in test_words:
    word_frequency[word] = freq

# Most common words
most_common_words = test_words[:10]
print("\nMost Common Words:")
print(most_common_words)

# Least common words
least_common_words = test_words[-10:]
print("\nLeast Common Words:")
print(least_common_words)


Most Common Words:
[('of', 106687), ('the', 105229), ('and', 71681), ('in', 69009), ('with', 38515), ('to', 37885), ('a', 37853), ('patients', 27426), ('was', 23847), ('were', 21782)]

Least Common Words:
[('approximates', 1), ('intratumorally', 1), ('perilesionally', 1), ('karyotypically', 1), ('foster', 1), ('greeted', 1), ('tempered', 1), ('earn', 1), ('nonmitochondrial', 1), ('saponin', 1)]


In the training set, there are 10433 examples distributed across 23 unique classes. The dataset comprises 6286 unique texts. The most represented class is `C23` with a frequency of 1799 occurrences. The most frequently occurring word is `the` with a frequency of 85034. Conversely, the least frequent word is `suberimidate` occurring only once.

In the test set, there are 12733 examples distributed across 23 unique classes. The dataset contains 7643 unique texts. Similar to the training set, the most represented class is `C23` with a frequency of 2153 occurrences. The most frequently occurring word is `of` with a frequency of 106687. The least frequent word is `approximates` also occurring only once.

## Pre-processing

In the pre-processing phase, several crucial steps were executed to prepare the raw text data for efficient utilization in a text classification task.

In [14]:
english_stops = set(stopwords.words('english'))

In [15]:
def convert_classes_to_integers(classes):
    unique_classes = classes.unique()
    class_mapping = {cls: int(cls[1:]) for cls in unique_classes}
    return classes.replace(class_mapping)

The text underwent a series of cleaning operations. This involved the removal of HTML tags and non-alphabetic characters from the articles, ensuring that only relevant textual content remained for analysis.<br>
Following this, two essential text refinement techniques were applied. Stopwords, common words like "the" or "and" that hold little discriminatory value, were eliminated to focus on more meaningful words. Additionally, stemming was performed using the Porter Stemmer algorithm from NLTK, reducing words to their root forms for consistency in analysis.

In [16]:
def load_dataset(texts: defaultdict[any, list]):
    stemmer = PorterStemmer()
    df = get_df(texts)

    x_data = df['Texts']
    y_data = df['Classes']

    # PRE-PROCESS REVIEW
    x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
    x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stops])  # remove stop words
    x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case
    x_data = x_data.apply(lambda review: [stemmer.stem(w) for w in review]) # perform stemming
    

    # Replace class name by their number
    y_data = convert_classes_to_integers(y_data)

    return x_data, y_data

In [17]:
x_train, y_train = load_dataset(training_texts)
x_test, y_test = load_dataset(test_texts)

x_train.describe()

count                                                 10433
unique                                                 6286
top       [magnet, reson, imag, radiat, optic, neuropath...
freq                                                      6
Name: Texts, dtype: object

In [18]:
def get_max_length():
    review_length = []
    for review in x_train:
        review_length.append(len(review))

    return int(np.ceil(np.mean(review_length)))

Next, the text underwent tokenization, a process where it was split into individual words to create sequences. These sequences were then transformed into numerical representations using encoding methods. This encoding was necessary for the text to be interpreted by the neural network model.<br>
To maintain consistency in the input data for the neural network, sequences were padded or truncated to a fixed length using the pad_sequences function. This ensured uniformity in sequence length across all articles.

In [19]:
# ENCODE REVIEW
token = Tokenizer(lower=False)
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = get_max_length() -20

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[ 1714 10950     6 ...  4898   194   881]
 [  269  1088  4900 ...  6597    22   168]
 [  477   276   332 ...     0     0     0]
 ...
 [ 3985  3986  1553 ...  3985  3986   188]
 [   17  4568   702 ...   111     8     9]
 [  116   489   477 ...   899  7877  2850]] 

Encoded X Test
 [[3051 1555 1052 ... 1407  314  594]
 [ 764 8901  878 ...  161   26  537]
 [ 738   68  601 ...   18 2131  774]
 ...
 [ 537 1156 3812 ...  785   34  940]
 [ 182  208  334 ... 1481 3547  974]
 [ 472 2612 1706 ...  151   90   98]] 

Maximum review length:  92


Amélioration du modèle :
- LSTM modèle avec Dense de 24 units 
- Augmentation du batch size
- Utilisation du modèle LSTM dans les 2 sens (Bidirectional)
- 

## Create and train the model

In the first model, I utilized a Long Short-Term Memory (LSTM) network, a type of recurrent neural network (RNN) designed to process sequences of data such as text. This initial model consisted of an Embedding layer, an LSTM layer, and a Dense output layer. The total trainable parameters in this LSTM model amounted to 664632.<br>
Upon fitting the model to the training data, the achieved accuracy was 29.17%.

In [20]:
# ARCHITECTURE
EMBED_DIM = 32
LSTM_OUT = 64
model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length=max_length))
model.add(LSTM(LSTM_OUT))
model.add(Dense(24, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 92, 32)            638240    
                                                                 
 lstm (LSTM)                 (None, 64)                24832     
                                                                 
 dense (Dense)               (None, 24)                1560      
                                                                 
Total params: 664,632
Trainable params: 664,632
Non-trainable params: 0
_________________________________________________________________
None


In [21]:
checkpoint = ModelCheckpoint(
    'models/LSTM.h5',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)

In [22]:
model.fit(x_train, y_train, batch_size = 128, epochs = 5, callbacks=[checkpoint])

Epoch 1/5
Epoch 1: accuracy improved from -inf to 0.16917, saving model to models\LSTM.h5
Epoch 2/5
Epoch 2: accuracy improved from 0.16917 to 0.17234, saving model to models\LSTM.h5
Epoch 3/5
Epoch 3: accuracy improved from 0.17234 to 0.20991, saving model to models\LSTM.h5
Epoch 4/5
Epoch 4: accuracy improved from 0.20991 to 0.25793, saving model to models\LSTM.h5
Epoch 5/5
Epoch 5: accuracy improved from 0.25793 to 0.29167, saving model to models\LSTM.h5


<keras.callbacks.History at 0x22882c07a60>

For the second model, I introduced several additional layers to enhance its complexity and performance.

- Embedding Layer: This layer is responsible for creating word vectors for each word in the provided word index. It groups words with similar meanings or relationships by analyzing their context within the text data. In this model, an embedding dimension of 128 is used, allowing for a higher-dimensional representation of words.

- Conv1D Layer: The 1-dimensional Convolutional Neural Network layer processes the embedded word vectors by applying filters of size `128` and using the `relu` activation function. This step helps in capturing local patterns and features within the sequences of words.

- MaxPooling1D Layer performs down-sampling by extracting the most important features

- Bidirectional LSTM Layer: This layer incorporates Bidirectional Long Short-Term Memory units, which process input sequences in both forward and backward directions.

- Dense Layer receives the processed information from the previous layers and performs computations using a `softmax` activation function with `24 units`.

The optimizer used is `Adam`. Additionally, the loss function employed is `Sparse Categorical Crossentropy` as it's suitable for multi-class classification tasks with integer labels.
<br>

This extended architecture resulted in a model with a total of 2904344 trainable parameters.<br>
Upon training this more complex model, the accuracy significantly improved to 48.56%. The added layers and increased model complexity notably contributed to the enhanced accuracy compared to the initial LSTM model.

In [23]:
# ARCHITECTURE
EMBED_DIM = 128
LSTM_OUT = 128
model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length=max_length))
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(pool_size=4))
model.add(Bidirectional(LSTM(LSTM_OUT, dropout=0.2)))
model.add(Dense(24, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 92, 128)           2552960   
                                                                 
 conv1d (Conv1D)             (None, 88, 128)           82048     
                                                                 
 max_pooling1d (MaxPooling1D  (None, 22, 128)          0         
 )                                                               
                                                                 
 bidirectional (Bidirectiona  (None, 256)              263168    
 l)                                                              
                                                                 
 dense_1 (Dense)             (None, 24)                6168      
                                                                 
Total params: 2,904,344
Trainable params: 2,904,344
No

In [24]:
checkpoint = ModelCheckpoint(
    'models/LSTM.h5',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)

In [25]:
model.fit(x_train, y_train, batch_size = 128, epochs = 5, callbacks=[checkpoint])

Epoch 1/5


Epoch 1: accuracy improved from -inf to 0.19189, saving model to models\LSTM.h5
Epoch 2/5
Epoch 2: accuracy improved from 0.19189 to 0.30931, saving model to models\LSTM.h5
Epoch 3/5
Epoch 3: accuracy improved from 0.30931 to 0.38675, saving model to models\LSTM.h5
Epoch 4/5
Epoch 4: accuracy improved from 0.38675 to 0.44906, saving model to models\LSTM.h5
Epoch 5/5
Epoch 5: accuracy improved from 0.44906 to 0.48557, saving model to models\LSTM.h5


<keras.callbacks.History at 0x228f365bdc0>

## Testing

In [27]:
loss, accuracy = model.evaluate(x_test, y_test)
print('Loss: {}'.format(loss))
print('Accuracy: {}'.format(accuracy))

Loss: 1.9650187492370605
Accuracy: 0.39511504769325256


I evaluated the trained model using the test dataset to assess its performance on unseen data. The model's performance metrics, including loss and accuracy, were computed based on how well it predicted the categories for the articles in the test dataset.<br>
The model achieved a test set accuracy of approximately 39.51% and a corresponding loss of approximately 1.97.<br>
Considering the performance, the model seems to neither severely overfit nor underfit the data. <br>
Given the accuracy and loss values obtained, it appears that the model has learned some patterns from the training data and can generalize reasonably well to the unseen test data. However, the accuracy is relatively modest, suggesting that there's room for improvement in capturing more intricate relationships within the text data.
