<br/>

<br/>

<center>
       <h1>BERT for multi-class text classification</h1>
       <h2>Wael Ben Hadj Yahia, Dorian Hervé<h2>
</center>
  

Sources :
<ul>
<li>https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794</li>     <li>https://github.com/mdipietro09/DataScience_ArtificialIntelligence_Utils/blob/master/natural_language_processing/example_text_classification.ipynb</li>
</ul>


<br/>

We have implemented BERT with the aim of performing multi-class text classification. The data is taken from an AI challenge (in which we are currently participating) where the objective is to assign the correct job category to a given job description. This is a multi-class classification task with 28 classes to choose from. 

### Data loading

In [5]:
import pandas as pd
import pickle
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

train_df = pd.read_json("data/defi-ia-insa-toulouse/train.json")
test_df = pd.read_json("data/defi-ia-insa-toulouse/test.json")
train_label = pd.read_csv("data/defi-ia-insa-toulouse/train_label.csv")

In [6]:
train_df.head()

Unnamed: 0,Id,description,gender
0,0,She is also a Ronald D. Asmus Policy Entrepre...,F
1,1,He is a member of the AICPA and WICPA. Brent ...,M
2,2,Dr. Aster has held teaching and research posi...,M
4,3,He runs a boutique design studio attending cl...,M
5,4,"He focuses on cloud security, identity and ac...",M


In [7]:
test_df.head()

Unnamed: 0,Id,description,gender
3,0,She currently works on CNN’s newest primetime...,F
6,1,Lavalette’s photographs have been shown widel...,M
11,2,Along with his academic and professional deve...,M
17,3,She obtained her Ph.D. in Islamic Studies at ...,F
18,4,She studies issues of women and Islam and has...,F


In [8]:
train_label

Unnamed: 0,Id,Category
0,0,19
1,1,9
2,2,19
3,3,24
4,4,24
...,...,...
217192,217192,19
217193,217193,22
217194,217194,19
217195,217195,19


### Data processing

In [3]:
test_df['description'] = test_df['description'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
test_df.head()

Unnamed: 0,Id,description,gender
3,0,She currently works on CNNs newest primetime ...,F
6,1,Lavalettes photographs have been shown widely...,M
11,2,Along with his academic and professional deve...,M
17,3,She obtained her Ph.D. in Islamic Studies at ...,F
18,4,She studies issues of women and Islam and has...,F


In [4]:
train_df['description'] = train_df['description'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
train_df.head()

Unnamed: 0,Id,description,gender
0,0,She is also a Ronald D. Asmus Policy Entrepre...,F
1,1,He is a member of the AICPA and WICPA. Brent ...,M
2,2,Dr. Aster has held teaching and research posi...,M
4,3,He runs a boutique design studio attending cl...,M
5,4,"He focuses on cloud security, identity and ac...",M


### Import of the pre-trained model DistilBert

In [5]:
from transformers import BertTokenizer, TFBertModel, TFDistilBertModel, AutoTokenizer
#import tensorflow
#from transformers import TFBertModel
#txt = "bank river"
## bert tokenizer

#tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)

## bert model
#nlp = TFBertModel.from_pretrained('bert-base-uncased')
nlp = TFDistilBertModel.from_pretrained('distilbert-base-uncased')

## return hidden layer with embeddings
#input_ids = np.array(tokenizer.encode(txt))[None,:]  
#embedding = nlp(input_ids)
#embedding[0][0]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_layer_norm', 'vocab_projector', 'vocab_transform', 'activation_13']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


### Transformation of training data

In [6]:
import re
import numpy as np

#Generating a BERT formatted X_TRAIN
corpus = [str(remove_stopwords(txt).strip()) for txt in train_df.description.values]
maxlen = 500

## add special tokens
maxqnans = np.int((maxlen-145)/2)
#to use cased, we should remove the .lower() and load the cased model
corpus_tokenized = ["[CLS] " + " ".join(tokenizer.tokenize(re.sub(r'[^\w\s]+|\n', '', str(txt).lower().strip()))[:maxqnans])+ " [SEP]" for txt in corpus]

## generate masks
masks = [[1]*len(txt.split(" ")) + [0]*(maxlen - len(txt.split(" "))) for txt in corpus_tokenized]
    
## padding
txt2seq = [txt + " [PAD]"*(maxlen-len(txt.split(" "))) if len(txt.split(" ")) != maxlen else txt for txt in corpus_tokenized]
    
## generate idx
idx = [tokenizer.encode(seq.split(" "))[1:-1] for seq in txt2seq]
    
## generate segments
segments = [] 
for seq in txt2seq:
    temp, i = [], 0
    for token in seq.split(" "):
        temp.append(i)
        if token == "[SEP]":
             i += 1
    segments.append(temp)
## feature matrix
X_train_BERT = [np.asarray(idx, dtype='int32'), np.asarray(masks, dtype='int32'), np.asarray(segments, dtype='int32')]

### Observation from the feature matrix

In [7]:
i = 50
print("txt: ", corpus[i])
print("tokenized size :", len([tokenizer.convert_ids_to_tokens(idx) for idx in X_train_BERT[0][i].tolist()]))
print("tokenized:", [tokenizer.convert_ids_to_tokens(idx) for idx in X_train_BERT[0][i].tolist()])
print("idx size: ", len(X_train_BERT[0][i]))
print("idx: ", X_train_BERT[0][i])
print("mask size: ", len(X_train_BERT[1][i]))
print("mask: ", X_train_BERT[1][i])
print("segment size: ", len(X_train_BERT[2][i]))
print("segment: ", X_train_BERT[2][i])

txt:  Specializing commercial, advertising editorial photography, Jared shoots photography video large list national regional clients. Emphasizing outdoor & sports, people & lifestyle, architecture, Jared visionary creative photography Oregon region. He available world wide assignments stock sales.
tokenized size : 500
tokenized: ['[CLS]', 'specializing', 'commercial', 'advertising', 'editorial', 'photography', 'jared', 'shoots', 'photography', 'video', 'large', 'list', 'national', 'regional', 'clients', 'emphasizing', 'outdoor', 'sports', 'people', 'lifestyle', 'architecture', 'jared', 'visionary', 'creative', 'photography', 'oregon', 'region', 'he', 'available', 'world', 'wide', 'assignments', 'stock', 'sales', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'

### Transformation of test data

In [8]:
import re

#Generating a BERT formatted X_TRAIN
corpus = [str(remove_stopwords(txt).strip()) for txt in test_df.description.values]# à modif
maxlen = 500

## add special tokens
maxqnans = np.int((maxlen-145)/2)
#to use cased, we should remove the .lower() and load the cased model
corpus_tokenized = ["[CLS] " + " ".join(tokenizer.tokenize(re.sub(r'[^\w\s]+|\n', '', str(txt).lower().strip()))[:maxqnans])+ " [SEP]" for txt in corpus]

## generate masks
masks = [[1]*len(txt.split(" ")) + [0]*(maxlen - len(txt.split(" "))) for txt in corpus_tokenized]
    
## padding
txt2seq = [txt + " [PAD]"*(maxlen-len(txt.split(" "))) if len(txt.split(" ")) != maxlen else txt for txt in corpus_tokenized]
    
## generate idx
idx = [tokenizer.encode(seq.split(" "))[1:-1] for seq in txt2seq]
    
## generate segments
segments = [] 
for seq in txt2seq:
    temp, i = [], 0
    for token in seq.split(" "):
        temp.append(i)
        if token == "[SEP]":
             i += 1
    segments.append(temp)
## feature matrix
X_test_BERT = [np.asarray(idx, dtype='int32'), np.asarray(masks, dtype='int32'), np.asarray(segments, dtype='int32')]

### Observation from the feature matrix

In [9]:
i = 10
print("txt :", corpus[i])
print("tokenized: size", len([tokenizer.convert_ids_to_tokens(idx) for idx in X_test_BERT[0][i].tolist()]))
print("tokenized:", [tokenizer.convert_ids_to_tokens(idx) for idx in X_test_BERT[0][i].tolist()])
print("idx size: ", len(X_test_BERT[0][i]))
print("idx: ", X_test_BERT[0][i])
print("mask size: ", len(X_test_BERT[1][i]))
print("mask: ", X_test_BERT[1][i])
print("segment size: ", len(X_test_BERT[2][i]))
print("segment: ", X_test_BERT[2][i])

txt : Dr. Knight completed undergraduate training University Wisconsin-Madison medical training Medical College Wisconsin. She completed residency training combined Internal Medicine Psychiatry program Rush University Chicago, double board certified specialties. She completed post-doctoral research fellowship psychoneuroimmunology University Rochester Medical Center.
tokenized: size 500
tokenized: ['[CLS]', 'dr', 'knight', 'completed', 'undergraduate', 'training', 'university', 'wisconsin', '##mad', '##ison', 'medical', 'training', 'medical', 'college', 'wisconsin', 'she', 'completed', 'residency', 'training', 'combined', 'internal', 'medicine', 'psychiatry', 'program', 'rush', 'university', 'chicago', 'double', 'board', 'certified', 'special', '##ties', 'she', 'completed', 'postdoctoral', 'research', 'fellowship', 'psycho', '##ne', '##uro', '##im', '##mun', '##ology', 'university', 'rochester', 'medical', 'center', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'

### Construction of the deep learning model with transfer learning from the pre-trained DistilBERT

In [10]:
from keras import layers, models
from transformers import DistilBertConfig
## inputs
idx = layers.Input((maxlen), dtype="int32", name="input_idx")
masks = layers.Input((maxlen), dtype="int32", name="input_masks")
## pre-trained bert with config
config = DistilBertConfig(dropout=0.1, 
           attention_dropout=0.1)
config.output_hidden_states = False
nlp = TFDistilBertModel.from_pretrained('distilbert-base-uncased', config=config)
bert_out = nlp(idx, attention_mask=masks)[0]
## fine-tuning
x = layers.GlobalAveragePooling1D()(bert_out)
x = layers.Dense(256, activation="relu")(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dense(64, activation="relu")(x)
x = layers.Dense(32, activation="relu")(x)
y_out = layers.Dense(len(np.unique(train_label.Category.values)), 
                     activation='softmax')(x)
## compile
model = models.Model([idx, masks], y_out)
for layer in model.layers[:3]:
    layer.trainable = False
model.compile(loss='sparse_categorical_crossentropy', 
              optimizer='adam', metrics=['accuracy'])
model.summary()

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertModel: ['vocab_layer_norm', 'vocab_projector', 'vocab_transform', 'activation_13']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_idx (InputLayer)          [(None, 500)]        0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 500)]        0                                            
__________________________________________________________________________________________________
tf_distil_bert_model_1 (TFDisti ((None, 500, 768),)  66362880    input_idx[0][0]                  
                                                                 input_masks[0][0]                
__________________________________________________________________________________________________
global_average_pooling1d (Globa (None, 768)          0           tf_distil_bert_model_1

### Model train

In [11]:
# distilbert


## encode y

#tf.config.run_functions_eagerly(True)

dic_y_mapping = {n:label for n,label in enumerate(np.unique(train_label.Category.values))} #CHANGE y_train with [..].values HERE
inverse_dic = {v:k for k,v in dic_y_mapping.items()}
y_train = np.array([inverse_dic[y] for y in train_label.Category.values]) #CHANGE y_train with [..].values in np.array(...)
## train
model.fit(x=X_train_BERT, y=y_train, batch_size=1, epochs=1, shuffle=True, verbose=1, validation_split=0.15)

# save the model
model.save("my_model")

Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
INFO:tensorflow:Assets written to: my_model/assets


### Model predictions

In [17]:
## test
predicted_prob = model.predict(X_test_BERT)
predicted = [dic_y_mapping[np.argmax(pred)] for pred in predicted_prob]

In [21]:
len(predicted)

54300

### Formatting predictions into expected shape

In [23]:
test_df.head()

Unnamed: 0,Id,description,gender
3,0,She currently works on CNNs newest primetime ...,F
6,1,Lavalettes photographs have been shown widely...,M
11,2,Along with his academic and professional deve...,M
17,3,She obtained her Ph.D. in Islamic Studies at ...,F
18,4,She studies issues of women and Islam and has...,F


In [25]:
test_df["Category"] = predicted
baseline_file = test_df[["Id","Category"]]
baseline_file.to_csv("baseline.csv", index=False)

In [26]:
baseline_file.head()

Unnamed: 0,Id,Category
3,0,6
6,1,20
11,2,20
17,3,19
18,4,19


## Result

<br/>

The resulting accuracy was a very deceiving 59%. To put this into perspective, a strategy consisting of applying a Logistic Regression on a tuncated TF-IDF matrix gave us a 72% accuracy, while a Linear SVM Classification approach on the same matrix yielded a 74% accuracy.

We verify the hypothesis according to which our model woud have learned to predict with the most frequent class.

In [52]:
from pandas import Series
baseline_file.Category.value_counts()

19    18460
20     5025
26     5006
11     3478
3      3043
6      2553
14     2316
22     2097
8      1325
25     1319
5      1240
16     1237
13      990
15      943
24      867
9       636
18      622
1       575
12      519
4       399
2       363
21      283
27      248
0       246
10      215
17      176
7       119
Name: Category, dtype: int64

In [9]:
train_label.Category.value_counts()

19    70016
26    18820
20    14646
14    12622
6     12295
11    11607
22    10391
3      9145
8      6616
24     5841
16     5450
5      4621
15     4292
18     4124
1      4115
13     4060
25     3395
9      3121
27     2288
12     1639
0      1497
17     1406
23      967
2       944
7       858
10      831
4       807
21      783
Name: Category, dtype: int64


<br/>
Let's take a look at the proportion of class 19 (by far the most frequent one) in the train and prediction datasets.

In [115]:
sum(train_label.Category==19)/len(train_label.Category)

0.3223617269115135

In [116]:
sum(baseline_file.Category==19)/len(baseline_file.Category)

0.33996316758747697

In [121]:
sum(train_label.Category==12)/len(train_label.Category)

0.007546144744172341

In [117]:
sum(baseline_file.Category==12)/len(baseline_file.Category)

0.009558011049723756

<br/>
The proportions are very similar in both datasets. It is clear that the predictor did not learn to predict with the most frequent class.

In [123]:
baseline_file.Category.value_counts()/len(baseline_file)

19    0.339963
20    0.092541
26    0.092192
11    0.064052
3     0.056041
6     0.047017
14    0.042652
22    0.038619
8     0.024401
25    0.024291
5     0.022836
16    0.022781
13    0.018232
15    0.017366
24    0.015967
9     0.011713
18    0.011455
1     0.010589
12    0.009558
4     0.007348
2     0.006685
21    0.005212
27    0.004567
0     0.004530
10    0.003959
17    0.003241
7     0.002192
Name: Category, dtype: float64

In [124]:
train_label.Category.value_counts()/len(train_label)

19    0.322362
26    0.086649
20    0.067432
14    0.058113
6     0.056608
11    0.053440
22    0.047841
3     0.042105
8     0.030461
24    0.026893
16    0.025092
5     0.021276
15    0.019761
18    0.018987
1     0.018946
13    0.018693
25    0.015631
9     0.014369
27    0.010534
12    0.007546
0     0.006892
17    0.006473
23    0.004452
2     0.004346
7     0.003950
10    0.003826
4     0.003716
21    0.003605
Name: Category, dtype: float64

<br/>

### Excluding the possibility of having implemented BERT in a non-optimal manner, it seems that it is not adapted for such a problem. In the future, we will try to implement it after having cleaned the job descriptions (i.e removing non-alphabetical characters, stopwords, and lemmatizing the text)