#NLP Classification with Transfer Learning

A simple Neural Network to predict 6 possible classes for this problem:
* 'toxic', 
* 'severe_toxic', 
* 'obscene', 
* 'threat', 
* 'insult', 
* 'identity_hate'

With Keras, Tensorflow and Mlflow.

In [0]:
%pip install tensorflow mlflow

Python interpreter will be restarted.
Python interpreter will be restarted.


In [0]:
import numpy as np
np.random.seed(42)
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing import text, sequence
import pandas as pd



In [0]:
dbutils.fs.ls("/dbfs/FileStore/jigsaw_processed")

Out[2]: [FileInfo(path='dbfs:/dbfs/FileStore/jigsaw_processed/_committed_1945499015450261817', name='_committed_1945499015450261817', size=221, modificationTime=1678144024000),
 FileInfo(path='dbfs:/dbfs/FileStore/jigsaw_processed/_committed_5582298936961953304', name='_committed_5582298936961953304', size=219, modificationTime=1678830377000),
 FileInfo(path='dbfs:/dbfs/FileStore/jigsaw_processed/_committed_6231977519080488190', name='_committed_6231977519080488190', size=220, modificationTime=1678312030000),
 FileInfo(path='dbfs:/dbfs/FileStore/jigsaw_processed/_committed_6967408854108031852', name='_committed_6967408854108031852', size=231, modificationTime=1678143793000),
 FileInfo(path='dbfs:/dbfs/FileStore/jigsaw_processed/_committed_8969578946190677926', name='_committed_8969578946190677926', size=221, modificationTime=1678144544000),
 FileInfo(path='dbfs:/dbfs/FileStore/jigsaw_processed/_committed_vacuum1390852046845635690', name='_committed_vacuum1390852046845635690', size=195,

In [0]:
# Read data
N_ROWS = 10000

LABELS = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

Load of Dataset previously processed.

In [0]:
dataset = spark.read.parquet("/dbfs/FileStore/jigsaw_processed").select(["comment_text", "comment_text_stem"]+LABELS).limit(N_ROWS).toPandas()

for label in LABELS:
  dataset[label] = dataset[label].astype(int)

In [0]:
dataset.head()

Unnamed: 0,comment_text,comment_text_stem,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,Explanation\nWhy the edits made under my usern...,"[explan, edit, made, usernam, hardcor, metalli...",0,0,0,0,0,0
1,D'aww! He matches this background colour I'm s...,"[aww, match, background, colour, seem, stuck, ...",0,0,0,0,0,0
2,"Hey man, I'm really not trying to edit war. It...","[hey, man, realli, tri, edit, war, guy, consta...",0,0,0,0,0,0
3,"""\nMore\nI can't make any real suggestions on ...","[make, real, suggest, improv, wonder, section,...",0,0,0,0,0,0
4,"You, sir, are my hero. Any chance you remember...","[sir, hero, chanc, rememb, page]",0,0,0,0,0,0


In [0]:
dataset['comment_text_stem'].values.shape

Out[6]: (10000,)

Load of the embeddings processed on the file "JigSaw_NLPword2vec".

Duplicates drop and index organization.

In [0]:
dbutils.fs.ls("/dbfs/FileStore/embeddings_100_3_10_1_1")


Out[7]: [FileInfo(path='dbfs:/dbfs/FileStore/embeddings_100_3_10_1_1/_SUCCESS', name='_SUCCESS', size=0, modificationTime=1678142900000),
 FileInfo(path='dbfs:/dbfs/FileStore/embeddings_100_3_10_1_1/_committed_1742439629418457023', name='_committed_1742439629418457023', size=123, modificationTime=1678142899000),
 FileInfo(path='dbfs:/dbfs/FileStore/embeddings_100_3_10_1_1/_started_1742439629418457023', name='_started_1742439629418457023', size=0, modificationTime=1678142894000),
 FileInfo(path='dbfs:/dbfs/FileStore/embeddings_100_3_10_1_1/part-00000-tid-1742439629418457023-e9100de5-e74c-4376-b575-7367350ec203-41-1-c000.snappy.parquet', name='part-00000-tid-1742439629418457023-e9100de5-e74c-4376-b575-7367350ec203-41-1-c000.snappy.parquet', size=32583846, modificationTime=1678142899000)]

In [0]:
embeddings = spark.read.format("parquet").load("/dbfs/FileStore/embeddings_100_3_10_1_1")
embeddings=embeddings.toPandas()
embeddings["comment_text_stem"] = embeddings["comment_text_stem"].map(lambda x:x[0])
embeddings.drop_duplicates(subset='comment_text_stem',inplace=True)
embeddings['word_index_new']=range(1,len(embeddings)+1)
embeddings.set_index("comment_text_stem", inplace=True)
embeddings.head()

Unnamed: 0_level_0,words,words_array,features,word_index,word_index_new
comment_text_stem,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0000z,0000z,[0000z],"[-0.09071973711252213, -0.24721206724643707, -...",1,1
000ft,000ft,[000ft],"[-0.04879645258188248, -0.07324051111936569, -...",2,2
000kg,000kg,[000kg],"[0.0027164840139448643, -0.05179981142282486, ...",3,3
000th,000th,[000th],"[0.006760281510651112, -0.03317908197641373, 0...",4,4
00am,00am,[00am],"[-0.13817279040813446, -0.29865097999572754, 0...",5,5


In [0]:
text_encoded=[]
for i, s in enumerate(dataset['comment_text_stem'].values):
  sentence_encoded=[]
  for word in s:
    try:
      sentence_encoded.append(embeddings.loc[word]['word_index_new'])
    except KeyError:
      sentence_encoded.append(0)
  text_encoded.append(sentence_encoded)
  if i%1000==0:
    print(i)

dataset["comment_text_encoded"] = text_encoded


0
1000
2000
3000
4000
5000
6000
7000
8000
9000


Dataset divide in Training and Validation.

In [0]:
msk = np.random.rand(len(dataset)) < 0.8
train = dataset[msk]
val = dataset[~msk]
y_train = train[LABELS].values 
y_val = val[LABELS].values
print("Train size:", train.shape[0], ", Val size:", val.shape[0])

Train size: 8038 , Val size: 1962


In [0]:
train_words_indexes = list(train["comment_text_encoded"].values)
val_words_indexes = list(val["comment_text_encoded"].values)

In [0]:
# Sanity check
print(train['comment_text_stem'].values[1],len(train['comment_text_stem'].values[1]))  # before tokenizer
print(train_words_indexes[1],len(train_words_indexes[1]))  # after tokenizer

['aww' 'match' 'background' 'colour' 'seem' 'stuck' 'thank' 'talk'
 'januari' 'utc'] 10
[3530, 24547, 3670, 8295, 35342, 38099, 39566, 38968, 20449, 42131] 10


Did input padding - standardize input size

In [0]:
q = 0.85
maxlen = int(np.quantile([len (s) for s in train_words_indexes], q))  #number max of words in a comment.
print(f'max lenght q({q}): {maxlen}')
train_words_indexes = sequence.pad_sequences(train_words_indexes, maxlen=maxlen, padding='post')
val_words_indexes = sequence.pad_sequences(val_words_indexes, maxlen=maxlen, padding='post')

max lenght q(0.85): 58


In [0]:
# Sanity check
print(train_words_indexes.shape)
train_words_indexes[:2]

(8038, 58)
Out[13]: array([[13523, 12237, 23887, 42075, 17359, 25206, 13818, 33478, 43463,
        42279,  8019, 15576, 42934, 27279, 44877, 11460, 13665, 30628,
        33176, 39359, 38968, 29214, 36334, 33431,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0],
       [17939, 24162, 32688, 40650, 12237, 43139, 16962,  8718, 33176,
        33111, 19561, 38968, 12237, 19740, 38968, 29214, 35342,  6670,
        14781,   916, 19552,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0]], dtype=int32)

Now, the inputs are ready for the Neural Network.

In [0]:
import datetime
from tensorflow.keras import callbacks
%reload_ext tensorboard

experiment_log_dir = "/dbfs/FileStore/experiments/jgsaw_base/"
log_dir = experiment_log_dir + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)


In [0]:
%tensorboard --logdir $experiment_log_dir

ERROR: Failed to launch TensorBoard (exited with 1).
Contents of stderr:
2023-03-16 23:05:14.073014: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-16 23:05:14.228534: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-16 23:05:14.228598: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-16 23:05:15.101848: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvin

Prepare the load embeddings to the current problem.

In [0]:
embeddings_dim = embeddings.features.values[0].shape[0]
embeddings_dim

Out[16]: 100

In [0]:
embeddings_matrix = np.vstack(
  [np.zeros((1,embeddings_dim)),
   np.vstack(embeddings.features)]
)

In [0]:
embeddings_matrix.shape

Out[18]: (45266, 100)

Test with a random word.

In [0]:
word_detail= 'join'
# Check some embeddings
print(f"Word: {word_detail}")
w_id = embeddings.loc[word_detail]["word_index_new"]
print("Id:", w_id)
print(f"Embedding in matrix: {embeddings_matrix[w_id][:20]} ...")
print(f"Embedding in source: {embeddings.loc[word_detail].features[:20]} ...")

Word: join
Id: 20834
Embedding in matrix: [ 0.09790695  0.02189239  0.14752306  0.05103488  0.05161414  0.09839237
 -0.03173011 -0.08204264 -0.09759831 -0.4352763   0.09824388 -0.13795614
  0.01458105 -0.22353576  0.37919179 -0.00846863  0.16990623 -0.01161456
  0.05995915 -0.24855721] ...
Embedding in source: [ 0.09790695  0.02189239  0.14752306  0.05103488  0.05161414  0.09839237
 -0.03173011 -0.08204264 -0.09759831 -0.4352763   0.09824388 -0.13795614
  0.01458105 -0.22353576  0.37919179 -0.00846863  0.16990623 -0.01161456
  0.05995915 -0.24855721] ...


In [0]:
embeddings.loc[word_detail]

Out[20]: words                                                          join
words_array                                                  [join]
features          [0.09790695458650589, 0.02189238928258419, 0.1...
word_index                                                    37921
word_index_new                                                20834
Name: join, dtype: object

3 differents models training to dataset classification and transfer learning aplication.

In [0]:
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense, Dropout
from tensorflow.keras.models import Model
import tensorflow as tf

def get_embeddings_model(input_len, num_words, embeddings_dim, embeddings_matrix=None, trainable=False, layers=[64]):
    # Inputs
    input = Input(shape=(input_len,), name="comment_words_idx")

    # Embedding layer
    if embeddings_matrix is not None:
      x = Embedding(num_words, embeddings_dim, weights=[embeddings_matrix], trainable=trainable)(input)
    else:
      x = Embedding(num_words, embeddings_dim)(input)
    x = Flatten()(x)

    # Classification head
    for n_neurons in layers:
      x = Dense(n_neurons, activation='sigmoid')(x)
    
    # output
    output = Dense(6, activation='sigmoid')(x)

    # model
    model = Model([input], output)
    model.compile(loss="binary_crossentropy",
                  optimizer='adam', metrics=['accuracy', tf.keras.metrics.AUC(name='auc')])
    return model

In [0]:
import time

In [0]:
train_words_indexes, y_train

Out[23]: (array([[13523, 12237, 23887, ...,     0,     0,     0],
        [17939, 24162, 32688, ...,     0,     0,     0],
        [24046, 32669, 38330, ...,     0,     0,     0],
        ...,
        [10304, 13321, 13815, ..., 31795,     0, 33941],
        [37715,  2931, 11479, ..., 37715, 22886, 39566],
        [ 6002, 39751, 27889, ...,     0,     0,     0]], dtype=int32),
 array([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        ...,
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]]))

In [0]:
import tensorflow.keras.backend as K
import mlflow

hypotheses=[
  {"embeddings_matrix":embeddings_matrix, "epochs":1, "layers":[64], "trainable":False},
  {"embeddings_matrix":embeddings_matrix, "epochs":1, "layers":[64], "trainable":True},
  {"embeddings_matrix":None, "epochs":1, "layers":[64], "trainable":True},
]

mlflow.tensorflow.autolog()
results=[]
for hyp in hypotheses:
  with mlflow.start_run(run_name="jigsaw_embeddings") as run:  
    mlflow.log_params(hyp)
    start = time.time()

    model = get_embeddings_model(maxlen, embeddings_matrix.shape[0], embeddings_matrix.shape[1], 
                                            hyp["embeddings_matrix"], layers=hyp["layers"],trainable=hyp["trainable"])

    model.compile(loss="binary_crossentropy",
                  optimizer='adam', metrics=['accuracy', tf.keras.metrics.AUC(name='auc')])

    print(model.summary())

    trainable_count = np.sum([K.count_params(w) for w in model.trainable_weights])
    mlflow.log_param('trainable_params',trainable_count)


    history = model.fit(train_words_indexes, y_train, 
                        batch_size=128,
                        epochs=hyp["epochs"],
                        validation_data=(val_words_indexes, y_val)
                       )

    mlflow.log_metric('train_time_s',time.time()-start)
    for metric, value in history.history.items():
      mlflow.log_metric(metric, value[-1])
      
    results.append({'runid':run.info.run_id,
                    'history':history.history})
    print(results[-1])
    print("---------------------------------")
    print("---------------------------------")
    print("---------------------------------")

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 comment_words_idx (InputLay  [(None, 58)]             0         
 er)                                                             
                                                                 
 embedding_3 (Embedding)     (None, 58, 100)           4526600   
                                                                 
 flatten_3 (Flatten)         (None, 5800)              0         
                                                                 
 dense_6 (Dense)             (None, 64)                371264    
                                                                 
 dense_7 (Dense)             (None, 6)                 390       
                                                                 
Total params: 4,898,254
Trainable params: 371,654
Non-trainable params: 4,526,600
___________________________________________

Use of the best model.

In [0]:
train_results = pd.concat([pd.DataFrame(r['history'],index=[r['runid']]) for r in results])
train_results = train_results.sort_values('val_auc',ascending=False).reset_index().rename(columns={'index':'runid'})
train_results

Unnamed: 0,runid,loss,accuracy,auc,val_loss,val_accuracy,val_auc
0,d4e24833e3bd4217943895762460ffd4,0.249039,0.912043,0.691705,0.132734,0.994393,0.868639
1,3fddda789ffd45e19cedcedf1d8e925d,0.316091,0.025006,0.679059,0.14967,0.108563,0.866774
2,efe6fe9cb4f742bba7e4c7299fe69999,0.279277,0.297089,0.551858,0.151595,0.994903,0.754478


In [0]:
best_run = train_results.loc[train_results['val_auc']==train_results['val_auc'].max(),"runid"][0]

In [0]:
best_model = mlflow.tensorflow.load_model(f'runs:/{best_run}/model')
best_model.summary()

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 comment_words_idx (InputLay  [(None, 58)]             0         
 er)                                                             
                                                                 
 embedding_4 (Embedding)     (None, 58, 100)           4526600   
                                                                 
 flatten_4 (Flatten)         (None, 5800)              0         
                                                                 
 dense_8 (Dense)             (None, 64)                371264    
                                                                 
 dense_9 (Dense)             (None, 6)                 390       
                                                                 
Total params: 4,898,254
Trainable params: 4,898,254
Non-trainable params: 0
_________________________________________________

In [0]:
for i,p in enumerate(best_model.predict(val_words_indexes)):
  if len(np.array(LABELS)[p>.5])>1:
    print(i)
    break

540


In [0]:
val_words_indexes[88], y_val[88]

Out[31]: (array([27269, 15156, 23887, 26209, 15156,  2931, 15156, 12237, 15156,
        23408, 39994,  1285, 15156,  3122, 30486,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0], dtype=int32),
 array([1, 0, 1, 0, 1, 0]))

In [0]:
val.iloc[88]

Out[32]: comment_text            I NEVER FUCKING MADE THIS MOTHER FUCKING ARTIC...
comment_text_stem       [never, fuck, made, mother, fuck, articl, fuck...
toxic                                                                   1
severe_toxic                                                            0
obscene                                                                 1
threat                                                                  0
insult                                                                  1
identity_hate                                                           0
comment_text_encoded    [27269, 15156, 23887, 26209, 15156, 2931, 1515...
Name: 437, dtype: object

Between the three models, there are no major differences. However, the model 0 has good accuracy at a lower cost.