This code was used in "Science nlp classification" challenge (https://www.kaggle.com/c/nlpsci/leaderboard#score) and resulted in the score of 0.81720 - 1st place as of 15.01.2022!

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split

In [3]:
from build_model import build_model
from preprocessing import create_tokenizer_from_hub_module, convert_single_example, \
                        convert_examples_to_features, convert_text_to_examples, initialize_vars

In [4]:
from plot_functions import plot_label

In [None]:
sess = tf.compat.v1.Session()

Let's load the data and see what our columns look like.
Note that there is one column with prediction target and it is non-binary. This means that we are dealing with multi-class classification.
Also note that there are two columns with input text: TITLE and ABSTRACT. In this approach, I simply merged them into one input text. An alternative approach will be shown in another notebook.

In [15]:
data = pd.read_csv(r'C:\Users\Anna\Files\mylearning\ScienceCategories\nlpsci\train.csv')
data.head()

In [None]:
data['text'] = data['TITLE'] + ' ' + data['ABSTRACT']

In [16]:
data.shape

(15972, 5)

Using 0.1 as dev set size

In [17]:
train, dev = train_test_split(data, test_size=0.1)

In [18]:
train.shape

(14374, 5)

Check the average length of an input text

In [19]:
lens = [len(t.split()) for t in train['text'].tolist()]
sum(lens) / len(lens)

158.32711840823708

150 looks like a reasonable choice of max_seq_length

In [25]:
max_seq_length = 150

In [26]:
train_text = train['text'].tolist()
train_text = np.array(train_text, dtype=object)[:, np.newaxis]

dev_text = dev['text'].tolist()
dev_text = np.array(dev_text, dtype=object)[:, np.newaxis]

In [28]:
bert_path = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"
tokenizer = create_tokenizer_from_hub_module(bert_path)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Convert data to InputExample format

In [29]:
train_examples = convert_text_to_examples(train_text)
dev_examples = convert_text_to_examples(dev_text)

Convert data to BERT input format

In [30]:
(train_input_ids,
train_input_masks,
train_segment_ids) = convert_examples_to_features(tokenizer, train_examples, max_seq_length=max_seq_length)
(dev_input_ids,
dev_input_masks,
dev_segment_ids) = convert_examples_to_features(tokenizer, dev_examples, max_seq_length=max_seq_length)

Converting examples to features: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 14374/14374 [01:37<00:00, 147.90it/s]
Converting examples to features: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1598/1598 [00:11<00:00, 145.08it/s]


In [31]:
train_input_ids.shape

(14374, 150)

Get the column with labels.
Get the number of unique labels - will need it to build the model.

In [None]:
num_labels = len(set(train['label'].tolist()))
train_labels = train['label'].to_numpy()
dev_labels = dev['label'].to_numpy()
num_labels

EarlyStopping in case val_loss does not go down over 3 epochs

In [32]:
my_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                min_delta=0,
                                patience=3,
                                verbose=0,
                                mode='auto',
                                baseline=None,
                                restore_best_weights=True)

Building the model: passing max_seq_length to define the input shape, and num_labels to define the number of units in the last layer.
Compiling the model: for multi-class classifiction where each label is a number, we use sparse_categorical_crossentropy loss and accuracy metric.

In [None]:
model = build_model(max_seq_length, num_labels)
model.compile(loss='sparse_categorical_crossentropy',
                 optimizer=tf.keras.optimizers.Adam(learning_rate = 5e-5, beta_1=0.8),
                 metrics=["accuracy"])
model.summary()

In [37]:
# Instantiate variables
initialize_vars(sess)

























In [39]:
epochs = 20
batch_size = 32

history = model.fit(
    [train_input_ids, train_input_masks, train_segment_ids],
    train_labels,
    validation_data=(
        [dev_input_ids, dev_input_masks, dev_segment_ids],
        dev_labels,
    ),
    epochs=epochs,
    batch_size=batch_size,
    callbacks=[my_callback]
)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 14374 samples, validate on 1598 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20


<tensorflow.python.keras.callbacks.History at 0x1e61fd05fc8>

Plot model history for accuracy:

In [None]:
plot_label(history, 'acc')

Plot model history for loss:

In [None]:
plot_label(history, 'loss')

To save the model's weights:

In [None]:
model.save_weights(r'weights')

 Load test data and convert to BERT input format

In [34]:
test = pd.read_csv(r'C:\Users\Anna\Files\mylearning\ScienceCategories\nlpsci\test.csv')
test['text'] = test['TITLE'] + ' ' + test['ABSTRACT']
test_text = test['text'].tolist()
test_text = np.array(test_text, dtype=object)[:, np.newaxis]

In [35]:
test_examples = convert_text_to_examples(test_text)

In [36]:
(test_input_ids,
test_input_masks,
test_segment_ids) = convert_examples_to_features(tokenizer, test_examples, max_seq_length=max_seq_length)

Converting examples to features: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:35<00:00, 141.68it/s]


Let's predict test data labels now

In [42]:
pred = model.predict([test_input_ids,
                        test_input_masks,
                        test_segment_ids], batch_size = batch_size)

In [43]:
pred.shape

(5000, 6)

Convert probabilities to class labels

In [44]:
pred_class = np.argmax(pred, axis = 1)
pred_class[:10]

array([1, 0, 2, 3, 2, 3, 2, 2, 1, 1], dtype=int64)

In [45]:
pred_class.shape

(5000,)

Add the needed columns names for submission

In [49]:
results = pd.DataFrame(data=pred_class, columns=['label'])
results['ID'] = results.index + 1
columns_titles = ["ID","label"]
results=results.reindex(columns=columns_titles)
results.to_csv(r'submission.csv', index = False)