https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb#scrollTo=mHwjUwYgi-uL

In [120]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers
import warnings
warnings.filterwarnings('ignore')

In [121]:
df = pd.read_csv('input/train.csv', header=None)

In [122]:
df.shape

(6920, 2)

Just the first two for training, resources constraints

In [123]:
size_train=6920
batch_1 = df[:size_train]

In [124]:
batch_1[1].value_counts()

1    3610
0    3310
Name: 1, dtype: int64

Loading DistilBERT

In [125]:
#For DistilBERT:
model_class = transformers.DistilBertModel
tokenizer_class = transformers.DistilBertTokenizer
pretrained_weights = 'distilbert-base-uncased'

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Prepare the data for DistilBERT

#### --- Padding + Attention masks (don't actually need to do this with function encode_plus), see below section

In [126]:
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)
print(max_len)

67


In [127]:
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
np.array(padded).shape

(6920, 67)

In [128]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(6920, 67)

In [129]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

#### --- Whereas With encode_plus:

In [130]:
sentences= batch_1[0].values

# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for sent in sentences:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = max_len,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)



In [131]:
# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[0])
print('Token IDs:', input_ids[0])

Original:  a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
Token IDs: tensor([  101,  1037, 18385,  1010,  6057,  1998,  2633, 18276,  2128, 16603,
         1997,  5053,  1998,  1996,  6841,  1998,  5687,  5469,  3152,   102,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0])


Run the model

In [132]:
with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called [CLS] (for classification) at the beginning of every sentence (hence the 0 index for slicing). The output corresponding to that token can be thought of as an embedding for the entire sentence.

We'll save those in the features variable, as they'll serve as the features to our logitics regression model.

NOTE: model outputs a tuple, here the first element is what we need, see documentation at https://huggingface.co/transformers/migration.html

In [133]:
features = last_hidden_states[0][:,0,:].numpy()

In [134]:
labels = batch_1[1]

Let's now split our datset into a training set and testing set (even though we're using 'size_train' sentences from the SST2 training set).

In [135]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

Grid Search for linear regression params

In [136]:
parameters = {'C': np.linspace(0.0001, 100, 20)}
grid_search = GridSearchCV(LogisticRegression(), parameters)
grid_search.fit(train_features, train_labels)

print('best parameters: ', grid_search.best_params_)
print('best scrores: ', grid_search.best_score_)

best parameters:  {'C': 5.263252631578947}
best scrores:  0.8302504816955685


In [137]:
lr_clf = LogisticRegression(C=5.2)
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=5.2, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [138]:
lr_clf.score(test_features, test_labels)

0.8433526011560694

In [139]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.507 (+/- 0.03)


For reference, the highest accuracy score for this dataset is currently 96.8. DistilBERT can be trained to improve its score on this task – a process called fine-tuning which updates BERT’s weights to make it achieve a better performance in this sentence classification task (which we can call the downstream task). The fine-tuned DistilBERT turns out to achieve an accuracy score of 90.7. The full size BERT model achieves 94.9.

And that’s it! That’s a good first contact with BERT. The next step would be to head over to the documentation and try your hand at fine-tuning. You can also go back and switch from distilBERT to BERT and see how that works.



Important: for example of fine tuning BERT language model (using pregenerate_training_data.py and finetune_on_pregenerated.py), see :

https://github.com/Shivampanwar/Bert-text-classification/blob/master/bert_language_model_with_sequence_classification.ipynb