# Preliminaries

Write requirements to file, anytime you run it, in case you have to go back and recover dependencies.

Requirements are hosted for each notebook in the companion github repo, and can be pulled down and installed here if needed. Companion github repo is located at https://github.com/azunre/transfer-learning-for-nlp

In [1]:
!pip freeze > kaggle_image_requirements.txt

# Prepare Book Reviews from MDSD Data

In [2]:
!ls ../input/multi-domain-sentiment-dataset-books-and-dvds/

books.negative.review  dvd.negative.review
books.positive.review  dvd.positive.review


### Preliminary overarching hyperparameters

In [3]:
Nsamp = 1000 # number of samples to generate in each class - 'positive', 'negative'
maxtokens = 200 # the maximum number of tokens per document
maxtokenlen = 100 # the maximum length of each token

### Tokenization

In [4]:
def tokenize(row):
    if row is None or row is '':
        tokens = ""
    else:
        tokens = str(row).split(" ")[:maxtokens]
    return tokens

### Stop-word removal

In [5]:
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

def stop_word_removal(row):
    token = [token for token in row if token not in stopwords]
    token = filter(None, token)
    return token

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Use regular expressions to remove unnecessary characters

In [6]:
import re

def remove_reg_expressions(row):
    tokens = []
    try:
        for token in row:
            token = token.lower()
            token = re.sub(r'[\W\d]', " ", token)
            token = token[:maxtokenlen] # truncate token
            tokens.append(token)
    except:
        token = ""
        tokens.append(token)
    return tokens

### Data shuffling function

In [7]:
def unison_shuffle_data(data, header):
    p = np.random.permutation(len(header))
    data = data[p]
    header = np.asarray(header)[p]
    return data, header

### Parse

In [8]:
import numpy as np

def parse_MDSD(data):    
    out_lst = []
    for i in range(len(data)):
        txt = ""
        if(data[i]=="<review_text>\n"):
            j=i
            while(data[j]!="</review_text>\n"):
                txt = txt+data[j]
                j = j+1
            text = tokenize(txt)
            text = stop_word_removal(text)
            text = remove_reg_expressions(text)
            out_lst.append(text)
            
    return out_lst

with open ("../input/multi-domain-sentiment-dataset-books-and-dvds/books.negative.review", "r", encoding="latin1") as myfile:
    data=myfile.readlines()
neg_books = parse_MDSD(data)
len(neg_books)

with open ("../input/multi-domain-sentiment-dataset-books-and-dvds/books.positive.review", "r", encoding="latin1") as myfile:
    data=myfile.readlines()
pos_books = parse_MDSD(data)
len(pos_books)

header = [0]*len(neg_books)
header.extend([1]*len(pos_books))
neg_books.extend(pos_books)
MDSD_data = np.array(neg_books)

data, sentiments = unison_shuffle_data(np.array(MDSD_data), header)

Write prepared dataset to file for use by transformers

In [9]:
import pandas as pd

train_df = pd.DataFrame(data=data)
print(train_df.shape)


(2000, 1)


In [10]:
train_df.to_csv("albert_dataset.csv")
train_df.head()

Unnamed: 0,0
0,"[ review_text this, emotionally, charged , wi..."
1,"[ review_text for, part , people, author, tho..."
2,"[ review_text i, , pages, go, i, can t, wai..."
3,"[ review_text this, required, text, one, grad..."
4,"[ review_text very, detailed, information, hy..."


# Fine-tune ALBERT on Data

Initialize ALBERT tokenizer to pretrained checkpoint

In [11]:
from transformers import AlbertTokenizer



In [12]:
tokenizer = AlbertTokenizer.from_pretrained("albert-base-v2") # use pre-trained ALBERT tokenizer

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=760289.0, style=ProgressStyle(descripti…




Having prepared tokenizer, load pretrained checkpoint into a ALBERT masked language model.

In [13]:
from transformers import AlbertForMaskedLM # use masked language modeling

model = AlbertForMaskedLM.from_pretrained("albert-base-v2") # initialize to pretrained checkpoint

print("Number of parameters in ALBERT model:")
print(model.num_parameters())

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=684.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=47376696.0, style=ProgressStyle(descrip…


Number of parameters in ALBERT model:
11812272


Build dataset with tokenizer using method included with transformers

In [14]:
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="albert_dataset.csv",
    block_size=128, # how many lines to read at a time 
)

We will also need a "data collator". This is a helper method that creates a special object out of a batch of sample data lines (of length block_size). This special object is consummable by PyTorch to neural network training

In [15]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True, mlm_probability=0.15) # use masked language modeling, and mask words with probability of 0.15

Define standard training arguments, and then use them with previously defined dataset and collator to define a "trainer" for one epoch, i.e. across all 600000+ examples.

In [16]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="albert",
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_gpu_train_batch_size=16,
    save_total_limit=1,
)

In [17]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    prediction_loss_only=True,
)

Train and time.

In [18]:
import time
start = time.time()
trainer.train()
end = time.time()
print("Number of seconds for training:")
print((end-start))

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=10.0, style=ProgressStyle(description_width='…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=126.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=126.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=126.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=126.0, style=ProgressStyle(description_wi…

{"loss": 1.3136331552267075, "learning_rate": 3.0158730158730158e-05, "epoch": 3.9682539682539684, "step": 500}







HBox(children=(FloatProgress(value=0.0, description='Iteration', max=126.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=126.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=126.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=126.0, style=ProgressStyle(description_wi…

{"loss": 1.085478025019169, "learning_rate": 1.0317460317460318e-05, "epoch": 7.936507936507937, "step": 1000}



HBox(children=(FloatProgress(value=0.0, description='Iteration', max=126.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=126.0, style=ProgressStyle(description_wi…



Number of seconds for training:
325.6463761329651


In [19]:
trainer.save_model("albert_fine-tuned") # save model

Test model on "fill-in-the-blank" task, by taking a sentence, masking a word and then predicting a completion with pipelines API.

In [20]:
# Define fill-in-the-blanks pipeline
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="albert_fine-tuned",
    tokenizer=tokenizer
)
print(fill_mask("The author fails to [MASK] the plot."))

[{'sequence': '[CLS] the author fails to describe the plot.[SEP]', 'score': 0.059310346841812134, 'token': 4996}, {'sequence': '[CLS] the author fails to clarify the plot.[SEP]', 'score': 0.03886289522051811, 'token': 23116}, {'sequence': '[CLS] the author fails to correct the plot.[SEP]', 'score': 0.035548847168684006, 'token': 4456}, {'sequence': '[CLS] the author fails to explain the plot.[SEP]', 'score': 0.03406033292412758, 'token': 3271}, {'sequence': '[CLS] the author fails to satisfy the plot.[SEP]', 'score': 0.029976123943924904, 'token': 14711}]


Compare with pretrained model

In [21]:
# Define fill-in-the-blanks pipeline
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="albert-base-v2",
    tokenizer=tokenizer
)
print(fill_mask("The author fails to [MASK] the plot."))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…


[{'sequence': '[CLS] the author fails to satisfy the plot.[SEP]', 'score': 0.11895451694726944, 'token': 14711}, {'sequence': '[CLS] the author fails to understand the plot.[SEP]', 'score': 0.08055300265550613, 'token': 1369}, {'sequence': '[CLS] the author fails to explain the plot.[SEP]', 'score': 0.04448580741882324, 'token': 3271}, {'sequence': '[CLS] the author fails to execute the plot.[SEP]', 'score': 0.03402300551533699, 'token': 15644}, {'sequence': '[CLS] the author fails to realize the plot.[SEP]', 'score': 0.021951846778392792, 'token': 4007}]
