# Lab 1 Text Processing Tools

## NLTK

Downloading Libs and Testing That They Are Working
https://github.com/hb20007/hands-on-nltk-tutorial

In [1]:
# install nltk
# ! pip install nltk
import nltk

### Tokenization

#### Sentence Tokenizer

In [2]:
# download punkt package
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/kaishuai/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
passage = "Mrkgnao! the cat said loudly. She blinked up out of her avid shameclosing eyes, mewing \
plaintively and long, showing him her milkwhite teeth. He watched the dark eyeslits narrowing \
with greed till her eyes were green stones. Then he went to the dresser, took the jug Hanlon's\
milkman had just filled for him, poured warmbubbled milk on a saucer and set it slowly on the floor.\
— Gurrhr! she cried, running to lap."

doc = nltk.sent_tokenize(passage)
for s in doc:
    print(">",s)

> Mrkgnao!
> the cat said loudly.
> She blinked up out of her avid shameclosing eyes, mewing plaintively and long, showing him her milkwhite teeth.
> He watched the dark eyeslits narrowing with greed till her eyes were green stones.
> Then he went to the dresser, took the jug Hanlon'smilkman had just filled for him, poured warmbubbled milk on a saucer and set it slowly on the floor.— Gurrhr!
> she cried, running to lap.


#### Word Tokenization

There are different methods for tokenizing text into words, such as:
1. TreebankWordTokenizer
2. WordPunctTokenizer
3. WhitespaceTokenizer

In [4]:
from nltk import word_tokenize

sentence = "Mary had a little lamb it's fleece was white as snow."

# default Tokenization
default_tokens = word_tokenize(sentence)   # nltk.download('punkt') for this

# WordPunctTokenizer
punct_tokenizer = nltk.tokenize.WordPunctTokenizer()
punct_tokens = punct_tokenizer.tokenize(sentence)

print("DEFAULT: ", default_tokens)
print("PUNCT  : ", punct_tokens)

DEFAULT:  ['Mary', 'had', 'a', 'little', 'lamb', 'it', "'s", 'fleece', 'was', 'white', 'as', 'snow', '.']
PUNCT  :  ['Mary', 'had', 'a', 'little', 'lamb', 'it', "'", 's', 'fleece', 'was', 'white', 'as', 'snow', '.']


Exercise: use nltk.tokenize.SpaceTokenizer() to tokenize the above sentence.

In [5]:
# your codes here
# space_tokenizer = nltk.tokenize.SpaceTokenizer()
# space_tokens = space_tokenizer.tokenize(sentence)

# print("SPACE: ", space_tokens)

### Part of Speech Tagging

In [6]:
# download package
nltk.download('averaged_perceptron_tagger', download_dir="your_download_dir")  # replace "your_download_dir" with your download directory, the following packages are loaded in the same way
nltk.data.path = ["your_download_dir"] + nltk.data.path

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/kaishuai/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [7]:
pos_default = nltk.pos_tag(default_tokens)
print("DEFAULT: ", pos_default)

pos_punct = nltk.pos_tag(punct_tokens)
print("PUNCT  : ", pos_punct)

DEFAULT:  [('Mary', 'NNP'), ('had', 'VBD'), ('a', 'DT'), ('little', 'JJ'), ('lamb', 'NN'), ('it', 'PRP'), ("'s", 'VBZ'), ('fleece', 'NN'), ('was', 'VBD'), ('white', 'JJ'), ('as', 'IN'), ('snow', 'NN'), ('.', '.')]
PUNCT  :  [('Mary', 'NNP'), ('had', 'VBD'), ('a', 'DT'), ('little', 'JJ'), ('lamb', 'NN'), ('it', 'PRP'), ("'", "''"), ('s', 'JJ'), ('fleece', 'NN'), ('was', 'VBD'), ('white', 'JJ'), ('as', 'IN'), ('snow', 'NN'), ('.', '.')]


Exercise: tage POS of the whitespace tokens.

In [8]:
# your codes here
# pos_space = nltk.pos_tag(space_tokens)

# print("SPACE: ", pos_space)

### Stemming and Lemmatization

#### Stemming

In [9]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
snowball = nltk.stem.snowball.SnowballStemmer("english")

print("PorterStemmer:", [porter.stem(t) for t in default_tokens])
print("LancasterStemmer:", [lancaster.stem(t) for t in default_tokens])
print("SnowballStemmer:", [snowball.stem(t) for t in default_tokens])

PorterStemmer: ['mari', 'had', 'a', 'littl', 'lamb', 'it', "'s", 'fleec', 'wa', 'white', 'as', 'snow', '.']
LancasterStemmer: ['mary', 'had', 'a', 'littl', 'lamb', 'it', "'s", 'fleec', 'was', 'whit', 'as', 'snow', '.']
SnowballStemmer: ['mari', 'had', 'a', 'littl', 'lamb', 'it', "'s", 'fleec', 'was', 'white', 'as', 'snow', '.']


In [10]:
# try another sentence
sentence2 = "When I was going into the woods I saw a bear lying asleep on the forest floor"
print("Sentence:", sentence2)

# first tokenize the sentence
tokens2 = word_tokenize(sentence2)
# then apply stemming
for stemmer in [porter, lancaster, snowball]:
    print([stemmer.stem(t) for t in tokens2])

Sentence: When I was going into the woods I saw a bear lying asleep on the forest floor
['when', 'I', 'wa', 'go', 'into', 'the', 'wood', 'I', 'saw', 'a', 'bear', 'lie', 'asleep', 'on', 'the', 'forest', 'floor']
['when', 'i', 'was', 'going', 'into', 'the', 'wood', 'i', 'saw', 'a', 'bear', 'lying', 'asleep', 'on', 'the', 'forest', 'flo']
['when', 'i', 'was', 'go', 'into', 'the', 'wood', 'i', 'saw', 'a', 'bear', 'lie', 'asleep', 'on', 'the', 'forest', 'floor']


#### Lemmatizing

In [11]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/kaishuai/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [12]:
# try word net lemmatizer
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t) for t in default_tokens])

['Mary', 'had', 'a', 'little', 'lamb', 'it', "'s", 'fleece', 'wa', 'white', 'a', 'snow', '.']


In [13]:
# try another sentence
sentence2 = "When I was going into the woods I saw a bear lying asleep on the forest floor"
print("Sentence:", sentence2)

# first tokenize the sentence
tokens2 = word_tokenize(sentence2)
# then apply lemmatization
tokens2_pos = nltk.pos_tag(tokens2)
print([wnl.lemmatize(t) for t in tokens2])

Sentence: When I was going into the woods I saw a bear lying asleep on the forest floor
['When', 'I', 'wa', 'going', 'into', 'the', 'wood', 'I', 'saw', 'a', 'bear', 'lying', 'asleep', 'on', 'the', 'forest', 'floor']


## spaCy

Download spaCy according to the official website: https://spacy.io/usage#quickstart

In [14]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [15]:
# try analyzing the sentence
# all the information is available in the token object
doc = nlp("When I was going into the woods I saw a bear lying asleep on the forest floor?")

print("text\tlemma\tpos\ttag\tdep\tshape\tis_alpha\tis_stop")
for token in doc:
    print(token.text, "\t", token.lemma_, "\t", token.pos_, "\t", token.tag_, "\t", token.dep_, "\t", 
            token.shape_, "\t", token.is_alpha, "\t", token.is_stop, sep='')

text	lemma	pos	tag	dep	shape	is_alpha	is_stop
When	when	ADV	WRB	advmod	Xxxx	True	True
I	-PRON-	PRON	PRP	nsubj	X	True	True
was	be	VERB	VBD	aux	xxx	True	True
going	go	VERB	VBG	advcl	xxxx	True	False
into	into	ADP	IN	prep	xxxx	True	True
the	the	DET	DT	det	xxx	True	True
woods	wood	NOUN	NNS	pobj	xxxx	True	False
I	-PRON-	PRON	PRP	nsubj	X	True	True
saw	see	VERB	VBD	ROOT	xxx	True	False
a	a	DET	DT	det	x	True	True
bear	bear	NOUN	NN	nsubj	xxxx	True	False
lying	lie	VERB	VBG	ccomp	xxxx	True	False
asleep	asleep	ADJ	JJ	dobj	xxxx	True	False
on	on	ADP	IN	prep	xx	True	True
the	the	DET	DT	det	xxx	True	True
forest	forest	NOUN	NN	compound	xxxx	True	False
floor	floor	NOUN	NN	pobj	xxxx	True	False
?	?	PUNCT	.	punct	?	False	False


## Vectorization and Document Similarity Calculation

### Gensim

Content in demofile.txt: \
Mars is the fourth planet in our solar system. \
It is second-smallest planet in the Solar System after Mercury. \
Saturn is yellow planet.

Content (query) in demofile2.txt: \
Mars is approximately half the diameter of Earth. 

Tasks: \
1.vectorize each sentence; \
2.compute the similarities between the query and each document in the demofile.txt;

In [16]:
import nltk, gensim
from nltk.tokenize import word_tokenize, sent_tokenize

gen_docs = []

# first tokenzie the sentence, which is the first step in most NLP tasks
with open ('data/demofile.txt') as f:
    docs=f.readlines()
    for doc in docs:
        gen_docs.append([w.lower() for w in word_tokenize(doc)])

print(gen_docs)

[['mars', 'is', 'the', 'fourth', 'planet', 'in', 'our', 'solar', 'system', '.'], ['it', 'is', 'second-smallest', 'planet', 'in', 'the', 'solar', 'system', 'after', 'mercury', '.'], ['saturn', 'is', 'yellow', 'planet', '.']]


In [17]:
# create a dictionary for all tokens in the documents
dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary.token2id)

{'.': 0, 'fourth': 1, 'in': 2, 'is': 3, 'mars': 4, 'our': 5, 'planet': 6, 'solar': 7, 'system': 8, 'the': 9, 'after': 10, 'it': 11, 'mercury': 12, 'second-smallest': 13, 'saturn': 14, 'yellow': 15}


In [18]:
# create a bag-of-words corpus
corpus = []
for doc in gen_docs:
    corpus.append(dictionary.doc2bow(doc))
    print(corpus[-1])

# (0, 1) means that word id 0 appears once in the first document
# (1, 1) means that word id 1 appears once in the first document

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]
[(0, 1), (2, 1), (3, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)]
[(0, 1), (3, 1), (6, 1), (14, 1), (15, 1)]


### Vectorization

In [35]:
# simple one-hot vector representation based on the bag-of-words
v = gensim.matutils.corpus2csc(corpus).toarray().T
v[v > 1] = 1
print("Shape:", v.shape)  # 3 documents, 16 unique words (i.e., the size of the dictionary)
print("One-hot Vector:\n", v)  # >0 if the word appears in the document, 0 otherwise

Shape: (3, 16)
One-hot Vector:
 [[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0.]
 [1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1.]]


### Documnet Similarity

#### Similarity based on One-Hot representation

In [20]:
# load the query and tokenize it
with open ('data/demofile2.txt') as f:
    query = f.readlines()[0]
    query_doc = [w.lower() for w in word_tokenize(query)]
print('query_doc', query_doc)

# convert the query to bag-of-words
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)

query_doc ['mars', 'is', 'a', 'cold', 'desert', 'world', '.', 'it', 'is', 'half', 'the', 'size', 'of', 'the', 'earth', '.']
[(0, 2), (3, 2), (4, 1), (9, 2), (11, 1)]


In [36]:
v = gensim.matutils.corpus2csc(corpus + [query_doc_bow]).toarray().T
v[v > 1] = 1
corpus_v = v[:-1]
query_v = v[-1]
print("Shape:", query_v.shape)  # 1 document, 16 unique words (i.e., the size of the dictionary)
print("One-hot Vector:\n", query_v)  # >0 if the word appears in the document, 0 otherwise

Shape: (16,)
One-hot Vector:
 [1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0.]


In [37]:
# calculate the cosine similarity
import numpy as np
from numpy.linalg import norm

# calculate the cosine similarity
def cos_sim(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

# calculate the cosine similarity
cos_similarities = [cos_sim(query_v, d) for d in corpus_v]
print('Comparing Result:', cos_similarities)

Comparing Result: [0.5656854249492379, 0.5393598899705937, 0.3999999999999999]


#### Similarity based on TF-IDF

In [23]:
import numpy as np

# create a tf-idf model from the corpus
tf_idf = gensim.models.TfidfModel(corpus)

for doc in tf_idf[corpus]:
    print([(dictionary[id], np.around(freq, decimals=2)) for id, freq in doc])

# ('fourth', 0.53) means that the word 'fourth' has a tf-idf score of 0.53 in the first document
# ('saturn', 0.71) means that the word 'saturn' has a tf-idf score of 0.71 in the third document
# Some tokens may appear in most documents, leading to their TF-IDF values being too low to display in the output.

[('fourth', 0.53), ('in', 0.2), ('mars', 0.53), ('our', 0.53), ('solar', 0.2), ('system', 0.2), ('the', 0.2)]
[('in', 0.17), ('solar', 0.17), ('system', 0.17), ('the', 0.17), ('after', 0.47), ('it', 0.47), ('mercury', 0.47), ('second-smallest', 0.47)]
[('saturn', 0.71), ('yellow', 0.71)]


In [24]:
# building the index
sims = gensim.similarities.Similarity('gensim', tf_idf[corpus], num_features=len(dictionary))
print(sims)

Similarity index with 3 documents in 0 shards (stored under gensim)


In [25]:
# load the query and tokenize it
with open ('data/demofile2.txt') as f:
    query = f.readlines()[0]
    query_doc = [w.lower() for w in word_tokenize(query)]
print('query_doc',query_doc)

# convert the query to bag-of-words
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)

query_doc ['mars', 'is', 'a', 'cold', 'desert', 'world', '.', 'it', 'is', 'half', 'the', 'size', 'of', 'the', 'earth', '.']
[(0, 2), (3, 2), (4, 1), (9, 2), (11, 1)]


In [26]:
# perform a similarity query against the corpus
query_doc_tf_idf = tf_idf[query_doc_bow]

# print document similarity
print('Comparing Result:', sims[query_doc_tf_idf])

Comparing Result: [0.42364514 0.37414625 0.        ]


## Classification using word2vec

In [27]:
import pandas as pd

# load the training data
train_df = pd.read_csv('data/BBC_News_Train.csv', header=0)
print('BBC News Training DataFrame:\n', train_df)

BBC News Training DataFrame:
      ArticleId                                               Text  Category
0         1833  worldcom ex-boss launches defence lawyers defe...  business
1          154  german business confidence slides german busin...  business
2         1101  bbc poll indicates economic gloom citizens in ...  business
3         1976  lifestyle  governs mobile choice  faster  bett...      tech
4          917  enron bosses in $168m payout eighteen former e...  business
..         ...                                                ...       ...
994       1647  van nistelrooy set to return manchester united...     sport
995       1827  yelling takes cardiff hat-trick european cross...     sport
996       2187  battered dollar hits another low the dollar ha...  business
997       1897  turkey turns on the economic charm three years...  business
998       1795  robinson ready for difficult task england coac...     sport

[999 rows x 3 columns]


In [28]:
# tokenize the news
news_docs = []
for sent in train_df.Text.tolist():
    news_docs.append([w.lower() for w in word_tokenize(sent)])

# get labels (here, we simplify the category into two classes: business and not-business)
label_text = train_df.Category.tolist()
labels = [1 if label == 'business' else 0 for label in label_text]
print('Labels:', labels)

Labels: [1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0

In [29]:
# training word2vec model on all news documents. This may take a moment.
word2vec = gensim.models.Word2Vec(news_docs,
                        size = 300,
                        window = 8,
                        min_count = 5,
                        sg = 0,
                        alpha = 0.025,
                        iter=10,
                        batch_words = 10000)

In [30]:
# create vector for each sentence based on word2vec
data = []

# simply add the word vectors in the sentence as the sentence vector
for i in range(len(news_docs)):
    tmp = np.zeros([300])
    for w in news_docs[i]:
        if w in word2vec.wv:
            tmp = tmp + word2vec.wv[w]

    data.append(tmp)
data = np.array(data)
print('Shape:', data.shape)

Shape: (999, 300)


In [31]:
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

### Train a Binary classifier using XGBoost

In [32]:
# build a XGBoost model to classsify
# ! pip install xgboost
from xgboost import XGBClassifier

model_XGB = XGBClassifier()
model_XGB.fit(X_train, y_train)

In [33]:
# calculate accuracy
from sklearn.metrics import classification_report, roc_auc_score

y_prob = model_XGB.predict_proba(X_test)[:,1]

#y_prob is the probability of the class being 1
#you can get the class predictions by thresholding the probabilities
#for example, here, we use 0.5 as the threshold, i.e., if the probability is greater than 0.5, we predict the class as 1, otherwise 0

# when threshold=0.5, the precision, recall, and F1 score
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
auc_roc = roc_auc_score(y_test, y_prob)

target_names = ['Not Business', 'Business']
print(classification_report(y_test, y_pred, target_names=target_names))
print("AUC =", auc_roc)

              precision    recall  f1-score   support

Not Business       0.95      0.98      0.96       149
    Business       0.93      0.84      0.89        51

    accuracy                           0.94       200
   macro avg       0.94      0.91      0.93       200
weighted avg       0.94      0.94      0.94       200

AUC = 0.98499802605606


Practice:
1. you can try to construct other models such as linear reagression model, tree-based models, SVM model, RNN models, simple MLP models, etc.;
2. you can try other processing operations such as removing stopwords, removing punctuation, removing numbers, etc.;

## Classification using BERT

### Tokenization

In [1]:
import torch
from transformers import BertTokenizer

In [2]:
# BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
vocab = tokenizer.get_vocab() # a dictionary

# special tokens and ids
print(vocab['[CLS]'])
print(vocab['[SEP]'])
print(vocab['[PAD]'])

101
102
0


In [3]:
# try to tokenize a sentence
sent1 = "[CLS] He remains characteristically confident and optimistic."
encoding1 = tokenizer.tokenize(sent1)
sent_ids1 = tokenizer.convert_tokens_to_ids(encoding1)
print('Sentence tokenization:', encoding1)
print('Sentence ids:',sent_ids1)

Sentence tokenization: ['[CLS]', 'he', 'remains', 'characteristic', '##ally', 'confident', 'and', 'optimistic', '.']
Sentence ids: [101, 2002, 3464, 8281, 3973, 9657, 1998, 21931, 1012]


In [4]:
# try to tokenize a sentence
sent2 = "He remains characteristically confident and optimistic. \
        [SEP] Therefore, I believe he will finally achieve success. [PAD]"
encoding2 = tokenizer.tokenize(sent2)
sent_ids2 = tokenizer.convert_tokens_to_ids(encoding2)
print('Sentence tokenization:', encoding2)
print('Sentence ids:',sent_ids2)

Sentence tokenization: ['he', 'remains', 'characteristic', '##ally', 'confident', 'and', 'optimistic', '.', '[SEP]', 'therefore', ',', 'i', 'believe', 'he', 'will', 'finally', 'achieve', 'success', '.', '[PAD]']
Sentence ids: [2002, 3464, 8281, 3973, 9657, 1998, 21931, 1012, 102, 3568, 1010, 1045, 2903, 2002, 2097, 2633, 6162, 3112, 1012, 0]


In [5]:
# try to encode a pair of sentences
# the tokenizer will automatically add [CLS] and [SEP] tokens
# token_type_ids will be used to distinguish the two sentences (input as 0 or 1, they will be converted to another embedding vector)
sent_encoding = tokenizer.encode_plus(sent1, sent2)
print(sent_encoding)

{'input_ids': [101, 101, 2002, 3464, 8281, 3973, 9657, 1998, 21931, 1012, 102, 2002, 3464, 8281, 3973, 9657, 1998, 21931, 1012, 102, 3568, 1010, 1045, 2903, 2002, 2097, 2633, 6162, 3112, 1012, 0, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


### Configuration & Model

In [6]:
from transformers import BertConfig, BertModel

# parameters for the model
configuration = BertConfig()
model = BertModel(configuration)
configuration = model.config
print(configuration)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [7]:
# load the pre-trained model 
# (you can also download the model from the link: https://huggingface.co/google-bert/bert-base-uncased, 
# and then load it from the local directory)
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("He remains characteristically confident and optimistic.",
                   return_tensors="pt")
print('Inputs:', inputs)

# the input is a dictionary with keys: input_ids, token_type_ids, attention_mask
outputs = model(**inputs)
last_hidden_states = outputs[0]
print('Last hidden states:', last_hidden_states)
print('Last hidden states size:',last_hidden_states.size())

Inputs: {'input_ids': tensor([[  101,  2002,  3464,  8281,  3973,  9657,  1998, 21931,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Last hidden states: tensor([[[-0.3684, -0.0749, -0.3237,  ..., -0.0464,  0.3429,  0.2990],
         [-0.3090, -0.2400, -0.3401,  ..., -0.0900,  0.8401, -0.6425],
         [-0.0244,  0.1629, -0.3323,  ..., -0.1590, -0.3468,  0.1759],
         ...,
         [-0.2457,  0.1194,  0.1172,  ...,  0.0415, -0.2971, -0.5022],
         [ 0.6263, -0.0320, -0.3789,  ...,  0.2608, -0.4490, -0.3570],
         [ 0.4926,  0.0245, -0.1460,  ...,  0.3488, -0.5853, -0.4121]]],
       grad_fn=<NativeLayerNormBackward0>)
Last hidden states size: torch.Size([1, 10, 768])


### Train a Multi-Class Sentence Classification Model

BBC News Train.csv includes dataset for news classification data

In [8]:
import pandas as pd

# load the training data
train_df = pd.read_csv('data/BBC_News_Train.csv', header=0)
print('BBC News Training DataFrame:\n', train_df)

BBC News Training DataFrame:
      ArticleId                                               Text  Category
0         1833  worldcom ex-boss launches defence lawyers defe...  business
1          154  german business confidence slides german busin...  business
2         1101  bbc poll indicates economic gloom citizens in ...  business
3         1976  lifestyle  governs mobile choice  faster  bett...      tech
4          917  enron bosses in $168m payout eighteen former e...  business
..         ...                                                ...       ...
994       1647  van nistelrooy set to return manchester united...     sport
995       1827  yelling takes cardiff hat-trick european cross...     sport
996       2187  battered dollar hits another low the dollar ha...  business
997       1897  turkey turns on the economic charm three years...  business
998       1795  robinson ready for difficult task england coac...     sport

[999 rows x 3 columns]


In [9]:
# get labels
label_text = train_df.Category.tolist()[:50]
label_type = set(label_text)
label2id = {label: id for id, label in enumerate(label_type)}
labels = [label2id[label] for label in label_text]
labels = torch.tensor(labels)
print('Labels:', labels)

Labels: tensor([0, 0, 0, 3, 0, 4, 1, 2, 0, 2, 4, 2, 0, 0, 1, 1, 2, 1, 1, 3, 1, 2, 1, 1,
        3, 0, 3, 1, 4, 0, 3, 2, 0, 4, 1, 2, 1, 1, 1, 3, 3, 1, 1, 3, 2, 2, 3, 3,
        3, 0])


In [10]:
input_ids, attention_masks = [], []

# 50 sentences for example, you can try more in your experiments
for sent in train_df.Text[:50]:
    encoded_dict = tokenizer.encode_plus(
                        ' '.join(sent.split()[:100]),    # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 102,           # Pad & truncate all sentences. (control the length so that you can construct the input as matrix)
                        pad_to_max_length = True,       #(control the length so that you can construct the input as matrix)
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
print('Original: ', ' '.join(train_df.Text[0].split()[:60]))
print('Token ids:', input_ids[0])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Token ids: tensor([  101,  2088,  9006,  4654,  1011,  5795, 18989,  4721,  9559,  6984,
         2280,  2088,  9006,  2708, 15941,  1041, 29325,  2015,  2114,  1037,
         6046,  1997,  9861,  5571,  2031,  2170,  1037,  2194, 13300, 16558,
        25114,  2004,  2037,  2034,  7409,  1012, 15809,  6201,  2088,  9006,
         1055,  4654,  1011,  2132,  1997,  4722,  9529, 22333,  5501,  2000,
        12052,  9529,  6078,  2012,  1996,  2149, 18126,  2015,  5016,  1999,
         2526,  1012,  2014, 16234,  2419,  2000,  1996,  7859,  1997,  1996,
         3813,  2206,  1996,  5456,  1997,  2019,  1002,  2340, 24700,  1006,
        27813,  1012,  1021, 24700,  1007,  9529,  9861,  1012,  2720,  1041,
        29325,  2015,  2038, 12254,  2025,  5905,  2000,  5571,  1997,  9861,
         1998,   102])




In [11]:
from torch.utils.data import TensorDataset, random_split
dataset = TensorDataset(input_ids, attention_masks, labels)
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset into training and validation by randomly selecting samples
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

   45 training samples
    5 validation samples


In [12]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
batch_size = 4

# create the dataloader to load the data automatically in batches
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially (the original order).
            batch_size = batch_size # Evaluate with this batch size.
        )

In [13]:
from transformers import BertForSequenceClassification

#BertForSequenceClassification is used for multi-calss classification
#if you want to use a model for binary classification, you can use other BERT series models or construct a new model based on BertModel

cls_model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = len(label_type), # The number of output labels.
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# use GPU
cls_model.cuda()
params = list(cls_model.named_parameters())
print('The BERT model has {:} different named parameters.\n'.format(len(params)))
print('==== Embedding Layer ====\n')
for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
print('\n==== First Transformer ====\n')
for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
print('\n==== Output Layer ====\n')
for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The BERT model has 201 different named parameters.

==== Embedding Layer ====

bert.embeddings.word_embeddings.weight                  (30522, 768)
bert.embeddings.position_embeddings.weight                (512, 768)
bert.embeddings.token_type_embeddings.weight                (2, 768)
bert.embeddings.LayerNorm.weight                              (768,)
bert.embeddings.LayerNorm.bias                                (768,)

==== First Transformer ====

bert.encoder.layer.0.attention.self.query.weight          (768, 768)
bert.encoder.layer.0.attention.self.query.bias                (768,)
bert.encoder.layer.0.attention.self.key.weight            (768, 768)
bert.encoder.layer.0.attention.self.key.bias                  (768,)
bert.encoder.layer.0.attention.self.value.weight          (768, 768)
bert.encoder.layer.0.attention.self.value.bias                (768,)
bert.encoder.layer.0.attention.output.dense.weight        (768, 768)
bert.encoder.layer.0.attention.output.dense.bias              (

In [14]:
from transformers import AdamW

# set the optimizer
optimizer = AdamW(cls_model.parameters(),
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )

from transformers import get_linear_schedule_with_warmup

epochs = 4
total_steps = len(train_dataloader) * epochs

# create the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0,  # you can change this value for your experiments
                                            num_training_steps = total_steps)



Following two blocks provide two help function

In [15]:
import numpy as np

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [16]:
import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [18]:
import random
import numpy as np
from tqdm import tqdm
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# We'll store a number of quantities such as training and validation loss, 
# validation accuracy, and timings.
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()

# For each epoch...
epochs = 5
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')
    
    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    cls_model.train()
    
    # For each batch of training data...
    for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):
        # Progress update every 40 batches.
        if step % 4 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
        b_input_ids = batch[0].to('cuda')
        b_input_mask = batch[1].to('cuda')
        b_labels = batch[2].to('cuda')
        
        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        cls_model.zero_grad()   
        outputs = cls_model(b_input_ids, 
                             token_type_ids=None, 
                             attention_mask=b_input_mask, 
                             labels=b_labels)
        loss = outputs.loss
        logits = outputs.logits

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_train_loss = total_train_loss + loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(cls_model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()
    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(training_time))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")
    
    
    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    cls_model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0
    
    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using 
        # the `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to('cuda')
        b_input_mask = batch[1].to('cuda')
        b_labels = batch[2].to('cuda')
        
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            # Get the "logits" output by the model. The "logits" are the output
            # values prior to applying an activation function like the softmax.

            outputs = cls_model(b_input_ids, 
                            token_type_ids=None, 
                            attention_mask=b_input_mask,
                            labels=b_labels
                           )
            loss, logits = outputs.loss, outputs.logits
            print(loss)
        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.cpu().numpy()

        #logits is the probability of each class
        #multi-class predictions are obtained by taking the argmax of the logits
        #Notice: binary classification is different from multi-class classification when get the predictions

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        total_eval_accuracy += flat_accuracy(logits, label_ids)
        

    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)
    
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...


 25%|██▌       | 3/12 [00:00<00:00, 22.95it/s]

  Batch     4  of     12.    Elapsed: 0:00:00.


 75%|███████▌  | 9/12 [00:00<00:00, 26.78it/s]

  Batch     8  of     12.    Elapsed: 0:00:00.


100%|██████████| 12/12 [00:00<00:00, 26.84it/s]



  Average training loss: 1.07
  Training epcoh took: 0:00:00

Running Validation...
tensor(1.2061, device='cuda:0')
tensor(1.1778, device='cuda:0')
  Accuracy: 0.88
  Validation Loss: 1.19
  Validation took: 0:00:00

Training...


 25%|██▌       | 3/12 [00:00<00:00, 28.11it/s]

  Batch     4  of     12.    Elapsed: 0:00:00.


 75%|███████▌  | 9/12 [00:00<00:00, 28.62it/s]

  Batch     8  of     12.    Elapsed: 0:00:00.


100%|██████████| 12/12 [00:00<00:00, 28.41it/s]



  Average training loss: 1.19
  Training epcoh took: 0:00:00

Running Validation...
tensor(1.2061, device='cuda:0')
tensor(1.1778, device='cuda:0')
  Accuracy: 0.88
  Validation Loss: 1.19
  Validation took: 0:00:00

Training...


 33%|███▎      | 4/12 [00:00<00:00, 31.06it/s]

  Batch     4  of     12.    Elapsed: 0:00:00.


 67%|██████▋   | 8/12 [00:00<00:00, 28.78it/s]

  Batch     8  of     12.    Elapsed: 0:00:00.


100%|██████████| 12/12 [00:00<00:00, 28.19it/s]



  Average training loss: 1.11
  Training epcoh took: 0:00:00

Running Validation...
tensor(1.2061, device='cuda:0')
tensor(1.1778, device='cuda:0')
  Accuracy: 0.88
  Validation Loss: 1.19
  Validation took: 0:00:00

Training...


 33%|███▎      | 4/12 [00:00<00:00, 31.03it/s]

  Batch     4  of     12.    Elapsed: 0:00:00.


 67%|██████▋   | 8/12 [00:00<00:00, 30.59it/s]

  Batch     8  of     12.    Elapsed: 0:00:00.


100%|██████████| 12/12 [00:00<00:00, 30.45it/s]



  Average training loss: 1.14
  Training epcoh took: 0:00:00

Running Validation...
tensor(1.2061, device='cuda:0')
tensor(1.1778, device='cuda:0')
  Accuracy: 0.88
  Validation Loss: 1.19
  Validation took: 0:00:00

Training...


 33%|███▎      | 4/12 [00:00<00:00, 31.83it/s]

  Batch     4  of     12.    Elapsed: 0:00:00.


 67%|██████▋   | 8/12 [00:00<00:00, 30.92it/s]

  Batch     8  of     12.    Elapsed: 0:00:00.


100%|██████████| 12/12 [00:00<00:00, 30.70it/s]


  Average training loss: 1.13
  Training epcoh took: 0:00:00

Running Validation...
tensor(1.2061, device='cuda:0')
tensor(1.1778, device='cuda:0')
  Accuracy: 0.88
  Validation Loss: 1.19
  Validation took: 0:00:00

Training complete!
Total training took 0:00:02 (h:mm:ss)





### Multiple Choice

Another example based on BERT for multiple choice

In [20]:
from transformers import BertForMultipleChoice

MC_model = BertForMultipleChoice.from_pretrained("bert-base-uncased")

Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
choice0 = "It is eaten with a fork and a knife."
choice1 = "It is eaten while held in the hand."
labels = torch.tensor(0).unsqueeze(0)
encoding = tokenizer([prompt, prompt], [choice0, choice1], return_tensors="pt", padding=True)
outputs = MC_model(**{k: v.unsqueeze(0) for k, v in encoding.items()}, labels=labels)
loss = outputs.loss
logits = outputs.logits
print('Loss:', loss)
print('Labels:', labels)
print('Logits:', logits)

Loss: tensor(0.6706, grad_fn=<NllLossBackward0>)
Labels: tensor([0])
Logits: tensor([[0.4332, 0.3876]], grad_fn=<ViewBackward0>)


Practice:
1. Have a try with SWAG_MC_Train.csv;