<a href="https://colab.research.google.com/github/ceying/MGMT962Tutorial/blob/main/Tutorial3_TextAnalytics_DocumentClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Huggingface Transformers with pretrained Bert embeddings

In [64]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# install libraries
!pip install transformers
!pip install torch

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 5.4MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 42.7MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/fd/5b/44baae602e0a30bcc53fbdbc60bd940c15e143d252d658dfdefce736ece5/tokenizers-0.10.1-cp36-cp36m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 30.2MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=58706009c64

## Steps are as follows:
1.   create the corpora
2.   check the sentiment of each document
3.   tokenize and vectorize each document with pretrained embeddings (BERT)
4.   assess the similarity between the documents




### Step 1: Create the Corpora

In [29]:
# define the corpora: here we're pretending each sentence as a document
batch_sentences =["Star light, star bright",
                  "The first star I see tonight",
                  "I wish I may, I wish I might",
                  "Have the wish I wish tonight"]

### Step 2: Check the sentiment of each document

In [30]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier(batch_sentences)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=48.0, style=ProgressStyle(description_w…




[{'label': 'POSITIVE', 'score': 0.9998798966407776},
 {'label': 'POSITIVE', 'score': 0.9996365308761597},
 {'label': 'NEGATIVE', 'score': 0.9938480257987976},
 {'label': 'NEGATIVE', 'score': 0.6100978255271912}]

Interesting... "I wish I may, I wish I might" has a very high negative score. 

### Step 3: Tokenize and Vectorize each document

In [33]:
# download pretrained Bert embeddings
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




In [37]:
# tokenization & vectorizataion - convert the document into individual tokens and into sequence vectors
inputs = tokenizer(batch_sentences, padding=True,truncation=True, return_tensors="pt")


In [39]:
print(inputs)

{'input_ids': tensor([[ 101, 2732, 2422, 1010, 2732, 4408,  102,    0,    0,    0,    0],
        [ 101, 1996, 2034, 2732, 1045, 2156, 3892,  102,    0,    0,    0],
        [ 101, 1045, 4299, 1045, 2089, 1010, 1045, 4299, 1045, 2453,  102],
        [ 101, 2031, 1996, 4299, 1045, 4299, 3892,  102,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])}


In [38]:
# check to make sure the padding is applied for the same sequence length
for ids in inputs['input_ids']:
  print(tokenizer.decode(ids))

[CLS] star light, star bright [SEP] [PAD] [PAD] [PAD] [PAD]
[CLS] the first star i see tonight [SEP] [PAD] [PAD] [PAD]
[CLS] i wish i may, i wish i might [SEP]
[CLS] have the wish i wish tonight [SEP] [PAD] [PAD] [PAD]


In [52]:
# create embeddings with pretrained model
outputs = model(**inputs)
print(outputs)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.1484,  0.2017, -0.1538,  ..., -0.3616,  0.1508,  0.4252],
         [ 0.4018,  0.6051,  0.2562,  ...,  0.0886,  0.8013,  0.3536],
         [ 0.4991,  0.4525,  1.0441,  ...,  0.0792,  0.6579, -0.7694],
         ...,
         [-0.2082, -0.1311,  0.3170,  ...,  0.5147,  0.1566, -0.2772],
         [-0.1246,  0.0765,  0.5740,  ...,  0.2817,  0.0549, -0.0986],
         [-0.2771, -0.3326,  0.0269,  ...,  0.5889,  0.0403, -0.3932]],

        [[-0.1637,  0.1043, -0.1241,  ..., -0.3076,  0.2827,  0.1035],
         [-0.4080, -0.3223, -0.1646,  ..., -0.3607,  0.3664,  0.0643],
         [-0.3479,  0.0284,  0.3115,  ..., -0.2890,  0.1873, -0.7466],
         ...,
         [-0.1183, -0.1137,  0.3209,  ...,  0.0545,  0.2423, -0.0424],
         [-0.3010, -0.2314,  0.2606,  ...,  0.1395,  0.2417, -0.2490],
         [-0.3598, -0.3175,  0.2953,  ...,  0.1286,  0.2519, -0.2023]],

        [[ 0.0475,  0.3088, -0.0851,  ..., -0.3279,  

In [53]:
last_hidden_state = outputs.last_hidden_state

In [63]:
last_hidden_state[0][0].shape

torch.Size([768])

### Step 4: Assess the similarity between the documents

In [55]:
import torch
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)

In [74]:
# quick look at a subset of the document pairs and their corresponding similarity scores
for i in range(3):
  print((batch_sentences[i],
        batch_sentences[i+1],
        cos(last_hidden_state[0][i],last_hidden_state[0][i+1])))


('Star light, star bright', 'The first star I see tonight', tensor(0.6800, grad_fn=<DivBackward0>))
('The first star I see tonight', 'I wish I may, I wish I might', tensor(0.2605, grad_fn=<DivBackward0>))
('I wish I may, I wish I might', 'Have the wish I wish tonight', tensor(0.4705, grad_fn=<DivBackward0>))


## Tutorial Question
Now it's your turn to try it out.

In [None]:
# first, create your document by replacing the xxxx
your_doc=["xxxxxxxxx"]

In [None]:
# run this to create the sequence vector for your document
your_inputs = tokenizer(your_doc, padding=True,truncation=True, return_tensors="pt")

In [None]:
# run this to create embeddings with pretrained model
your_outputs = model(**your_inputs)
your_last_hidden_state = your_outputs.last_hidden_state

In [None]:
# now assess the similarity between your document and one of the earlier ones by changing i and j

your_doc[i]
batch_sentences[j]
cos(your_last_hidden_state[0][i],last_hidden_state[0][j])

In [77]:
# For example:
your_doc=["My dog is a super star"]

your_inputs = tokenizer(your_doc, padding=True,truncation=True, return_tensors="pt")
your_outputs = model(**your_inputs)
your_last_hidden_state = your_outputs.last_hidden_state

your_doc[0]
batch_sentences[0]
cos(your_last_hidden_state[0][0],last_hidden_state[0][0])

'My dog is a super star'

'Star light, star bright'

tensor(0.8503, grad_fn=<DivBackward0>)

Finally, screenshot your similarity score for submission.