<a id='top'></a><a name='top'></a>
# Chapter 11: Performing Sentiment Analysis on Text Data

Book: [Blueprints for Text Analysis Using Python](https://www.oreilly.com/library/view/blueprints-for-text/9781492074076/)

Repo: https://github.com/blueprints-for-text-analytics-python/blueprints-text

* [Introduction](#introduction)
* [11.0 Imports and Setup](#11.0)
* [11.1 Sentiment Analysis](#11.1)
* [11.2 Introducing the Amazon Customer Reviews Dataset](#11.2)
* [11.3 Blueprint: Performing Sentiment Analysis Using Lexicon-Based Approaches](#11.3)
    - [11.3.1 Bing Liu Lexicon](#11.3.1)
    - [11.3.2 Disadvantage of a Lexicon-Based Approach](#11.3.2)
* [11.4 Supervised Learning Approaches](#11.4)
    - [11.4.1 Preparing Data for a Supervised Learning Approach](#11.4.1)
* [11.5 Blueprint: Vectorizing Text Data And Applying a Supervised Machine Learning Algorithm](#11.5)
    - [11.5.1 Step 1: Data Preparation](#11.5.1)
    - [11.5.2 Step 2: Train-Test Split](#11.5.2)
    - [11.5.3 Step 3: Text Vectorization](#11.5.3)
    - [11.5.4 Step 4: Training the Machine Learning Model](#11.5.4)
* [11.6 Pretrained Language Models Using Deep Learning](#11.6)
    - [11.6.1 Deep Learning and Transfer Learning](#11.6.1)
* [11.7 Blueprint: Using the Transfer Learning Technique and a Pretrained Language Model](#11.7)
    - [11.7.1 Step 1: Loading Models and Tokenization](#11.7.1)
    - [11.7.2 Step 2: Model Training](#11.7.2)
    - [11.7.3 Step 3: Model Evaluation](#11.7.3)

---
<a name='introduction'></a><a id='introduction'></a>
# Introduction
<a href="#top">[back to top]</a>

### Dataset

* reviews_5_balanced.json.gz (Amazon product reviews): [script](#reviews_5_balanced.json.gz), [source](https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/data/amazon-product-reviews/reviews_5_balanced.json.gz)
    - This dataset of Amazon customer reviews has already been scraped and compiled by researchers at Stanford University. The last updated version consists of product reviews from the Amazon website between 1996 and 2018 across several categories. It includes product reviews, product ratings, and other information such as helpfulness votes and product metadata. For our blueprints, we are going to focus on product reviews and use only those that are one sentence long. This is to keep the blueprint simple and remove the step of aggregation. A review containing multiple sentences can include both positive and negative sentiment. Therefore, if we tag all sentences in a review with the same sentiment, this would be incorrect. We only use data for some of the categories so that it can fit in memory and reduce processing time. 
    

### Explore

* Multiple techniques to estimate sentiment from text snippets.
* Simple rule-based techniques for estimating sentiment.
* Language models like BERT for estimating sentiment.

---
<a name='11.0'></a><a id='11.0'></a>
# 11.0 Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
# Start with clean project
# !rm -f *.py 
# !rm -f *.txt 
# !rm -f *.gz 
# !rm -f *.pickle
# !rm -fr outputs

In [2]:
req_file =  "requirements_11.txt"

In [3]:
%%writefile {req_file}
isort
scikit-learn
spacy
textacy
tqdm
torch
transformers
watermark

Overwriting requirements_11.txt


In [4]:
import sys
IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("Installing packages")
    !pip install --upgrade --quiet -r {req_file}
else:
    print("Running locally.")

Running locally.


In [5]:
%%writefile imports.py
import html
import locale
import pickle
import pprint
import re
import string
import warnings

import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import spacy
import textacy
import torch
from nltk import pos_tag
from nltk.corpus import opinion_lexicon, stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import WhitespaceTokenizer, word_tokenize
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from spacy.lang.en import STOP_WORDS as stop_words
from spacy.tokens import Span, Token
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from tqdm import notebook, tqdm, trange
from transformers import (AdamW, BertConfig, BertForSequenceClassification,
                          BertTokenizer, get_linear_schedule_with_warmup)
from watermark import watermark

Overwriting imports.py


In [6]:
!isort imports.py
!cat imports.py

import html
import locale
import pickle
import pprint
import re
import string

import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import spacy
import textacy
import torch
from nltk import pos_tag
from nltk.corpus import opinion_lexicon, stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import WhitespaceTokenizer, word_tokenize
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from spacy.lang.en import STOP_WORDS as stop_words
from spacy.tokens import Span, Token
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from tqdm import notebook, tqdm, trange
from transformers import (AdamW, BertConfig, BertForSequenceClassification,
                          BertTokenizer, get_linear_sched

In [7]:
import html
import locale
import pickle
import pprint
import re
import string
import warnings

import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import spacy
import textacy
import torch
from nltk import pos_tag
from nltk.corpus import opinion_lexicon, stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import WhitespaceTokenizer, word_tokenize
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from spacy.lang.en import STOP_WORDS as stop_words
from spacy.tokens import Span, Token
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from tqdm import notebook, tqdm, trange
from transformers import (AdamW, BertConfig, BertForSequenceClassification,
                          BertTokenizer, get_linear_schedule_with_warmup)
from watermark import watermark

In [8]:
def HR():
    print("-"*40)
    
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
warnings.filterwarnings('ignore')
BASE_DIR = '.'
sns.set_style("darkgrid")
tqdm.pandas(desc="progress-bar")
pp = pprint.PrettyPrinter(indent=4)

print(watermark(iversions=True, globals_=globals(),python=True, machine=True))

Python implementation: CPython
Python version       : 3.11.5
IPython version      : 8.18.1

Compiler    : Clang 14.0.0 (clang-1400.0.29.202)
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

pandas : 2.2.2
re     : 2.2.1
sys    : 3.11.5 (main, Jan 16 2024, 17:25:53) [Clang 14.0.0 (clang-1400.0.29.202)]
sklearn: 1.5.2
nltk   : 3.8.1
numpy  : 1.26.0
textacy: 0.13.0
torch  : 2.2.2
seaborn: 0.13.2
tqdm   : 4.66.5
spacy  : 3.7.4



In [9]:
# Misc downloads
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('opinion_lexicon')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/gpb/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /Users/gpb/nltk_data...
[nltk_data] Downloading package omw-1.4 to /Users/gpb/nltk_data...
[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     /Users/gpb/nltk_data...
[nltk_data]   Unzipping corpora/opinion_lexicon.zip.
[nltk_data] Downloading package stopwords to /Users/gpb/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/gpb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

---
<a name='11.1'></a><a id='11.1'></a>
# 11.1 Sentiment Analysis
<a href="#top">[back to top]</a>

---
<a name='11.2'></a><a id='11.2'></a>
# 11.2 Introducing the Amazon Customer Reviews Dataset
<a href="#top">[back to top]</a>

<a id='reviews_5_balanced.json.gz'></a><a name='reviews_5_balanced.json.gz'></a>
### Dataset: reviews_5_balanced.json.gz
<a href="#top">[back to top]</a>

In [10]:
file = "reviews_5_balanced.json.gz"
!wget -nc -q https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master/data/amazon-product-reviews/reviews_5_balanced.json.gz
!ls -l {file}

-rw-r--r--  1 gpb  staff  16649530 11  3 19:38 reviews_5_balanced.json.gz


In [11]:
df = pd.read_json(file, lines=True)
df = df.drop(columns=['reviewTime','unixReviewTime']) ###
df = df.rename(columns={'reviewText': 'text'}) ###
df.sample(5, random_state=12)

Unnamed: 0,overall,verified,reviewerID,asin,text,summary
163807,5,False,A2A8GHFXUG1B28,B0045Z4JAI,Good Decaf... it has a good flavour for a deca...,Nice!
195640,5,True,A1VU337W6PKAR3,B00K0TIC56,I could not ask for a better system for my sma...,I could not ask for a better system for my sma...
167820,4,True,A1Z5TT1BBSDLRM,B0012ORBT6,good product at a good price and saves a trip ...,Four Stars
104268,1,False,A4PRXX2G8900X,B005SPI45U,I like the principle of a raw chip - something...,No better alternatives but still tastes bad.
51961,1,True,AYETYLNYDIS2S,B00D1HLUP8,"Fake China knockoff, you get what you pay for.",Definitely not OEM


---
<a name='11.3'></a><a id='11.3'></a>
# 11.3 Blueprint: Performing Sentiment Analysis Using Lexicon-Based Approaches
<a href="#top">[back to top]</a>

<a name='11.3.1'></a><a id='11.3.1'></a>
## 11.3.1 Bing Liu Lexicon
<a href="#top">[back to top]</a>

In [12]:
print('Total number of words in opinion lexicon', len(opinion_lexicon.words()))
print('Examples of positive words in opinion lexicon',
      opinion_lexicon.positive()[:5])
print('Examples of negative words in opinion lexicon',
      opinion_lexicon.negative()[:5])

Total number of words in opinion lexicon 6789
Examples of positive words in opinion lexicon ['a+', 'abound', 'abounds', 'abundance', 'abundant']
Examples of negative words in opinion lexicon ['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable']


In [13]:
# Let's create a dictionary which we can use for scoring our review text
df.rename(columns={"reviewText": "text"}, inplace=True)
pos_score = 1
neg_score = -1
word_dict = {}

# Adding the positive words to the dictionary
for word in opinion_lexicon.positive():
        word_dict[word] = pos_score
        
# Adding the negative words to the dictionary
for word in opinion_lexicon.negative():
        word_dict[word] = neg_score
        
def bing_liu_score(text):
    sentiment_score = 0
    bag_of_words = word_tokenize(text.lower())
    for word in bag_of_words:
        if word in word_dict:
            sentiment_score += word_dict[word]
    return sentiment_score / len(bag_of_words)

In [14]:
df['Bing_Liu_Score'] = df['text'].progress_apply(bing_liu_score)
df[['asin','text','Bing_Liu_Score']].sample(2, random_state=0)

progress-bar: 100%|█████████████████████████| 294240/294240 [00:40<00:00, 7331.83it/s]


Unnamed: 0,asin,text,Bing_Liu_Score
188097,B00099QWOU,As expected,0.0
184654,B000RW1XO8,Works as designed...,0.25


In [15]:
df['Bing_Liu_Score'] = preprocessing.scale(df['Bing_Liu_Score'])
df.groupby('overall').agg({'Bing_Liu_Score':'mean'})

Unnamed: 0_level_0,Bing_Liu_Score
overall,Unnamed: 1_level_1
1,-0.587784
2,-0.427183
4,0.345291
5,0.529736


<a name='11.3.2'></a><a id='11.3.2'></a>
## 11.3.2 Disadvantage of a Lexicon-Based Approach
<a href="#top">[back to top]</a>


---
<a name='11.4'></a><a id='11.4'></a>
# 11.4 Supervised Learning Approaches
<a href="#top">[back to top]</a>

<a name='11.4.1'></a><a id='11.4.1'></a>
## 11.4.1 Preparing Data for a Supervised Learning Approach
<a href="#top">[back to top]</a>

In [16]:
pd.set_option('display.max_rows', None)  ###
pd.set_option('display.max_columns', None)  ###
pd.set_option('display.width', None)  ###
pd.set_option('display.max_colwidth', None)  ###

file = "reviews_5_balanced.json.gz"
df = pd.read_json(file, lines=True)
df = df.rename(columns={'reviewText': 'text'})  ###

# Assigning a new [1,0] target class label based on the product rating
df['sentiment'] = 0
df.loc[df['overall'] > 3, 'sentiment'] = 1
df.loc[df['overall'] < 3, 'sentiment'] = 0

# Removing unecessary columns to keep a simple dataframe 
df.drop(columns=[
    'reviewTime', 'unixReviewTime', 'overall', 'reviewerID', 'summary'],
        inplace=True)

df.sample(3)

Unnamed: 0,verified,asin,text,sentiment
236082,True,B00ZR8ACIA,great,1
41386,True,B00IKPV2JU,This switch is trash didn't even have it in a year and already left turn signal not acting right do yourself a favor and go with acdelco switch much better this is a waste of money that you don't have,0
81398,True,B00FPKICT6,not for use if you drop your phone often,0


---
<a name='11.5'></a><a id='11.5'></a>
# 11.5 Blueprint: Vectorizing Text Data And Applying a Supervised Machine Learning Algorithm
<a href="#top">[back to top]</a>

<a name='11.5.1'></a><a id='11.5.1'></a>
## 11.5.1 Step 1: Data Preparation
<a href="#top">[back to top]</a>

In [17]:
def clean(text):
    # convert html escapes like &amp; to characters.
    text = html.unescape(text) 
    # tags like <tab>
    text = re.sub(r'<[^<>]*>', ' ', text)
    # markdown URLs like [Some text](https://....)
    text = re.sub(r'\[([^\[\]]*)\]\([^\(\)]*\)', r'\1', text)
    # text or code in brackets like [0]
    text = re.sub(r'\[[^\[\]]*\]', ' ', text)
    # standalone sequences of specials, matches &# but not #cool
    text = re.sub(r'(?:^|\s)[&#<>{}\[\]+|\\:-]{1,}(?:\s|$)', ' ', text)
    # standalone sequences of hyphens like --- or ==
    text = re.sub(r'(?:^|\s)[\-=\+]{2,}(?:\s|$)', ' ', text)
    # sequences of white spaces
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

In [18]:
df['text_orig'] = df['text'].copy()
df['text'] = df['text'].progress_apply(clean)

progress-bar: 100%|████████████████████████| 294240/294240 [00:05<00:00, 49446.81it/s]


In [19]:
# First method that performs Tokenization and Lemmatization by re-using the blueprint from Chapter 4 
# This can take longer to run due to the size of the dataset!

def extract_lemmas(doc, **kwargs):
    return [t.lemma_ for t in textacy.extract.words(
        doc,
        filter_stops = False,
        filter_punct = True,
        filter_nums = True,
        include_pos = ['ADJ', 'NOUN', 'VERB', 'ADV'],
        exclude_pos = None,
        min_freq = 1)
           ]

# def clean_text(text):
#     doc = nlp(text)
#     lemmas = extract_lemmas(doc)
#     return ' '.join(lemmas)

In [20]:
# Alternate method that uses Wordnet POS tags instead of spaCy - can run faster with similar accuracy
# Tokenization and Lemmatization using wordnet. Re-uses parts of blueprint from Chapter 4
# Uses wordnet POS tags instead of spaCy
# return the wordnet object value corresponding to the POS tag

def get_wordnet_pos(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    

def clean_text(text):
    # lower text
    text = text.lower()
    # tokenize text and remove puncutation
    text = [word.strip(string.punctuation) for word in text.split(" ")]
    # remove words that contain numbers
    text = [word for word in text if not any(c.isdigit() for c in word)]
    # remove stop words
    stop = stopwords.words('english')
    text = [x for x in text if x not in stop]
    # remove empty tokens
    text = [t for t in text if len(t) > 0]
    # pos tag text
    pos_tags = pos_tag(text)
    # lemmatize text
    text = [WordNetLemmatizer().lemmatize(t[0], get_wordnet_pos(t[1])) for t in pos_tags]
    # remove words with only one letter
    text = [t for t in text if len(t) > 1]
    # join all
    text = " ".join(text)
    return(text)

In [21]:
df["text"] = df["text"].progress_apply(clean_text)

## Remove observations that are empty after the cleaning step.
df = df[df['text'].str.len() != 0]

progress-bar: 100%|█████████████████████████| 294240/294240 [04:24<00:00, 1110.84it/s]


<a name='11.5.2'></a><a id='11.5.2'></a>
## 11.5.2 Step 2: Train-Test Split
<a href="#top">[back to top]</a>

In [22]:
X_train, X_test, Y_train, Y_test = train_test_split(
    df['text'],
    df['sentiment'],
    test_size=0.2,
    random_state=42,
    stratify=df['sentiment']
)

print ('Size of Training Data ', X_train.shape[0])
print ('Size of Test Data ', X_test.shape[0])

print ('Distribution of classes in Training Data :')
print ('Positive Sentiment ', str(sum(Y_train == 1)/ len(Y_train) * 100.0))
print ('Negative Sentiment ', str(sum(Y_train == 0)/ len(Y_train) * 100.0))

print ('Distribution of classes in Testing Data :')
print ('Positive Sentiment ', str(sum(Y_test == 1)/ len(Y_test) * 100.0))
print ('Negative Sentiment ', str(sum(Y_test == 0)/ len(Y_test) * 100.0))

Size of Training Data  234108
Size of Test Data  58527
Distribution of classes in Training Data :
Positive Sentiment  50.90770071932612
Negative Sentiment  49.09229928067388
Distribution of classes in Testing Data :
Positive Sentiment  50.9081278726058
Negative Sentiment  49.09187212739419


<a name='11.5.3'></a><a id='11.5.3'></a>
## 11.5.3 Step 3: Text Vectorization
<a href="#top">[back to top]</a>

In [23]:
tfidf = TfidfVectorizer(min_df = 10, ngram_range=(1,1))
X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)

<a name='11.5.4'></a><a id='11.5.4'></a>
## 11.5.4 Step 4: Training the Machine Learning Model
<a href="#top">[back to top]</a>

In [24]:
model1 = LinearSVC(random_state=42, tol=1e-5)
model1.fit(X_train_tf, Y_train)

In [25]:
Y_pred = model1.predict(X_test_tf)
print ('Accuracy Score - ', accuracy_score(Y_test, Y_pred))
print ('ROC-AUC Score - ', roc_auc_score(Y_test, Y_pred))

Accuracy Score -  0.8658396979172006
ROC-AUC Score -  0.8660667427476778


In [26]:
sample_reviews = df.sample(5, random_state=22)
sample_reviews_tf = tfidf.transform(sample_reviews['text'])
sentiment_predictions = model1.predict(sample_reviews_tf)

sentiment_predictions = pd.DataFrame(
    data = sentiment_predictions,
    index=sample_reviews.index,
    columns=['sentiment_prediction']
)

sample_reviews = pd.concat([sample_reviews, sentiment_predictions], axis=1)
print ('Some sample reviews with their sentiment - ')
sample_reviews[['text_orig','sentiment_prediction']]

Some sample reviews with their sentiment - 


Unnamed: 0,text_orig,sentiment_prediction
29500,"Its a nice night light, but not much else apparently!",1
98387,"Way to small, do not know what to do with them or how to use them",0
113648,"Didn't make the room ""blue"" enough - returned with no questions asked",0
281527,Excellent,1
233713,fit like oem and looks good,1


In [27]:
def baseline_scorer(text):
    score = bing_liu_score(text)
    if score > 0:
        return 1
    else:
        return 0
    
Y_pred_baseline = X_test.progress_apply(baseline_scorer)
acc_score = accuracy_score(Y_pred_baseline, Y_test)
print (acc_score)

progress-bar: 100%|███████████████████████████| 58527/58527 [00:12<00:00, 4683.05it/s]

0.7525073897517385





### Saving the trained model and vectorizer for use with the API later

In [28]:
pickle.dump(model1, open('./sentiment_classification.pickle','wb'))
pickle.dump(tfidf, open('./sentiment_vectorizer.pickle','wb'))

---
<a name='11.6'></a><a id='11.6'></a>
# 11.6 Pretrained Language Models Using Deep Learning
<a href="#top">[back to top]</a>

<a name='11.6.1'></a><a id='11.6.1'></a>
## 11.6.1 Deep Learning and Transfer Learning
<a href="#top">[back to top]</a>


---
<a name='11.7'></a><a id='11.7'></a>
# 11.7 Blueprint: Using the Transfer Learning Technique and a Pretrained Language Model
<a href="#top">[back to top]</a>

In [29]:
# This is an optional step to reduce the size of the data by sampling only 40% of the observations
# It is very useful to conduct a first run using a GPU (on Google Colab)
# Lager number of observations can cause longer runtime and automatic shutdown on the Colab free instance

# df = df.sample(frac=0.4, random_state=42)
df = df.sample(frac=0.1, random_state=42)

<a name='11.7.1'></a><a id='11.7.1'></a>
## 11.7.1 Step 1: Loading Models and Tokenization
<a href="#top">[back to top]</a>

In [30]:
config = BertConfig.from_pretrained('bert-base-uncased', finetuning_task='binary')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [31]:
%%time

# There is a change in behavior of the truncation while calling the encode function. 
# This produces a warning and the behavior will probably change in future
# Currently supress the warning as described - 
# https://github.com/huggingface/transformers/issues/5397

def get_tokens(text, tokenizer, max_seq_length, add_special_tokens=True):
    input_ids = tokenizer.encode(
        text, 
        add_special_tokens=add_special_tokens, 
        max_length=max_seq_length,
        pad_to_max_length=True
    )
    attention_mask = [int(id > 0) for id in input_ids]
    assert len(input_ids) == max_seq_length
    assert len(attention_mask) == max_seq_length
    return (input_ids, attention_mask)


text = "Here is the sentence I want embeddings for."
input_ids, attention_mask = get_tokens(
    text, 
    tokenizer, 
    max_seq_length=30, 
    add_special_tokens = True
)
input_tokens = tokenizer.convert_ids_to_tokens(input_ids)

print (text)
print (input_tokens)
print (input_ids)
print (attention_mask)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Here is the sentence I want embeddings for.
['[CLS]', 'here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
[101, 2182, 2003, 1996, 6251, 1045, 2215, 7861, 8270, 4667, 2015, 2005, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
CPU times: user 3.91 ms, sys: 780 µs, total: 4.69 ms
Wall time: 14.6 ms


In [32]:
%%time
X_train, X_test, Y_train, Y_test = train_test_split(
    df['text_orig'],
    df['sentiment'],
    test_size=0.2,
    random_state=42,
    stratify=df['sentiment']
)

X_train_tokens = X_train.progress_apply(get_tokens, args=(tokenizer, 50))
X_test_tokens = X_test.progress_apply(get_tokens, args=(tokenizer, 50))

progress-bar: 100%|███████████████████████████| 23411/23411 [00:16<00:00, 1425.27it/s]
progress-bar: 100%|█████████████████████████████| 5853/5853 [00:03<00:00, 1792.63it/s]

CPU times: user 14.9 s, sys: 455 ms, total: 15.3 s
Wall time: 20.3 s





In [33]:
input_ids_train = torch.tensor(
    [features[0] for features in X_train_tokens.values], dtype=torch.long)

input_mask_train = torch.tensor(
    [features[1] for features in X_train_tokens.values], dtype=torch.long)

label_ids_train = torch.tensor(Y_train.values, dtype=torch.long)

print (input_ids_train.shape)
print (input_mask_train.shape)
print (label_ids_train.shape)

torch.Size([23411, 50])
torch.Size([23411, 50])
torch.Size([23411])


In [34]:
input_ids_train[2]

tensor([ 101, 2106, 2025, 2131, 2151, 2537, 2012, 2035, 1037, 2261, 2318, 2000,
        2272, 2039, 2021, 2351,  102,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0])

In [35]:
train_dataset = TensorDataset(input_ids_train,input_mask_train,label_ids_train)

In [36]:
input_ids_test = torch.tensor([features[0] for features in X_test_tokens.values], 
                              dtype=torch.long)

input_mask_test = torch.tensor([features[1] for features in X_test_tokens.values], 
                               dtype=torch.long)

label_ids_test = torch.tensor(Y_test.values, 
                              dtype=torch.long)

test_dataset = TensorDataset(input_ids_test, input_mask_test, label_ids_test)

<a name='11.7.2'></a><a id='11.7.2'></a>
## 11.7.2 Step 2: Train-Test Split
<a href="#top">[back to top]</a>

In [37]:
train_batch_size = 64
num_train_epochs = 2

train_sampler = RandomSampler(train_dataset)

train_dataloader = DataLoader(train_dataset, 
                              sampler=train_sampler, 
                              batch_size=train_batch_size)

t_total = len(train_dataloader) * num_train_epochs

print ("Num training examples = ", len(train_dataset))
print ("Train batch size  = ", train_batch_size)
print ("Num training steps in an epoch = ", len(train_dataloader))
print ("Num Epochs = ", num_train_epochs)
print ("Total num training steps = ", t_total)

Num training examples =  23411
Train batch size  =  64
Num training steps in an epoch =  366
Num Epochs =  2
Total num training steps =  732


In [38]:
learning_rate = 1e-4
adam_epsilon = 1e-8
warmup_steps = 0

optimizer = AdamW(model.parameters(), lr=learning_rate, eps=adam_epsilon)

scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps=warmup_steps, 
    num_training_steps=t_total
)

In [None]:
%%time
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_iterator = trange(num_train_epochs, desc="Epoch")

## Put model in 'train' mode
model.train()
    
for epoch in train_iterator:
    epoch_iterator = notebook.tqdm(train_dataloader, desc="Iteration")
    for step, batch in enumerate(epoch_iterator):

        ## Reset all gradients at start of every iteration
        model.zero_grad()
        
        ## Put the model and the input observations to GPU
        model.to(device)
        batch = tuple(t.to(device) for t in batch)
        
        ## Identify the inputs to the model
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2]}

        ## Forward Pass through the model. Input -> Model -> Output
        outputs = model(**inputs)

        ## Determine the deviation (loss)
        loss = outputs[0]
        print("\r%f" % loss, end='')

        ## Back-propogate the loss (automatically calculates gradients)
        loss.backward()

        ## Prevent exploding gradients by limiting gradients to 1.0 
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        ## Update the parameters and learning rate
        optimizer.step()
        scheduler.step()

In [40]:
model.save_pretrained('outputs')

<a name='11.7.3'></a><a id='11.7.3'></a>
## 11.7.3 Step 3: Text Vectorization
<a href="#top">[back to top]</a>

In [None]:
test_batch_size = 64

test_sampler = SequentialSampler(test_dataset)

test_dataloader = DataLoader(
    test_dataset, 
    sampler=test_sampler, 
    batch_size=test_batch_size
)

# Load the pre-trained model that was saved earlier 
# model = model.from_pretrained('/outputs')

# Initialize the prediction and actual labels
preds = None
out_label_ids = None

## Put model in "eval" mode
model.eval()

for batch in notebook.tqdm(test_dataloader, desc="Evaluating"):
    
    ## Put the model and the input observations to GPU
    model.to(device)
    batch = tuple(t.to(device) for t in batch)
    
    ## Do not track any gradients since in 'eval' mode
    with torch.no_grad():
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2]}

        ## Forward pass through the model
        outputs = model(**inputs)

        ## We get loss since we provided the labels
        tmp_eval_loss, logits = outputs[:2]

        ## There maybe more than one batch of items in the test dataset
        if preds is None:
            preds = logits.detach().cpu().numpy()
            out_label_ids = inputs['labels'].detach().cpu().numpy()
        else:
            preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
            out_label_ids = np.append(
                out_label_ids, 
                inputs['labels'].detach().cpu().numpy(), 
                axis=0
            )
    
## Get final loss, predictions and accuracy
preds = np.argmax(preds, axis=1)
acc_score = accuracy_score(preds, out_label_ids)
print ('Accuracy Score on Test data ', acc_score)

Evaluating:   0%|          | 0/92 [00:00<?, ?it/s]