# A Visual Notebook to Using BERT for the First TIme.ipynb



<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-sentence-classification.png" />

In this notebook, we will use pre-trained deep learning model to process some text. We will then use the output of that model to classify the text. The text is a list of sentences from film reviews. And we will calssify each sentence as either speaking "positively" about its subject of "negatively".

## Models: Sentence Sentiment Classification
Our goal is to create a model that takes a sentence (just like the ones in our dataset) and produces either 1 (indicating the sentence carries a positive sentiment) or a 0 (indicating the sentence carries a negative sentiment). We can think of it as looking like this:

<img src="https://jalammar.github.io/images/distilBERT/sentiment-classifier-1.png" />

Under the hood, the model is actually made up of two model.

* DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
* The next model, a basic Logistic Regression model from scikit learn will take in the result of DistilBERT’s processing, and classify the sentence as either positive or negative (1 or 0, respectively).

The data we pass between the two models is a vector of size 768. We can think of this of vector as an embedding for the sentence that we can use for classification.


<img src="https://jalammar.github.io/images/distilBERT/distilbert-bert-sentiment-classifier.png" />

## Dataset
The dataset we will use in this example is [SST2](https://nlp.stanford.edu/sentiment/index.html), which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):


<table class="features-table">
  <tr>
    <th class="mdc-text-light-green-600">
    sentence
    </th>
    <th class="mdc-text-purple-600">
    label
    </th>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      apparently reassembled from the cutting room floor of any given daytime soap
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      they presume their audience won't sit still for a sociology lesson
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      this is a visually stunning rumination on love , memory , history and the war between art and commerce
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      jonathan parker 's bartleby should have been the be all end all of the modern office anomie films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
</table>

## Installing the transformers library
Let's start by installing the huggingface transformers library so we can load our deep learning NLP model.

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 13.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 77.8 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 7.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 75.9 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 76.1 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Atte

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [3]:
df = pd.read_excel('/content/Topics and Keywords.xlsx', sheet_name = 'Final')
# df['lowered_keywords'] = df['Keywords'].apply(lambda x: x.lower())
df.head()

Unnamed: 0,content,category
0,big data analytics,technical keyword
1,robotics,technical keyword
2,Internet of things,technical keyword
3,Artificial intelligence,technical keyword
4,ML,technical keyword


In [7]:
df = df.sample(frac = 1)
df['content'] = df['content'].apply(lambda x: x.lower())
df['category'] = df['category'].apply(lambda x: x.lower())
df.head()

Unnamed: 0,content,category
701,smart parking technology,research area
359,thermochromic,technical keyword
244,fluorescence-cum-optical-based biosensors,technical keyword
877,driverless haulers,research area
6,yield monitoring,technical keyword


For performance reasons, we'll only use 2,000 sentences from the dataset

We can ask pandas how many sentences are labeled as "positive" (value 1) and how many are labeled "negative" (having the value 0)

In [8]:
df['category'].value_counts()

research area        487
technical keyword    461
Name: category, dtype: int64

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model. 

In [9]:
# For DistilBERT:
# model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

## Model #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [10]:
tokenized = df['content'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" />

### Padding
After tokenization, `tokenized` is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).

In [11]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [None]:
padded

Our dataset is now in the `padded` variable, we can view its dimensions below:

In [13]:
np.array(padded).shape

(948, 15)

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [14]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(948, 15)

## Model #1: And Now, Deep Learning!
Now that we have our model and inputs ready, let's run our model!

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tutorial-sentence-embedding.png" />

The `model()` function runs our sentences through BERT. The results of the processing will be returned into `last_hidden_states`.

In [15]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [None]:
last_hidden_states

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [17]:
type(last_hidden_states)

transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions

In [None]:
features = last_hidden_states[0][:,0,:].numpy()
features

In [83]:
# def bert_embedding_pipeline(keyword):
#   tokenised = tokenizer.encode(keyword, add_special_tokens=True)
#   # padded = np.array(tokenised)
#   max_len = 15
#   padded = np.array([tokenised + [0]*(max_len-len(tokenised))])
#   attention_mask = np.where(padded != 0, 1, 0)
#   input_ids = torch.tensor(padded)  
#   attention_mask = torch.tensor(attention_mask)
#   with torch.no_grad():
#     last_hidden_states = model(input_ids, attention_mask = attention_mask)
#   keyword_features = last_hidden_states[0][:,0,:].numpy()
#   output = clf.predict(keyword_features)
#   new_array = ['Technical Keyword' if i==0 else 'Research Area' for i in output]
#   return new_array

# bert_embedding_pipeline('lidar sensor')  

In [19]:
features.shape

(948, 768)

The labels indicating which sentence is positive and negative now go into the `labels` variable

In [20]:
df['label'] = df['category'].replace(['technical keyword','research area'],[0,1])
labels = df['label']

In [None]:
# # Import label encoder
# from sklearn import preprocessing
  
# # label_encoder object knows how to understand word labels.
# label_encoder = preprocessing.LabelEncoder()
  
# # Encode labels in column 'species'.
# df['Category']= label_encoder.fit_transform(df['Category'])
  
# df['Category'].unique()

#Classification

In [24]:
import time

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.model_selection import train_test_split
# from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
# #-------------------------------------------------------------------#
# from sklearn.datasets import fetch_20newsgroups
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.feature_extraction.text import TfidfTransformer
# #-------------------------------------------------------------------#


import warnings
warnings.filterwarnings("ignore")

In [25]:
lr = LogisticRegression()
lsvc = LinearSVC()
svc = SVC() 
# mnbc = MultinomialNB()
dtc = DecisionTreeClassifier()
rfc = RandomForestClassifier(n_estimators=100)
knc = KNeighborsClassifier()
# ncc = NearestCentroid()
bgc = BaggingClassifier(KNeighborsClassifier())
gbc = GradientBoostingClassifier(n_estimators=50,verbose=2)


all_model = [lr, lsvc, svc, dtc, rfc, knc , bgc]

In [32]:
X_train, X_test, y_train, y_test = train_test_split(features, labels,test_size=0.2,random_state = 42,shuffle = True)


In [39]:
model_eva_dict = {}

for clf in all_model:
    model_score = {}
    clf_name = str(clf).split("(")[0]
    s_time = time.time()
    
    print("Model training started for", clf_name)
    clf.fit(X_train, y_train)
    ts = time.time()-s_time
    print("Traning time is", ts)
    
    predicted = clf.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print("Accuracy: ",round(acc * 100,2))
    b_acc = metrics.balanced_accuracy_score(y_test, predicted)
    t_s = time.time()-s_time
    
    model_score["Model"] = clf
    model_score["Accuracy_score"] = round(acc * 100,2)
    model_score["Balanced_accuracy_score"] = round(b_acc * 100,2)
    model_score["Training_time"] = ts
    model_score["Total_time"] = ts
    
    model_eva_dict[clf_name] = model_score
    
    print()
#     break

Model training started for LogisticRegression
Traning time is 0.05448555946350098
Accuracy:  82.63

Model training started for LinearSVC
Traning time is 0.6382486820220947
Accuracy:  76.32

Model training started for SVC
Traning time is 0.09622740745544434
Accuracy:  80.53

Model training started for DecisionTreeClassifier
Traning time is 0.43171143531799316
Accuracy:  73.68

Model training started for RandomForestClassifier
Traning time is 1.0661473274230957
Accuracy:  80.0

Model training started for KNeighborsClassifier
Traning time is 0.0009744167327880859
Accuracy:  81.05

Model training started for BaggingClassifier
Traning time is 0.043843746185302734
Accuracy:  77.89



In [31]:
model_eva_dict

{'BaggingClassifier': {'Accuracy_score': 81.05,
  'Balanced_accuracy_score': 80.84,
  'Model': BaggingClassifier(base_estimator=KNeighborsClassifier()),
  'Total_time': 0.04031562805175781,
  'Training_time': 0.04031562805175781},
 'DecisionTreeClassifier': {'Accuracy_score': 71.05,
  'Balanced_accuracy_score': 70.93,
  'Model': DecisionTreeClassifier(),
  'Total_time': 0.44517064094543457,
  'Training_time': 0.44517064094543457},
 'KNeighborsClassifier': {'Accuracy_score': 81.05,
  'Balanced_accuracy_score': 80.75,
  'Model': KNeighborsClassifier(),
  'Total_time': 0.0010979175567626953,
  'Training_time': 0.0010979175567626953},
 'LinearSVC': {'Accuracy_score': 76.32,
  'Balanced_accuracy_score': 76.12,
  'Model': LinearSVC(),
  'Total_time': 0.6401405334472656,
  'Training_time': 0.6401405334472656},
 'LogisticRegression': {'Accuracy_score': 82.63,
  'Balanced_accuracy_score': 82.53,
  'Model': LogisticRegression(),
  'Total_time': 0.08279609680175781,
  'Training_time': 0.082796096

In [40]:
param_grid = {'C': [0.1, 1, 10, 100],  
              'gamma': [1, 0.1, 0.01, 0.001], 
              'kernel': ['rbf']}

grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3) 

In [41]:
model_score = {}
clf_name = str(clf).split("(")[0]
s_time = time.time()

print("Model training started for", clf_name)
grid.fit(X_train, y_train)
ts = time.time()-s_time
print("Traning time is", ts)

predicted = grid.predict(X_test)
acc = metrics.accuracy_score(y_test, predicted)
print(acc)
b_acc = metrics.balanced_accuracy_score(y_test, predicted)
t_s = time.time()-s_time

model_score["Model"] = clf
model_score["Accuracy_score"] = round(acc * 100,2)
model_score["Balanced_accuracy_score"] = round(b_acc * 100,2)
model_score["Training_time"] = ts
model_score["Total_time"] = ts

Model training started for BaggingClassifier
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.513 total time=   0.1s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.513 total time=   0.1s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.513 total time=   0.1s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.510 total time=   0.1s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.510 total time=   0.1s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.539 total time=   0.1s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.526 total time=   0.1s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.559 total time=   0.1s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.550 total time=   0.1s
[CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.563 total time=   0.1s
[CV 1/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.763 total time=   0.1s
[CV

In [42]:
# print best parameter after tuning 
print(grid.best_params_) 

# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_) 

{'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
SVC(C=10, gamma=0.01)


In [43]:
clf = SVC(C=10, gamma=0.01, kernel='rbf', random_state=42)
# clf = some.classifier()
clf.fit(features, labels)

SVC(C=10, gamma=0.01, random_state=42)

In [45]:
import pickle
# now you can save it to a file
with open('bert_svc.pkl', 'wb') as f:
    pickle.dump(clf, f)

In [46]:
# and later you can load it
with open('bert_svc.pkl', 'rb') as f:
    clf = pickle.load(f)

#Function Test

In [None]:
## Want BERT instead of distilBERT? Uncomment the following line:
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [None]:
#load the pickle model
with open('bert_svc.pkl', 'rb') as f:
    clf = pickle.load(f)

In [92]:
def bert_embedding_classification_pipeline(keyword):
  tokenised = tokenizer.encode(keyword, add_special_tokens=True)
  # padded = np.array(tokenised)
  max_len = 15
  padded = np.array([tokenised + [0]*(max_len-len(tokenised))])
  attention_mask = np.where(padded != 0, 1, 0)
  input_ids = torch.tensor(padded)  
  attention_mask = torch.tensor(attention_mask)
  with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask = attention_mask)
  keyword_features = last_hidden_states[0][:,0,:].numpy()
  output = clf.predict(keyword_features)
  new_array = ['Technical Keyword' if i==0 else 'Research Area' for i in output]
  return new_array


In [93]:
bert_embedding_classification_pipeline('lidar sensor')  

['Technical Keyword']