# Sentiment Analysis using BERT and TensorFlow

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SaschaHeyer/Sentiment-Analysis-GCP/blob/main/notebook/Sentiment_Analysis_BERT_and_TensorFlow.ipynb)

This notebook contains the code for the DoiT blog article https://blog.doit-intl.com/performing-surprisingly-easy-sentiment-analysis-on-google-cloud-platform-fc26b2e2b4b. If you want to deploy this model to Google Cloud head over to the article. 
 

## Author
Sascha Heyer - Senior Machine Learning Engineer at [DoiT International](https://twitter.com/doitint)<br/>
[Twitter](https://twitter.com/HeyerSascha)
[LinkedIn](https://www.linkedin.com/in/saschaheyer/)


## Install Huggingface Transformers

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 28.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 48.2 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 71.9 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.0 tokenizers-0.13.2 transformers-4.24.0


## Dependencies

In [2]:
from transformers import DistilBertTokenizerFast
from transformers import TFDistilBertForSequenceClassification

import tensorflow as tf
import json
import pandas as pd
import numpy as np
from io import StringIO

## Configuration

First, you'll need to enable GPUs for the notebook:

Navigate to Edit→Notebook Settings
select GPU from the Hardware Accelerator drop-down

In [3]:
num_gpus_available = len(tf.config.experimental.list_physical_devices('GPU'))
print("Num GPUs Available: ", num_gpus_available)
assert num_gpus_available > 0

Num GPUs Available:  1


## Load dataset

In [4]:
file = tf.io.gfile.GFile(
'gs://machine-learning-samples/datasets/sentiment/imdb/csv/dataset.csv', mode='r').read()

df = pd.read_csv(StringIO(file))

In [None]:
df.tail()

Unnamed: 0,review,sentiment
49995,I thought this movie did a down right good job...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,0
49998,I'm going to have to disagree with the previou...,0
49999,No one expects the Star Trek movies to be high...,0


## Train / Validation split

In [None]:
sentiments = df['sentiment'].values.tolist()
reviews = df['review'].values.tolist()

from sklearn.model_selection import train_test_split
training_sentences, validation_sentences, training_labels, validation_labels = train_test_split(reviews, sentiments, test_size=.2)

## Tokenization

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
tokenizer([training_sentences[0]], truncation=True,
                            padding=True, max_length=128)

{'input_ids': [[101, 2009, 2003, 5875, 2000, 2156, 2054, 2111, 2228, 1997, 2023, 3185, 1010, 2144, 2009, 2003, 1010, 1999, 2755, 1010, 3243, 4310, 1006, 2295, 2009, 6468, 2070, 1997, 1996, 11749, 2015, 1997, 14675, 12852, 1005, 1055, 3015, 1007, 1012, 2130, 2295, 2009, 2453, 4025, 1037, 2978, 26881, 2000, 2360, 2061, 1010, 1996, 3185, 2003, 2074, 17796, 2438, 2000, 13366, 2571, 6593, 2216, 2008, 2342, 3115, 5365, 5436, 18008, 1010, 1998, 21323, 1010, 2061, 2008, 2065, 2017, 5987, 2000, 2022, 7349, 1010, 2017, 2097, 2156, 1037, 3671, 6071, 17312, 2007, 7167, 1997, 9219, 1998, 1037, 4487, 2015, 5558, 18447, 2098, 5436, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2216, 2040, 2342, 1037, 7399, 1010, 3563, 1998, 4895, 27898, 5436, 2240, 2097, 5223, 2023, 3185, 1010, 2138, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [None]:
sequence = 'DoiT is a great company to work for'
tokenizer.tokenize(sequence)

['doi', '##t', 'is', 'a', 'great', 'company', 'to', 'work', 'for']

In [None]:
tokenizer(sequence)

{'input_ids': [101, 9193, 2102, 2003, 1037, 2307, 2194, 2000, 2147, 2005, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
train_encodings = tokenizer(training_sentences,
                            truncation=True,
                            padding=True)
val_encodings = tokenizer(validation_sentences,
                            truncation=True,
                            padding=True)

## TensorFlow dataset

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    training_labels
))

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    validation_labels
))

## Model

In [None]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',
                                                              num_labels=2)

Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['activation_13', 'vocab_layer_norm', 'vocab_projector', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['dropout_19', 'pre_classifier', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

In [None]:
# using the Hugginface model saves as the time and effor to build the model on our own
# https://www.tensorflow.org/official_models/fine_tuning_bert_files/output_8L__-erBwLIQ_0.png?dcb_=0.04391390122987171

model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


## Training

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5, epsilon=1e-08)
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy'])
model.fit(train_dataset.shuffle(100).batch(16),
              epochs=2,
              batch_size=16,
              validation_data=val_dataset.shuffle(100).batch(16))

Epoch 1/2


AttributeError: ignored

IMDB Sentiment Benchmark https://paperswithcode.com/sota/sentiment-analysis-on-imdb

## Save model

In [None]:
model.save_pretrained("./model")

## Load model from storage (for demo purpuses without time for fullt raining)

In [None]:
!gsutil cp -r gs://machine-learning-samples/models/sentiment/model . 

## Load model

In [None]:
loaded_model = TFDistilBertForSequenceClassification.from_pretrained("./model")

## Prediction

In [None]:
test_sentence = "DoiT is a great company"

# replace to test_sentence_sarcasm variable, if you want to test sarcasm
predict_input = tokenizer.encode(test_sentence,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")

tf_output = loaded_model.predict(predict_input)[0]

tf_prediction = tf.nn.softmax(tf_output, axis=1)
labels = ['Negative','Positive']
label = tf.argmax(tf_prediction, axis=1)
label = label.numpy()
print(labels[label[0]])

## Excursion

### Masking

In [None]:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
unmasker("DoiT is a [MASK] company to work for.")

### Tokenization

In [None]:
tokenizer.tokenize('Cat Dog Cat Dog')

In [None]:
tokenizer(['Cat Dog Cat Dog'], 
          truncation=True,
          padding=True, 
          max_length=128)

### Biased
BERT was trained on Wikipedia and Book Corpus and thus learns underlying bias. This is a important topic and we need to be aware of that at any time we work with machine learning models and data.

In [None]:
unmasker("The White man worked as a [MASK].")

In [None]:
unmasker("The woman worked as a [MASK].")

In [None]:
unmasker("The Black woman worked as a [MASK].")