# Sentiment Analysis Using BERT on Chinese Dataset

In this tutorial, we show how to perform text classification of spammed mails using the pre-trained BERT model.

This example also shows the effectiveness of **transfer learning**.

## BERT in Short

- BERT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

- The BERT model was proposed in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. 
- It’s a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.
- In particular, BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation.
- The size of the large BERT model:
    - Transformer blocks: 24
    - Embedding dimension: 1024
    - Attention heads: 16
    - Total number of parameters: 340M

- The size of GPT-2 Model:
    - Transformer blocks: 48
    - Sequence length: 1024
    - Embedding dimension: 1600
    - Total number of parameters: 1.5B

## Setup

In [2]:
import tensorflow as tf
import tensorflow.keras as keras
import pandas as pd
import sklearn
import unicodedata
import numpy as np
import re
import nltk
from nltk.corpus import stopwords

# import tensorflow_hub as hub

# from sklearn.model_selection import train_test_split

# from tqdm import tqdm
# import pickle
# from keras.models import Model
# import keras.backend as K
# from sklearn.metrics import confusion_matrix,f1_score,classification_report
# import matplotlib.pyplot as plt
# from keras.callbacks import ModelCheckpoint
# import itertools
# from keras.models import load_model
# from sklearn.utils import shuffle
# from transformers import *
# from transformers import BertTokenizer, TFBertModel, BertConfig


## Data Loading

In [17]:
# def unicode_to_ascii(s):
#     return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

# def clean_stopwords_shortwords(w):
#     stopwords_list=stopwords.words('english')
#     words = w.split() 
#     clean_words = [word for word in words if (word not in stopwords_list) and len(word) > 2]
#     return " ".join(clean_words) 

# def preprocess_sentence(w):
#     w = unicode_to_ascii(w.lower().strip())
#     w = re.sub(r"([?.!,¿])", r" ", w)
#     w = re.sub(r'[" "]+', " ", w)
#     w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
#     w=clean_stopwords_shortwords(w)
#     w=re.sub(r'@\w+', '',w)
#     return w

In [4]:
csv_file='../../../RepositoryData/data/marc_movie_review_metadata.csv'
csv_data=pd.read_csv(csv_file,encoding='utf-8')
csv_data.head()

Unnamed: 0,reviewID,title_CH,title_EN,genre,rating,reviews,reviews_sentiword_seg
0,Review_1,紫羅蘭永恆花園外傳－永遠與自動手記人偶－,Violet Evergarden - Eternity and the Auto Memo...,動畫,negative,唉，踩雷了，浪費時間，不推 唉，踩雷了，浪費時間，不推,唉 ， 踩 雷 了 ， 浪費 時間 ， 不 推 唉 ， 踩 雷 了 ， 浪費 時間 ， 不 推
1,Review_2,復仇者聯盟：終局之戰,Avengers: Endgame,動作_冒險,negative,片長三個小時，只有最後半小時能看，前面真的鋪陳太久，我旁邊的都看到打呼,片長 三個 小時 ， 只有 最後 半 小時 能 看 ， 前面 真的 鋪陳 太久 ， 我 旁邊...
2,Review_3,復仇者聯盟：終局之戰,Avengers: Endgame,動作_冒險,negative,史上之最，劇情拖太長，邊看邊想睡覺...... 1.浩克竟然學會跟旁人一起合照。 2.索爾...,史上 之 最 ， 劇情 拖 太長 ， 邊看邊 想 睡覺 . . . . . . 1. 浩克 ...
3,Review_4,復仇者聯盟：終局之戰,Avengers: Endgame,動作_冒險,negative,難看死ㄌ 難看死了 難看死ㄌ 看到睡著 拖戲拖很長 爛到爆,難看 死 ㄌ 難看 死 了 難看 死 ㄌ 看到 睡著 拖戲 拖 很長 爛 到 爆
4,Review_5,復仇者聯盟：終局之戰,Avengers: Endgame,動作_冒險,negative,連續三度睡著，真的演的太好睡了,連續 三度 睡著 ， 真的 演 的 太 好 睡 了


## Data Preprocessing

In [5]:
print('File has {} rows and {} columns'.format(csv_data.shape[0],csv_data.shape[1]))

File has 3200 rows and 7 columns


In [48]:
# csv_data = csv_data.loc[:, ~csv_data.columns.str.contains('Unnamed: 2', case=False)] 
# csv_data = csv_data.loc[:, ~csv_data.columns.str.contains('Unnamed: 3', case=False)] 
# csv_data = csv_data.loc[:, ~csv_data.columns.str.contains('Unnamed: 4', case=False)] 
# csv_data.head()
# csv_data=csv_data.dropna() 
# csv_data=csv_data.reset_index(drop=True)  # Reset index after dropping the columns/rows with NaN values

In [6]:
csv_data.rename(columns={'rating':'label','reviews':'text'}, inplace=True)

In [7]:
csv_data = sklearn.utils.shuffle(csv_data)                                                         # Shuffle the dataset
#print('Available labels: ',data.label.unique())                              # Print all the unique labels in the dataset
# csv_data['text']=csv_data['text'].map(preprocess_sentence)                           # Clean the text column using preprocess_sentence function defined above

In [8]:
print('File has {} rows and {} columns'.format(csv_data.shape[0],csv_data.shape[1]))
csv_data.head()

File has 3200 rows and 7 columns


Unnamed: 0,reviewID,title_CH,title_EN,genre,label,text,reviews_sentiword_seg
2425,Review_2426,葉問4：完結篇,IP MAN 4,動作_劇情,positive,後面直接他媽看哭\r\n能用功夫片把我看哭的，大概也只有葉師傅了！,後面 直接 他 媽 看 哭 \r \n 能 用 功夫 片 把 我 看 哭 的 ， 大概 也 ...
2312,Review_2313,返校,Detention,懸疑/驚悚,positive,看到韓粉洗負評，就知道這一定是好片,看到 韓粉 洗 負評 ， 就 知道 這 一定是 好片
405,Review_406,古曼童,Kumanthong,恐怖_懸疑/驚悚,negative,這是在描述神棍的片子\r\n整場滿頭問號\r\n想學泰國邪降但只有20%\r\n真的不用特地...,這是 在 描述 神棍 的 片子 \r \n 整場 滿頭 問號 \r \n 想學 泰國 邪降 ...
2748,Review_2749,返校,Detention,懸疑/驚悚,positive,挖操，電影還沒上映，一堆時空旅人來給一星影評\r\n\r\n笑死\r\n\r\n五毛網軍們是...,挖操 ， 電影 還沒 上映 ， 一堆 時 空 旅人 來給 一星 影評 \r \n \r \n...
1645,Review_1646,花椒之味,Fagaro,劇情,positive,好看啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊,好看 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊


## BERT Tokenizer

- You can find more pre-trained models supported by HuggingFace [here](https://huggingface.co/models?filter=zh).
- CKIP has also released their BERT models. Please see [here](https://huggingface.co/ckiplab). - It seems that CKIP only releases the `pytorch` version of the pre-trained models. They are not using Tensorflow unfortunately. But the general Chinese models come with both versions `bert-base-chinese`. So we will use this general one.


```{tip}
In `transformers`, there are several predefined tensorflow models that use BERT for classification. Please see Hugginface transformers's [BERT](https://huggingface.co/transformers/model_doc/bert.html) documentation.
```

dd

In [26]:
num_classes = len(csv_data.label.unique())


from transformers import *
from transformers import BertTokenizer, TFBertModel, BertConfig
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=109540.0), HTML(value='')))




In [27]:
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese',num_labels=num_classes)

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=624.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=478309336.0), HTML(value='')))




Some layers from the model checkpoint at bert-base-chinese were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier', 'dropout_37']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
sent= '天阿，這電影實在是...無言啊！'
tokens=bert_tokenizer.tokenize(sent)
print(tokens)

['天', '阿', '，', '這', '電', '影', '實', '在', '是', '.', '.', '.', '無', '言', '啊', '！']


Parameters of `TFBertForSequenceClassification` model:
- `input_ids`: The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model. This can be obtained by the BERT Tokenizer.
input_ids (Numpy array or tf.Tensor of shape (batch_size, sequence_length))
Indices of input sequence tokens in the vocabulary.
- `batch_size` : Number of examples or sentences batch
sequence_length : A number of tokens in a sentence.
2. attention_mask (Numpy array or tf.Tensor of shape (batch_size, sequence_length)) –
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are marked (0 if the token is added by padding).
This argument indicates to the model which tokens should be attended to, and which should not.
If we have 2 sentences and the sequence length of one sentence is 8 and another one is 10, then we need to make them of equal length and for that, padding is required. To distinguish between the padded and nonpadded input attention mask is used.
3. labels (tf.Tensor of shape (batch_size,), optional) – Labels for computing the sequence classification/regression loss.
Indices should be in [0, ..., num_classes- 1]. If num_classes == 1 a regression loss is computed (Mean-Square loss), If num_classes > 1 a classification loss is computed (Cross-Entropy).
These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding the sentence to the tokenizer.

In [29]:
tokenized_sequence= bert_tokenizer.encode_plus(sent,add_special_tokens = True, max_length =30,padding = True,
return_attention_mask = True)

In [30]:
tokenized_sequence

{'input_ids': [101, 1921, 7350, 8024, 6857, 7442, 2512, 2179, 1762, 3221, 119, 119, 119, 4192, 6241, 1557, 8013, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [31]:
bert_tokenizer.decode(tokenized_sequence['input_ids'])

'[CLS] 天 阿 ， 這 電 影 實 在 是... 無 言 啊 ！ [SEP]'

## From Text to BERT Input

In [32]:
csv_data['label_num'] = csv_data['label'].map({'negative':0,'positive':1})
csv_data.head()

Unnamed: 0,reviewID,title_CH,title_EN,genre,label,text,reviews_sentiword_seg,label_num
2425,Review_2426,葉問4：完結篇,IP MAN 4,動作_劇情,positive,後面直接他媽看哭\r\n能用功夫片把我看哭的，大概也只有葉師傅了！,後面 直接 他 媽 看 哭 \r \n 能 用 功夫 片 把 我 看 哭 的 ， 大概 也 ...,1
2312,Review_2313,返校,Detention,懸疑/驚悚,positive,看到韓粉洗負評，就知道這一定是好片,看到 韓粉 洗 負評 ， 就 知道 這 一定是 好片,1
405,Review_406,古曼童,Kumanthong,恐怖_懸疑/驚悚,negative,這是在描述神棍的片子\r\n整場滿頭問號\r\n想學泰國邪降但只有20%\r\n真的不用特地...,這是 在 描述 神棍 的 片子 \r \n 整場 滿頭 問號 \r \n 想學 泰國 邪降 ...,0
2748,Review_2749,返校,Detention,懸疑/驚悚,positive,挖操，電影還沒上映，一堆時空旅人來給一星影評\r\n\r\n笑死\r\n\r\n五毛網軍們是...,挖操 ， 電影 還沒 上映 ， 一堆 時 空 旅人 來給 一星 影評 \r \n \r \n...,1
1645,Review_1646,花椒之味,Fagaro,劇情,positive,好看啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊,好看 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊 啊,1


In [33]:
sentences=csv_data['text']
labels=csv_data['label_num']
len(sentences),len(labels)

(3200, 3200)

In [34]:
input_ids=[]
attention_masks=[]

for sent in sentences:
    bert_inp=bert_tokenizer.encode_plus(sent,add_special_tokens = True, max_length =32,pad_to_max_length = True,return_attention_mask = True)
    input_ids.append(bert_inp['input_ids'])
    attention_masks.append(bert_inp['attention_mask'])

## alvin's note:
## according to the warning, we should use `padding=True` and `max_length = 32`
## It didn't work. the tokenizer won't pad the sequences

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [35]:
input_ids=np.asarray(input_ids).astype('int32')
attention_masks=np.array(attention_masks)
labels=np.array(labels)

In [36]:
len(input_ids),len(attention_masks),len(labels)


(3200, 3200, 3200)

BERT Tokenizer returns a dictionary from which we can get the input ds and the attention masks.
Convert all the encoding to NumPy arrays.
Arguments of BERT Tokenizer:
text (str, List[str], List[List[str]]) – The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
2. add_special_tokens (bool, optional, defaults to True) – Whether or not to encode the sequences with the special tokens relative to their model.
3. max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters. (max_length≤512)
4. pad_to_max_length (bool, optional, defaults to True) – Whether or not to pad the sequences to the maximum length.
5. return_attention_mask (bool, optional) –

## Train-Test Split

In [37]:
train_inp,val_inp,train_label,val_label,train_mask,val_mask=sklearn.model_selection.train_test_split(input_ids,labels,attention_masks,test_size=0.2)

print('Train inp shape {} Val input shape {}\nTrain label shape {} Val label shape {}\nTrain attention mask shape {} Val attention mask shape {}'.format(train_inp.shape,val_inp.shape,train_label.shape,val_label.shape,train_mask.shape,val_mask.shape))

Train inp shape (2560, 32) Val input shape (640, 32)
Train label shape (2560,) Val label shape (640,)
Train attention mask shape (2560, 32) Val attention mask shape (640, 32)


## Model Setup

In [40]:

import os
path = "./sentiment-analysis-using-bert-keras-chinese/models/"
os.mkdir(path)

## Callbacks
## The model will automatically create the `log_dir` but not `model_save_path`

In [38]:
log_dir='./sentiment-analysis-using-bert-keras-chinese/tensorboard_data/tb_bert'
model_save_path='./sentiment-analysis-using-bert-keras-chinese/models/bert_model.h5'


callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath=model_save_path,save_weights_only=True,monitor='val_loss',mode='min',save_best_only=True),keras.callbacks.TensorBoard(log_dir=log_dir)]

print('\nBert Model',bert_model.summary())

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5,epsilon=1e-08)

bert_model.compile(loss=loss,optimizer=optimizer,metrics=[metric])

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  102267648 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 102,269,186
Trainable params: 102,269,186
Non-trainable params: 0
_________________________________________________________________

Bert Model None


## Model Training

In [39]:
history=bert_model.fit([train_inp,train_mask],
                       train_label,
                       batch_size=32,
                       epochs=1,
                       validation_data=([val_inp,val_mask],val_label),
                       callbacks=callbacks)



OSError: Unable to create file (unable to open file: name = './sentiment-analysis-using-bert-keras-chinese/models/bert_model.h5', errno = 2, error message = 'No such file or directory', flags = 13, o_flags = 602)

In [41]:
bert_model.save_weights(model_save_path)

## Model Evaluation Using Tensorbaord

In [144]:
# %load_ext tensorboard

In [None]:
# %tensorboard --logdir {log_dir}

## Model Evaluation: Metrics

In [42]:
# model_save_path='./sentiment-analysis-using-bert-keras/models/bert_model.h5'


trained_model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese',num_labels=2)
trained_model.compile(loss=loss,optimizer=optimizer, metrics=[metric])
trained_model.load_weights(model_save_path)

preds = trained_model.predict([val_inp,val_mask],batch_size=32)

Some layers from the model checkpoint at bert-base-chinese were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier', 'dropout_75']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [43]:
pred_labels = preds[0].argmax(axis=1)
f1 = sklearn.metrics.f1_score(val_label,pred_labels)
print('F1 score',f1)
print('Classification Report')

target_names=csv_data.label.unique()
print(sklearn.metrics.classification_report(val_label,pred_labels,target_names=target_names))

print('Training and saving built model.....')

F1 score 0.8888888888888888
Classification Report
              precision    recall  f1-score   support

    positive       0.94      0.80      0.87       312
    negative       0.83      0.95      0.89       328

    accuracy                           0.88       640
   macro avg       0.89      0.88      0.88       640
weighted avg       0.89      0.88      0.88       640

Training and saving built model.....


In [44]:
trained_model.evaluate([val_inp,val_mask],batch_size=32)



[0.0, 0.878125011920929]

## References

- [BERT Text Classification Using Keras](https://swatimeena989.medium.com/bert-text-classification-using-keras-903671e0207d)
- [The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)](http://jalammar.github.io/illustrated-bert/)
- [Text Extraction with BERT](https://keras.io/examples/nlp/text_extraction_with_bert/#text-extraction-with-bert)