In [None]:
from google.colab import drive
#drive.flush_and_unmount()
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/2c/4e/4f1ede0fd7a36278844a277f8d53c21f88f37f3754abf76a5d6224f76d4a/transformers-3.4.0-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 2.7MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 17.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 19.1MB/s 
Collecting tokenizers==0.9.2
[?25l  Downloading https://files.pythonhosted.org/packages/7c/a5/78be1a55b2ac8d6a956f0a211d372726e2b1dd2666bb537fea9b03abd62c/tokenizers-0.9.2-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     

# Deceptive Opinion Detection using BERT without Fine-Tune
In this notebook, we are going to use pre-trained BERT to process text to learn features. Then, the encoded embeddings are used to train a logistic regression model for classifcation. The used corpus consists of truthful and deceptive hotel reviews of 20 Chicago hotels. The model will be used to classify the review as truthful or decpetive.  The data is open in [Kaggle](https://www.kaggle.com/rtatman/deceptive-opinion-spam-corpus).  

## Agenda

1. Data Loading
2. BaseLine Mode: BoW Features + Logistic Regression
3. BERT without Fine-Tune:

   3.1 DistillBERT is used to process sentences to learn features

   3.2 Features are fed into Logistic Regression for classifcation.

DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.



It should be noted that although the `[CLS]` acts as an "aggregate representation" for classification tasks, this is not the best choice for a high quality sentence embedding vector. [According to](https://github.com/google-research/bert/issues/164) BERT author Jacob Devlin: "*I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations*."

(However, the [CLS] token does become meaningful if the model has been fine-tuned, where the last hidden layer of this token is used as the "sentence vector" for sequence classification.)


In [None]:
import pandas as pd
import re
from bs4 import BeautifulSoup
import seaborn as sns
import numpy as np

## 1. Data Loading

In [None]:
basefn = "/content/drive/My Drive/fraud_analysis/datasets/"
df_corpus = pd.read_csv(basefn + "deceptive-opinion.csv")
df_corpus['LABEL'] = 1
df_corpus.loc[df_corpus['deceptive']=='truthful', 'LABEL'] = 0

In [None]:
def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    # 2. Only keep letters
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    
    return( " ".join(words)) 
# Get the number of reviews based on the dataframe column size
num_reviews = df_corpus["text"].size

# Initialize an empty list to hold the clean reviews
clean_train_reviews = []

# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list 
for i in range(0, num_reviews ):
    # Call our function for each one, and add the result to the list of
    # clean reviews
    clean_train_reviews.append( review_to_words( df_corpus["text"][i] ) )

In [None]:
df_corpus['TEXT'] = clean_train_reviews
df_corpus = df_corpus[['TEXT', 'LABEL']]
df_corpus.head()

Unnamed: 0,TEXT,LABEL
0,we stayed for a one night getaway with family ...,0
1,triple a rate with upgrade to view room was le...,0
2,this comes a little late as i m finally catchi...,0
3,the omni chicago really delivers on all fronts...,0
4,i asked for a high floor away from the elevato...,0


## 2. Baseline Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
df_train, df_test, train_labels, test_labels = train_test_split(df_corpus.TEXT, df_corpus.LABEL, test_size=0.25, random_state=123)

#### BoW Model

In [None]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(df_train)
X_test =vectorizer.transform(df_test)

#### Logistic Regression

In [None]:
lr_clf = LogisticRegression(max_iter=20000)
lr_clf.fit(X_train, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=20000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

#### Evaluating the baseline model

In [None]:
lr_clf.score(X_test, test_labels)

0.8225

## 3. BERT without Fine-Tune

#### Load pre-trained model

In [None]:
import torch
from transformers import DistilBertTokenizer, DistilBertModel
# Load pre-trained model tokenizer (vocabulary)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
# Load pre-trained model (weights)
model = DistilBertModel.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




#### 3.1 Convert text into vectors using DistillBERT

### Tokenization-Padding-Masking

In [None]:
tokenized = df_corpus['TEXT'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True)))
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

In [None]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [None]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(1600, 512)

### Batch Inference

To save memory, 10 reviews are fed into the BERT model each time.

In [None]:
feature_list = []
with torch.no_grad():
    for batch_idx in range(0,padded.shape[0],10):
      # BERT check 10 sample each time.
      input_ids = torch.tensor(padded[batch_idx:batch_idx+10])  
      used_attention_mask = torch.tensor(attention_mask[batch_idx:batch_idx+10])
      last_hidden_states = model(input_ids, attention_mask=used_attention_mask)
      # Get the embeddings for the [CLS] tag (position is 0)
      features = last_hidden_states[0][:,0,:].numpy()
      feature_list.append(features)

It should be noted that although the `[CLS]` acts as an "aggregate representation" for classification tasks, this is not the best choice for a high quality sentence embedding vector. [According to](https://github.com/google-research/bert/issues/164) BERT author Jacob Devlin: "*I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations*."

(However, the [CLS] token does become meaningful if the model has been fine-tuned, where the last hidden layer of this token is used as the "sentence vector" for sequence classification.)

![picture](https://docs.google.com/uc?export=download&id=1h9keMjcvvXPJwU0fF4L16Smoe8gYVfSk)

In [None]:
# preprare features
features = np.vstack(feature_list)
features.shape

(1600, 768)

#### 3.2 Build Logistic Regression

In [None]:
# get labels
labels = df_corpus.LABEL.tolist()

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.25, random_state=123)

In [None]:
lr_clf = LogisticRegression(max_iter=20000)
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=20000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
# validate the model
lr_clf.score(test_features, test_labels)

0.845

So it is clear that the features encoded by pre-trained BERT is better than the BoW features.  
And that’s it! That’s a good first contact with BERT. The next step would be to head over to the documentation and try your hand at [fine-tuning](https://huggingface.co/transformers/examples.html#glue). You can also go back and switch from distilBERT to BERT and see how that works.