The purpose of this Colab sheet is to have a hands-on experience with BERT and CNN. The project is about clickbait classification for news by comparing the title and the body content. I have two datasets. One is from http://www.fakenewschallenge.org/ (this is a multi-class dataset), and the other one is from Kaggle https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset (this is a binary class dataset)

I create two approaches based on the BERT model embeddings. The first approach is to use BERT Sequence Classification model, then connect the output to a softmax classifier. The second approach is to use a basic BERT model to obtain embedding vectors, then apply CNN as a downstream classification model. 

# Load Data and Data Preparation

In [3]:
# load the dataset, which is a bi-class dataset
# the dataset is from https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset
import pandas as pd
fake = pd.read_csv('../kaggle_dataset/Fake.csv')
real = pd.read_csv('../kaggle_dataset/True.csv')
# fake = pd.read_csv('Fake.csv', engine='python')
# real = pd.read_csv('True.csv', engine='python')
fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [4]:
fake = fake.drop(['subject', 'date'], axis=1)
real = real.drop(['subject', 'date'], axis=1)
fake['Identity'] = 0
real['Identity'] = 1 
fake.head()

Unnamed: 0,title,text,Identity
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,0


In [5]:
# remove the undesired characters in the data
import re

def clean(text):
    text = re.sub(r'http\S+', " ", text)
    text = re.sub(r'@\w+',' ',text)
    text = re.sub(r'#\w+', ' ', text)
    text = re.sub(r'\d+', ' ', text)
    text = re.sub(r'<.*?>',' ', text)
    return text

In [6]:
fake['title'] = fake['title'].apply(lambda x: clean(x))
fake['text'] = fake['text'].apply(lambda x: clean(x))
real['title'] = real['title'].apply(lambda x: clean(x))
real['text'] = real['text'].apply(lambda x: clean(x))
fake.head()

Unnamed: 0,title,text,Identity
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,0


In [7]:
# convert to a list in order to fit into the BERT model
fake_news = fake.values.tolist()
real_news = real.values.tolist()

data = fake_news + real_news

In [8]:
import random

random.shuffle(data)
print(len(data))
print(data[0])

44898
['CEO Who Threatened To Kill Trump With Sniper Rifle Says Life Has Been Turned Upside Down [VIDEO]', 'The former CEO of a local cybersecurity firm is talking first to  News   about his threat to shoot President-elect Donald Trump.Team   Investigator Allison Ash sat down with Matt Harrigan Tuesday afternoon. He tells Allison he s been getting death threats since the posts went viral and he and his family have relocated until the storm settles.Harrigan was the CEO of PacketSled until he resigned his position Tuesday morning. He said he s sorry for his words, and he wants his side of the story told.Harrigan wrote the series of Facebook posts on election night. They were words that he thought only his friends would see. He wrote,  I m going to kill the president. elect  and  Bring it, secret service.  He even mentioned getting a sniper rifle and targeting the White House once Donald Trump was living there.Harrigan said his Facebook friends shared the post on Twitter and that s how it

In [None]:
print(data[0][0])

Suspected Indonesian radicals armed with bows and arrows burn down police complex


In [9]:
# split data to train and test
import math
split_point = math.ceil(len(data) * 0.05)
train = data[0:split_point]
#train = data[0:1000]
test = data[split_point:len(data)]
print(len(train))
print(len(test))

2245
42653


In [10]:
# remove the cells that contains empty content

def seperate_title_body(arr):
  title = []
  body = []
  labels = []
  # the reason to reverse the steps is to eliminate the effect 
  for i in range(len(arr)):
    if len(arr[i][0]) > 10 and len(arr[i][1]) > 10:
      title.append(arr[i][0])
      body.append(arr[i][1])
      labels.append(arr[i][2])
  return (title, body, labels)

In [11]:
# separate to title, body, and label

train_title, train_body, y = seperate_title_body(train)
test_title, test_body, Y = seperate_title_body(test)
print(len(train_title))
print(len(train_body))
print(len(y))

2200
2200
2200


# BERT Tokenizer and Bert Classification
The output is a two-dimensional vector (logits), which is the number before using an activation function, such as sigmoid and softmax.

Tutorial References
https://colab.research.google.com/drive/1Y4o3jh3ZH70tl6mCd76vz_IxX23biCPP#scrollTo=6J-FYdx6nFE_

In [12]:
from transformers import BertTokenizer, BertModel
import torch
import torch.nn as nn

In [13]:
# word embedding

# call BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# padding is to make the length of each input uniform. Truncation is to break
# the length of a sentence because BERT tockenizer cannot take more than 512 tokens

# BERT tokenize
train_input = tokenizer(train_title,
                  train_body,
                  padding=True,
                  truncation=True,
                  return_tensors="pt")

# test_input = tokenizer(test_title,
#                   test_body,
#                   padding=True,
#                   truncation=True,
#                   return_tensors="pt")

In [12]:
print(len(train_input['input_ids']))

19875


In [14]:
# TensorDataset is to combine different data into one tensor list
# DataLoader is to feed data in batches for later procedures

from torch.utils.data import TensorDataset, DataLoader

y = torch.tensor(y)

# train_input['input_ids'].to(device)
# train_input['attention_mask'].to(device)
# y.to(device)

train_tensor = TensorDataset(train_input['input_ids'], train_input['attention_mask'], y)
bert_train = DataLoader(train_tensor, batch_size=16)

In [15]:
# use existing BERT package (BertForSequenceClassification) to generate 
# embedding vectors. This package has a logistic regression layer included

from transformers import BertForSequenceClassification, AdamW, BertConfig

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels = 2,
    output_attentions = False,
    output_hidden_states = False
)

# if you run on local server, you need the following line because it puts the
# model into GPU
model.cuda()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [16]:
from transformers import get_linear_schedule_with_warmup
epochs = 5
# total number of training steps are = 1104 batchs * 10 epochs per batch
total_steps = len(bert_train) * epochs

optimizer = AdamW(model.parameters(),
                  lr = 2e-5,
                  eps = 1e-8
                )

# create a learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)


# Test Model

In [24]:
torch.cuda.empty_cache()

In [19]:

t = torch.cuda.get_device_properties(0).total_memory
r = torch.cuda.memory_reserved(0) 
a = torch.cuda.memory_allocated(0)
print(t)
print(r)
print(a)

3221225472
2151677952
2094151680


In [18]:
# start training the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

loss_values = []
for iter in range(0, epochs):
  print(f"iter: {iter}")
  #total_loss = 0
  model.train()
  total_loss = []
  for step, batch in enumerate(bert_train):
    if step % 100 == 0:
      print(f'step: {step}')
    #batch = torch.tensor(batch).to(device)
    input_id = batch[0].to(device)
    input_mask = batch[1].to(device)
    input_label = batch[2].to(device)

    model.zero_grad()
    outcome = model(input_id, token_type_ids=None, attention_mask=input_mask, 
                    labels=input_label)
    
    # I just want to see if the loss is decreasing or not
    loss = outcome[0]

    total_loss.append(loss)

    loss.backward()

    # Clip the norm of the gradients to 1.0.
    # This is to help prevent the "exploding gradients" problem.
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    optimizer.step()

    scheduler.step()
  
  #avg_train_loss = total_loss / len(bert_input)
  loss_values.append(total_loss)



iter: 0
step: 0


RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 3.00 GiB total capacity; 1.95 GiB already allocated; 39.58 MiB free; 2.00 GiB reserved in total by PyTorch)

In [None]:
print(loss_values)

In [None]:
Y = torch.tensor(Y)  # test label
test_tensor = TensorDataset(test_input['input_ids'], test_input['attention_mask'], Y)
bert_test = DataLoader(test_tensor, batch_size=32)

In [None]:
print(len(Y))

8826


In [None]:
# Prediction on test set

# Put model in evaluation mode
model.eval()

# Tracking variables 
predictions = []

# Predict 
for batch in bert_test:
  # Add batch to GPU
  batch = tuple(t for t in batch)
  
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  
  # Telling the model not to compute or store gradients, saving memory and 
  # speeding up prediction
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)

  logits = outputs[0]

  # Move logits and labels to CPU
  logits = logits.detach()
  #logits = logits.detach().cpu().numpy()
  #label_ids = b_labels.to('cpu').numpy()
  
  # Store predictions and true labels
  predictions.append(logits)

print('    DONE.')

    DONE.


In [None]:
print(len(predictions))

552


In [None]:
print(predictions[0])

tensor([[ 3.9957, -3.5112],
        [-4.1279,  3.7034],
        [-4.0963,  3.7056],
        [ 4.0848, -3.6926],
        [-4.1057,  3.7062],
        [ 3.9095, -3.6115],
        [ 3.9577, -3.6056],
        [ 4.0058, -3.5786],
        [ 4.0528, -3.6972],
        [ 4.0740, -3.6632],
        [-4.1151,  3.7369],
        [-4.0964,  3.6974],
        [ 4.0719, -3.6530],
        [ 4.0550, -3.7217],
        [ 4.0649, -3.6627],
        [ 4.1007, -3.6953]])


In [None]:
import os

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()

output_dir = './model_save/'

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

# Good practice: save your training arguments together with the trained model
# torch.save(args, os.path.join(output_dir, 'training_args.bin'))


In [None]:
# Load a trained model and vocabulary that you have fine-tuned from a disk
model = model_class.from_pretrained(output_dir)
tokenizer = tokenizer_class.from_pretrained(output_dir)

# Copy the model to the GPU.
model.to(device)

# Use Existing BERT Model to Generate Vectors, then apply CNN
I use the existing BERT model (without fine tuning) to generate a 768-dimension vector for the title and the body. Then, I apply a two-layer CNN to each one of the vectors and combine the result to feed into 

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 6.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 21.0MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 34.3MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


In [None]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import torch

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
outputs = model(**inputs, labels=labels)
# loss = outputs.loss
# logits = outputs.logits

In [None]:
from transformers import BertTokenizer, BertForNextSentencePrediction
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
next_sentence = "The sky is blue due to the shorter wavelength of blue light."
encoding = tokenizer(prompt, next_sentence, return_tensors='pt')

outputs = model(**encoding)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
logits = outputs.logits
print(logits[0, 0] < logits[0, 1])

tensor(True)


In [None]:
print(logits[0, 0])

tensor(-3.0729, grad_fn=<SelectBackward>)


In [None]:
print(outputs)

NextSentencePredictorOutput(loss=None, logits=tensor([[-3.0729,  5.9056]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)


In [None]:
# call existing BERT model to get embedding vectors

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import sklearn.metrics as metrics

from transformers import BertTokenizer, BertModel

In [None]:
# this is the default model
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello World!", return_tensors="pt")
outputs = model(**inputs)

In [None]:
print(len(outputs.last_hidden_state[0][0]))
#print(outputs.last_hidden_state[0][0])

768


In [None]:
# simple CNN layer with BERT model

class CNN_Module (nn.Module):
  def __init__(self):
    super(CNN_Module, self).__init_()

    self.conv = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3)
    self.d1 = nn.Linear( , 128)
    self.d2 = nn.Linear(128, 2)
    self.bert = 