## Bert For Fake News Classification:
### Yassin Bahid

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art natural language processing (NLP) model developed by Google. It is a pre-trained language model that uses a transformer-based neural network architecture to learn contextual relations between words in a sentence. The BERT model is trained using a large amount of text data and is designed to understand the meaning of text in a way that is closer to how humans do. Unlike previous language models, BERT can take into account the context of a word within a sentence, as well as the context of the sentence within a larger body of text. The BERT model is bidirectional, which means it can take into account both the preceding and following words in a sentence when making predictions. This allows it to have a better understanding of the context in which a word is being used.
BERT has achieved state-of-the-art results on a wide range of NLP tasks, including sentiment analysis, named entity recognition, question answering, and more. It has been used in a variety of applications, such as chatbots, language translation, and text classification. BERT is an important advancement in the field of NLP, as it provides a powerful tool for understanding and processing natural language text in a way that is more accurate and nuanced than previous models. 

In this report, we modify the original BERT pretrained model to classify fake news articles.

In [1]:
import pandas as pd
import time
from torch.optim import Adam
from tqdm import tqdm
from torch import nn
from transformers import BertModel
import torch
import numpy as np
from transformers import BertTokenizer
from torch.utils.data import Dataset, DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('Running On...:', device)

Running On...: cuda


## 2 - Data:

We use a Kaggle dataset of News articles which were determined to either be true or false. https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset?select=True.csv. We devide this dataset into a training, validation, and testing subsets.

In [2]:
truedf = pd.read_csv('data/True.csv')
truedf['label'] = 1

fakedf = pd.read_csv('data/Fake.csv')
fakedf['label'] = 0

data = pd.concat([truedf, fakedf], ignore_index=True)
data

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1
...,...,...,...,...,...
44893,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",0
44894,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",0
44895,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",0
44896,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",0


In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')


labels_idx = {'Fake':0,
          'True':1
          }

class DataClass(Dataset):

    def __init__(self, dataframe, tokenizer, max_length):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        text = str(self.data.loc[index, 'text'])
        label = int(self.data.loc[index, 'label'])

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            truncation=True,
            max_length=self.max_length,
            return_token_type_ids=False,
            padding='max_length',
            return_attention_mask=True,
            return_tensors='pt',
        )

        return {
            'text': text,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }



In [4]:
##Initializing a training, valation, and testinng data set:

traindf, valdf, testdf = np.split(data.sample(frac=1, random_state=np.random.seed(round(time.time()))), 
                                     [int(.8*len(data)), int(.9*len(data))])


traindf, valdf, testdf = traindf.reset_index(drop=True), valdf.reset_index(drop=True), testdf.reset_index(drop=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

train, val, test = DataClass(traindf,  tokenizer, max_length=256), DataClass(valdf, tokenizer, max_length=256), DataClass(testdf,  tokenizer, max_length=256)

In [5]:
print(len(train))
print(len(val))

35918
4490


In [6]:
train_dataloader = DataLoader(train, batch_size=8, shuffle=True)
val_dataloader = DataLoader(val, batch_size=8, shuffle=True)
test_dataloader = DataLoader(test, batch_size=8, shuffle=True)

## 3 - Bert Model:

### 3.1 Architechture:
BERT consists of two parts: the encoder and the decoder. The encoder is a multi-layer bidirectional transformer, which means it processes the input text in both directions (left-to-right and right-to-left) and uses self-attention to capture the relationships between words in the text. The decoder is typically used for tasks such as text classification or question answering, where the model is given a specific task to perform.
The encoder consists of 12 or 24 transformer layers, depending on the specific version of BERT used. Each layer has two sub-layers: a multi-head self-attention mechanism and a feed-forward neural network. The multi-head self-attention mechanism allows the model to focus on different parts of the input text at different levels of granularity, while the feed-forward neural network provides a non-linear transformation of the input.
After passing through the final transformer layer, the output is processed by a pooling layer, which generates a fixed-size representation of the input text. This representation is subsequently used as input to the decoder, which is typically a simple feed-forward neural network that generates the final output for the given NLP task.


In [7]:
class BertClassifier(nn.Module):
    def __init__(self, num_classes):
        super(BertClassifier, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(0.1)
        self.linear = nn.Linear(self.bert.config.hidden_size, num_classes)
        self.relu = nn.ReLU()

    def forward(self, input_ids, attention_mask):
        # Bert_outputs  = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        # pooled_output =  Bert_outputs['last_hidden_state']
        # dropout_output = self.dropout(pooled_output)
        # linear_output = self.linear(dropout_output)
        # final_layer = self.relu(linear_output)

        _, pooled_output = self.bert(input_ids= input_ids, attention_mask=attention_mask,return_dict=False)
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        final_layer = self.relu(linear_output)

        return final_layer

### 3.2 - Training:

In [8]:
model = BertClassifier(num_classes=2).to(device)
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()
criterion = criterion.to(device)

num_epochs = 2
best_val_loss = float('inf')
for epoch in range(num_epochs):
    # Train
    model.train()
    for batch in train_dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask)
        # print(outputs.shape)
        loss = criterion(outputs, labels.long())
        loss.backward()
        optimizer.step()

    # valate
    model.eval()
    val_loss = 0
    val_correct = 0
    with torch.no_grad():
        for batch in val_dataloader:
            input_ids = batch['input_ids'].squeeze(1).to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            outputs = model(input_ids, attention_mask)
            
            loss = criterion(outputs, labels)
            val_loss += loss.item() * len(labels)
            val_correct += sum(torch.argmax(outputs, dim=1) == labels).item()

    val_loss /= len(valdf)
    val_accuracy = val_correct
    print('epoch:', epoch, '  |  Train Loss:', loss, '|  Val Loss: ', val_loss, ' | Val Accuracy:', val_accuracy)
    

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


epoch: 0   |  Train Loss: tensor(0.0002, device='cuda:0') |  Val Loss:  0.0004038849848575106  | Val Accuracy: 4490
epoch: 1   |  Train Loss: tensor(2.7299e-05, device='cuda:0') |  Val Loss:  0.007017729406151778  | Val Accuracy: 4487


### 3.3 - Evaluating the model:

In [9]:
def evaluate(model, test_dataloader):



    total_acc_test = 0
    with torch.no_grad():


        for batch in val_dataloader:
            input_ids = batch['input_ids'].squeeze(1).to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            output = model(input_ids, attention_mask)

            acc = (output.argmax(dim=1) == labels).sum().item()
            total_acc_test += acc
    
    print(f'Test Accuracy: {total_acc_test / len(testdf): .3f}')
    
evaluate(model, test_dataloader)

Test Accuracy:  0.999


## 4 - Discussion and Conclusion:

The model does outstandingly well. As pointed out in the paper, only 2 epochs are needed to train the model. The accuracy achieved of $99.9\%$ far exceeds any of the models from the now closed Kaggle competition. One can further the results by using the full 512 charachters that the original BERT model was trained on. We can also use roBERTa, which offers a larger initial corpus to run and has pretrained libraries specificially for text classification. Both of these solutions are too large for my local machine, and consistently result in operations too large of my memory. In the Appendix, the code to train roBERTa is presented. Note that in order for the training to achive great performance, one need to tweak certain hyper parameters such as learning rates and sentence length.

## Appendix - RoBERTa:

### A.1 - Introduction:

RoBERTa (Robustly Optimized BERT pre-training Approach) is a language model developed by Facebook AI and introduced in 2019. It is based on the same architecture as BERT but is trained using an improved pre-training approach that enhances the original BERT model.
RoBERTa is a bidirectional transformer-based neural network model that uses a self-attention mechanism to understand the context of the words in the input text. It consists of multiple transformer layers, each of which contains a multi-head attention mechanism, feed-forward neural network, and layer normalization. However, unlike BERT, RoBERTa is pre-trained on a larger and more diverse corpus of text data, and it uses a longer training duration and larger batch sizes. RoBERTa also employs various training techniques, such as dynamic masking and training on longer sequences, to improve its performance. Dynamic masking randomly masks out different spans of the input text during training, whereas training on longer sequences helps the model to capture longer-range dependencies between words.
Overall, the RoBERTa architecture is highly effective in capturing the relationships between words in natural language text and has shown significant improvements in various NLP tasks, including sentiment analysis, question answering, and text classification.

### A.2 - Similarities and Differences with BERT:

Similarities:

-Both models are based on the transformer architecture, which uses self-attention mechanisms to process input sequences.
-Both models use a multi-layer bidirectional transformer encoder that processes the input text and a task-specific decoder that produces the final output.
-Both models can be fine-tuned on a wide range of NLP tasks, such as sentiment analysis, question answering, and language -translation.

Differences:

-RoBERTa was trained on a larger and more diverse corpus of data than BERT, using additional training techniques such as dynamic masking, which helps prevent the model from memorizing specific examples during training. This has led to RoBERTa outperforming BERT on several NLP benchmarks.
-RoBERTa removed the next sentence prediction task during training, allowing the model to better focus on the language modeling task. BERT, on the other hand, still includes this task during training.
-RoBERTa uses larger batch sizes and longer training times than BERT, which allows it to better capture complex patterns in the data.
-RoBERTa has shown better performance than BERT on several NLP tasks, particularly those that require more generalization.

In [13]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2).to(device)


train, val, test = DataClass(traindf,  tokenizer, max_length=256), DataClass(valdf, tokenizer, max_length=1024), DataClass(testdf,  tokenizer, max_length=256)
train_dataloader = DataLoader(train, batch_size=8, shuffle=True)
val_dataloader = DataLoader(val, batch_size=8, shuffle=True)
test_dataloader = DataLoader(test, batch_size=8, shuffle=True)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.

### A.3 - Training:

In [None]:
num_epochs = 5
for epoch in range(num_epochs):
    # Train
    model.train()
    train_loss = 0.0
    correct_predictions = 0
    
    for batch in train_dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss, logits = outputs.loss, outputs.logits
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        preds = torch.argmax(logits, dim=1)
        correct_predictions += torch.sum(preds == labels)
    train_loss /= len(train_dataloader)
    accuracy = correct_predictions / len(train_dataloader.dataset)
    # valate

    model.eval()
    val_loss = 0
    val_accuracy = 0
    with torch.no_grad():
        for batch in val_dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            print(outputs)
            val_loss += outputs.loss.item()
            _, preds = torch.max(outputs.logits, dim=1)
            val_accuracy += torch.sum(preds == labels).item()

    print('epoch:', epoch, '  |  Train Loss:', train_loss, '|  Val Loss: ', val_loss, ' | Val Accuracy:', val_correct)
    