# The following model uses the DistilBERT language model as the base.
## Learn about it [here](https://en.wikipedia.org/wiki/BERT_(language_model)).

Possible scaling:
- usage of BERT's multilingual variant, which could help with the detection in different languages, because it understands semantics (the performance for such cases would be significantally worse, but it can be resolved with even a small dataset in the target language). **Main caveat**: significantally longer training time.
- enlarging the dataset may yield way better results

# Imports

In [61]:
import torch
import pandas as pd
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from tqdm.notebook import tqdm
from config import utils
from config import config as cfg

In [5]:
cwd = utils.get_repo_path()
notext = pd.read_csv(cwd/cfg.DATA_FOLDER/"fakenews_notext.csv")

In [11]:
notext.head()

Unnamed: 0.1,Unnamed: 0,title,source_domain,tweet_num,real
0,0,Kandi Burruss Explodes Over Rape Accusation on...,toofab.com,42,1
1,1,People's Choice Awards 2018: The best red carp...,www.today.com,0,1
2,2,Sophia Bush Sends Sweet Birthday Message to 'O...,www.etonline.com,63,1
3,3,Colombian singer Maluma sparks rumours of inap...,www.dailymail.co.uk,20,1
4,4,Gossip Girl 10 Years Later: How Upper East Sid...,www.zerchoo.com,38,1


In [15]:
notext['source_domain'] = notext['source_domain'].str.replace(r'^https?://(www\.)?|^www\.', '', regex=True)

In [16]:
notext.head()

Unnamed: 0.1,Unnamed: 0,title,source_domain,tweet_num,real
0,0,Kandi Burruss Explodes Over Rape Accusation on...,toofab.com,42,1
1,1,People's Choice Awards 2018: The best red carp...,today.com,0,1
2,2,Sophia Bush Sends Sweet Birthday Message to 'O...,etonline.com,63,1
3,3,Colombian singer Maluma sparks rumours of inap...,dailymail.co.uk,20,1
4,4,Gossip Girl 10 Years Later: How Upper East Sid...,zerchoo.com,38,1


In [17]:
titles = list(notext['title'])
targets = list(notext['real'])
domains = list(notext['source_domain'])

In [22]:
max([len(i) for i in titles]) # maximum title length

340

In [23]:
min([len(i) for i in titles]) # minimum title length

10

# Let the fun begin

## Train test split and tokenizers

In [62]:
train_texts, test_texts, train_labels, test_labels = train_test_split(
    titles, targets, test_size=0.2, random_state=42
)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=32, return_tensors='pt')
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=32, return_tensors='pt')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

## Custom class for handling the titles

In [63]:
class NewsDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = torch.tensor(labels, dtype=torch.float)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = NewsDataset(train_encodings, train_labels)
test_dataset = NewsDataset(test_encodings, test_labels)

# Initialize the model using the pretrained BERT model

In [64]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=1)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [65]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [66]:
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)

In [67]:
#training loop
model.train()
for epoch in range(3):
    total_loss = 0
    for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}"):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    print(f"Epoch {epoch+1} Loss: {total_loss:.4f}")

Epoch 1:   0%|          | 0/1160 [00:00<?, ?it/s]

Epoch 1 Loss: 832.8175


Epoch 2:   0%|          | 0/1160 [00:00<?, ?it/s]

Epoch 2 Loss: 831.6057


Epoch 3:   0%|          | 0/1160 [00:00<?, ?it/s]

Epoch 3 Loss: 832.2799


# Accuracy?

In [81]:
model.eval()
predictions, true_labels = [], []
with torch.no_grad():
    for batch in tqdm(test_loader):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        logits = outputs.logits.squeeze()
        preds = torch.sigmoid(logits) >= 0.5
        predictions.extend(preds.cpu().numpy())
        true_labels.extend(batch['labels'].cpu().numpy())

acc = accuracy_score(true_labels, predictions)
print(f"Test Accuracy: {acc:.2%}")

  0%|          | 0/290 [00:00<?, ?it/s]

KeyboardInterrupt: 

# **SAVE THE MODEL AFTER TRAINING!**

---

In [70]:
torch.save(model.state_dict(), "fake_titles_model.pth")

---

In [72]:
sum(p.numel() for p in model.parameters()) # number of parameters

66954241

Not the best results. However, we're on the right track.

In [74]:
sum(targets)

17441

In [75]:
sum(targets)/len(targets)

0.7518968787722021

The dataset is biased towards true titles. Balancing it could help with the results. Another thing is increasing the number of epochs for tuning the parameters (not too much!).

# TESTING BY HAND

In [None]:
# RUN ONLY IF MODEL IS NOT YET TRAINED/LOADED
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=1)
model.load_state_dict(torch.load("distilbert_model.pth"))
model.eval()

In [78]:
def input_title(title):
    inputs = tokenizer(title, return_tensors="pt", truncation=True, padding=True, max_length=32)
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        prob = torch.sigmoid(logits).item()
        prediction = 1 if prob >= 0.5 else 0
    return prob, prediction

In [82]:
input_title("Nobody In White House Sure Who Guy Praying Over Trump Is")

(0.5013893842697144, 1)

In [83]:
input_title("Poll Finds Most Americans Would Swap Democracy For $100 Best Buy Gift Card")

(0.5002263784408569, 1)