# Preliminaries

The program was run using Google Colab with GPU, Tesla T4. For finetuning the pretrained models to the desired datasets, the Hugging Face Trainer API was used. Datasets include a local fake news dataset (Filipino) and the Kaggle fake news dataset from UTK Machine Learning Club 2017.

This experiment will mainly cover creating an adversarial attack by removing degree adverbs.



In [None]:
from google.colab import drive
drive.mount('/content/drive')
!cp "/content/drive/My Drive/198-adversarial-ml/Kaggle-Fake-News/train.csv" "train.csv"
!cp "/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/full.csv" "full.csv"

Mounted at /content/drive


In [None]:
!pip install datasets
!pip install transformers

In [None]:
import torch
import numpy as np
import pandas as pd
import itertools
import string
import re
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import EarlyStoppingCallback

# Kaggle Fake News Dataset

Use the train.csv file from [Kaggle Fake News Dataset](https://www.kaggle.com/competitions/fake-news/data) containing over 20000 news articles labeled as 0 when reliable, and 1 when unreliable.

In [None]:
df = pd.read_csv('train.csv')

## Pre-processing

For the **first experiement**, the *adv_list*  will contain the list of degree adverbs from (Flores et al., 2022)




In [None]:
adv_list = ['absolutely', 'amazingly', 'awfully', 'barely',
                'completely', 'considerably', 'decidedly', 'deeply', 
                'enormously', 'entirely', 'especially', 'exceptionally',
                'exclusively', 'extremely', 'fully', 'greatly', 'hardly',
                'hella', 'highly', 'hugely', 'incredibly', 'intensely',
                'majorly', 'overwhelmingly', 'really', 'remarkably',
                'substantially', 'thoroughly', 'totally', 'tremendously',
                'unbelievably', 'unusually', 'utterly', 'very']
special = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '“', '”', '‘', '’']

In [None]:
df = df[df['text'].notnull()]

In the following lines of code, a new dataframe is created which does not contain the adverbs in the *adv_list*

In [None]:
df['text_new'] = df['text'].apply(lambda s: ' '.join([w for w in s.split() if w.lower() not in adv_list]))

df_old = df[['id','title','author','text','label']]
df_new  = df[['id','title','author','text_new','label']].rename(columns={'text_new':'text'})

Copy the old and modified dataset to local storage and drive.

In [None]:
df_old.to_csv('/content/drive/My Drive/198-adversarial-ml/Kaggle-Fake-News/train_old.csv', index=False)
df_new.to_csv('/content/drive/My Drive/198-adversarial-ml/Kaggle-Fake-News/train_new.csv', index=False)

In [None]:
!cp "/content/drive/My Drive/198-adversarial-ml/Kaggle-Fake-News/train_old.csv" "train_old_kaggle.csv"
!cp "/content/drive/My Drive/198-adversarial-ml/Kaggle-Fake-News/train_new.csv" "train_new_kaggle.csv"

In [None]:
ids_old = df_old.text.str.contains('really$|really-|really ', flags = re.IGNORECASE, regex = True, na = False)
ids_new = df_new.text.str.contains('really$|really-|really ', flags = re.IGNORECASE, regex = True, na = False)

3937 rows with adverb "really".

In [None]:
df_old[ids_old]

Unnamed: 0,id,title,author,text,label
9,9,"A Back-Channel Plan for Ukraine and Russia, Co...",Megan Twohey and Scott Shane,A week before Michael T. Flynn resigned as nat...,0
11,11,"BBC Comedy Sketch ""Real Housewives of ISIS"" Ca...",Chris Tomlinson,The BBC produced spoof on the “Real Housewives...,0
14,14,"Re: Yes, There Are Paid Government Trolls On S...",AnotherAnnie,"Yes, There Are Paid Government Trolls On Socia...",1
15,15,"In Major League Soccer, Argentines Find a Home...",Jack Williams,Guillermo Barros Schelotto was not the first A...,0
21,21,"Monica Lewinsky, Clinton Sex Scandal Set for ’...",Jerome Hudson,"Screenwriter Ryan Murphy, who has produced the...",0
...,...,...,...,...,...
20758,20758,Trump’s Opponents See Normal Americans as Depl...,pcr3,Trump’s Opponents See Normal Americans as Depl...,1
20765,20765,NFL Preview: Championship Match-Ups Prove Team...,Daniel Leberfeld,"The NFL is a league, so it should come as no...",0
20773,20773,Australia to hunt down anti-vax nurses and pro...,Vicki Batts,Australia to hunt down anti-vax nurses and pro...,1
20784,20784,Comment on World Heaves Sigh of Relief after T...,Debbie Menon,Finian Cunningham has written extensively on...,1


84 rows with the adverb "really" are left in the modified dataframe.Instances of punctuations and other special characters in the adverb string were not removed.

In [None]:
df_new[ids_new]

Unnamed: 0,id,title,author,text,label
194,194,Death of the ‘Two-State Solution’,Consortiumnews.com,"Death of the ‘Two-State Solution’ November 16,...",1
956,956,Indiana Parents Lose Their Baby and 2 Years of...,Admin - Orissa,Indiana Parents Lose Their Baby and 2 Years of...,1
1148,1148,Gay man finds it in himself to tolerate religi...,,Gay man finds it in himself to tolerate religi...,1
1230,1230,FBI Director Comey Asks President Putin: “Is A...,The European Union Times,An astonishing Security Council (SC) report ci...,1
1504,1504,Alabama Prison Officials Retaliate Against Pri...,Brian Sonenstein,Advocates say prison officials at the Kilby Co...,1
...,...,...,...,...,...
19895,19895,SpaceX Says It’s Ready to Launch Rockets Again...,Kenneth Chang,After the explosion in September of one of its...,0
20140,20140,Mary Jo White to Step Down as S.E.C. Chief - T...,Ben Protess and Alexandra Stevenson,Wall Street regulators began an exodus from Wa...,0
20182,20182,"Hot-Air Balloon Crash in Texas Kills 16, Offic...","David Montgomery, Maggie Astor and Christine H...","LOCKHART, Tex. — A balloon carrying 16 people ...",0
20278,20278,And Then There Was Trump - The New York Times,Thomas B. Edsall,How do you deal with an opponent immune to the...,0


## Finetuning

In [None]:
train, test = train_test_split(df_old, test_size=0.3)
train.to_csv('/content/drive/My Drive/198-adversarial-ml/Kaggle-Fake-News/train_old.csv', index=False)
test.to_csv('/content/drive/My Drive/198-adversarial-ml/Kaggle-Fake-News/test_old.csv', index=False)

!cp "/content/drive/My Drive/198-adversarial-ml/Kaggle-Fake-News/train_old.csv" "train_old_kaggle.csv"
!cp "/content/drive/My Drive/198-adversarial-ml/Kaggle-Fake-News/test_old.csv" "test_old_kaggle.csv"

In [None]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

In [None]:
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

In [None]:
args = TrainingArguments(
    output_dir="output",
    evaluation_strategy="steps",
    eval_steps=500,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    seed=0,
    load_best_model_at_end=True,
)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

In [None]:
data = pd.read_csv('train_old_kaggle.csv')

# Load the finetuned model
pretrained = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(pretrained)
model = AutoModelForSequenceClassification.from_pretrained(pretrained)

X = list(data["text"])
y = list(data["label"])
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=512)

train_dataset = Dataset(X_train_tokenized, y_train)
val_dataset = Dataset(X_val_tokenized, y_val)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [None]:
trainer.train()

***** Running training *****
  Num examples = 10172
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3816
  Number of trainable parameters = 108311810


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
500,0.138,0.036224,0.992202,0.995896,0.988683,0.992276
1000,0.0481,0.035997,0.992431,0.988769,0.996378,0.992559
1500,0.0349,0.027068,0.995642,0.992799,0.998642,0.995712
2000,0.0293,0.032587,0.994266,0.997721,0.990946,0.994322
2500,0.0162,0.02314,0.995183,0.995023,0.995473,0.995248
3000,0.0017,0.030926,0.995183,0.998179,0.992304,0.995233
3500,0.0022,0.030421,0.99633,0.99773,0.99502,0.996374


***** Running Evaluation *****
  Num examples = 4360
  Batch size = 8
Saving model checkpoint to output/checkpoint-500
Configuration saved in output/checkpoint-500/config.json
Model weights saved in output/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 4360
  Batch size = 8
Saving model checkpoint to output/checkpoint-1000
Configuration saved in output/checkpoint-1000/config.json
Model weights saved in output/checkpoint-1000/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 4360
  Batch size = 8
Saving model checkpoint to output/checkpoint-1500
Configuration saved in output/checkpoint-1500/config.json
Model weights saved in output/checkpoint-1500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 4360
  Batch size = 8
Saving model checkpoint to output/checkpoint-2000
Configuration saved in output/checkpoint-2000/config.json
Model weights saved in output/checkpoint-2000/pytorch_model.bin
***** Running Evaluation *****
  Nu

TrainOutput(global_step=3816, training_loss=0.035778310470111215, metrics={'train_runtime': 3976.1934, 'train_samples_per_second': 7.675, 'train_steps_per_second': 0.96, 'total_flos': 8029096965365760.0, 'train_loss': 0.035778310470111215, 'epoch': 3.0})

Copy finetuned model to local storage

In [None]:
!cp -r "output" "/content/drive/My Drive/198-adversarial-ml/Kaggle-Fake-News/output"

## Evaluation

Use the best model, step = 2500.

In [None]:
test_data = pd.read_csv("test_old_kaggle.csv")
X_test = list(test_data["text"])
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)
y_test = list(test_data["label"])

test_dataset = Dataset(X_test_tokenized)

model_path = "output/checkpoint-2500"
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=2)

test_trainer = Trainer(model)

raw_pred, _, _ = test_trainer.predict(test_dataset)
y_pred = np.argmax(raw_pred, axis=1)

accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(accuracy, recall, precision, f1)

loading configuration file output/checkpoint-2500/config.json
Model config BertConfig {
  "_name_or_path": "output/checkpoint-2500",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file output/checkpoint-2500/pytorch_model.bin
All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequ

0.9950232782148017 0.9968152866242038 0.9933354490637892 0.9950723255444286


In [None]:
train, test = train_test_split(df_new, test_size=0.3)
test.to_csv('/content/drive/My Drive/198-adversarial-ml/Kaggle-Fake-News/test_new.csv', index=False)

!cp "/content/drive/My Drive/198-adversarial-ml/Kaggle-Fake-News/test_new.csv" "test_new_kaggle.csv"

In [None]:
test_data = pd.read_csv("test_new_kaggle.csv")
test_data = test_data.dropna()
X_test = list(test_data["text"])
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)
y_test = list(test_data["label"])

test_dataset = Dataset(X_test_tokenized)

model_path = "output/checkpoint-2500"
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=2)

test_trainer = Trainer(model)

raw_pred, _, _ = test_trainer.predict(test_dataset)
y_pred = np.argmax(raw_pred, axis=1)

accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(accuracy, recall, precision, f1)

loading configuration file output/checkpoint-2500/config.json
Model config BertConfig {
  "_name_or_path": "output/checkpoint-2500",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file output/checkpoint-2500/pytorch_model.bin
All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequ

0.9966887417218543 0.998262380538662 0.9939446366782007 0.9960988296488946


# Fake News Filipino Dataset

The provided dataset contains around 3000 news articles in Filipino that is perfectly split of real and fake news.

In [None]:
df = pd.read_csv('full.csv')

## Pre-processing

For the **first experiement**, the *adv_list*  will contain the list of degree adverbs commonly used in Filipino.

In [None]:
adv_list = ['masyado', 'medyo', 'tunay', 'kaagad', 'lubos', 'parang', 'bahagya', 'halos', 'lubhang', 'labis',
            'lalong', 'higit', 'talaga', 'totoo', 'pa rin', 'mabuti', 'mahirap', 'kamakailan', 'madalang', 'minsan']
special = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '“', '”', '‘', '’']

In [None]:
df = df[df['article'].notnull()]

In the following lines of code, a new dataframe is created which does not contain the adverbs in the *adv_list*

In [None]:
df['article_new'] = df['article'].apply(lambda s: ' '.join([w for w in s.split() if w.lower() not in adv_list]))

df_old = df[['label', 'article']]
df_new  = df[['label', 'article_new']].rename(columns={'article_new':'article'})

Copy the old and modified dataset to local storage and drive.

In [None]:
df_old.to_csv('/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/full_old.csv', index=False)
df_new.to_csv('/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/full_new.csv', index=False)

In [None]:
!cp "/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/full_old.csv" "full_old_filipino.csv"
!cp "/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/full_new.csv" "full_new_filipino.csv"

In [None]:
ids_old = df_old.article.str.contains('kaagad$|kaagad-|kaagad ', flags = re.IGNORECASE, regex = True, na = False)
ids_new = df_new.article.str.contains('kaagad$|kaagad-|kaagad ', flags = re.IGNORECASE, regex = True, na = False)

In [None]:
df_old[ids_old]

Unnamed: 0,label,article
11,0,"Ayon kay SPO1 Jaycee Calma, may hawak ng kaso,..."
201,0,Pero hindi pa rin tumitigil ang pagkainis ng m...
241,0,"""Tingnan natin, pero may plano 'yan, alam niya..."
273,0,Ang resignation ni Darren Wilson ay kaagad na ...
339,0,Hawak na ngayon ng Mandaluyong City Police si ...
396,0,"Ayon kay Senior Supt. Bartolome Bustamante, he..."
491,0,Ang resignation ni Darren Wilson ay kaagad na ...
571,0,"Ayon sa mga magulang ng biktima, tatlong araw ..."
616,0,Si Bert ang naging gabay ng staff members ng n...
661,0,PINADAPA kaagad ng Dream Dad nina Zanjoe Marud...


Only 1 instance of the adverb, *kaagad* is left.

In [None]:
df_new[ids_new]

Unnamed: 0,label,article
1763,1,Huli sa isinagawang entrapment operation ng Ph...


## Finetuning

The pretrained model will be finetuned to both the original dataset and the modified dataset. The pretrained model, *bert-tagalog-base-cased,* was trained using the WikiText-TL-39 dataset which is a corpus of 172,815 articles in Tagalog.

In [None]:
train, test = train_test_split(df_old, test_size=0.3)
train.to_csv('/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/train_old.csv', index=False)
test.to_csv('/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/test_old.csv', index=False)

!cp "/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/train_old.csv" "train_old_filipino.csv"
!cp "/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/test_old.csv" "test_old_filipino.csv"

In [None]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

In [None]:
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

In [None]:
args = TrainingArguments(
    output_dir="output",
    evaluation_strategy="steps",
    eval_steps=500,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    seed=0,
    load_best_model_at_end=True,
)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

In [None]:
data = pd.read_csv('train_old_filipino.csv')

pretrained = 'jcblaise/bert-tagalog-base-cased'
tokenizer = AutoTokenizer.from_pretrained(pretrained)
model = AutoModelForSequenceClassification.from_pretrained(pretrained)

X = list(data["article"])
y = list(data["label"])
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=512)

train_dataset = Dataset(X_train_tokenized, y_train)
val_dataset = Dataset(X_val_tokenized, y_val)

Downloading:   0%|          | 0.00/439M [00:00<?, ?B/s]

Some weights of the model checkpoint at jcblaise/bert-tagalog-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the mode

In [None]:
trainer.train()

Downloading:   0%|          | 0.00/55.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/624 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/439M [00:00<?, ?B/s]

Some weights of the model checkpoint at jcblaise/bert-tagalog-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the mode

Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
500,0.1729,0.280965,0.94362,0.9653,0.918919,0.941538


***** Running Evaluation *****
  Num examples = 674
  Batch size = 8
Saving model checkpoint to output/checkpoint-500
Configuration saved in output/checkpoint-500/config.json
Model weights saved in output/checkpoint-500/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from output/checkpoint-500 (score: 0.2809652090072632).


TrainOutput(global_step=591, training_loss=0.14828649551533604, metrics={'train_runtime': 30487.6898, 'train_samples_per_second': 0.154, 'train_steps_per_second': 0.019, 'total_flos': 1239253070745600.0, 'train_loss': 0.14828649551533604, 'epoch': 3.0})

Copy the finetuned model to local storage

In [None]:
!cp -r "/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/output" "output"
!cp "/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/train_old.csv" "train_old_filipino.csv"
!cp "/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/test_old.csv" "test_old_filipino.csv"

## Evaluation

Use the best model, step = 500.

In [None]:
test_data = pd.read_csv("test_old_filipino.csv")
X_test = list(test_data["article"])
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)
y_test = list(test_data["label"])

test_dataset = Dataset(X_test_tokenized)

model_path = "output/checkpoint-500"
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=2)

test_trainer = Trainer(model)

raw_pred, _, _ = test_trainer.predict(test_dataset)
y_pred = np.argmax(raw_pred, axis=1)

accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(accuracy, recall, precision, f1)

loading configuration file output/checkpoint-500/config.json
Model config BertConfig {
  "_name_or_path": "output/checkpoint-500",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30101
}

loading weights file output

0.9438669438669439 0.9462151394422311 0.9462151394422311 0.9462151394422311


In [None]:
train, test = train_test_split(df_new, test_size=0.3)
test.to_csv('/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/test_new.csv', index=False)

!cp "/content/drive/My Drive/198-adversarial-ml/Fake-News-Filipino/test_new.csv" "test_new_filipino.csv"

In [None]:
test_data = pd.read_csv("test_new_filipino.csv")
X_test = list(test_data["article"])
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)
y_test = list(test_data["label"])

test_dataset = Dataset(X_test_tokenized)

model_path = "output/checkpoint-500"
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=2)

test_trainer = Trainer(model)

raw_pred, _, _ = test_trainer.predict(test_dataset)

y_pred = np.argmax(raw_pred, axis=1)

accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(accuracy, recall, precision, f1)

loading configuration file output/checkpoint-500/config.json
Model config BertConfig {
  "_name_or_path": "output/checkpoint-500",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30101
}

loading weights file output

0.9688149688149689 0.9641350210970464 0.9723404255319149 0.9682203389830509


# Visualization of Results

In [3]:
import plotly.graph_objects as go

fig = go.Figure(data=[go.Table(
    header=dict(values=['Finetuned Model','Accuracy', 'Recall', 'Precision', 'F1-Score'],
                line_color='darkslategray',
                fill_color='lightskyblue',
                align='left'),
    cells=dict(values=[['Kaggle Fake News (Original)', 'Kaggle Fake News (Adversarial)', 'Fake News Filipino (Original)', 'Fake News Filipino (Adversarial)'],
                       [99.50, 99.67, 94.39, 96.88],
                       [99.68, 99.83, 94.62, 96.41],
                       [99.33, 99.39, 94.62, 97.23],
                       [99.51, 99.61, 94.62, 96.82]],
               line_color='darkslategray',
               fill_color='lightcyan',
               align='left'))
])

fig.update_layout(width=1000, height=500)
fig.show()

# Attribution


1.   [An Adversarial Benchmark for Fake News Detection Models](https://github.com/ljyflores/fake-news-adversarial-benchmark/blob/master/polarity_preprocessing.ipynb)
2.   [Fine-tuning pretrained NLP models with Huggingface’s Trainer](https://towardsdatascience.com/fine-tuning-pretrained-nlp-models-with-huggingfaces-trainer-6326a4456e7b)