In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **`Importing Libraries`**

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_recall_fscore_support

from transformers import RobertaTokenizer, RobertaForSequenceClassification
from transformers import TrainingArguments , Trainer
import torch
from torch.utils.data import Dataset, DataLoader

## **`Workflow for Fine-Tuning RoBERTa`**

##### **Step 1: Load Dataset:**
* Split into train and test set.

##### **Step 2: Load Pretrained RoBERTa Model & Tokenizer:**
* Tokenization with Truncation & Padding.

##### **Step 3: Convert to PyTorch Dataset:**
* Create Dataloaders

##### **Step 4: Setup Training Configuration with Regularization**
* Setting LR, batch size, epochs, etc.

##### **Step 5: Train the Model**

##### **Step 6: Evaluate the Model on Test Set**

##### **Step 7: Save & Load Model for Inference**

##### **Step 8: Predict on New Articles**


## **`Step 1: Loading Data`**

In [3]:
df = pd.read_csv('/content/drive/MyDrive/Fake_News_Classification/Data/titles_text_combined.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,subject,date,label,title_length,text_length,full_text
0,2619,Ex-CIA head says Trump remarks on Russia inter...,Former CIA director John Brennan on Friday cri...,politicsNews,"July 22, 2017",1,67,2733,Ex-CIA head says Trump remarks on Russia inter...
1,16043,YOU WON’T BELIEVE HIS PUNISHMENT! HISPANIC STO...,How did this man come to OWN this store? There...,Government News,"Jun 19, 2017",0,121,2630,YOU WON’T BELIEVE HIS PUNISHMENT! HISPANIC STO...
2,876,Federal Reserve governor Powell's policy views...,President Donald Trump on Thursday tapped Fede...,politicsNews,"November 2, 2017",1,64,4052,Federal Reserve governor Powell's policy views...
3,19963,SCOUNDREL HILLARY SUPPORTER STARTS “TrumpLeaks...,Hillary Clinton ally David Brock is offering t...,left-news,"Sep 17, 2016",0,72,1131,SCOUNDREL HILLARY SUPPORTER STARTS “TrumpLeaks...
4,10783,NANCY PELOSI ARROGANTLY DISMISSES Questions on...,Pleading ignorance is a perfect ploy for Nancy...,politics,"May 26, 2017",0,104,1061,NANCY PELOSI ARROGANTLY DISMISSES Questions on...


In [4]:
#Chechking Dimensions of data (no. of rows, no. of cols)
print(f"Dataset shape: {df.shape}")

Dataset shape: (27209, 9)


In [5]:
#Checking Distribution of labels
df['label'].value_counts()   # 0: Fake, 1: Real

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,14422
0,12787


In [6]:
#Splitting Data into Training and Test Set
X = df['full_text'].tolist()
y = df['label'].tolist()

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

In [7]:
type(X_train)

list

In [8]:
print(f"Training Data Size: {len(X_train)}")
print(f"Testing Data Size: {len(X_test)}")
print()
# Counting real and fake news
print(f"Real news articles in Training data: {y_train.count(1)}")
print(f"Fake news articles in Training data: {(y_train.count(0))}")

Training Data Size: 21767
Testing Data Size: 5442

Real news articles in Training data: 11537
Fake news articles in Training data: 10230


## **`Step 2: Load Pretrained RoBERTa Model & Tokenizer:`**

* RoBERTa has no Next Sentence Prediction (NSP), so it relies on masked language modeling (MLM).

In [9]:
model_name = "roberta-base"

tokenizer = RobertaTokenizer.from_pretrained(model_name)

roberta_model = RobertaForSequenceClassification.from_pretrained(
                                    model_name,
                                    num_labels=2,
                                    hidden_dropout_prob=0.3,  # Default is 0.1, increased to 0.3
                                    attention_probs_dropout_prob=0.3
                                  )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### **` Tokenization with Truncation & Padding`**

In [10]:
# Tokenize texts
train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=512)
val_encodings = tokenizer(X_test, truncation=True, padding=True, max_length=512)

## **`Step 3: Convert to PyTorch Dataset:`**

In [11]:
class FakeNewsDataset(Dataset):
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __len__(self):
    """Returns the number of samples in the dataset."""
    return len(self.labels)

  def __getitem__(self, idx):
    """Returns tokenized inputs and labels for a given index."""
    item = {key: torch.tensor(val[idx], dtype=torch.long) for key, val in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
    return item

In [12]:
# Create dataset objects
train_dataset = FakeNewsDataset(train_encodings, y_train)
val_dataset = FakeNewsDataset(val_encodings, y_test)

## **`Step 4: Setup Training Configuration with Regularization`**

1. **Using Label Smoothing:** Label smoothing is a regularization technique used to prevent overfitting and improve model generalization by slightly adjusting the true labels. Instead of assigning a hard 0 or 1 to each class, it distributes some probability mass to the other classes.

Why using Label Smoothing?

In standard classification, we use one-hot encoded labels, where:

* Fake News → [1, 0]
* Real News → [0, 1]

But this forces the model to be overconfident in its predictions. Label smoothing softens these labels to something like:

* Fake News → [0.9, 0.1]
* Real News → [0.1, 0.9]

This prevents the model from assigning excessive confidence to any one class, reducing overfitting and improving robustness to noisy labels.



In [13]:
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/Fake_News_Classification/Saved_Models/robertaResults",
    eval_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    learning_rate=2e-5,
    weight_decay=0.01,              # L2 Regularization
    label_smoothing_factor=0.1,     # Apply 10% smoothing
    logging_dir="/content/drive/MyDrive/Fake_News_Classification/Saved_Models/robertaResultslogs",
    logging_steps=10,
    save_total_limit=2,
    report_to="none",
    load_best_model_at_end=True
)

## **`Evaluation Metrics`**

In [14]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    accuracy = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average=None, zero_division=0)

    macro_f1 = f1_score(labels, preds, average="macro")

    return {
        "accuracy": accuracy,
        "f1_fake": f1[0],      # F1-score for Fake News (Class 0)
        "f1_real": f1[1],      # F1-score for Real News (Class 1)
        "macro_f1": macro_f1
    }

## **`Step 5: Train the Model`**

In [15]:
trainer = Trainer(
    model = roberta_model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
    compute_metrics = compute_metrics       # Custom function created above
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1 Fake,F1 Real,Macro F1
1,0.2,0.20586,1.0,1.0,1.0,1.0
2,0.1994,0.20574,1.0,1.0,1.0,1.0


TrainOutput(global_step=4080, training_loss=0.20568439037192102, metrics={'train_runtime': 1497.4482, 'train_samples_per_second': 43.608, 'train_steps_per_second': 2.725, 'total_flos': 1.717115369490432e+16, 'train_loss': 0.20568439037192102, 'epoch': 2.9981624402793092})

## **`Step 6: Evaluate the Model`**

In [16]:
eval_results = trainer.evaluate()
print(f"Test Accuracy: {eval_results['eval_accuracy']:.4f}")
print(f"\nFake News F1 Score: {eval_results['eval_f1_fake']:.4f}")
print(f"\nReal News F1 Score: {eval_results['eval_f1_real']:.4f}")
print(f"\nMacro F1 Score: {eval_results['eval_macro_f1']:.4f}")

Test Accuracy: 1.0000

Fake News F1 Score: 1.0000

Real News F1 Score: 1.0000

Macro F1 Score: 1.0000


## **`Step 7: Save Model and Tokenizer`**

In [17]:
trainer.save_model("/content/drive/MyDrive/Fake_News_Classification/Saved_Models/roberta_model")

tokenizer.save_pretrained("/content/drive/MyDrive/Fake_News_Classification/Saved_Models/roberta_tokenizer")

('/content/drive/MyDrive/Fake_News_Classification/Saved_Models/roberta_tokenizer/tokenizer_config.json',
 '/content/drive/MyDrive/Fake_News_Classification/Saved_Models/roberta_tokenizer/special_tokens_map.json',
 '/content/drive/MyDrive/Fake_News_Classification/Saved_Models/roberta_tokenizer/vocab.json',
 '/content/drive/MyDrive/Fake_News_Classification/Saved_Models/roberta_tokenizer/merges.txt',
 '/content/drive/MyDrive/Fake_News_Classification/Saved_Models/roberta_tokenizer/added_tokens.json')

## **`Step 8: Loading Saved Models and Use for Prediction`**

In [18]:
model_path = "/content/drive/MyDrive/Fake_News_Classification/Saved_Models/roberta_model"

tokenizer_path = "/content/drive/MyDrive/Fake_News_Classification/Saved_Models/roberta_tokenizer"

In [19]:
# Loading the tokenizer
tokenizer = RobertaTokenizer.from_pretrained(tokenizer_path)

In [20]:
# Load the model
model = RobertaForSequenceClassification.from_pretrained(model_path)
model.eval()

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.3, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.3, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
         

In [21]:
def predict_news_article(article_text):
    # Tokenize the input text
    inputs = tokenizer(article_text, truncation=True, padding=True, max_length=512, return_tensors="pt")

    # Forward pass through the model
    with torch.no_grad():
        outputs = model(**inputs)

    # Get predicted class
    logits = outputs.logits
    predicted_class = torch.argmax(logits, dim=1).item()

    # Convert prediction to label
    label_map = {0: "Fake News", 1: "Real News"}
    return label_map[predicted_class]

In [22]:
sample_news1 = '''
All England: How with long rallies and tall tosses, Malvika Bansod ousted World No 12 Yeo Jia Min
Malvika Bansod politely declined the offer of having legendary coach Irwansyah sit for her All England opening match against Singaporean Yeo Jia Min. The Indonesian perfectly understood, as Malvika and her regular coach for the last two years at Thane’s Shrikant Vad Academy, Vignesh Devlekar, had a plan to take down the World No 12.

The 23-year-old from Nagpur had lost to Jia Min previously and was coming off a first-round loss from Orleans. But pushing herself beyond limits of exhaustion, with both players utterly knackered by the end, Malvika recorded a stunning 21-13, 10-21, 21-17 victory to advance to the second round. Things were tricky at 11-9 in the decider, but Malvika did well to conserve energy, and mix her usually well-executed high tosses and lifts with attacking openings on cross shots as Jia Min tired out, fading off at the baseline.

Vignesh had won five national ranking titles in 2019 in doubles, and played Maldives Open, the only international trip he could afford on his parents’ salaries that year – mother a BMC teacher, and father a clerk in a PSU. “The lockdown ended my playing dreams and I had no funds anyway. But when coaching Malvika, I lean on my weaknesses – I never had a big attack or great physical strength. I was good at finding solutions,” Devlekar says. The long rallies and tall tosses that would strain Jia Min, the coach and athlete had analysed thoroughly.
In addition, Malvika dumped low serves in the large Birmingham hall for high serves -again pressuring Jia Min’s neck – and injected pace into her first stroke. The Indian was prepared for the long rallies, but had no intention of defending endlessly. She created openings, pouncing on the short lengths.
Malvika has compiled nearly a dozen thick diaries where she jots down details of fitness workouts she observes when at tournaments and national camps. “She must be the only player at this level (World No 28) who designs own fitness. We are desperately looking for an experienced trainer but we don’t have the funds for it yet. So, she plans it herself,” Vignesh notes.
Same shots, different paths
Malvika, a cerebral player, has spent last few months devising two strokes from the same position, guided by Vad and Vignesh, tweaking angles with wrist work, because a giant smash isn’t suddenly going to materialise. “We work within our limitations, but we are looking to get her stronger if we get a trainer,” he adds.

The engineering graduate who had beaten Olympic bronze medallist Gregoria Tunjung at China, has immersed herself even deeper into badminton, and shut out all noise that questions if she has a future when 23 already. “Qualifying for LA Olympics is the plan. We are working towards it,” Vignesh says, adding they hit it off, geeking out on the sport because he was in constant analysing/plotting mode as coach. A BWF Level 1 certified coach, he’s pursuing his MBA alongside, but the natural aptitude for coaching and the bond he struck with Malvika has brought good results.
All evenings see her work on fitness by herself. “She’s working hard like crazy. We don’t have funds for beyond the coach. Hopefully if we get results and few Top 10 wins, they will consider funding her,” Vignesh says.


'''

In [23]:
prediction = predict_news_article(sample_news1)
print("Prediction:", prediction)

Prediction: Real News


In [24]:
sample_news2 = '''
Their untold stories need to be told': Teens capture India's labourers in pictures
The elderly woman gazes wistfully into the distance, her hands curled over a basket of tobacco, surrounded by the hundreds of cigarettes she has spent hours rolling by hand.

The photograph is one of several snapped by student Rashmitha T in her village in Tamil Nadu, featuring her neighbours who make traditional Indian cigarettes called beedis.

"No-one knows about their work. Their untold stories need to be told," Rashmitha told the BBC.

Her pictures were featured in a recent exhibition about India's labourers titled The Unseen Perspective at the Egmore Museum in Chennai.

All the photographs were taken by 40 students from Tamil Nadu's government-run schools, who documented the lives of their own parents or other adults.

From quarry workers to weavers, welders to tailors, the pictures highlight the diverse, backbreaking work undertaken by the estimated 400 million labourers in India.
Many beedi rollers, for instance, are vulnerable to lung damage and tuberculosis due to their dangerous work, said Rashmitha.

"Their homes reek of tobacco, you cannot stay there long," she said, adding that her neighbours sit outside their homes for hours rolling beedis.

For every 1,000 cigarettes they roll, they only earn 250 rupees ($2.90; £2.20), she told the BBC.In the state's Erode district, Jayaraj S captured a photo of his mother Pazhaniammal at work as a brick maker. She is seen pouring a clay and sand mixture into moulds and shaping bricks by hand.

Jayaraj had to wake up at 2am to snap the picture, because his mother begins working in the middle of the night.

"She has to start early to avoid the afternoon sun," he said.

It was only when he embarked on his photography project that he truly realised the hardships she has to endure, he added.

"My mother frequently complains of headaches, leg pain, hip pain and sometimes faints," he said.In the Madurai district, Gopika Lakshmi M captured her father Muthukrishnan selling goods from an old van.

Her father has to get a dialysis twice a week after he lost a kidney two years ago.

"He drives to nearby villages to sell goods despite being on dialysis," Lakshmi says.

"We don't have the luxury of resting at home."

But despite his serious condition, her father "looked like a hero" as he carried on with his gruelling daily routine, said Gopika.Taking pictures with a professional camera was not easy initially, but it got easier after months of training with experts, said the students.

"I learned how to shoot at night, adjust shutter speed and aperture," said Keerthi, who lives in the Tenkasi district.

For her project, Keerthi chose to document the daily life of her mother, Muthulakshmi, who owns a small shop in front of their house.

"Dad is not well, so mum looks after both the shop and the house," she said. "She wakes up at 4am and works until 11pm."

Her photos depict her mother's struggles as she travels long distances via public buses to source goods for her store.

"I wanted to show through photographs what a woman does to improve her children's lives," she said.Mukesh K spent four days with his father, documenting his work at a quarry.

"My father stays here and comes home only once a week," he said.

Mukesh's father works from 3am till noon, and after a brief rest, works from 3pm to 7pm. He earns a meagre sum of about 500 rupees a day.

"There are no beds or mattresses in their room. My father sleeps on empty cardboard boxes in the quarry," he said. "He suffered a sunstroke last year because he was working under the hot sun."The students, aged 13 to 17, are learning various art forms, including photography, as part of an initiative by the Tamil Nadu School education department.

"The idea is to make students socially responsible," said Muthamizh Kalaivizhi, state lead of Holistic Development programme in Tamil Nadu's government schools and founder of non-government organisation Neelam Foundation.

"They documented the working people around them. Understanding their lives is the beginning of social change," he added.

'''

In [25]:
prediction = predict_news_article(sample_news2)
print("Prediction:", prediction)

Prediction: Real News
