# Fake News Classification – Preprocessing Notebook

### Goal
The goal of this notebook is to clean, process, and prepare the dataset for machine learning models to classify fake and real news articles. We will handle missing data, clean text fields, remove stopwords, and generate numerical embeddings to feed into ML classifiers.


## 4.1 Dataset Analysis – Step 1: Load Required Libraries and Dataset Files

In [None]:
# Mount Google Drive to access datasets

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords
import os

In [None]:
# Load train, test, and submission datasets from drive
train_data = pd.read_csv('/content/drive/MyDrive/Annan Project/Datasets/train.csv')
test_data = pd.read_csv('/content/drive/MyDrive/Annan Project/Datasets/test.csv')
submit_data = pd.read_csv('/content/drive/MyDrive/Annan Project/Datasets/submit.csv')

# Preview the first few rows of training data
train_data.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [None]:
# Preview the test data
test_data.head()

Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


## 4.1 Dataset Summary and Feature Information

### Dataset Details:
- Dataset Source: Kaggle Fake News Dataset (Competition Dataset)

| Dataset | Rows | Description |
|---------|------|-------------|
| train.csv | 20,800 | Contains labeled data (Real/Fake) |
| test.csv  | 5,200  | Unlabeled data (Model will predict) |
| submit.csv | 5,200  | Submission file for competition |

---

### Columns Description:
| Column | Description |
|--------|-------------|
| id | Unique ID for each news article |
| title | Title of the article |
| author | Name of the author |
| text | Full content of the article |
| label | Target variable (1 = Fake, 0 = Real) |

## Why This Dataset Was Chosen?

- It is highly relevant due to rising misinformation problems.
- Sufficient size for training Machine Learning models.
- Multiple textual features (title, text, author) available.
- Real-world application for news/media industry.

## Strengths & Weaknesses of Dataset:

### Strengths:
- Sufficient size & labeled data
- Real-world problem
- Good feature diversity (title, text, author)

### Weaknesses:
- Missing values in `author` column
- Noisy text data
- Class imbalance

## Dataset Bias Analysis:

### Potential Biases:
- Certain authors or headlines may dominate class prediction.
- Source or style-based bias possible.
- Class imbalance (slightly higher fake news samples).

### Impact of Bias:
- Model may overfit to stylistic patterns.
- Misclassification risk increases if not handled properly.

## 4.2 Preprocessing Steps — Step 1: Handling Missing Values

→ We found missing values in the `author` column.  
To handle this, we filled missing author names with `Unknown`.

In [None]:
# Fill missing values in 'author' with 'Unknown'
train_data['author'].fillna('Unknown', inplace=True)
test_data['author'].fillna('Unknown', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_data['author'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_data['author'].fillna('Unknown', inplace=True)


## Step 2: Text Cleaning (Removing Special Characters, Numbers & Lowercasing)

→ Text cleaning helps in reducing noise from data and making it uniform for model input.

In [None]:
# Function to clean text columns
def clean_text(text):
    text = re.sub(r'\W', ' ', str(text))  # Remove special characters
    text = re.sub(r'\d+', ' ', text)  # Remove numbers
    text = text.lower()  # Convert text to lowercase
    return text

# Apply cleaning to both 'title' and 'text'
train_data['title'] = train_data['title'].apply(clean_text)
train_data['text'] = train_data['text'].apply(clean_text)
test_data['title'] = test_data['title'].apply(clean_text)
test_data['text'] = test_data['text'].apply(clean_text)

## Step 3: Remove Stopwords (Common Irrelevant Words)

→ Words like 'the', 'is', 'in' etc., do not carry much meaning for model learning.

In [None]:
# Import and Download NLTK stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    return ' '.join(word for word in text.split() if word not in stop_words)

# Apply stopword removal
train_data['text'] = train_data['text'].apply(remove_stopwords)
test_data['text'] = test_data['text'].apply(remove_stopwords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Step 4: Check Class Balance (Distribution of Fake vs Real News)

→ Important to check if data is balanced for both labels.

## Class Balance Check

It’s important to confirm whether the dataset is balanced between real and fake news before training the model.

This impacts both model performance and evaluation.

If the dataset is **imbalanced**, we apply undersampling to avoid bias.


In [None]:
# Define a function to check label distribution
def check_balance(data):
  pos = len(data[data['label'] == 0])
  neg = len(data[data['label'] == 1])
  if pos == neg:
    print('Dataset is balanced')
  else:
    print('Dataset is not balanced')

# Apply balance check on training data
check_balance(train_data)

Dataset is not balanced


In [None]:
# Print actual counts of both classes
print(len(train_data[train_data['label'] == 0]))
print(len(train_data[train_data['label'] == 1]))

10387
10413


## Step 6: Balance the Dataset (Undersampling)

Since the dataset is **not perfectly balanced**, we perform undersampling to equalize both classes.

We randomly sample equal-sized data from each class.


In [None]:
# Perform undersampling to balance dataset
min_count = min(
    len(train_data[train_data['label'] == 0]),
    len(train_data[train_data['label'] == 1])
)
# Sample equal records from both classes
class_0 = train_data[train_data['label'] == 0].sample(n=min_count, random_state=42)
class_1 = train_data[train_data['label'] == 1].sample(n=min_count, random_state=42)

# Combine and shuffle the balanced data
balanced_data = pd.concat([class_0, class_1]).sample(frac=1, random_state=42).reset_index(drop=True)

## Step 7: Sentence Embedding using Pretrained Transformer Models

We now move toward embedding-based feature engineering using the Sentence Transformer library.

We use:
- **`all-mpnet-base-v2`** → to evaluate semantic similarity for label prediction (in missing author name logic)
- **`all-MiniLM-L6-v2`** → to generate dense numerical features for model training

### Step 7.1: Install Required Libraries

In [None]:
# Install transformer and PyTorch libraries
!pip install -U sentence-transformers torch

Collecting sentence-transformers
  Downloading sentence_transformers-4.0.2-py3-none-any.whl.metadata (13 kB)
Downloading sentence_transformers-4.0.2-py3-none-any.whl (340 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m340.6/340.6 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-4.0.2


### Step 7.2: Import Required Libraries

In [None]:
import torch
from sentence_transformers import SentenceTransformer, util

## Step 8: Predicting Author Label Confidence using Sentence Embeddings

Since many `author` names are missing or unhelpful (filled as 'Unknown'),  
we are using a smart approach by leveraging Sentence Transformers to calculate how much the content of an article looks like:

- "Fake" → `Fake_Confidence` score
- "Real" → `Real_Confidence` score

---

### Models Used:
| Model | Purpose |
|-------|---------|
| all-mpnet-base-v2 | To generate label confidence (Real/Fake) |
| all-MiniLM-L6-v2 | To generate numerical features from text for ML models |


In [None]:
# Import tqdm for progress visualization
from tqdm import tqdm
tqdm.pandas()  # Adding progress bar support for pandas apply

## Initialize the Sentence Transformer Models

In [None]:
# Load the sentence transformer models
label_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
feature_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Generate Static Embeddings for Target Labels
This helps us later to calculate cosine similarity between article content and label embeddings.

In [None]:
# Generate embeddings for both labels "Fake" and "Real"
fake_emb = label_model.encode('Fake',convert_to_tensor=True)
real_emb = label_model.encode('Real',convert_to_tensor=True)

## Validate Data Columns Before Processing

In [None]:
# Print all column names of balanced_data for reference
for col in balanced_data.columns:
  print(repr(col))

'id'
'title'
'author'
'text'
'label'


## Generate Confidence Scores using Cosine Similarity
This function generates:
- `Fake_Confidence` → Similarity of article to word "Fake"
- `Real_Confidence` → Similarity of article to word "Real"

In [None]:
# Define function to calculate label confidence
def get_label_confidence(row_vals):
  text = (f"Title: {row_vals['title']}. Text: {row_vals['text']} Published by: {row_vals['author']}")

  # Convert text into embeddings
  text_emb = label_model.encode(text, convert_to_tensor=True)

  # Calculate cosine similarity with fake and real embeddings
  cos_sim_fake = util.cos_sim(text_emb, fake_emb)
  cos_sim_real = util.cos_sim(text_emb, real_emb)

  return cos_sim_fake.item(), cos_sim_real.item()

# Apply function to balanced dataset
balanced_data['Real_Confidence'], balanced_data['Fake_Confidence'] = zip(*balanced_data.progress_apply(get_label_confidence, axis=1))

100%|██████████| 20774/20774 [25:12<00:00, 13.73it/s]


## View Sample Records with Generated Label Confidences

In [None]:
balanced_data.head()

Unnamed: 0,id,title,author,text,label,Real_Confidence,Fake_Confidence
0,13397,ex army sniper gets year sentence in murder ...,Benjamin Weiser,former united states army sergeant nickname ra...,0,0.121041,0.075157
1,609,more than chaos terror ties make venezuela ...,Frances Martel,international community act venezuela socialis...,0,0.041147,0.060388
2,1821,san francisco torn as some see street behavio...,Thomas Fuller,san francisco apartment foot celebrated zigzag...,0,0.12766,0.129394
3,20115,florida taco trucks used to lure democrat v...,admin,information liberation october video florida s...,1,0.196192,0.138637
4,123,taiwan responds after china sends carrier to t...,Michael Forsythe and Chris Buckley,hong kong taiwan scrambled fighter jets dispat...,0,0.033105,0.034631


## Step 9: Generate Final Feature Embeddings using MiniLM Model

### Purpose:
We now create dense numerical features for both training and test data using the Sentence Transformer model `all-MiniLM-L6-v2`.

This helps in converting the entire text into numbers for feeding into ML models.

---

### Strategy:
- Combine `title`, `text`, and `author` fields.
- Generate embeddings for each record.
- Store embeddings as multiple features (feature_0, feature_1, ..., feature_n)


In [None]:
# Make a copy of balanced dataset for feature creation
df_new = balanced_data.copy()

# Define function to generate embeddings using MiniLM
def get_features(row_vals):
  text = (f"For the given news: " +
          f"Title: {row_vals['title']}, " +
          f"The news is: {row_vals['text']}, " +
          f"which is published by: {row_vals['author']}")

  # Generate embeddings
  text_emb = feature_model.encode(text, convert_to_tensor=True)

  # Convert embeddings to numpy array
  if hasattr(text_emb,'cpu'):
    text_emb = text_emb.cpu().detach().numpy()
  else:
    text_emb = text_emb.numpy()

  # Flatten if needed
  if text_emb.ndim > 1:
    text_emb = text_emb.flatten()

  # Store embeddings as new features
  for i, value in enumerate(text_emb):
        row_vals[f'feature_{i}'] = value
  return row_vals


## Apply Embedding Function to Training Data

In [None]:
df_new = df_new.progress_apply(get_features, axis=1)

100%|██████████| 20774/20774 [1:04:12<00:00,  5.39it/s]


## Drop Unnecessary Columns: title, text, author

In [None]:
df_new.drop(columns=[col for col in ['title','text','author'] if col in df_new.columns], axis=1, inplace=True)

## Reorder Columns: Features First, Confidence Scores & Labels Last

In [None]:
right_order = ['Real_Confidence','Fake_Confidence','label']
left_order = [col for col in df_new.columns if col not in right_order]
df_new = df_new[left_order + right_order]
df_new.head()

Unnamed: 0,id,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,...,feature_377,feature_378,feature_379,feature_380,feature_381,feature_382,feature_383,Real_Confidence,Fake_Confidence,label
0,13397,-0.069114,-0.055755,-0.073467,-0.016929,0.020035,0.060953,0.032014,0.020693,-0.083892,...,0.031493,-0.002904,-0.067265,0.024946,-0.080874,-0.012528,-0.046056,0.121041,0.075157,0
1,609,0.028742,-0.046899,-0.066271,-0.007598,0.03011,0.001222,0.003658,-0.005663,-0.05767,...,0.099533,-0.002136,-0.054731,0.018363,-0.074943,-0.000823,-0.033429,0.041147,0.060388,0
2,1821,0.0532,-0.024045,0.077612,0.066155,0.023577,0.030569,0.011022,0.028285,-0.077734,...,0.075666,0.02983,0.019771,0.088194,-0.006,-0.077202,0.026258,0.12766,0.129394,0
3,20115,-0.014612,-0.06853,-0.018928,0.048608,0.029969,-0.040724,0.024027,-0.047901,-0.021213,...,-0.056335,-0.015363,-0.000553,-0.009095,0.03275,0.058083,-0.014926,0.196192,0.138637,1
4,123,0.026727,0.015821,0.022083,0.016269,0.075496,-0.018269,0.02408,-0.041206,-0.130468,...,-0.008655,-0.03384,0.046211,0.001195,-0.057127,0.011764,0.028459,0.033105,0.034631,0


## Export Final Training Data with Generated Features

In [None]:
train_path = '/content/drive/MyDrive/Annan Project/Datasets/Feature Converted/train.csv'

# Remove existing file if exists
if os.path.exists(train_path):
    os.remove(train_path)

# Create path if not exists
path = os.path.dirname(train_path)
if not os.path.exists(path):
    os.makedirs(path)
# Save file
df_new.to_csv(train_path, index=False)

## Apply Same Process to Test Data

In [None]:
test_data.head()

Unnamed: 0,id,title,author,text
0,20800,specter of trump loosens tongues if not purse...,David Streitfeld,palo alto calif years scorning political proce...
1,20801,russian warships ready to strike terrorists ne...,Unknown,russian warships ready strike terrorists near ...
2,20802,nodapl native american leaders vow to stay a...,Common Dreams,videos nodapl native american leaders vow stay...
3,20803,tim tebow will attempt another comeback this ...,Daniel Victor,first succeed try different sport tim tebow he...
4,20804,keiser report meme wars e,Truth Broadcast Network,mins ago views comments likes first time histo...


### Generate Real & Fake Confidence for Test Data

In [None]:
test_data['Real_Confidence'], test_data['Fake_Confidence'] = zip(*test_data.progress_apply(get_label_confidence, axis=1))

100%|██████████| 5200/5200 [07:12<00:00, 12.02it/s]


### Generate Embeddings for Test Data using MiniLM

In [None]:
test_data = test_data.progress_apply(get_features, axis=1)

100%|██████████| 5200/5200 [16:05<00:00,  5.38it/s]


## Step 10: Preparing Final Test Data

Now that we have generated both:
- `Real_Confidence` and `Fake_Confidence`
- Embeddings using MiniLM model

We now clean the test data, reorder the columns, and export it as final clean test data.

---

### Columns Arrangement:
1. Feature Columns first
2. `Real_Confidence` and `Fake_Confidence` at the end

In [None]:
# Reorder columns of test data
right_order = ['Real_Confidence','Fake_Confidence']
left_order = [col for col in test_data.columns if col not in right_order]
test_data = test_data[left_order + right_order]

# Drop unnecessary columns: title, text, author
test_data.drop(columns=[col for col in ['title','text','author'] if col in test_data.columns], axis=1, inplace=True)

# Preview final structure of test_data
test_data.head()

Unnamed: 0,id,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,...,feature_376,feature_377,feature_378,feature_379,feature_380,feature_381,feature_382,feature_383,Real_Confidence,Fake_Confidence
0,20800,0.043848,-0.087044,0.038414,-0.058766,0.025619,-0.043383,0.021066,-0.017662,-0.047061,...,-0.00238,0.05962,0.009396,-0.033729,0.055025,-0.076543,0.038457,0.009483,0.082908,0.102194
1,20801,0.014846,0.00485,-0.093616,-0.020758,-0.000725,0.014329,0.030804,0.026781,-0.030054,...,0.009534,-0.056048,0.032425,0.013453,-0.050632,-0.025472,0.0501,0.057132,0.029318,0.069465
2,20802,-0.009574,0.029982,-0.006876,0.02763,0.088113,-0.02197,-0.008983,-0.017175,-0.097481,...,-0.033,0.005198,-0.033162,-0.02153,0.014571,-0.085396,-0.080934,0.040364,0.094428,0.048591
3,20803,-0.019143,-0.034643,-0.033826,-0.064742,0.00872,0.038987,-0.007195,0.031628,0.045901,...,-0.079527,0.030677,-0.009398,-0.146653,0.07596,-0.024999,0.026842,0.050363,0.058847,0.088894
4,20804,-0.006959,-0.048538,0.024193,0.025842,0.054738,0.056438,0.027011,0.047982,0.029612,...,0.045485,0.00181,-0.028531,-0.013817,0.039169,0.013829,-0.029217,0.078077,0.1737,0.187116


## Export Final Test Dataset to Drive

In [None]:
# Define path to save test data
train_path = '/content/drive/MyDrive/Annan Project/Datasets/Feature Converted/test.csv'

# Remove existing test file if it exists
if os.path.exists(train_path):
    os.remove(train_path)

# Create folder path if not exists
path = os.path.dirname(train_path)
if not os.path.exists(path):
    os.makedirs(path)
# Save final clean test data
df_new.to_csv(train_path, index=False)