# Data Cleaning and Preprocessing

This notebook contains the data cleaning and preprocessing steps for the Fake News Detection project.

## Overview
- Load raw data from CSV files
- Assign labels (1 for real news, 0 for fake news)
- Merge and resample datasets
- Perform text preprocessing
- Split data into train and test sets


## 1. Import Required Libraries


In [1]:
import numpy as np
import pandas as pd

## 2. Load Raw Data

Load the fake and true news datasets from the raw data directory.


In [2]:
# Load the raw data
fake_df = pd.read_csv('data/raw/Fake.csv')
true_df = pd.read_csv('data/raw/True.csv')

print(f"Fake news shape: {fake_df.shape}")
print(f"True news shape: {true_df.shape}")


Fake news shape: (23481, 4)
True news shape: (21417, 4)


## 3. Assign Labels and Merge Datasets

- Assign label `0` to fake news
- Assign label `1` to real news
- Merge both datasets into a single DataFrame


In [3]:
# Assign labels: 1 for real news, 0 for fake news
fake_df['label'] = 0  # Fake news gets label 0
true_df['label'] = 1  # Real news gets label 1

# Merge the datasets
merged_df = pd.concat([fake_df, true_df], ignore_index=True)

print(f"Merged dataset shape: {merged_df.shape}")
print(f"\nLabel distribution:")
print(merged_df['label'].value_counts())
print(f"\nFirst few rows:")
merged_df.head()


Merged dataset shape: (44898, 5)

Label distribution:
0    23481
1    21417
Name: label, dtype: int64

First few rows:


Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


## 4. Resample (Shuffle) the Dataset

Randomly shuffle the merged dataset to ensure proper distribution of fake and real news throughout the dataset.


In [4]:
# Resample (shuffle) the merged dataset
resampled_df = merged_df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"Resampled dataset shape: {resampled_df.shape}")
print(f"\nLabel distribution after resampling:")
print(resampled_df['label'].value_counts())
print(f"\nFirst few rows after resampling:")
resampled_df.head()


Resampled dataset shape: (44898, 5)

Label distribution after resampling:
0    23481
1    21417
Name: label, dtype: int64

First few rows after resampling:


Unnamed: 0,title,text,subject,date,label
0,Ben Stein Calls Out 9th Circuit Court: Committ...,"21st Century Wire says Ben Stein, reputable pr...",US_News,"February 13, 2017",0
1,Trump drops Steve Bannon from National Securit...,WASHINGTON (Reuters) - U.S. President Donald T...,politicsNews,"April 5, 2017",1
2,Puerto Rico expects U.S. to lift Jones Act shi...,(Reuters) - Puerto Rico Governor Ricardo Rosse...,politicsNews,"September 27, 2017",1
3,OOPS: Trump Just Accidentally Confirmed He Le...,"On Monday, Donald Trump once again embarrassed...",News,"May 22, 2017",0
4,Donald Trump heads for Scotland to reopen a go...,"GLASGOW, Scotland (Reuters) - Most U.S. presid...",politicsNews,"June 24, 2016",1


## 5. Merge Title and Text Columns

Combine the `title` and `text` columns into a single `content` column for easier text processing.


In [5]:
# merge tthe title and text columns
# Merge the 'title' and 'text' columns into a new column 'content'
resampled_df['content'] = resampled_df['title'] + ' ' + resampled_df['text']




## 6. Remove Empty Content

Remove rows with empty or null content to ensure all entries have valid text data.


In [6]:
# Remove rows with empty content
resampled_df = resampled_df[resampled_df['content'].notna()]

# Reset the index of resampled_df
resampled_df = resampled_df.reset_index(drop=True)


In [7]:
resampled_df

Unnamed: 0,title,text,subject,date,label,content
0,Ben Stein Calls Out 9th Circuit Court: Committ...,"21st Century Wire says Ben Stein, reputable pr...",US_News,"February 13, 2017",0,Ben Stein Calls Out 9th Circuit Court: Committ...
1,Trump drops Steve Bannon from National Securit...,WASHINGTON (Reuters) - U.S. President Donald T...,politicsNews,"April 5, 2017",1,Trump drops Steve Bannon from National Securit...
2,Puerto Rico expects U.S. to lift Jones Act shi...,(Reuters) - Puerto Rico Governor Ricardo Rosse...,politicsNews,"September 27, 2017",1,Puerto Rico expects U.S. to lift Jones Act shi...
3,OOPS: Trump Just Accidentally Confirmed He Le...,"On Monday, Donald Trump once again embarrassed...",News,"May 22, 2017",0,OOPS: Trump Just Accidentally Confirmed He Le...
4,Donald Trump heads for Scotland to reopen a go...,"GLASGOW, Scotland (Reuters) - Most U.S. presid...",politicsNews,"June 24, 2016",1,Donald Trump heads for Scotland to reopen a go...
...,...,...,...,...,...,...
44893,UNREAL! CBS’S TED KOPPEL Tells Sean Hannity He...,,politics,"Mar 27, 2017",0,UNREAL! CBS’S TED KOPPEL Tells Sean Hannity He...
44894,PM May seeks to ease Japan's Brexit fears duri...,LONDON/TOKYO (Reuters) - British Prime Ministe...,worldnews,"August 29, 2017",1,PM May seeks to ease Japan's Brexit fears duri...
44895,Merkel: Difficult German coalition talks can r...,BERLIN (Reuters) - Chancellor Angela Merkel sa...,worldnews,"November 16, 2017",1,Merkel: Difficult German coalition talks can r...
44896,Trump Stole An Idea From North Korean Propaga...,Jesus f*cking Christ our President* is a moron...,News,"July 14, 2017",0,Trump Stole An Idea From North Korean Propaga...


### Apply tokenization

In [8]:
# Apply tokenization on the 'content' column and create a new column 'tokens'
import nltk
# nltk.download('punkt_tab')

# Download punkt tokenizer if not already present
nltk.download('punkt', quiet=True)

resampled_df['tokens'] = resampled_df['content'].apply(nltk.word_tokenize)


### Convert in lower case

In [10]:
# Convert all tokens in the 'tokens' column to lower case
resampled_df['tokens'] = resampled_df['tokens'].apply(lambda tokens: [word.lower() for word in tokens])


### Remove stop words

In [11]:
# Remove stop words from the tokens
from nltk.corpus import stopwords

# Download stopwords if not already present
nltk.download('stopwords', quiet=True)

stop_words = set(stopwords.words('english'))

# Remove stop words in 'tokens' column
resampled_df['tokens'] = resampled_df['tokens'].apply(lambda tokens: [word for word in tokens if word.lower() not in stop_words])


### Remove punctuation

In [12]:
import string

# Remove punctuation and digits from the 'tokens' column
resampled_df['tokens'] = resampled_df['tokens'].apply(
    lambda tokens: [word for word in tokens if word.isalpha()]
)


### Lemmatization

In [13]:
# Lemmatize the tokens in the 'tokens' column
from nltk.stem import WordNetLemmatizer

# Download wordnet if not already present
nltk.download('wordnet', quiet=True)

lemmatizer = WordNetLemmatizer()

resampled_df['tokens'] = resampled_df['tokens'].apply(
    lambda tokens: [lemmatizer.lemmatize(word) for word in tokens]
)


### Remove extra space and html tags

In [14]:
import re

def clean_token(token):
    # Remove HTML tags from individual token (just in case, usually tokens shouldn't have them)
    token = re.sub(r'<.*?>', '', token)
    # Replace multiple spaces, tabs, and newlines with a single space
    token = re.sub(r'\s+', ' ', token)
    # Strip leading and trailing spaces
    return token.strip()

# Clean extra spaces and html tags from each token in the 'tokens' column
resampled_df['tokens'] = resampled_df['tokens'].apply(
    lambda tokens: [clean_token(token) for token in tokens]
)


In [15]:
resampled_df

Unnamed: 0,title,text,subject,date,label,content,tokens
0,Ben Stein Calls Out 9th Circuit Court: Committ...,"21st Century Wire says Ben Stein, reputable pr...",US_News,"February 13, 2017",0,ben stein calls out 9th circuit court: committ...,"[ben, stein, call, circuit, court, committed, ..."
1,Trump drops Steve Bannon from National Securit...,WASHINGTON (Reuters) - U.S. President Donald T...,politicsNews,"April 5, 2017",1,trump drops steve bannon from national securit...,"[trump, drop, steve, bannon, national, securit..."
2,Puerto Rico expects U.S. to lift Jones Act shi...,(Reuters) - Puerto Rico Governor Ricardo Rosse...,politicsNews,"September 27, 2017",1,puerto rico expects u.s. to lift jones act shi...,"[puerto, rico, expects, lift, jones, act, ship..."
3,OOPS: Trump Just Accidentally Confirmed He Le...,"On Monday, Donald Trump once again embarrassed...",News,"May 22, 2017",0,oops: trump just accidentally confirmed he le...,"[oops, trump, accidentally, confirmed, leaked,..."
4,Donald Trump heads for Scotland to reopen a go...,"GLASGOW, Scotland (Reuters) - Most U.S. presid...",politicsNews,"June 24, 2016",1,donald trump heads for scotland to reopen a go...,"[donald, trump, head, scotland, reopen, golf, ..."
...,...,...,...,...,...,...,...
44893,UNREAL! CBS’S TED KOPPEL Tells Sean Hannity He...,,politics,"Mar 27, 2017",0,unreal! cbs’s ted koppel tells sean hannity he...,"[unreal, cbs, ted, koppel, tell, sean, hannity..."
44894,PM May seeks to ease Japan's Brexit fears duri...,LONDON/TOKYO (Reuters) - British Prime Ministe...,worldnews,"August 29, 2017",1,pm may seeks to ease japan's brexit fears duri...,"[pm, may, seek, ease, japan, brexit, fear, tra..."
44895,Merkel: Difficult German coalition talks can r...,BERLIN (Reuters) - Chancellor Angela Merkel sa...,worldnews,"November 16, 2017",1,merkel: difficult german coalition talks can r...,"[merkel, difficult, german, coalition, talk, r..."
44896,Trump Stole An Idea From North Korean Propaga...,Jesus f*cking Christ our President* is a moron...,News,"July 14, 2017",0,trump stole an idea from north korean propaga...,"[trump, stole, idea, north, korean, propaganda..."


In [18]:
# Convert the list of tokens back to text and store in a new column 'clean_text'
resampled_df['clean_text'] = resampled_df['tokens'].apply(lambda tokens: ' '.join(tokens))


In [19]:
from sklearn.model_selection import train_test_split

# Split the data into 80% train, 20% test
train_df, test_df = train_test_split(resampled_df, test_size=0.2, random_state=42, shuffle=True)
# To properly store lists in a CSV and reload them as lists, you can use `ast.literal_eval` when reading.
# Save DataFrame as usual (lists will be stored as strings)
train_df.to_csv('data/processed/train.csv', index=False)
test_df.to_csv('data/processed/test.csv', index=False)

# When you load the CSV, convert the 'tokens' column back to list using ast.literal_eval:
# Example:
# import ast
# train_df = pd.read_csv('data/processed/train.csv')
# train_df['tokens'] = train_df['tokens'].apply(ast.literal_eval)

# Do the same for test_df.

# Save the splits to the processed directory
train_df.to_csv('data/processed/train.csv', index=False)
test_df.to_csv('data/processed/test.csv', index=False)


In [16]:
df = pd.read_csv('data/processed/cleaned_data.csv')