## Step 1 — Data Loading and Initial Inspection
**Objective:**  
Load the two raw CSV files (`Fake.csv` and `True.csv`), check their structure, confirm consistent column names, and identify missing or malformed entries.  
This step ensures data integrity before merging and labeling in the following phase.

### Data Loading

In [3]:
import pandas as pd
import os

# Define dataset directory and file paths
DATA_DIR = "/Users/erenburakgokpinar/Documents/b198project/datasets"
FAKE_PATH = os.path.join(DATA_DIR, "Fake.csv")
TRUE_PATH = os.path.join(DATA_DIR, "True.csv")

# Function to safely load CSV with consistent column formatting
def safe_read_csv(path):
    df = pd.read_csv(
        path,
        dtype=str,              # Read all columns as strings to avoid dtype conflicts
        encoding="utf-8",
        on_bad_lines="skip"     # Skip problematic lines gracefully
    )
    df.columns = [c.strip().lower() for c in df.columns]  # Normalize column names
    return df

# Load both datasets
fake = safe_read_csv(FAKE_PATH)
true = safe_read_csv(TRUE_PATH)

### Dataset Overview
Now we inspect both datasets to confirm their dimensions, column names, and a few example rows.  
This step also highlights missing or null values that need to be addressed before merging.

In [4]:
# Display basic structure information
print("Fake.csv shape:", fake.shape)
print("True.csv shape:", true.shape)
print("\nFake.csv columns:", fake.columns.tolist())
print("True.csv columns:", true.columns.tolist())

# Preview first few entries
print("\n--- Fake.csv (first 3 rows) ---")
print(fake.head(3))
print("\n--- True.csv (first 3 rows) ---")
print(true.head(3))

# Check for missing values
print("\nMissing values per column (Fake):")
print(fake.isna().sum())
print("\nMissing values per column (True):")
print(true.isna().sum())

Fake.csv shape: (23481, 4)
True.csv shape: (21417, 4)

Fake.csv columns: ['title', 'text', 'subject', 'date']
True.csv columns: ['title', 'text', 'subject', 'date']

--- Fake.csv (first 3 rows) ---
                                               title  \
0   Donald Trump Sends Out Embarrassing New Year’...   
1   Drunk Bragging Trump Staffer Started Russian ...   
2   Sheriff David Clarke Becomes An Internet Joke...   

                                                text subject  \
0  Donald Trump just couldn t wish all Americans ...    News   
1  House Intelligence Committee Chairman Devin Nu...    News   
2  On Friday, it was revealed that former Milwauk...    News   

                date  
0  December 31, 2017  
1  December 31, 2017  
2  December 30, 2017  

--- True.csv (first 3 rows) ---
                                               title  \
0  As U.S. budget fight looms, Republicans flip t...   
1  U.S. military to accept transgender recruits o...   
2  Senior U.S. Republican s

## Step 2 — Labeling and Merging

**Objective:**  
Assign binary labels to both source datasets (`0` = Fake, `1` = Real`) and combine them into a single unified DataFrame.  
This consolidated dataset will form the foundation for all upcoming preprocessing and model training tasks.

### Labeling and Merging

In [5]:
import pandas as pd

# Assign binary labels
fake["label"] = 0   # 0 = Fake news
true["label"] = 1   # 1 = Real news

# Merge and shuffle the datasets
df = pd.concat([fake, true], ignore_index=True)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Display basic information
print("Combined dataset shape:", df.shape)
print("\nColumns:", df.columns.tolist())

# Label distribution
label_dist = df["label"].value_counts(normalize=True).round(3)
print("\nLabel distribution:\n", label_dist)

# Quick sanity check
df.head(3)

Combined dataset shape: (44898, 5)

Columns: ['title', 'text', 'subject', 'date', 'label']

Label distribution:
 label
0    0.523
1    0.477
Name: proportion, dtype: float64


Unnamed: 0,title,text,subject,date,label
0,Ben Stein Calls Out 9th Circuit Court: Committ...,"21st Century Wire says Ben Stein, reputable pr...",US_News,"February 13, 2017",0
1,Trump drops Steve Bannon from National Securit...,WASHINGTON (Reuters) - U.S. President Donald T...,politicsNews,"April 5, 2017",1
2,Puerto Rico expects U.S. to lift Jones Act shi...,(Reuters) - Puerto Rico Governor Ricardo Rosse...,politicsNews,"September 27, 2017",1


## Step 3 — Data Cleaning and Normalization

**Objective:**  
Clean and normalize the combined dataset to remove noise, duplicated content, and irrelevant text patterns.  
This step ensures consistent formatting and prepares the text data for vectorization or transformer tokenization.

### Cleaning

In [6]:
import re

# Drop duplicates based on text content
before_dups = df.shape[0]
df = df.drop_duplicates(subset=["text"])
after_dups = df.shape[0]
print(f"Removed {before_dups - after_dups} duplicate rows.")

# Remove rows with extremely short text (<50 characters)
before_short = df.shape[0]
df = df[df["text"].str.len() > 50]
after_short = df.shape[0]
print(f"Removed {before_short - after_short} short texts.")

# Normalize whitespace and remove URLs or special symbols
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r"http\S+", "", text)           # remove URLs
    text = re.sub(r"[^A-Za-z0-9.,!?'\s]", " ", text)  # remove special chars
    text = re.sub(r"\s+", " ", text).strip()      # normalize spaces
    return text

df["title"] = df["title"].apply(clean_text)
df["text"] = df["text"].apply(clean_text)

# Combine title and text for future modeling
df["content"] = df["title"] + " " + df["text"]

# Final structure overview
print("\nCleaned dataset shape:", df.shape)
print(df[["label", "content"]].head(3))

Removed 6252 duplicate rows.
Removed 144 short texts.

Cleaned dataset shape: (38502, 6)
   label                                            content
0      0  Ben Stein Calls Out 9th Circuit Court Committe...
1      1  Trump drops Steve Bannon from National Securit...
2      1  Puerto Rico expects U.S. to lift Jones Act shi...


### Text Normalization

In [7]:
import re

# Define text cleaning function
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r"http\S+", "", text)                 # remove URLs
    text = re.sub(r"[^A-Za-z0-9.,!?'\s]", " ", text)    # remove special characters
    text = re.sub(r"\s+", " ", text).strip()            # normalize spaces
    return text

# Apply to both title and text columns
df["title"] = df["title"].apply(clean_text)
df["text"] = df["text"].apply(clean_text)

# Combine title and text into a new column for modeling
df["content"] = df["title"] + " " + df["text"]

# Preview cleaned content
df[["label", "content"]].head(3)

Unnamed: 0,label,content
0,0,Ben Stein Calls Out 9th Circuit Court Committe...
1,1,Trump drops Steve Bannon from National Securit...
2,1,Puerto Rico expects U.S. to lift Jones Act shi...


### Final Dataset Verification

Before moving on to modeling, verify the final structure, column names, and total record count.  
This ensures the dataset is consistent, balanced, and ready to be split into training and test sets.

In [8]:
# Final structure overview
print("Final dataset shape:", df.shape)
print("\nColumns:", df.columns.tolist())

# Check label balance again
print("\nLabel distribution:")
print(df["label"].value_counts(normalize=True).round(3))

# Sanity check for missing values
print("\nMissing values per column:")
print(df.isna().sum())

# Optional: preview random sample
df.sample(3, random_state=42)[["label", "content"]]

Final dataset shape: (38502, 6)

Columns: ['title', 'text', 'subject', 'date', 'label', 'content']

Label distribution:
label
1    0.55
0    0.45
Name: proportion, dtype: float64

Missing values per column:
title      0
text       0
subject    0
date       0
label      0
content    0
dtype: int64


Unnamed: 0,label,content
4796,0,THIS LL MAKE YOUR DAY THE CHALKENING Continues...
18400,0,POWERFUL! FORMERLY OPPRESSED EX MUSLIM WARNS I...
34062,0,American Medical Association Says GOP s Make A...


In [9]:
## Save cleaned dataset

import os

output_dir = "../data/processed"
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, "cleaned_combined.csv")

df.to_csv(output_path, index=False, encoding="utf-8")
print(f"✅ Cleaned dataset saved successfully at: {output_path}")
print(f"Final shape: {df.shape}")

✅ Cleaned dataset saved successfully at: ../data/processed/cleaned_combined.csv
Final shape: (38502, 6)
