# Data Preprocessing

In this notebook, we load and preprocess a labeled dataset of news articles for training a fake news classification model.  
We'll clean, label, shuffle, and split the dataset into training, validation, and test sets. 

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

## Load the Dataset

We are using the **ISOT Fake News Dataset**, which contains two CSV files:

- `isot_Fake.csv` → Contains fake news articles
- `isot_True.csv` → Contains real news articles

We'll load each into a DataFrame using pandas.


In [None]:
fake_df = pd.read_csv('../data/isot_Fake.csv')
real_df = pd.read_csv('../data/isot_True.csv')

## Label the Data

To train a binary classifier, we need to label the articles:
- Fake news → `label = 0`
- Real news → `label = 1`

We'll add a new column `label` to each DataFrame accordingly.


In [None]:
fake_df['label'] = 0
real_df['label'] = 1

## Merge and Shuffle the Dataset

We'll now combine both datasets into one DataFrame using `pd.concat()`.  
To avoid any learning bias, we will shuffle the combined dataset randomly.  
We'll also reset the index for a clean, continuous DataFrame.


In [None]:
df = pd.concat([fake_df, real_df], ignore_index = True)
df = df.sample(frac = 1, random_state = 42).reset_index(drop =True)

In [None]:
df['full_text'] = df['title'].fillna(' ') + ' ' + df['text'].fillna(' ')

## Clean the Text

We combine the `title` and `text` fields into a new `full_text` column.  
This gives the model richer context when making predictions.

We also define a `clean_text()` function to:
- Lowercase the text
- Strip whitespace
- Remove newline characters

This step improves consistency and prevents tokenization errors later.


In [None]:
def clean_text(text):
    text = text.lower().strip()
    text = text.replace('\n', ' ').replace('\r', ' ')
    return text

df['full_text'] = df['full_text'].apply(clean_text) 

## Keep Only Relevant Columns

Now that we’ve created the `full_text`, we no longer need the original `title` or `text`.  
We’ll keep only the `full_text` and `label` columns for simplicity.


In [None]:
df = df[['full_text', 'label']]

## Split the Dataset

We'll split the dataset into:
- **80%** Training set
- **10%** Validation set
- **10%** Test set

We use `train_test_split()` twice:
1. First to split `train + temp`
2. Then to split `temp` into `val` and `test`

We also stratify on the `label` to preserve class balance across all splits.


In [None]:
train_df, temp_df = train_test_split(df, test_size = 0.2, random_state = 42, stratify = df['label'])
val_df, test_df = train_test_split(temp_df, test_size = 0.5, random_state = 42, stratify = df['label'])

## Save the Preprocessed Splits

Finally, we save the `train`, `val`, and `test` DataFrames to CSV files.  
These files will be used in the next stage: **tokenization and model training**.


In [None]:
train_df.to_csv('../data/train.csv', index=False)
val_df.to_csv('../data/val.csv', index=False)
test_df.to_csv('../data/test.csv', index=False)

## Preprocessing Complete

- Loaded and labeled fake/real news articles
- Combined and shuffled the data
- Cleaned the text and merged title + body
- Split into train, validation, and test sets
- Saved the processed files to `data/`
