## **NLP Project**

Status Quo:
2 Datases provided
1) Data.csv --> Used for training and testing. Where labels of (0) represent fake news, and (1) real news
2) Validation_data.csv --> Data without any real answers. It contains the label filled with 2 as placeholder. 
In each dataset, we have:
Rows → 1 news article 
Columns → 5 columns with a piece of information ( label, title, text, subject, and date) 

## **1. Data Understanding & Set-up**

1.1 Importing libriaries

In [1]:
import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\macat\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nl

True

1.2 Load the data

In [3]:
train = pd.read_csv("data.csv")      
val   = pd.read_csv("validation_data.csv")
print("Train shape:", train.shape, "| Validation shape:", val.shape)

Train shape: (39942, 5) | Validation shape: (4956, 5)


39,942 rows → each row is one news article in your training dataset.
5 columns → the features: label, title, text, subject, date.

4,956 rows → each row is one news article in the validation set (unseen by the model during training).
5 columns → same structure as the training data.

1.3 Verification of schema. Checking if there are no typos, no missing values in the columns.Confirm the 5 expected columns exist and see their types; check for nulls. This prevents downstream errors and shows data quality. 

In [4]:
info_df = pd.DataFrame({
    "dtype": train.dtypes,
    "missing_count": train.isna().sum(),
    "missing_ratio": (train.isna().sum() / len(train)).round(4)
})

print("\nDataset schema & missing values:")
print(info_df)


Dataset schema & missing values:
          dtype  missing_count  missing_ratio
label     int64              0            0.0
title    object              0            0.0
text     object              0            0.0
subject  object              0            0.0
date     object              0            0.0


1.4 Checked duplicates

In [5]:
dupes = train.duplicated(subset=["title", "text"]).sum()
print("Duplicates:", dupes)

vc = train["label"].value_counts().sort_index()
print(pd.DataFrame({"count": vc, "ratio": (vc/len(train)).round(4)}))

Duplicates: 3513
       count   ratio
label               
0      19943  0.4993
1      19999  0.5007


 39,942 total articles in your training data, 3,513 rows have the exact same title and text as another row. (8% of my dataset)

## **2. Sentiment Analyisis**

2.1 Cleaning and processing data

In [6]:
import re
import pandas as pd
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.model_selection import train_test_split

2.2 Tokenization / Stop word Removal / Steamming / Lemmitization

In [7]:
STOP = set(ENGLISH_STOP_WORDS)
STEMMER = PorterStemmer()
LEMMATIZER = WordNetLemmatizer()

def tokenize_text(text):
    """Lowercase, strip URLs/HTML, regex-tokenize to letters/apostrophes, drop very short tokens."""
    if not isinstance(text, str):
        text = "" if text is None else str(text)
    text = text.lower()
    text = re.sub(r'http\S+|www\.\S+', ' ', text)  
    text = re.sub(r'<.*?>', ' ', text)          
    tokens = re.findall(r"[a-z']+", text)           # regex tokenization (removal punctuation)
    return [t for t in tokens if len(t) > 2]      

In [8]:
def remove_stopwords(tokens):
    return [t for t in tokens if t not in STOP]

In [9]:
def lemmatize_tokens(tokens):
    return [LEMMATIZER.lemmatize(t) for t in tokens]

In [10]:
sample = str(train.iloc[0]["text"])
tokens_lemma = lemmatize_tokens(remove_stopwords(tokenize_text(sample)))
print(tokens_lemma)

['washington', 'reuters', 'head', 'conservative', 'republican', 'faction', 'congress', 'voted', 'month', 'huge', 'expansion', 'national', 'debt', 'pay', 'tax', 'cut', 'called', 'fiscal', 'conservative', 'sunday', 'urged', 'budget', 'restraint', 'keeping', 'sharp', 'pivot', 'way', 'republican', 'representative', 'mark', 'meadow', 'speaking', 'cbs', 'face', 'nation', 'drew', 'hard', 'line', 'federal', 'spending', 'lawmaker', 'bracing', 'battle', 'january', 'return', 'holiday', 'wednesday', 'lawmaker', 'begin', 'trying', 'pas', 'federal', 'budget', 'fight', 'likely', 'linked', 'issue', 'immigration', 'policy', 'november', 'congressional', 'election', 'campaign', 'approach', 'republican', 'seek', 'control', 'congress', 'president', 'donald', 'trump', 'republican', 'want', 'big', 'budget', 'increase', 'military', 'spending', 'democrat', 'want', 'proportional', 'increase', 'non', 'defense', 'discretionary', 'spending', 'program', 'support', 'education', 'scientific', 'research', 'infrastruct

In [11]:
def stem_tokens(tokens):
    return [STEMMER.stem(t) for t in tokens]

In [12]:
sample = str(train.iloc[0]["text"])
tokens_filtered = stem_tokens(remove_stopwords(tokenize_text(sample)))
print(tokens_filtered)

['washington', 'reuter', 'head', 'conserv', 'republican', 'faction', 'congress', 'vote', 'month', 'huge', 'expans', 'nation', 'debt', 'pay', 'tax', 'cut', 'call', 'fiscal', 'conserv', 'sunday', 'urg', 'budget', 'restraint', 'keep', 'sharp', 'pivot', 'way', 'republican', 'repres', 'mark', 'meadow', 'speak', 'cb', 'face', 'nation', 'drew', 'hard', 'line', 'feder', 'spend', 'lawmak', 'brace', 'battl', 'januari', 'return', 'holiday', 'wednesday', 'lawmak', 'begin', 'tri', 'pass', 'feder', 'budget', 'fight', 'like', 'link', 'issu', 'immigr', 'polici', 'novemb', 'congression', 'elect', 'campaign', 'approach', 'republican', 'seek', 'control', 'congress', 'presid', 'donald', 'trump', 'republican', 'want', 'big', 'budget', 'increas', 'militari', 'spend', 'democrat', 'want', 'proport', 'increas', 'non', 'defens', 'discretionari', 'spend', 'program', 'support', 'educ', 'scientif', 'research', 'infrastructur', 'public', 'health', 'environment', 'protect', 'trump', 'administr', 'will', 'say', 'go',

2.3 Visualizing the words after applying the different techniques

In [13]:
from termcolor import colored
from itertools import zip_longest
def _highlight_changes(before, after, n=25):
    """
    Show first n items, marking changes in red.
    Uses zip_longest so we can spot removals/insertions.
    """
    shown = []
    for b, a in zip_longest(before[:n], after[:n], fillvalue=None):
        if b is None and a is not None:
            shown.append(colored(a, 'red'))          
        elif a is None and b is not None:
            shown.append(colored(f"{b}⟂", 'red'))    
        elif a != b:
            shown.append(colored(a, 'red'))        
        else:
            shown.append(a)
    return shown
def visualize_pipeline(text, n=25):
    text = "" if text is None else str(text)

    tokens      = tokenize_text(text)
    no_stop     = remove_stopwords(tokens)
    stemmed     = stem_tokens(no_stop)
    lemmatized  = lemmatize_tokens(no_stop) 

    print(colored("\n--- VISUAL TOKEN TRANSFORMATIONS ---", "cyan"))
    print(colored("Original tokens:", "blue"), tokens[:n])
    print(colored("After stopword removal:", "green"), no_stop[:n])

    
    print(colored("After stemming (changes in red):", "yellow"),
          _highlight_changes(no_stop, stemmed, n))
    print(colored("After lemmatization (changes in red):", "yellow"),
          _highlight_changes(no_stop, lemmatized, n))


    removed = [t for t in tokens if t not in no_stop]
    if removed:
        print(colored("Removed stopwords (sample):", "magenta"), removed[:n])

def visualize_pipeline_from_df(df, row=0, text_col="text", n=25):
 
    if text_col not in df.columns:
        raise KeyError(f"Column '{text_col}' not found. Available: {list(df.columns)}")
    sample = df.iloc[row][text_col]
    visualize_pipeline(sample, n=n)

visualize_pipeline_from_df(train, row=0, text_col="text", n=25)


[36m
--- VISUAL TOKEN TRANSFORMATIONS ---[0m
[34mOriginal tokens:[0m ['washington', 'reuters', 'the', 'head', 'conservative', 'republican', 'faction', 'the', 'congress', 'who', 'voted', 'this', 'month', 'for', 'huge', 'expansion', 'the', 'national', 'debt', 'pay', 'for', 'tax', 'cuts', 'called', 'himself']
[32mAfter stopword removal:[0m ['washington', 'reuters', 'head', 'conservative', 'republican', 'faction', 'congress', 'voted', 'month', 'huge', 'expansion', 'national', 'debt', 'pay', 'tax', 'cuts', 'called', 'fiscal', 'conservative', 'sunday', 'urged', 'budget', 'restraint', 'keeping', 'sharp']
[33mAfter stemming (changes in red):[0m ['washington', '\x1b[31mreuter\x1b[0m', 'head', '\x1b[31mconserv\x1b[0m', 'republican', 'faction', 'congress', '\x1b[31mvote\x1b[0m', 'month', 'huge', '\x1b[31mexpans\x1b[0m', '\x1b[31mnation\x1b[0m', 'debt', 'pay', 'tax', '\x1b[31mcut\x1b[0m', '\x1b[31mcall\x1b[0m', 'fiscal', '\x1b[31mconserv\x1b[0m', 'sunday', '\x1b[31murg\x1b[0m', 'budget', '

2.4 Cleaning Pipe-lines: puts everything back into one cleaned string so that when we do any vectorized technique (CountVectorized, TfidfVectorizer), it shows as continious string as input, not Python list.

In [14]:
def clean_text(text: str) -> str:
    """tokenize -> remove stopwords -> stem -> join (no lemmatization)."""
    #text = "" if text is None else str(text)          # safety
    tokens = tokenize_text(text)
    tokens = remove_stopwords(tokens)
    tokens = stem_tokens(tokens)
    return " ".join(tokens)

def text_preprocessing(df_in: pd.DataFrame, text_col: str = "text") -> pd.DataFrame:
    """Apply clean_text() to a dataframe column and return a new dataframe with 'text_clean'."""
    df_out = df_in.copy()
    if text_col not in df_out.columns:
        raise KeyError(f"Column '{text_col}' not found in the dataframe. Available: {list(df_out.columns)}")
    df_out["text_clean"] = df_out[text_col].fillna("").apply(clean_text)
    return df_out

## **3. Applying Processing on both training/test split from data.csv and on validation data**

3.1 Splitting the training data into train/test (80/20). 

In [15]:
train_split, test_split = train_test_split(
    train,
    test_size=0.2,
    random_state=42,
    stratify=train["label"] 
)

Note: I have 3 subset of data that all need to be cleaned before the model can understand them

3.2 Applying all the text-cleaning techniques built into every subset of the data. Without doing this then my training and validation data will be "speaking different languages" and my model will fail. 

In [16]:
train_proc = text_preprocessing(train_split, text_col="text")
test_proc = text_preprocessing(test_split, text_col="text")
val_proc = text_preprocessing(val, text_col="text")

3.3 This point was added, after checking that we had:
- 552 empty rows in train
- 144 empty rows in test
- 23 empty rows in val
And later for vectorizing is best to have 0 empty rows to avoid surprises later in the pipeline. 

In [17]:
train_proc.reset_index(drop=True, inplace=True)
test_proc.reset_index(drop=True, inplace=True)
val_proc.reset_index(drop=True, inplace=True)

In [18]:
print("Empty rows in train:", (train_proc["text_clean"].fillna("").str.strip() == "").sum())
print("Empty rows in test :", (test_proc["text_clean"].fillna("").str.strip() == "").sum())
print("Empty rows in val  :", (val_proc["text_clean"].fillna("").str.strip() == "").sum(), "\n")

Empty rows in train: 552
Empty rows in test : 144
Empty rows in val  : 23 



3.4 In this case, we are simplifying the DataFrame modeling. When I load the CSV, it has lots of columns. By applying the code below, we have the only 2 columns we need, which are "label - target variable" and "text_clean - processed input feature". 
This way we avoid the model to be fed with irrelevant fields and additionally to take less memory from our PC. 

In [19]:
cols_keep = ["label", "text_clean"]
train_ready = train_proc[cols_keep].copy()
test_ready  = test_proc[cols_keep].copy()
val_ready   = val_proc[cols_keep].copy()

In [20]:
print("Train:", train_ready.shape)
print(train_ready.head(), "\n")
print("Test:", test_ready.shape)
print(test_ready.head(), "\n")
print("Val:", val_ready.shape)
print(val_ready.head())

Train: (31953, 2)
   label                                         text_clean
0      0                                  crook lie hillari
1      0  justic scalia appear good health prior vacat c...
2      0  univers north texa student critic condit nra r...
3      0                                                   
4      1  washington reuter robert mueller special couns... 

Test: (7989, 2)
   label                                         text_clean
0      1  washington reuter senat major leader mitch mcc...
1      1  washington reuter state depart friday name rus...
2      0  report expos email share wikileak show result ...
3      1  washington reuter senat republican leader mitc...
4      0  democrat stood american peopl republican shout... 

Val: (4956, 2)
   label                                         text_clean
0      2  london reuter british prime minist theresa reg...
1      2  london reuter british counter terror polic mon...
2      2  wellington reuter south pacif island 

**Observations**
- Train: 31,953 rows

- Test: 7,989 rows

- Validation: 4,956 rows

text_clean is indeed cleaned
From your sample rows:

- Lowercased text ✅

- Punctuation removed ✅

- Stopwords removed ✅

- Words appear stemmed (receiv, investig, violenc, defens, presid) ✅

3.5.2 Checking for label distribution. This information help us to make sure my split between train, test and validation are compatible and balanced before starting the model. 

In [21]:
print("Unique labels — train:", sorted(train_proc["label"].unique()))
print("Unique labels — test :", sorted(test_proc["label"].unique()))
print("Unique labels — val  :", sorted(val_proc["label"].unique()), "\n")

print("Label distribution (counts)")
print("Train:\n", train_proc["label"].value_counts(), "\n")
print("Test:\n",  test_proc["label"].value_counts(),  "\n")
print("Val:\n",   val_proc["label"].value_counts(),   "\n")

print("Label distribution (percent)")
print("Train:\n", (train_proc["label"].value_counts(normalize=True) * 100).round(2), "\n")
print("Test:\n",  (test_proc["label"].value_counts(normalize=True) * 100).round(2),  "\n")
print("Val:\n",   (val_proc["label"].value_counts(normalize=True) * 100).round(2),   "\n")

Unique labels — train: [0, 1]
Unique labels — test : [0, 1]
Unique labels — val  : [2] 

Label distribution (counts)
Train:
 label
1    15999
0    15954
Name: count, dtype: int64 

Test:
 label
1    4000
0    3989
Name: count, dtype: int64 

Val:
 label
2    4956
Name: count, dtype: int64 

Label distribution (percent)
Train:
 label
1    50.07
0    49.93
Name: proportion, dtype: float64 

Test:
 label
1    50.07
0    49.93
Name: proportion, dtype: float64 

Val:
 label
2    100.0
Name: proportion, dtype: float64 



3.5 Saving the processed dataset after cleaning. This way we can reload the clean CSV later and go straight to vectorization and modeling without repeating the cleaning steps. 

In [22]:
train_ready.to_csv("train_ready.csv", index=False)
test_ready.to_csv("test_ready.csv", index=False)
val_ready.to_csv("validation_ready.csv", index=False)