**4.LEMMATIZATION**

**text lemmatization** is performed as part of the preprocessing stage for our Fake News Detection project.

### Step 1: Importing Libraries & Resources
I used **NLTK** for tokenization, stopword removal, and lemmatization, along with regular expressions for text cleaning.  
Required NLTK resources (`punkt`, `punkt_tab`, `wordnet`, `omw-1.4`, and `stopwords`) were downloaded.

In [1]:
import zipfile

with zipfile.ZipFile("archive.zip", 'r') as zip_ref:
    zip_ref.extractall("unzipped_data")

print("Files extracted successfully!")


Files extracted successfully!


In [2]:
import pandas as pd

fake_df = pd.read_csv("unzipped_data/Fake.csv")
true_df = pd.read_csv("unzipped_data/True.csv")



In [3]:
#merge and label

#Add a label column
fake_df["label"] = "FAKE"
true_df["label"] = "TRUE"

#Merge into one dataset
data = pd.concat([fake_df, true_df], ignore_index = True)

#Shuffle the rows so FAKE and TRUE are mixed
data = data.sample(frac=1, random_state=42).reset_index(drop=True)

#Check the structure
print(data.shape)
print(data["label"].value_counts())
data.head()

(44898, 5)
label
FAKE    23481
TRUE    21417
Name: count, dtype: int64


Unnamed: 0,title,text,subject,date,label
0,Ben Stein Calls Out 9th Circuit Court: Committ...,"21st Century Wire says Ben Stein, reputable pr...",US_News,"February 13, 2017",FAKE
1,Trump drops Steve Bannon from National Securit...,WASHINGTON (Reuters) - U.S. President Donald T...,politicsNews,"April 5, 2017",TRUE
2,Puerto Rico expects U.S. to lift Jones Act shi...,(Reuters) - Puerto Rico Governor Ricardo Rosse...,politicsNews,"September 27, 2017",TRUE
3,OOPS: Trump Just Accidentally Confirmed He Le...,"On Monday, Donald Trump once again embarrassed...",News,"May 22, 2017",FAKE
4,Donald Trump heads for Scotland to reopen a go...,"GLASGOW, Scotland (Reuters) - Most U.S. presid...",politicsNews,"June 24, 2016",TRUE


In [10]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

# Download resources once
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("stopwords")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Step 2: Preprocessing + Lemmatization Function
I defined a function `preprocess_and_lemmatize` that:
1. Converts text to lowercase.  
2. Removes non-alphabetic characters.  
3. Tokenizes the text into words.  
4. Removes English stopwords.  
5. Lemmatizes each word to its base form.  
6. Joins words back into a cleaned string.  

This ensures the text is standardized and ready for modeling.

In [11]:
# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

# Define preprocessing + lemmatization function
def preprocess_and_lemmatize(text):
    if isinstance(text, str):  # make sure it's a string
        # Lowercase
        text = text.lower()

        # Remove punctuation, numbers, special chars
        text = re.sub(r'[^a-z\s]', '', text)

        # Tokenize
        tokens = nltk.word_tokenize(text)

        # Remove stopwords + lemmatize
        tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

        return " ".join(tokens)
    else:
        return ""


**Step 3: Applying Lemmatization**

Applied the function to the title column → new column clean_title.

Applied the function to the text column → new column clean_text.

In [12]:
# Lemmatize the title column
data["clean_title"] = data["title"].apply(preprocess_and_lemmatize)

# Lemmatize the text column
data["clean_text"] = data["text"].apply(preprocess_and_lemmatize)

# Quick preview of original + cleaned
data[["title", "clean_title", "text", "clean_text"]].head()


Unnamed: 0,title,clean_title,text,clean_text
0,Ben Stein Calls Out 9th Circuit Court: Committ...,ben stein call th circuit court committed coup...,"21st Century Wire says Ben Stein, reputable pr...",st century wire say ben stein reputable profes...
1,Trump drops Steve Bannon from National Securit...,trump drop steve bannon national security council,WASHINGTON (Reuters) - U.S. President Donald T...,washington reuters u president donald trump re...
2,Puerto Rico expects U.S. to lift Jones Act shi...,puerto rico expects u lift jones act shipping ...,(Reuters) - Puerto Rico Governor Ricardo Rosse...,reuters puerto rico governor ricardo rossello ...
3,OOPS: Trump Just Accidentally Confirmed He Le...,oops trump accidentally confirmed leaked israe...,"On Monday, Donald Trump once again embarrassed...",monday donald trump embarrassed country accide...
4,Donald Trump heads for Scotland to reopen a go...,donald trump head scotland reopen golf resort,"GLASGOW, Scotland (Reuters) - Most U.S. presid...",glasgow scotland reuters u presidential candid...


Lemmatization reduces words to their root form (e.g., running → run), making the dataset cleaner and improving model performance.

Now the dataset has both original and cleaned versions of the news articles and titles, ready for further feature extraction and modeling.