## **1. Import Required Libraries**

In [41]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

## **2. Download NLTK Resources**

In [42]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/magnus/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/magnus/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/magnus/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## **3. Load the Dataset**

In [43]:
df=pd.read_csv('../data/raw/WELFake_Dataset.csv')

## **4. Remove Unnecessary Index Column**

In [44]:
df = df.drop(columns=['Unnamed: 0'], errors='ignore')

## **5. Handle Missing Values in Text Columns**

In [45]:
df['title'] = df['title'].fillna('')
df['text'] = df['text'].fillna('')

## **6. Combine Title and Text into Single Content Column**

In [56]:
df['content'] = df['title'] + " " + df['text']
df['content']

0        LAW ENFORCEMENT ON HIGH ALERT Following Threat...
1           Did they post their votes for Hillary already?
2        UNBELIEVABLE! OBAMAâ€™S ATTORNEY GENERAL SAYS MO...
3        Bobby Jindal, raised Hindu, uses story of Chri...
4        SATAN 2: Russia unvelis an image of its terrif...
                               ...                        
72129    Russians steal research on Trump in hack of U....
72130     WATCH: Giuliani Demands That Democrats Apolog...
72131    Migrants Refuse To Leave Train At Refugee Camp...
72132    Trump tussle gives unpopular Mexican leader mu...
72133    Goldman Sachs Endorses Hillary Clinton For Pre...
Name: content, Length: 72134, dtype: object

## **7. Convert Text to Lowercase**

In [47]:
df['content'] = df['content'].str.lower()

## **8. Remove Punctuation and Special Characters**

In [48]:
def remove_punctuation(text):
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    return text

df['content'] = df['content'].apply(remove_punctuation)

## **9. Remove Extra White Spaces**

In [49]:
df['content'] = df['content'].apply(lambda x: re.sub(r'\s+', ' ', x).strip())

## **10. Load Stopwords**

In [52]:
stop_words = set(stopwords.words('english'))

## **11. Remove Stopwords**

In [53]:
def remove_stopwords(text):
    words = word_tokenize(text)
    filtered_words = [word for word in words if word not in stop_words]
    return " ".join(filtered_words)

df['cleaned_content'] = df['content'].apply(remove_stopwords)

## **12. Tokenization**

In [54]:
df['tokens'] = df['cleaned_content'].apply(word_tokenize)
df[['cleaned_content', 'tokens']].head()

Unnamed: 0,cleaned_content,tokens
0,law enforcement high alert following threats c...,"[law, enforcement, high, alert, following, thr..."
1,post votes hillary already,"[post, votes, hillary, already]"
2,unbelievable obama attorney general says charl...,"[unbelievable, obama, attorney, general, says,..."
3,bobby jindal raised hindu uses story christian...,"[bobby, jindal, raised, hindu, uses, story, ch..."
4,satan russia unvelis image terrifying new supe...,"[satan, russia, unvelis, image, terrifying, ne..."


## **13. Comparison of Content**

In [55]:
df[['content', 'cleaned_content']].head(3)

Unnamed: 0,content,cleaned_content
0,law enforcement on high alert following threat...,law enforcement high alert following threats c...
1,did they post their votes for hillary already,post votes hillary already
2,unbelievable obama s attorney general says mos...,unbelievable obama attorney general says charl...


## **14. Check Final Null Values**

In [57]:
df.isnull().sum()

title              0
text               0
label              0
content            0
cleaned_content    0
tokens             0
dtype: int64

## **15. Save Clean Dataset**

In [58]:
df.to_csv("../data/processed/cleaned_news.csv", index=False)

## **16. Conclusion**

The text preprocessing pipeline successfully cleaned and normalized the news articles by:

*   **Normalization**: Converting text to lowercase.
*   **Sanitization**: Removing punctuation and special characters.
*   **Filtering**: Eliminating stopwords.
*   **Noise Reduction**: Reducing inconsistencies in the text.