## Project Overview

# 📰 Fake News Classification Project  

## 📌 Project Overview  
The rapid spread of misinformation has become one of the biggest challenges in the digital age.  
This project aims to **classify news articles as either Fake (0) or Real (1)** using machine learning models.  
We combine **data preprocessing, feature extraction, and multiple ML algorithms** to identify which model performs best for the task.  



## 📊 Dataset  
We use two datasets:  

- **Fake.csv** → Contains fake news articles.  
- **True.csv** → Contains real news articles.  

Both datasets were combined into a single dataframe with a new **label column**:  
- `0` → Fake News  
- `1` → Real News  

After merging and shuffling, preprocessing steps were applied including:  
- Merging `title` and `text` columns.  
- Lowercasing all text.  
- Removing punctuation, numbers, and symbols.  
- Lemmatization with **WordNet Lemmatizer**.  
- Stopword removal (NLTK).  



## ⚙️ Project Features  
✔️ **Text Preprocessing:** Cleaning, tokenization, lemmatization, stopword removal.  
✔️ **Feature Extraction:** `CountVectorizer` used for Bag-of-Words representation.  
✔️ **Model Training:** Implemented and compared:  
   - **Naive Bayes**  
   - **Logistic Regression**  
   - **Random Forest**  
✔️ **Evaluation Metrics:** Accuracy, Precision, Recall, F1-Score, ROC-AUC.  
✔️ **Visualizations:** Top words, histograms, and word clouds for Fake vs Real news.  



## 🏆 Results & Best Model  
After comparing models on multiple metrics:  

| Model               | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|---------------------|----------|-----------|--------|----------|---------|
| Logistic Regression | **0.98** | **0.98**  | **0.98** | **0.98** | **0.98** |
| Random Forest       | 0.97     | 0.97      | 0.97   | 0.97     | 0.97    |
| Naive Bayes         | 0.94     | 0.95      | 0.92   | 0.93     | 0.94    |

📌 **Best Performing Model:**  
**Logistic Regression** with the highest **Accuracy, F1-score, and ROC-AUC (≈98%)**, making it the most reliable choice for Fake News Detection.  




##  Load Training and Testing Data

In [None]:
#importing libraries
# Data handling
import pandas as pd
import numpy as np

# Text preprocessing
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
# Feature extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Model training
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

# Model evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix,roc_auc_score

#visualization
import matplotlib.pyplot as plt
from collections import Counter

df_fake_news=pd.read_csv('/kaggle/input/fake-and-real-news-dataset/Fake.csv')
df_real_news=pd.read_csv('/kaggle/input/fake-and-real-news-dataset/True.csv')

#adding label columns in both datatsets
df_fake_news['label'] = 0
df_real_news['label'] = 1

#concat both dataframes
df=pd.concat([df_fake_news,df_real_news],axis=0)

#find missing values in new combined dataframe
print(df.isnull().sum())
print(df.shape)

#shuffle dataframe
df=df.sample(frac=1).reset_index(drop=True)
print(df.head(5))




## Data Preprocessing

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Step 1: Merge title + text
df['text'] = df['title'] + " " + df['text']

# Step 2: Lowercase conversion
df['text'] = df['text'].str.lower()

# Step 3: Remove punctuation, numbers, symbols
df['text'] = df['text'].str.replace('[^a-zA-Z]', ' ', regex=True)

# Step 4: Preprocessing function using NLTK stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess(text):
    # Split words manually
    tokens = text.split()
    # Remove NLTK stopwords and lemmatize
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w.lower() not in stop_words]
    return " ".join(tokens)

# Step 5: Print before vs after
print("Before preprocessing:\n", df['text'].head(5))

# Apply preprocessing
df['text'] = df['text'].apply(preprocess)

print("\nAfter preprocessing:\n", df['text'].head(5))

## Feature Extraction & Model NaveBayes Prediction

In [None]:

#features and target
X=df['text']
y=df['label']

#convert text into numerical features
vectorizer=CountVectorizer()
X_counts=vectorizer.fit_transform(df['text'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_counts, y, test_size=0.3, random_state=42)

# Model
model = MultinomialNB()
model.fit(X_train, y_train)

# Prediction
y_pred = model.predict(X_test)
print("===========Model Nave Bayes=============")
print("News is predicted as : ",y_pred[:10])

#metrics evaluation
nb_acc=accuracy_score(y_test, y_pred)
nb_pre=precision_score(y_test, y_pred)
nb_rec=recall_score(y_test, y_pred)
nb_f1=f1_score(y_test, y_pred)
nb_cm=confusion_matrix(y_test, y_pred)
nb_roc=roc_auc_score(y_test, y_pred)

print("==========================================")
print("============Model Performance=============")
print("Accuracy:",nb_acc)
print("Precision:",nb_pre)
print("Recall:",nb_rec)
print("F1-Score:",nb_f1)
print("Confusion Matrix:\n",nb_cm)
print("ROC-AUC:",nb_roc)

## Logistic Regression Predictions

In [None]:
#logistic regression
reg = LogisticRegression(max_iter=1000)
reg.fit(X_train,y_train)
reg_pred=reg.predict(X_test)

#predictions
print("===========Model Logistic Regression=============")
print("News is predicted as : ",reg_pred[:10])

#metrics evaluation
lr_acc=accuracy_score(y_test, reg_pred)
lr_pre=precision_score(y_test, reg_pred)
lr_rec=recall_score(y_test, reg_pred)
lr_f1=f1_score(y_test, reg_pred)
lr_cm=confusion_matrix(y_test, reg_pred)
lr_roc=roc_auc_score(y_test, reg_pred)

print("==========================================")
print("============Model Performance=============")
print("Accuracy:",lr_acc)
print("Precision:",lr_pre)
print("Recall:",lr_rec)
print("F1-Score:",lr_f1)
print("Confusion Matrix:\n",lr_cm)
print("ROC-AUC:",lr_roc)

## Random Forest Predictions

In [None]:
#random forest
rf=RandomForestClassifier()
rf.fit(X_train,y_train)
rf_pred=rf.predict(X_test)

#predictions
print("===========Model Random Forest=============")
print("News is predicted as : ",rf_pred[:10])

#metrics evaluation
rf_acc=accuracy_score(y_test, rf_pred)
rf_pre=precision_score(y_test, rf_pred)
rf_rec=recall_score(y_test, rf_pred)
rf_f1=f1_score(y_test, rf_pred)
rf_cm=confusion_matrix(y_test, rf_pred)
rf_roc=roc_auc_score(y_test, rf_pred)

print("==========================================")
print("============Model Performance=============")
print("Accuracy:",rf_acc)
print("Precision:",rf_pre)
print("Recall:",rf_rec)
print("F1-Score:",rf_f1)
print("Confusion Matrix:\n",rf_cm)
print("ROC-AUC:",rf_roc)

## Comparison between Models to decide which performs best 

In [None]:
# Comparing Results of Logistic Regression, RF and Nave Bayes 
results = pd.DataFrame({
    "Model": [
        "Logistic Regression ",
        "Random Forest",
        "Naive Bayes"
    ],
    "Accuracy": [
        lr_acc, rf_acc, nb_acc
       
    ],
    "Precision": [
        lr_pre, rf_pre, nb_pre

    ],
        
    "Recall":[
        lr_rec, rf_rec, nb_rec
    ],
    "F1_score": [
        lr_f1, rf_f1, nb_f1
    ],
    "ROC_AUC":[
        lr_roc, rf_roc, nb_roc
    ]
})

print("============== 📊 Model Comparison Table ==============")
print(results.to_string())

# ===================== Ranking Models =====================
# Rank based on Accuracy, F1 Score, and ROC-AUC
results["Acc_Rank"] = results["Accuracy"].rank(ascending=False)
results["F1_Rank"] = results["F1_score"].rank(ascending=False)
results["ROC_AUC_Rank"] = results["ROC_AUC"].rank(ascending=False)

# Calculate overall rank (lower = better)
results["Overall_Rank"] = results[["Acc_Rank","F1_Rank","ROC_AUC_Rank"]].sum(axis=1)

# Sort models by rank
results_sorted = results.sort_values("Overall_Rank")
print("\n============== 🏆 Ranked Models ==============")
print(results_sorted.to_string())

# Best model
best_model = results_sorted.iloc[0]["Model"]
print("\n🏆 Best Model Selected:", best_model)


## Visualizations

## Top 20 Words in Fake News

In [None]:
#Top 20 words in Fake News
fake_words = " ".join(df[df['label'] == 0]['text']).split()
fake_counter = Counter(fake_words)
top_fake = fake_counter.most_common(20)

# top_real already list of tuples [('word1', count1), ('word2', count2), ...]
words_fake = [w for w, c in top_fake]   # x-axis
counts_fake = [c for w, c in top_fake]  # y-axis

plt.figure(figsize=(10,6))
plt.bar(words_fake, counts_fake, color='red')
plt.xticks(rotation=45, ha='right')
plt.title("Top 20 Words in Fake News")
plt.ylabel("Frequency")
plt.show()



## Top 20 Words in Real News

In [None]:
# Top 20 words in Real News
real_words = " ".join(df[df['label'] == 1]['text']).split()
real_counter = Counter(real_words)
top_real = real_counter.most_common(20)

# top_real already list of tuples [('word1', count1), ('word2', count2), ...]
words_real = [w for w, c in top_real]   # x-axis
counts_real = [c for w, c in top_real]  # y-axis

plt.figure(figsize=(10,6))
plt.bar(words_real, counts_real, color='green')
plt.xticks(rotation=45, ha='right')
plt.title("Top 20 Words in Real News")
plt.ylabel("Frequency")
plt.show()

## Histogram: Article Length (Number of Words)

In [None]:

df['word_count'] = df['text'].apply(lambda x: len(x.split()))

# Plot histogram for fake and real news
plt.figure(figsize=(10,5))
plt.hist(df[df['label'] == 0]['word_count'], bins=30, alpha=0.6, label='Fake News', color='pink')
plt.hist(df[df['label'] == 1]['word_count'], bins=30, alpha=0.6, label='Real News', color='purple')
plt.xlabel("Number of Words per Article")
plt.ylabel("Number of Articles")
plt.title("Article Length Distribution: Fake vs Real News", fontsize=16)
plt.legend()
plt.show()

## Wordcloud for Real/Fake News

In [None]:
from wordcloud import WordCloud

# Get positive and negative reviews
positive_reviews = df[df['label']==1]['text']
negative_reviews = df[df['label']==0]['text']

# Create word clouds for positive and negative reviews
positive_wordcloud = WordCloud(colormap='PuRd').generate(' '.join(positive_reviews))  
negative_wordcloud = WordCloud(colormap='plasma').generate(' '.join(negative_reviews))


# Plot word clouds
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.imshow(positive_wordcloud, interpolation='bilinear')
plt.title('Real News')
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow(negative_wordcloud, interpolation='bilinear')
plt.title('Fake News')
plt.axis('off')

plt.tight_layout()
plt.show()


## Submission.csv  
A `submission.csv` file has been generated using the **Logistic Regression model**,  
as it achieved the **best performance with 98% accuracy** on the validation set.  
This file contains the predicted labels (0 = Fake, 1 = Real) for the Kaggle test dataset.

In [None]:
# Save predictions of best model (logistic regression) for Kaggle submission

submission = pd.DataFrame({
    "id": range(len(reg_pred)),
    "sentiment": reg_pred
})

submission.to_csv("submission.csv", index=False)
print("✅ submission.csv file generated!")