# **RESEARCH PROJECT**

# **Detecting Fake News Using Natural Language Processing**

Mini Research Project in Artificial Intelligence / Machine Learning

# By: **K.Eekshith Sai**

# 1. Introduction

Fake news has become a serious issue in the digital age, especially with the widespread use of social media.  
Misinformation can have harmful effects on public opinion, health, and political decisions.  
This research project aims to build an AI model that can automatically detect whether a news article is real or fake using Natural Language Processing (NLP).

We use a publicly available dataset from Kaggle and apply machine learning techniques like TF-IDF and Passive Aggressive Classifier to detect patterns in the text.


# 2. Problem Statement & Objectives

**Problem:**  
With the increase in fake news content online, it's important to build automated systems that can classify news articles as real or fake accurately.

**Objectives:**  
- Preprocess and clean the dataset
- Extract features using NLP techniques
- Train a classification model
- Evaluate the performance using metrics like accuracy and F1-score


# 3. Dataset Information

We use the *Fake and Real News Dataset* from Kaggle, which contains labeled news articles.

- `Fake.csv` contains 23,481 fake news articles.
- `True.csv` contains 21,417 real news articles.

Each file has the following columns:
- `title` – the title of the article
- `text` – the full content
- `subject` – the topic category
- `date` – publication date


In [None]:
import pandas as pd

# Load the datasets
fake = pd.read_csv('/content/Fake.csv')
real = pd.read_csv('/content/True.csv')

# Add labels
fake['label'] = 'FAKE'
real['label'] = 'REAL'

# Combine both datasets
df = pd.concat([fake, real], axis=0).reset_index(drop=True)
df.head()


# 4. Data Preprocessing

To prepare the data for model training, we:
- Remove the `subject` and `date` columns
- Combine `title` and `text` into a single `content` column
- Convert labels into binary classes
- Check and handle missing values


In [None]:
# Drop unnecessary columns
df.drop(['subject', 'date'], axis=1, inplace=True)

# Combine title and text
df['content'] = df['title'] + " " + df['text']
df.drop(['title', 'text'], axis=1, inplace=True)

# Check for missing values
print("Missing values:\n", df.isnull().sum())


# 5. Feature Extraction using TF-IDF

We use **TF-IDF Vectorization** to convert the text into numerical features.  
This method helps the model to focus on important words while ignoring common words like “the”, “is”, etc.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Split features and labels
X = df['content']
y = df['label']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorization
tfidf = TfidfVectorizer(stop_words='english', max_df=0.7)
X_train_vec = tfidf.fit_transform(X_train)
X_test_vec = tfidf.transform(X_test)


# 6. Model Building

We use the **Passive Aggressive Classifier**, a fast and efficient linear model for large-scale classification tasks.  
It is especially useful for online learning — meaning it updates its model only when it makes a wrong prediction.


In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier

# Model training
model = PassiveAggressiveClassifier(max_iter=1000)
model.fit(X_train_vec, y_train)


# 7. Model Evaluation

We evaluate the model using:
- **Accuracy** – overall correctness
- **Confusion Matrix** – breakdown of real vs fake predictions
- **Classification Report** – precision, recall, and F1-score

> A high F1-score indicates that the model is both precise and has high recall.


In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Predictions
y_pred = model.predict(X_test_vec)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


# 8. Results

The model achieved an accuracy of approximately **93%**, which shows that it is able to distinguish between real and fake news with high reliability.

**Confusion Matrix Summary:**
- True Positives: Real news correctly identified
- True Negatives: Fake news correctly identified
- False Positives: Fake news predicted as real
- False Negatives: Real news predicted as fake

The precision and recall values are also balanced, making this a strong baseline model.


# 9. Conclusion

In this research project, we demonstrated how NLP techniques can be used to detect fake news.  
Using the Passive Aggressive Classifier along with TF-IDF feature extraction, we achieved over **90% accuracy**.

### Key Takeaways:
- Text data can be converted into machine-understandable format using vectorizers like TF-IDF
- Even simple ML models can achieve strong performance with good preprocessing
- There is room to further improve accuracy using deep learning (e.g., BERT, LSTM)

This system can potentially be used as a base for a real-time fake news detection application.


# 10. References

- Kaggle Dataset: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset  
- scikit-learn Documentation: https://scikit-learn.org  
- Research Papers on Fake News Detection using NLP
