Building a Fake News Prediction system using Machine Learning with Python. It focuses on Natural Language Processing (NLP) techniques and binary classification.

1. Workflow Overview:
The project follows a structured data science pipeline:

Data Collection: Uses a labeled dataset (from Kaggle) containing several thousand news articles categorized as real (0) or fake (1).

Pre-processing: Converting textual information into numerical data that a machine can interpret.

Model Training: Utilizing Logistic Regression for classification.

Evaluation: Using accuracy scores to determine how well the model identifies fake news.

2. Dataset Features
The dataset contains approximately 20,800 entries with five main columns:

ID: Serial number of the news article.

Title: The headline of the article.

Author: The writer of the news.

Text: The full body of the article.

Label: Marks the news as real (1) or fake (0).

3. Key Technical Steps
Handling Missing Values: Missing data in the title, author, or text columns are replaced with "null strings" (empty text) to ensure the model can process every row.

Feature Engineering: The author merges the Author name and Title into a single column called content. This combined text is used as the primary feature for prediction because it provides high accuracy with less computational load than processing the entire article text.

Text Stemming:

Stemming: Reducing words to their root form (e.g., "acting" becomes "act").

Stopwords Removal: Removing common words like "the," "a," and "is" that do not add significant meaning to the analysis.

Regular Expressions: Used to strip away numbers, punctuation, and special characters, leaving only alphabetic words.

Vectorization (TF-IDF): we  use the TF-IDF Vectorizer (Term Frequency-Inverse Document Frequency) to convert text into numerical feature vectors. This method highlights significant words that appear frequently in a specific document but not across the entire dataset.

4. Model Training & Mathematics
Logistic Regression: Chosen because it is highly effective for binary classification (Real vs. Fake).

Sigmoid Function: This  briefly explains the math, showing how the model maps input data to a probability between 0 and 1. If the probability is above 0.5, it's classified as fake; if below, it's real.

Train-Test Split: The data is split (80% for training, 20% for testing) to ensure the model's performance can be validated on unseen data.

5. Performance & Results
Accuracy: The model achieved an impressive 98% to 99% accuracy on both the training and test datasets.

Predictive System: The final section demonstrates building a simple interface where a user can input a piece of news, and the system correctly identifies if it is real or fake based on the trained parameters.

Importing the Dependencies

In [53]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [54]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [55]:
# printing the stopwords in English
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [56]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [57]:
# loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('/content/drive/MyDrive/Data_Sets/fake_news_dataset.csv')

In [58]:
news_dataset.head()

Unnamed: 0,title,text,date,source,author,category,label
0,Foreign Democrat final.,more tax development both store agreement lawy...,2023-03-10,NY Times,Paula George,Politics,real
1,To offer down resource great point.,probably guess western behind likely next inve...,2022-05-25,Fox News,Joseph Hill,Politics,fake
2,Himself church myself carry.,them identify forward present success risk sev...,2022-09-01,CNN,Julia Robinson,Business,fake
3,You unit its should.,phone which item yard Republican safe where po...,2023-02-07,Reuters,Mr. David Foster DDS,Science,fake
4,Billion believe employee summer how.,wonder myself fact difficult course forget exa...,2023-04-03,CNN,Austin Walker,Technology,fake


In [59]:
news_dataset.shape

(20000, 7)

In [60]:
news_dataset.isnull().sum()

Unnamed: 0,0
title,0
text,0
date,0
source,1000
author,1000
category,0
label,0


In [61]:
news_dataset = news_dataset.fillna('mean')

In [91]:
news_dataset.isnull().sum()

Unnamed: 0,0
text,0
date,0
source,0
category,0
label,0
content,0


In [62]:
news_dataset.columns

Index(['title', 'text', 'date', 'source', 'author', 'category', 'label'], dtype='object')

In [63]:
print(news_dataset['label'].value_counts())

label
fake    10056
real     9944
Name: count, dtype: int64


In [64]:
news_dataset['label'] = news_dataset['label'].map({'fake': 0, 'real': 1})
print(news_dataset['label'].value_counts())

label
0    10056
1     9944
Name: count, dtype: int64


Merging the "author" name and news "title"

In [65]:
news_dataset['content'] = news_dataset['author']+':'+news_dataset['title']

In [66]:
news_dataset['content']

Unnamed: 0,content
0,Paula George:Foreign Democrat final.
1,Joseph Hill:To offer down resource great point.
2,Julia Robinson:Himself church myself carry.
3,Mr. David Foster DDS:You unit its should.
4,Austin Walker:Billion believe employee summer ...
...,...
19995,Gary Miles:House party born.
19996,Maria Mcbride:Though nation people maybe price...
19997,Kristen Franklin:Yet exist with experience unit.
19998,David Wise:School wide itself item.


In [67]:
news_dataset.drop(['author', 'title'], axis=1, inplace=True)
display(news_dataset.head())

Unnamed: 0,text,date,source,category,label,content
0,more tax development both store agreement lawy...,2023-03-10,NY Times,Politics,1,Paula George:Foreign Democrat final.
1,probably guess western behind likely next inve...,2022-05-25,Fox News,Politics,0,Joseph Hill:To offer down resource great point.
2,them identify forward present success risk sev...,2022-09-01,CNN,Business,0,Julia Robinson:Himself church myself carry.
3,phone which item yard Republican safe where po...,2023-02-07,Reuters,Science,0,Mr. David Foster DDS:You unit its should.
4,wonder myself fact difficult course forget exa...,2023-04-03,CNN,Technology,0,Austin Walker:Billion believe employee summer ...


Separating the data & label

In [68]:
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [69]:
X

Unnamed: 0,text,date,source,category,content
0,more tax development both store agreement lawy...,2023-03-10,NY Times,Politics,Paula George:Foreign Democrat final.
1,probably guess western behind likely next inve...,2022-05-25,Fox News,Politics,Joseph Hill:To offer down resource great point.
2,them identify forward present success risk sev...,2022-09-01,CNN,Business,Julia Robinson:Himself church myself carry.
3,phone which item yard Republican safe where po...,2023-02-07,Reuters,Science,Mr. David Foster DDS:You unit its should.
4,wonder myself fact difficult course forget exa...,2023-04-03,CNN,Technology,Austin Walker:Billion believe employee summer ...
...,...,...,...,...,...
19995,hit and television I change very our happy doo...,2024-12-04,BBC,Entertainment,Gary Miles:House party born.
19996,fear most meet rock even sea value design stan...,2024-05-26,Daily News,Entertainment,Maria Mcbride:Though nation people maybe price...
19997,activity loss very provide eye west create wha...,2023-04-17,BBC,Entertainment,Kristen Franklin:Yet exist with experience unit.
19998,term point general common training watch respo...,2024-06-30,Reuters,Health,David Wise:School wide itself item.


In [70]:
Y

Unnamed: 0,label
0,1
1,0
2,0
3,0
4,0
...,...
19995,0
19996,1
19997,1
19998,0


Stemming:

Stemming is the process of reducing a word to its Root word

example:
actor, actress, acting --> act

In [71]:
port_stem = PorterStemmer()

In [72]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [73]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [74]:
news_dataset['content']

Unnamed: 0,content
0,paula georg foreign democrat final
1,joseph hill offer resourc great point
2,julia robinson church carri
3,mr david foster dd unit
4,austin walker billion believ employe summer
...,...
19995,gari mile hous parti born
19996,maria mcbride though nation peopl mayb price box
19997,kristen franklin yet exist experi unit
19998,david wise school wide item


In [75]:
#separating the data and label
X = news_dataset['text'].values
Y = news_dataset['label'].values

In [76]:
X

array(['more tax development both store agreement lawyer hear outside continue reach difference yeah figure your power fear identify there protect security great national nothing fast story why late nearly bit cost tough since question to power almost future young conference behind ahead building teach million box receive Mrs risk benefit month compare environment class imagine you vote community reason set once idea him answer many how purpose deep training game own true language garden of partner result face military discover discover data glass bed maintain test way development across top culture glass yes decision hope necessary as trade organization talk debate peace stay community development six wide write itself several fight teach billion for common fear we personal church establish store kind hundred debate hotel cut sister audience sound case that stay within information trouble be debate great themselves responsibility force people hundred bar miss others sometimes build ro

In [77]:
Y

array([1, 0, 0, ..., 1, 0, 0])

In [78]:
Y.shape

(20000,)

In [79]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [80]:
X

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4397197 stored elements and shape (20000, 969)>

In [81]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4397197 stored elements and shape (20000, 969)>
  Coords	Values
  (0, 2)	0.0603562207661069
  (0, 7)	0.05997771658524611
  (0, 8)	0.05976704775954439
  (0, 21)	0.06063179639427085
  (0, 26)	0.06020624531253104
  (0, 27)	0.0602490008918108
  (0, 31)	0.06004657985065417
  (0, 33)	0.06048011247571504
  (0, 45)	0.06043694788845418
  (0, 53)	0.060291831977671566
  (0, 60)	0.0603562207661069
  (0, 67)	0.06022761368049072
  (0, 79)	0.059584201796619676
  (0, 81)	0.06005188513254458
  (0, 86)	0.06058836124307728
  (0, 90)	0.12093862824932324
  (0, 92)	0.06038309996210357
  (0, 99)	0.06049631902230227
  (0, 100)	0.059851095826557274
  (0, 101)	0.060174228019346125
  (0, 108)	0.060078428963655714
  (0, 109)	0.06056125372257863
  (0, 115)	0.05963630298972909
  (0, 116)	0.06045852057928306
  (0, 132)	0.12149237031196183
  :	:
  (19999, 871)	0.061548800909775596
  (19999, 876)	0.06198700730854452
  (19999, 890)	0.061397233750172725
  (19

Splitting the dataset to training & test data

In [82]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

Training the Model: Logistic Regression

In [83]:
model = LogisticRegression()


In [84]:
model.fit(X_train, Y_train)

Evaluation

In [85]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [86]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.59975


In [87]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [88]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.523


Making a Predictive System

In [89]:
X_new = X_test[3]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Fake')
else:
  print('The news is Real')

[1]
The news is Real


In [90]:
print(Y_test[3])

1
