# 1. Introduction
Implementation of a **Fake News Classifier** using **Scikit-learn** library from **Kaggle** and applies a **Multinomial Naive Bayes** classifier.

**Thanks to:**
* [Fake News Classification](https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification)
* [Fake News Classifier](https://www.kaggle.com/code/hassanamin/fake-news-classifier)
* [Loading Kaggle data directly into Google Colab](https://www.youtube.com/watch?v=yEXkEUqK52Q)
* [Importing Datasets from Kaggle to Google Colab](https://saturncloud.io/blog/importing-datasets-from-kaggle-to-google-colab/)
* [kaggle dataset download 403 Forbidden](https://stackoverflow.com/questions/75569191/kaggle-dataset-download-403-forbidden)

## 1.1 Definition
A **Fake News Classifier** is a machine learning model designed to distinguish between **real** and **fake** news articles. Using datasets from platforms like **Kaggle**, these classifiers are trained on labeled data to learn patterns and features that differentiate authentic news from fabricated ones.


# 2. Install required libraries

In [1]:
!pip install pandas numpy scikit-learn



# 3. Import libraries

In [2]:
# for data manipulation
import pandas as pd

# for numerical operations
import numpy as np

# for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# for data preprocessing

# for splitting the dataset into training and testing sets
from sklearn.model_selection import train_test_split

# for converting text data into numerical features.
from sklearn.feature_extraction.text import TfidfVectorizer

# for the Naive Bayes classifier.
from sklearn.naive_bayes import MultinomialNB

# for evaluating the model.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# 4. Load dataset

In [3]:
# Mount Google Drive: Import the Drive to access and store the Kaggle API key in Google Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Kaggle library to download datasets from Kaggle
!pip install kaggle



In [5]:
# Set Kaggle Configuration: To direct Kaggle to the appropriate directory in Drive
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'

In [6]:
# copy the Kaggle api key to the google drive
!mkdir ~/.kaggle
!cp /content/drive/MyDrive/kaggle/kaggle.json ~/.kaggle/kaggle.json

In [7]:
# give the permission to the JSON file
!chmod 600 ~/.kaggle/kaggle.json

In [8]:
# download the dataset from Kaggle using the Kaggle API.
!kaggle datasets download -d saurabhshahane/fake-news-classification

Dataset URL: https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification
License(s): Attribution 4.0 International (CC BY 4.0)
Downloading fake-news-classification.zip to /content
 93% 86.0M/92.1M [00:01<00:00, 106MB/s]
100% 92.1M/92.1M [00:01<00:00, 86.2MB/s]


In [9]:
df = pd.read_csv('/content/fake-news-classification.zip')
df.head(10)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1
5,5,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1
6,6,DR BEN CARSON TARGETED BY THE IRS: “I never ha...,DR. BEN CARSON TELLS THE STORY OF WHAT HAPPENE...,1
7,7,HOUSE INTEL CHAIR On Trump-Russia Fake Story: ...,,1
8,8,Sports Bar Owner Bans NFL Games…Will Show Only...,"The owner of the Ringling Bar, located south o...",1
9,9,Latest Pipeline Leak Underscores Dangers Of Da...,"FILE – In this Sept. 15, 2005 file photo, the ...",1


In [10]:
# df = pd.read_csv('/kaggle/input/fake-news-classification/WELFake_Dataset.csv')
# df.head(10)

# 5. Data Preprocessing

In [11]:
# Checks for any missing values in the dataset.
df.isnull().sum()

# Removes rows with missing values to ensure clean data.
df = df.dropna()

# Split the dataset into features and labels

# Extracts the text data (features/independent variable) from the DataFrame.
X = df['text']

# Extracts the labels (target/dependent variable) from the DataFrame.
y = df['label']

# 6. Train-Test Split
Splits the dataset into **training** and **testing** sets.
* **test_size=0.2:** Allocates 20% of the data for testing.
* **random_state=42:** Ensures reproducibility of the split.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 7. Text Vectorization
* **TfidfVectorizer(stop_words='english', max_df=0.7):** Initializes the **TF-IDF** vectorizer.
* **stop_words='english':** Removes common English stop words.
* **max_df=0.7:** Ignores terms that appear in more than **70%** of the documents.
* **fit_transform(X_train):** Fits the vectorizer to the *training data* and transforms it into **TF-IDF** features.
* **transform(X_test):** Transforms the *test data* into **TF-IDF** features using the already fitted vectorizer.


In [13]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# 8. Model Training
* **MultinomialNB():** Initializes the *Multinomial Naive Bayes classifier.*
* **fit(X_train_tfidf, y_train):** Trains the classifier using the **TF-IDF** *features* of the training data and their corresponding *labels*.


In [14]:
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)