# Fake News Detection Project

This notebook implements a fake news detection system using machine learning. The goal is to classify news articles as real (0) or fake (1) based on their text. We’ll use a dataset with news headlines, preprocess the text, train a Naïve Bayes model, and evaluate its performance.

### Steps:
1. Load and explore the dataset.
2. Preprocess the text (cleaning, tokenization, lemmatization).
3. Convert text to numerical features using TF-IDF.
4. Train a Naïve Bayes classifier.
5. Evaluate the model and test it on new data.
6. Save the model for future use.

In [None]:
!pip install nltk scikit-learn pandas numpy




In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

## Step 1: Load , explore and  set the Dataset

We’ll load the dataset, which contains multiple column but we will remove all other except one i.e text and add one i.e label so two columns:
- **text**: News headlines or snippets.
- **label**: 1 (real) or 0 (fake).

Upload the dataset to Colab and load it using pandas.

In [None]:
true = pd.read_csv('True.csv')

In [None]:
fake = pd.read_csv('Fake.csv')

In [None]:
true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [None]:
true['label']=1


In [None]:
true.head()

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [None]:
fake['label']=0

In [None]:
fake.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [None]:
news= pd.concat([fake, true],axis=0)

In [None]:
news.isnull().sum()

Unnamed: 0,0
title,0
text,0
subject,0
date,0
label,0


In [None]:
news=news.drop(['title','subject','date'],axis=1)

In [None]:
news.head()

Unnamed: 0,text,label
0,Donald Trump just couldn t wish all Americans ...,0
1,House Intelligence Committee Chairman Devin Nu...,0
2,"On Friday, it was revealed that former Milwauk...",0
3,"On Christmas day, Donald Trump announced that ...",0
4,Pope Francis used his annual Christmas Day mes...,0


In [None]:
news=news.sample(frac=1)           #reshuffling

In [None]:
news.head()

Unnamed: 0,text,label
17506,LONDON (Reuters) - The World Health Organizati...,1
20222,SEOUL (Reuters) - South Korea said on Wednesda...,1
14947,In your face progressivism They re not even at...,0
788,WASHINGTON (Reuters) - U.S. Attorney General J...,1
23286,Episode #149 of SUNDAY WIRE SHOW resumes this ...,0


In [None]:
news.reset_index(inplace=True)

In [None]:
news.head()

Unnamed: 0,index,text,label
0,17506,LONDON (Reuters) - The World Health Organizati...,1
1,20222,SEOUL (Reuters) - South Korea said on Wednesda...,1
2,14947,In your face progressivism They re not even at...,0
3,788,WASHINGTON (Reuters) - U.S. Attorney General J...,1
4,23286,Episode #149 of SUNDAY WIRE SHOW resumes this ...,0


In [None]:
news.drop(['index'],axis=1,inplace=True)

In [None]:
news.head()

Unnamed: 0,text,label
0,LONDON (Reuters) - The World Health Organizati...,1
1,SEOUL (Reuters) - South Korea said on Wednesda...,1
2,In your face progressivism They re not even at...,0
3,WASHINGTON (Reuters) - U.S. Attorney General J...,1
4,Episode #149 of SUNDAY WIRE SHOW resumes this ...,0


In [None]:
print(news.shape)

(44898, 2)


## Step 2: Preprocess the Text

Text data needs to be cleaned before feeding it to a model. We’ll:
- Convert text to lowercase.
- Remove punctuation.
- Tokenize (split into words).
- Remove stopwords (e.g., "the", "is").
- Lemmatize (e.g., "running" → "run").

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
import string

def clean_text(text):

    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Split into words (tokenize)
    words = nltk.word_tokenize(text)
    # Remove common words (stopwords) like "the", "is"
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    # Reduce words to their base form (lemmatize)
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    # Join words back into a single string
    return ' '.join(words)

In [None]:
news['cleaned_text'] = news['text'].apply(clean_text)
print(news.head())  # Check the result

                                                text  label  \
0  LONDON (Reuters) - The World Health Organizati...      1   
1  SEOUL (Reuters) - South Korea said on Wednesda...      1   
2  In your face progressivism They re not even at...      0   
3  WASHINGTON (Reuters) - U.S. Attorney General J...      1   
4  Episode #149 of SUNDAY WIRE SHOW resumes this ...      0   

                                        cleaned_text  
0  london reuters world health organization said ...  
1  seoul reuters south korea said wednesday trace...  
2  face progressivism even attempting hide anymor...  
3  washington reuters u attorney general jeff ses...  
4  episode 149 sunday wire show resume sunday aug...  


## Step 3: Convert Text to Numerical Features

Machine learning models need numbers, not text. We’ll use TF-IDF (Term Frequency-Inverse Document Frequency) to convert the cleaned text into a numerical matrix.

In [None]:
vectorizer = TfidfVectorizer(max_features=5000)  # Limit to 5000 most important words
X = vectorizer.fit_transform(news['cleaned_text'])
y = news['label']

## Step 4: Split the Data

We’ll split the data into training (80%) and testing (20%) sets to train and evaluate the model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 5: Train a Naïve Bayes Model

We’ll use a Multinomial Naïve Bayes classifier, which is effective for text classification tasks.

In [None]:
model = MultinomialNB()
model.fit(X_train, y_train)

## Step 6: Evaluate the Model

We’ll evaluate the model on the test set using accuracy and a classification report (precision, recall, F1-score).

In [None]:
y_pred = model.predict(X_test)

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.933630289532294
              precision    recall  f1-score   support

           0       0.93      0.94      0.94      4617
           1       0.94      0.92      0.93      4363

    accuracy                           0.93      8980
   macro avg       0.93      0.93      0.93      8980
weighted avg       0.93      0.93      0.93      8980



## Step 7: Test on a New Headline

Let’s test the model on a sample headline to see if it predicts correctly.

In [None]:
new_headline = "Breaking: Aliens land in London!"
cleaned_headline = clean_text(new_headline)
vectorized_headline = vectorizer.transform([cleaned_headline])
prediction = model.predict(vectorized_headline)
print("Prediction:", "Fake" if prediction[0] == 0 else "Real")

Prediction: Fake


## Step 8: Save the Model

We’ll save the model and vectorizer for future use.

In [None]:
import joblib

joblib.dump(model, 'fake_news_model.pkl')
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')

['tfidf_vectorizer.pkl']

## Conclusion

We’ve built a fake news detection system using a Naïve Bayes classifier. The model preprocesses text, converts it to TF-IDF features, and predicts whether a news headline is real or fake. See the report for a summary of the approach, challenges, and performance.