# Fake News Detector NLP Project
##### Project Overview
##### This project aims to detect and classify news articles as "Fake" or "Real" using Natural Language Processing (NLP) techniques and machine learning models. The system leverages a Random Forest Classifier trained on a preprocessed dataset and provides predictions via an flask application.

#### Key Features
##### - Preprocessing raw text data by cleaning, tokenizing, and removing stopwords.
##### - Vectorizing the text using TF-IDF to convert it into numerical features.
##### - Training a Random Forest Classifier for accurate classification.
##### - Hosting predictions through a Flask-based API for user interaction.

#### This project addresses the growing challenge of misinformation by providing a reliable tool for detecting fake news.


In [4]:
import pandas as pd

# Load the dataset
fake_news = pd.read_csv("./dataset/Fake.csv")  # Path to the Fake News CSV
real_news = pd.read_csv("./dataset/True.csv")  # Path to the Real News CSV

# Add labels: 0 for Fake News, 1 for Real News
fake_news['label'] = 0
real_news['label'] = 1

# Combine the datasets
data = pd.concat([fake_news, real_news], axis=0).reset_index(drop=True)

# Shuffle the dataset for better mixing
data = data.sample(frac=1).reset_index(drop=True)

# Check the data structure
print(data.head())


                                               title  \
0  Former Lebanese PM says Hariri will return to ...   
1    Hillary Supporters Explained In 6 BRUTAL Photos   
2  Eight bodies found in boat washed up on Japan ...   
3   The Gun Shop Owner That Banned Muslims Is Now...   
4  House Speaker Ryan to speak with Trump Wednesd...   

                                                text       subject  \
0  BEIRUT (Reuters) - Former Lebanese prime minis...     worldnews   
1  Hysterical With all the evidence available to ...      politics   
2  TOKYO (Reuters) - Eight bodies, which had been...     worldnews   
3  One of Florida s most infamous gun shop owners...          News   
4  WASHINGTON (Reuters) - U.S. House Speaker Paul...  politicsNews   

                 date  label  
0   November 7, 2017       1  
1         Mar 3, 2016      0  
2  November 27, 2017       1  
3        May 31, 2016      0  
4       May 25, 2016       1  


In [43]:
import nltk
from nltk.corpus import stopwords
import re

# Download NLTK stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Text preprocessing function
def preprocess_text(text):
    # Remove special characters, numbers, and extra spaces
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    # Convert to lowercase
    text = text.lower()
    # Tokenize and remove stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Apply preprocessing to the dataset
data['cleaned_text'] = data['text'].apply(preprocess_text)

# Check the cleaned data
print(data[['text', 'cleaned_text']].head())


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                                text  \
0  MADRID (Reuters) - Catalan secessionist partie...   
1  NAIROBI (Reuters) - Kenya s opposition leader ...   
2  The Daily Show, be it under the hosting of Jon...   
3  LONDON (Reuters) - U.S. Secretary of State Rex...   
4  ANKARA/LONDON/WASHINGTON (Reuters) - Iran has ...   

                                        cleaned_text  
0  madrid reuters catalan secessionist parties tu...  
1  nairobi reuters kenya opposition leader raila ...  
2  daily show hosting jon stewart trevor noah nev...  
3  london reuters u secretary state rex tillerson...  
4  ankara london washington reuters iran suggeste...  


In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=2000)  # Limit to top 2000 features

# Transform the cleaned text into feature vectors
X = vectorizer.fit_transform(data['cleaned_text']).toarray()
y = data['label']

print("Shape of feature matrix:", X.shape)


Shape of feature matrix: (44898, 2000)


In [46]:
from collections import Counter
print("Label distribution:", Counter(y))

Label distribution: Counter({0: 23481, 1: 21417})


In [47]:
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

# Balance the dataset using oversampling
ros = RandomOverSampler(random_state=42)
X_balanced, y_balanced = ros.fit_resample(X, y)

# Check the new distribution
print("Balanced label distribution:", Counter(y_balanced))




Balanced label distribution: Counter({1: 23481, 0: 23481})


In [48]:
from sklearn.model_selection import train_test_split

# Split the balanced data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=0.2, random_state=42)

print("Training data size:", X_train.shape)
print("Testing data size:", X_test.shape)


Training data size: (37569, 2000)
Testing data size: (9393, 2000)


In [49]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

model_rf = RandomForestClassifier(class_weight='balanced', random_state=42)
model_rf.fit(X_train, y_train)

# Predictions
y_pred_rf = model_rf.predict(X_test)

# Evaluate
print("Random Forest Classification Report:\n", classification_report(y_test, y_pred_rf))


Random Forest Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      4626
           1       1.00      1.00      1.00      4767

    accuracy                           1.00      9393
   macro avg       1.00      1.00      1.00      9393
weighted avg       1.00      1.00      1.00      9393



In [50]:
import joblib

# Save the trained model
joblib.dump(model_rf, 'fake_news_detector_rf_model.pkl')

# Save the vectorizer
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')



['tfidf_vectorizer.pkl']

In [51]:
#from sklearn.model_selection import cross_val_score

#cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
#print("Cross-Validation Scores:", cv_scores)
#print("Average CV Score:", cv_scores.mean())


In [52]:
#import pickle

# Save the model
#with open('model.pkl', 'wb') as model_file:
 #   pickle.dump(model, model_file)

# Save the vectorizer
#with open('vectorizer.pkl', 'wb') as vectorizer_file:
 #   pickle.dump(vectorizer, vectorizer_file)


In [3]:
from flask import Flask, request, jsonify, render_template
import joblib
import re
from nltk.corpus import stopwords
import nltk
import logging
#from flask_ngrok import run_with_ngrok  # If testing on Colab or external access

# Initialize Flask app
app = Flask(__name__)
app.config['TEMPLATES_AUTO_RELOAD'] = True
#run_with_ngrok(app)  # Only if using Google Colab/Ngrok for external access

# Configure logging
logging.basicConfig(level=logging.INFO)

# Load model and vectorizer
with open('fake_news_detector_rf_model.pkl', 'rb') as model_file:
    model = joblib.load(model_file)

with open('tfidf_vectorizer.pkl', 'rb') as vectorizer_file:
    vectorizer = joblib.load(vectorizer_file)

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/predict', methods=['POST'])
def predict():
    try:
        input_texts = request.json.get('texts', None)
        if not input_texts or not isinstance(input_texts, list):
            return jsonify({'error': 'Input must be a list of texts.'}), 400

        results = []
        for input_text in input_texts:
            cleaned_text = preprocess_text(input_text)
            vectorized_text = vectorizer.transform([cleaned_text]).toarray()
            prediction = model.predict(vectorized_text)[0]
            result = "Fake News" if prediction == 0 else "Real News"
            results.append({'input_text': input_text, 'prediction': result})

        return jsonify({'results': results})

    except Exception as e:
        return jsonify({'error': str(e)}), 500

# Run the Flask app
app.run()


 * Serving Flask app '__main__'
 * Debug mode: off


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [22/Dec/2024 16:36:51] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [22/Dec/2024 16:36:51] "[36mGET /static/style.css HTTP/1.1[0m" 304 -
INFO:werkzeug:127.0.0.1 - - [22/Dec/2024 16:36:51] "[36mGET /static/script.js HTTP/1.1[0m" 304 -
INFO:werkzeug:127.0.0.1 - - [22/Dec/2024 16:36:51] "[33mGET /get_accuracy HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [22/Dec/2024 16:36:52] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [22/Dec/2024 16:36:52] "[36mGET /static/style.css HTTP/1.1[0m" 304 -
INFO:werkzeug:127.0.0.1 - - [22/Dec/2024 16:36:52] "[36mGET /static/script.js HTTP/1.1[0m" 304 -
INFO:werkzeug:127.0.0.1 - - [22/Dec/2024 16:36:52] "[33mGET /get_accuracy HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - -