# Fake News Detection

## Importing Libraries

We have already imported the necessary libraries for you, however, feel free to add more if you see fit!

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import re
import string

## Importing Dataset

Our datasets have been copied from Kaggle (https://www.kaggle.com/code/therealsampat/fake-news-detection/input) and we have two datasets, one that contains real news and one that contains fake news.

In this section, we shall import the datasets into this notebook and examine what each contains.

In [5]:
df_fake = pd.read_csv("./Fake.csv")
df_true = pd.read_csv("./True.csv")

In [6]:
# Examining the first 5 rows of the fake news dataset
df_fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [None]:
# Examining the first 5 rows of the true news dataset
df_true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


## Inserting a column "class" as target feature

In machine learning, we usually try to predict a **class** or **label**.

Over here, our classification is going to be either **real** or **fake**, an we denote them using integers, **0** and **1**.

In [None]:
df_fake["class"] = 0
df_true["class"] = 1

In [None]:
df_fake.shape, df_true.shape

((23481, 5), (21417, 5))

## Merging True and Fake Dataframes

Now that we've added a `class` column to our datasets (which we call our `target` column), we can merge the two datasets, shuffle the rows, and define and train our ML model/s to differenciate between what's real or fake.

In [None]:
# Merging the two dataframes
df_merge = pd.concat([df_fake, df_true], axis=0)
df_merge.head(10)

Unnamed: 0,title,text,subject,date,class
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0
5,Racist Alabama Cops Brutalize Black Boy While...,The number of cases of cops brutalizing and ki...,News,"December 25, 2017",0
6,"Fresh Off The Golf Course, Trump Lashes Out A...",Donald Trump spent a good portion of his day a...,News,"December 23, 2017",0
7,Trump Said Some INSANELY Racist Stuff Inside ...,In the wake of yet another court decision that...,News,"December 23, 2017",0
8,Former CIA Director Slams Trump Over UN Bully...,Many people have raised the alarm regarding th...,News,"December 22, 2017",0
9,WATCH: Brand-New Pro-Trump Ad Features So Muc...,Just when you might have thought we d get a br...,News,"December 21, 2017",0


In [None]:
# Checking the columns of the new dataframe
df_merge.columns

Index(['title', 'text', 'subject', 'date', 'class'], dtype='object')

## Removing columns which are not required

We will now remove any data we don't need from our dataframe. Since we want our ML model to identify fake news from text, we will drop the `title`, `subject`, and `date` columns.

In [None]:
df = df_merge.drop(["title", "subject","date"], axis = 1)

In [None]:
df.isnull().sum()

text     0
class    0
dtype: int64

## Random Shuffling the dataframe

We randomly shuffle the rows so that the data is spread more uniformly, i.e., so that there isn't data for one class after the first class, becuase that will hinder model performance if it sees too many examples of the same class before being introduced to any examples of data from another class.

Over here, our `classes` are `0` and `1`, the column that we added before to differentiate which is real and fake!

In [None]:
# Shuffling the rows of the dataframe
df = df.sample(frac=1)

In [None]:
df.head()

Unnamed: 0,text,class
7974,DUBAI (Reuters) - The Saudi riyal fell against...,1
11368,(This version of the Dec. 25 story was refile...,1
3509,"If you haven t heard, Donald Trump has joined ...",0
9911,HAVANA (Reuters) - Cruise company Carnival Cor...,1
16528,MOSCOW (Reuters) - Russia has freed two promin...,1


In [None]:
df.reset_index(inplace = True) # Resetting the index
df.drop(["index"], axis = 1, inplace = True) # Drop the index column

In [None]:
df.columns

Index(['text', 'class'], dtype='object')

In [None]:
df.head()

Unnamed: 0,text,class
0,DUBAI (Reuters) - The Saudi riyal fell against...,1
1,(This version of the Dec. 25 story was refile...,1
2,"If you haven t heard, Donald Trump has joined ...",0
3,HAVANA (Reuters) - Cruise company Carnival Cor...,1
4,MOSCOW (Reuters) - Russia has freed two promin...,1


## Creating a function to process the texts

The `wordopt` function helps clean up text data before using it for natural language processing tasks. It performs the following operations:

- **Lowercase Conversion**: Ensures all text is in lowercase, providing uniformity.

- **Bracketed Text Removal**: Deletes content within square brackets, commonly used for citations or notes.

- **Non-Word Characters Removal**: Replaces non-alphabetic characters with spaces, reducing noise in the data.

- **URL Removal**: Removes links to prevent any web URLs from interfering with text analysis.

- **HTML Tag Removal**: Cleans out HTML tags often found in web-scraped text.

- **Punctuation Removal**: Strips out all punctuation to simplify word frequency analysis.

- **Newline Removal**: Joins all text into one line by removing newline characters.

- **Digit-Containing Word Removal**: Deletes words with numbers, like dates or product IDs, which might not contribute to text meaning.

This preprocessing function prepares the text for machine learning algorithms by removing potentially distracting elements.

In [None]:
def wordopt(text):
    text = text.lower() # Convert all text to lowercase for consistency.
    text = re.sub('\[.*?\]', '', text) # Remove text within square brackets.
    text = re.sub("\\W"," ",text) # Remove all special characters.
    text = re.sub('https?://\S+|www\.\S+', '', text) # Remove all URLs.
    text = re.sub('<.*?>+', '', text) # Remove HTML tags.
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # Remove all punctuations.
    text = re.sub('\n', '', text) # Remove newline characters.
    text = re.sub('\w*\d\w*', '', text) # Remove words containing numbers.
    return text

In [None]:
# Applying the wordopt function to the text column
df["text"] = df["text"].apply(wordopt)

## Defining dependent and independent variables

In this step, we define our **independent** and **dependent variables** for the ML model.

- **Independent Variable (`x`)**: This is the data we use to make predictions. Here, `x` is set to `df["text"]`, meaning it contains the text content of each news article or tweet. This variable acts as the input that our model will analyze to find patterns.
  
- **Dependent Variable (`y`)**: This is the **target** variable or the label that we aim to predict. In this case, `y` is set to `df["class"]`, where each entry represents whether the text is classified as real or fake news. This variable serves as the outcome that the model learns to predict based on the patterns in `x`.

By assigning these variables, we’re setting up the data so the model can learn the relationship between the text content (`x`) and its classification (`y`).

In [None]:
x = df["text"]
y = df["class"]

## Splitting Training and Testing

Splitting data into **training** and **testing sets** is essential in machine learning to evaluate how well a model generalizes to new, unseen data. Here’s why this split is important:

- **Training Set**: This portion of the data is used to train the model, allowing it to learn patterns, relationships, and structures within the data. The model uses this set to adjust its internal parameters and optimize its performance on this specific dataset.

- **Testing Set**: After training, the model is evaluated on the testing set, which it hasn't seen before. This step is crucial because it helps us assess the model's accuracy and reliability on new data. By testing on a separate set, we can get an idea of how well the model would perform on real-world data and detect issues like **overfitting**, where the model performs well on training data but poorly on new data.

In short, splitting the data allows us to create a more robust model that performs well not just on known data, but also on previously unseen data, making it more reliable and effective for practical applications.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

## Convert text to vectors

This code is converting text data into **numerical vectors** using a technique called **TF-IDF (Term Frequency-Inverse Document Frequency) Vectorization**. Machine learning models require numerical input, so this process transforms the text into a form that the model can understand.

Here’s what each part does:

1. **TfidfVectorizer**: This is a tool from `scikit-learn` that transforms text data into a matrix of TF-IDF features. TF-IDF is a common text representation technique that reflects how important each word is within a document relative to the entire dataset.

   - **Term Frequency (TF)**: Measures how frequently a word appears in a document.
   - **Inverse Document Frequency (IDF)**: Reduces the weight of words that are common across many documents, emphasizing words unique to a specific document.

   By combining TF and IDF, TF-IDF gives higher importance to words that are more unique within a document, which helps the model distinguish between different topics or classes.

2. **Fit and Transform Training Data**:
   - `fit_transform` is used on the training data (`x_train`) to learn the vocabulary and calculate the TF-IDF scores, producing a numerical vector for each document in `x_train`.

3. **Transform Testing Data**:
   - `transform` is used on the test data (`x_test`) to convert it into the same TF-IDF feature space without altering the learned vocabulary. This ensures that the training and test data are represented consistently.

### Why Do This?
TF-IDF vectorization converts raw text into a structured numerical form, capturing the importance of each word within each document. This numerical matrix can then be fed into machine learning algorithms to identify patterns and make predictions.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test)

## Logistic Regression

**Logistic Regression** is a simple yet powerful algorithm used for binary classification tasks. Unlike linear regression, which predicts a continuous value, logistic regression predicts the probability that a data point belongs to a particular class. It uses the **logistic (sigmoid) function** to convert these probabilities into two possible outcomes (e.g., 0 or 1, true or false).

In summary, logistic regression is ideal for situations where the goal is to classify data into distinct categories, such as identifying spam emails or predicting if a tweet is about a natural disaster.

In [None]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(xv_train,y_train)

In [None]:
# Predicting the test results
pred_lr=LR.predict(xv_test)

In [None]:
# Checking the accuracy of the model
LR.score(xv_test, y_test)

0.9869875222816399

In [None]:
# Checking the classification report
print(classification_report(y_test, pred_lr))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5966
           1       0.98      0.99      0.99      5254

    accuracy                           0.99     11220
   macro avg       0.99      0.99      0.99     11220
weighted avg       0.99      0.99      0.99     11220



## Decision Tree Classification

**Decision Tree Classification** is a model that classifies data by splitting it into branches based on feature values. At each split, it asks a yes/no question, creating a path from the root to a leaf, where each leaf represents a class. Decision trees are easy to interpret, but they can overfit if not carefully managed.

In [None]:
from sklearn.tree import DecisionTreeClassifier

DT = DecisionTreeClassifier()
DT.fit(xv_train, y_train)

In [None]:
# Predicting the test results
pred_dt = DT.predict(xv_test)

In [None]:
# Checking the accuracy of the model
DT.score(xv_test, y_test)

0.9961675579322639

In [None]:
# Checking the classification report
print(classification_report(y_test, pred_dt))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5966
           1       1.00      1.00      1.00      5254

    accuracy                           1.00     11220
   macro avg       1.00      1.00      1.00     11220
weighted avg       1.00      1.00      1.00     11220



## Gradient Boosting Classifier

**Gradient Boosting Classifier** is a powerful model that builds a series of small, simple decision trees, each one correcting the errors of the previous one. By focusing on areas where the model performs poorly, it "boosts" its performance, making it highly accurate and effective at handling complex patterns in data.

**NOTE:** This approach is important because it combines the strengths of multiple weak learners (simple models) into a strong, robust model. Gradient Boosting is widely used in competitive and practical machine learning because it often achieves high accuracy, especially in tasks like classification for finance, healthcare, and customer behavior analysis.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

GBC = GradientBoostingClassifier(random_state=0)
GBC.fit(xv_train, y_train)

In [None]:
# Predicting the test results
pred_gbc = GBC.predict(xv_test)

In [None]:
# Checking the accuracy of the model
GBC.score(xv_test, y_test)

0.9950089126559715

In [None]:
# Checking the classification report
print(classification_report(y_test, pred_gbc))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00      5966
           1       0.99      1.00      0.99      5254

    accuracy                           1.00     11220
   macro avg       0.99      1.00      0.99     11220
weighted avg       1.00      1.00      1.00     11220



## Random Forest Classifier

**Random Forest Classifier** is an ensemble learning model that creates multiple decision trees and combines their predictions to improve accuracy and reduce overfitting. Each tree is trained on a random subset of the data and features, which helps the model learn diverse patterns and become more robust.

In short, Random Forests are popular because they balance accuracy and stability, handling complex data well without easily overfitting.

In [None]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier(random_state=0)
RFC.fit(xv_train, y_train)

In [None]:
# Predicting the test results
pred_rfc = RFC.predict(xv_test)

In [None]:
# Checking the accuracy of the model
RFC.score(xv_test, y_test)

0.9902852049910873

In [None]:
# Checking the classification report
print(classification_report(y_test, pred_rfc))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5966
           1       0.99      0.99      0.99      5254

    accuracy                           0.99     11220
   macro avg       0.99      0.99      0.99     11220
weighted avg       0.99      0.99      0.99     11220



## Model Testing

Let's see how our model performs on data it's never seen before!

In [None]:
def output_lable(n):
    if n == 0:
        return "Fake News"
    elif n == 1:
        return "Real News"

def manual_testing(news):
    testing_news = {"text":[news]}
    new_def_test = pd.DataFrame(testing_news)
    new_def_test["text"] = new_def_test["text"].apply(wordopt)
    new_x_test = new_def_test["text"]
    new_xv_test = vectorization.transform(new_x_test)
    pred_LR = LR.predict(new_xv_test)
    pred_DT = DT.predict(new_xv_test)
    pred_GBC = GBC.predict(new_xv_test)
    pred_RFC = RFC.predict(new_xv_test)

    return print("\n\nLR Prediction: {} \nDT Prediction: {} \nGBC Prediction: {} \nRFC Prediction: {}".format(output_lable(pred_LR[0]),                                                                                                       output_lable(pred_DT[0]),
                                                                                                              output_lable(pred_GBC[0]),
                                                                                                              output_lable(pred_RFC[0])))

In [None]:
# Testing the model with real and fake news
input1_fake = "It would appear Fox News isn t even hiding their blatant racism anymore, nor doing anything about it.Appearing on the cable  news  network, Fox regular Eric Bolling decided he would go all in and call a black Congresswoman a crack addict, and in the worst way possible.While discussing Rep. Maxine Waters (D-CA), Bolling says, in the most repugnant way possible: You saw what happened to Whitney Houston. Step away from the crack pipe. Step away from the Xanax. Step away from the Lorazepam. It s gonna get you in trouble. Which also insinuates she s going to get herself killed because that s what happened to Whitney Houston.Watch here:Now, if you didn t already think Eric Bolling was a raging racist and absolutely horrible person before, you undoubtedly should now. And if you don t, maybe take a good, long look in the mirror and have so self-introspection to how horrible you may be yourself.Who the hell talks like this and is still allowed to keep their job? It s incredibly vile, and unfortunately, not at all surprising.Featured image via video screen capture"
input2_real = "WASHINGTON (Reuters) - Republican Senator John McCain, who is receiving treatment for brain cancer and has missed votes this week, will be available next week to vote on the tax compromise bill, John Cornyn, the No. 2 Republican in the U.S. Senate, said on Thursday. “He’s just resting up,” Cornyn said."
manual_testing(input1_fake)
manual_testing(input2_real)