# Fake News Detection
IE0005 Project by Bryan Noel, Sharyn Anastasia, Nicholas Sachio, Mayank Pallai <br>
Dataset: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset

In [None]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt 
import calendar
sb.set()
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.tree import DecisionTreeClassifier


## Import the Dataset

In [None]:
Fakedata = pd.read_csv('Fake.csv', error_bad_lines = False, engine = 'python')
Truedata = pd.read_csv('True.csv', error_bad_lines = False, engine = 'python')

In [101]:
Fakedata.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [102]:
Fakedata.shape

(2738, 4)

In [None]:
Truedata.head()

In [103]:
Truedata.shape

(3307, 4)

## Joining the two datasets

As there are two datasets, one is the fake news, and the other one is real news.
Before we concat those datasets, we create a new column called 'reliability' to differentiate the fake and real news.
We set 0 for fake news and 1 for real news. Last, we concat the two datasets into a single dataset called 'News Data'.

In [None]:
Fakedata['reliability'] = 0
Truedata['reliability'] = 1

In [None]:
news_data = pd.concat([Fakedata, Truedata])
news_data.sample(10)

In [98]:
news_data.shape

(2418, 5)

In [None]:
print("Data type : ", type(news_data))
print("Data dims : ", news_data.shape)

In [None]:
print(news_data.dtypes)

In [None]:
news_data.info()

## Cleaning the dataset
Before we analyse the data, we have to clean the data to ensure that there is no null value in it. This procedure will also increase the accuracy of the machine learning.

In [None]:
# Create a copy of the Dataset
news_data_clean = news_data.copy()

# Convert all Variable Names to UPPERCASE
news_data_clean.columns = news_data_clean.columns.str.upper()

# Print the Variable Information to check
news_data_clean.info()

In [None]:
news_data_clean.isnull().sum()

In [99]:
news_data_clean.fillna("")

Unnamed: 0,TITLE,TEXT,SUBJECT,DATE,RELIABILITY,MONTH,YEAR
0,Donald Trump Sends Out Embarrassing New Year’...,donald trump send embarrass new year eve messa...,News,2017-12-31,0,December,2017
1,Drunk Bragging Trump Staffer Started Russian ...,drunk brag trump staffer start russian collus ...,News,2017-12-31,0,December,2017
2,Sheriff David Clarke Becomes An Internet Joke...,sheriff david clark becom internet joke threat...,News,2017-12-30,0,December,2017
3,Trump Is So Obsessed He Even Has Obama’s Name...,trump obsess even obama name code websit imag ...,News,2017-12-29,0,December,2017
4,Pope Francis Just Called Out Donald Trump Dur...,pope franci call donald trump christma speechp...,News,2017-12-25,0,December,2017
...,...,...,...,...,...,...,...
1212,Trump to meet Yellen Thursday in search for ne...,trump meet yellen thursday search new fed chai...,politicsNews,2017-10-16,1,October,2017
1213,Trump says he believes Cuba responsible for at...,trump say believ cuba respons attack hurt u di...,politicsNews,2017-10-16,1,October,2017
1214,U.S. condemns Venezuelan elections as neither ...,u condemn venezuelan elect neither free fairu ...,politicsNews,2017-10-16,1,October,2017
1215,EPA head seeks to avoid settlements with green...,epa head seek avoid settlement green groupsepa...,politicsNews,2017-10-16,1,October,2017


In [100]:
news_data_clean.shape

(2418, 7)

## Explore the dataset
Now, we explore the dataset. First, we can count the number of fake and real news in our dataset.

In [None]:
sb.catplot(x = 'RELIABILITY', data = news_data_clean, kind = "count")

Then, here is the number of news (both real and fake) grouped by subject

In [None]:
plt.figure (figsize=(20,16))
sb.catplot(y = 'SUBJECT', data = news_data_clean, kind = "count").fig.set_figwidth(15)

## New columns for month and year
We can also extract the month and the year of the data using to_datetime function from pandas

In [None]:
news_data_clean['DATE'] = pd.to_datetime(news_data_clean['DATE'], errors='coerce') # If 'coerce', then invalid parsing will be set as NaT.
news_data_clean['MONTH'] = news_data_clean['DATE'].dt.month_name()
news_data_clean['YEAR'] = news_data_clean['DATE'].dt.year.astype('Int32').astype(str)
news_data_clean.sample(10)

Here, we can show the number of news grouped by year

In [None]:
years = ['2015','2016','2017','2018']
sb.catplot(x='YEAR', data=news_data_clean, kind = "count", hue = 'RELIABILITY', order = years)

In [None]:
news_data_clean['YEAR'].value_counts()

Last, here is the number of news grouped by month

In [None]:
months = ['January','February','March','April','May','June','July','August','September','October','November','December']
sb.catplot(x='MONTH', data=news_data_clean, kind = "count", hue = 'RELIABILITY', order = months).fig.set_figwidth(15)

## Importing Essential Libraries

nltk : Natural Language Toolkit Library                        
re : Regular Expression             
sklearn : Skitlearn Library       
TfidfVectorizer : Term Frequency and Inverse Document Frequency

In [None]:
import re                                                                          # Regular Expression Library
import nltk                                                                        # Natural language toolkit library
from nltk.corpus import stopwords                                                  #Removes word like "the", "where" etc
from nltk.stem.porter import PorterStemmer                                         #Gives root word for a word
from sklearn.feature_extraction.text import TfidfVectorizer                        #Convert text to TF-IDF features (Term Frequency and Inverse Document Frequency)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
nltk.download('stopwords')                                                        # Downloading the stopwords package

In [None]:
print(stopwords.words('english'))                                                  # Printing out all the stopwords in the english language listed in the package

## Pre- Processing the Data for Training

In this, our job is to make sure there are not any null values as well as we extract the particular column that we shall be using for our predictive model.

In [None]:
news_data_clean.head()

In [None]:
news_data_clean.isnull().sum()

In [None]:
news_data_clean = news_data_clean.fillna('')

In [None]:
news_data_clean["TEXT"] = news_data_clean["TITLE"]                                  # Extracting only the Title column for implementation in the model

In [None]:
news_data_clean["TEXT"].head()

### Stemming : is the process of reducing a word to its Root Word

Example:
words : photograph, autograph, geography etc --> root word : graph

In [None]:
port_stem = PorterStemmer()                                                         # Calling the porterstemmer method

In [None]:
def stemming(text):                                                                 # Defining a function to implement root words
  stem_text = re.sub('[^a-zA-z]',' ', text)
  stem_text = stem_text.lower()
  stem_text = stem_text.split()
  stem_text = [port_stem.stem(word) for word in stem_text if not word in stopwords.words("english")]
  stem_text = ' '.join(stem_text)
  return stem_text

In [None]:
news_data_clean["TEXT"] = news_data_clean["TEXT"].apply(stemming)                   # Applying the function to our data (Title)

In [None]:
print(news_data_clean["TEXT"])

In [None]:
X = news_data_clean.drop(columns= "RELIABILITY", axis = 1)
Y = news_data_clean["RELIABILITY"]

In [None]:
X= news_data_clean["TEXT"].values
Y= news_data_clean["RELIABILITY"].values

In [None]:
print(X)                                                                            # You can finally see the output after text cleaning

In [None]:
print(Y)

### Applying Vectorization to our Title Data

Vectorization refers to mapping of words/text to its corresponding vector of real numbers.

In [None]:
vectorizer = TfidfVectorizer()                                                      # Applying vectorization methods to transform the text data into numbers
vectorizer.fit(X)

X = vectorizer.transform(X) 

In [None]:
print(X)

### Splitting and Training the Data

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify =Y, random_state = 42)       # Splitting Test and Text Data

Training the Model Using Logistic Regression (Using this because Binary Classification is done best using LR)

In [None]:
model = LogisticRegression()                                                           # Fitting the model into our LR Model

In [None]:
model.fit(X_train, Y_train)

In [None]:
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [None]:
print(training_data_accuracy)                                                         # Printing train data accuracy

In [None]:
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print(test_data_accuracy)                                                           # Printing test data accuracy

### Making a Predictive System based on our model

In [None]:
Z = int(input("Enter a random index value from the X Dataset:"))
X_new = X_test[Z]

predicter = model.predict(X_new)
print(predicter)

if(predicter[0]==0):
  print("The news is Fake")
else:
  print("The news is Real")

In [None]:
print (Y_test[Z])

## Combined Title + Text Columns of the Dataset

#### We also tried using both text as well as the title columns combined to train our data and visualised if the model works without overfitting.**bold text**

In [None]:
news_data_clean["TEXT"] = news_data_clean["TITLE"]+news_data_clean["TEXT"]

In [None]:
news_data_clean["TEXT"].head()

In [None]:
news_data_clean["TEXT"] = news_data_clean["TEXT"].apply(stemming)

In [None]:
print(news_data_clean["TEXT"])

In [None]:
X = news_data_clean.drop(columns= "RELIABILITY", axis = 1)
Y = news_data_clean["RELIABILITY"]

In [None]:
X= news_data_clean["TEXT"].values
Y= news_data_clean["RELIABILITY"].values

In [None]:
print(X)

In [None]:
print(Y)

In [None]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X) 

In [None]:
print(X)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify =Y, random_state = 42)

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, Y_train)

In [None]:
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [None]:
print(training_data_accuracy)

In [None]:
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print(test_data_accuracy)

In [None]:
Z = int(input("Enter a random index value from the X Dataset:"))
X_new = X_test[Z]

predicter = model.predict(X_new)
print(predicter)

if(predicter[0]==0):
  print("The news is Fake")
else:
  print("The news is Real")

In [None]:
print (Y_test[Z])