In [2]:
# Import libraries
import pandas as pd # library for data manipulation and analysis
import numpy as np # library that offers support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import seaborn as sns # data visualization library, provides an interface for creating statistical graphics
import matplotlib.pyplot as plt # plotting library for creating static, animated, and interactive visualizations 

from sklearn.model_selection import train_test_split # provides functions for splitting datasets into train and test sets
from sklearn.metrics import accuracy_score # provides various metrics for evaluating machine learning models
from sklearn.metrics import classification_report

import re # module for regular expression operations
import string # provides a collection of common string operations and constants

In [3]:
# Import datasets
data_fake = pd.read_csv("Fake.csv")
data_real = pd.read_csv("Real.csv")

In [4]:
# Print the first 5 rows of the dataset with fake news
data_fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [5]:
# Print the first 5 rows of the dataset with real news
data_real.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [6]:
# Insert a column "label" as target feature
data_fake["label"] = 0
data_real["label"] = 1

In [7]:
# Print the shape of each dataset: (nr. of rows, nr. of columns)
# i.e., (nr. of entries, nr. of features)
data_fake.shape, data_real.shape

((23481, 5), (21417, 5))

In [8]:
# Remove the last 15 rows from each dataset for manual testing
data_fake_manual_testing = data_fake.tail(15)
for i in range (23480, 23465, -1):
    data_fake.drop([i], axis = 0, inplace = True)
    

data_real_manual_testing = data_real.tail(15)
for i in range (21416, 21401, -1):
    data_real.drop([i], axis = 0, inplace = True)

In [9]:
# Print the shape of each dataset to make sure that it is correct
data_fake.shape, data_real.shape

((23466, 5), (21402, 5))

In [10]:
# Print the first 5 entries in the dataset with fake news for manual testing
data_fake_manual_testing.head()

Unnamed: 0,title,text,subject,date,label
23466,Boston Brakes? How to Hack a New Car With Your...,21st Century Wire says For those who still ref...,Middle-east,"January 22, 2016",0
23467,Oregon Governor Says Feds ‘Must Act’ Against P...,"21st Century Wire says So far, after nearly 20...",Middle-east,"January 21, 2016",0
23468,Ron Paul on Burns Oregon Standoff and Jury Nul...,21st Century Wire says If you ve been followin...,Middle-east,"January 21, 2016",0
23469,BOILER ROOM: As the Frogs Slowly Boil – EP #40,Tune in to the Alternate Current Radio Network...,Middle-east,"January 20, 2016",0
23470,Arizona Rancher Protesting in Oregon is Target...,RTOne of the most visible members of the armed...,Middle-east,"January 20, 2016",0


In [11]:
# Print the first 5 entries in the dataset with real news for manual testing
data_real_manual_testing.head()

Unnamed: 0,title,text,subject,date,label
21402,Exclusive: Trump's Afghan decision may increas...,ON BOARD A U.S. MILITARY AIRCRAFT (Reuters) - ...,worldnews,"August 22, 2017",1
21403,U.S. puts more pressure on Pakistan to help wi...,WASHINGTON (Reuters) - The United States sugge...,worldnews,"August 21, 2017",1
21404,Exclusive: U.S. to withhold up to $290 million...,WASHINGTON (Reuters) - The United States has d...,worldnews,"August 22, 2017",1
21405,Trump talks tough on Pakistan's 'terrorist' ha...,ISLAMABAD (Reuters) - Outlining a new strategy...,worldnews,"August 22, 2017",1
21406,"U.S., North Korea clash at U.N. forum over nuc...",GENEVA (Reuters) - North Korea and the United ...,worldnews,"August 22, 2017",1


In [12]:
# Merge fake and real datasets for manual testing and save the result in a separate csv file
data_manual_testing = pd.concat([data_fake_manual_testing, data_real_manual_testing], axis = 0)
data_manual_testing.to_csv("Manual_testing.csv")

In [13]:
# Merge fake and real datasets used for training
data_training = pd.concat([data_fake, data_real], axis = 0)
# Print the shape of the resulting dataset
data_training.shape

(44868, 5)

In [14]:
data_training.columns

Index(['title', 'text', 'subject', 'date', 'label'], dtype='object')

In [15]:
# Remove columns that are not needed
data_training = data_training.drop(["title", "subject", "date"], axis = 1)

In [16]:
# Check if there are values missing
data_training.isnull().sum()

text     0
label    0
dtype: int64

In [17]:
# Shuffle the dataset so that real and fake data are mixed
data_training = data_training.sample(frac = 1) # frac = fraction of axis items to return

In [18]:
# Print the first 5 rows of the resulting dataset to make sure everything's correct
data_training.head()

Unnamed: 0,text,label
12564,One can only imagine what kind of relationship...,0
17905,Has the Left and the leftist media managed to ...,0
10965,WASHINGTON (Reuters) - Senate Democrats teamed...,1
2705,Donald Trump hasn t even been president for a ...,0
5086,There s no doubt about it: Trump and his campa...,0


In [19]:
# Reset indices back to the default, i.e., 0, 1, 2...
data_training.reset_index(inplace = True)
data_training.head()

Unnamed: 0,index,text,label
0,12564,One can only imagine what kind of relationship...,0
1,17905,Has the Left and the leftist media managed to ...,0
2,10965,WASHINGTON (Reuters) - Senate Democrats teamed...,1
3,2705,Donald Trump hasn t even been president for a ...,0
4,5086,There s no doubt about it: Trump and his campa...,0


In [20]:
data_training.columns

Index(['index', 'text', 'label'], dtype='object')

In [21]:
# Remove the column "index" as it is not needed for the training
data_training.drop(["index"], axis = 1, inplace = True)
data_training.head()

Unnamed: 0,text,label
0,One can only imagine what kind of relationship...,0
1,Has the Left and the leftist media managed to ...,0
2,WASHINGTON (Reuters) - Senate Democrats teamed...,1
3,Donald Trump hasn t even been president for a ...,0
4,There s no doubt about it: Trump and his campa...,0


In [22]:
# Create a function to pre-process the textual data
def preprocess(text):
    text = text.lower() # all characters to lowercase
    text = re.sub('\[.*?\]', '', text) # remove all occurrences of square bracketed text (including the brackets) from the given text string
    text = re.sub("\\W", " ", text)  # replace all non-alphanumeric characters (including underscores) in the given text string with a space character
    text = re.sub('https?://\S+|www\.\S+', '', text) # remove any URLs (starting with "http://", "https://", or "www.") from the given text string
    text = re.sub('<.*?>+', '', text) # remove any HTML tags (including the tags themselves and their content) from the given text string
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # remove any punctuation marks from the given text string
    text = re.sub('\n', '', text) # remove any newline characters from the given text string
    text = re.sub('\w*\d\w*', '', text) # remove any word-like strings that include numbers
    return text

In [23]:
# Apply the function for the column "text" in our dataset
data_training["text"] = data_training["text"].apply(preprocess)

# Print to see the result
data_training.head()

Unnamed: 0,text,label
0,one can only imagine what kind of relationship...,0
1,has the left and the leftist media managed to ...,0
2,washington reuters senate democrats teamed...,1
3,donald trump hasn t even been president for a ...,0
4,there s no doubt about it trump and his campa...,0


In [24]:
# Define dependent and independent variables
x = data_training["text"] # independent variable
y = data_training["label"] # dependent variable = the one we want to predict

In [25]:
# Split training and testing data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

In [26]:
# Convert text to vectors
from sklearn.feature_extraction.text import TfidfVectorizer # tf-idf: term frequency–inverse document frequency

vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(x_train) # simultaneously performs fit (analyzes the data and learns any necessary parameters or statistics from it) and transform (applies a specific transformation or preprocessing technique to the data based on the learned parameters) operations on the input data and converts the data points
xv_test = vectorization.transform(x_test)

# Logistic Regression

In [27]:
from sklearn.linear_model import LogisticRegression # a statistical algorithm used to model the relationship between a binary dependent variable and one or more independent variables by estimating the probabilities of different outcomes

LR = LogisticRegression()
LR.fit(xv_train, y_train)

In [28]:
pred_LR = LR.predict(xv_test)

In [29]:
LR.score(xv_test, y_test) # calculates the F1 score for the set of predicted labels (a measure of test's accuracy)

0.9871623428724258

In [30]:
print(classification_report(y_test, pred_LR))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5844
           1       0.98      0.99      0.99      5373

    accuracy                           0.99     11217
   macro avg       0.99      0.99      0.99     11217
weighted avg       0.99      0.99      0.99     11217



# Dummy Classifier

In [47]:
from sklearn.dummy import DummyClassifier

DC = DummyClassifier(strategy="most_frequent")
DC.fit(xv_train, y_train)

In [48]:
pred_DC = DC.predict(xv_test)

In [49]:
DC.score(xv_test, y_test)

0.520994918427387

In [50]:
print(classification_report(y_test, pred_DC))

              precision    recall  f1-score   support

           0       0.52      1.00      0.69      5844
           1       0.00      0.00      0.00      5373

    accuracy                           0.52     11217
   macro avg       0.26      0.50      0.34     11217
weighted avg       0.27      0.52      0.36     11217



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Decision Tree Classification

In [31]:
from sklearn.tree import DecisionTreeClassifier # predicts the value of a target variable by learning simple decision rules inferred from the data features

DT = DecisionTreeClassifier()
DT.fit(xv_train, y_train)

In [32]:
pred_DT = DT.predict(xv_test)

In [33]:
DT.score(xv_test, y_test)

0.9947401265935634

In [34]:
print(classification_report(y_test, pred_DT))

# precision: measures the proportion of correctly predicted positive instances out of all instances predicted as positive
# recall: measures the proportion of correctly predicted positive instances out of all actual positive instances
# f1-score: the harmonic mean of precision and recall and provides a single metric that balances both measures
# support: represents the number of instances in each class

# accuracy: the overall proportion of correctly predicted instances out of all instances
# macro avg: calculates the average of precision, recall, and F1-score across both classes
# weighted avg: calculates the average of precision, recall, and F1-score, weighted by the support of each class

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      5844
           1       1.00      0.99      0.99      5373

    accuracy                           0.99     11217
   macro avg       0.99      0.99      0.99     11217
weighted avg       0.99      0.99      0.99     11217



# Gradient Boosting Classifier

In [35]:
from sklearn.ensemble import GradientBoostingClassifier # a functional gradient algorithm that repeatedly selects a function that leads in the direction of a weak hypothesis or negative gradient so that it can minimize a loss function

GBC = GradientBoostingClassifier(random_state = 0)
GBC.fit(xv_train, y_train)

In [36]:
pred_GBC = GBC.predict(xv_test)

In [37]:
GBC.score(xv_test, y_test)

0.9948292769902826

In [38]:
print(classification_report(y_test, pred_GBC))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00      5844
           1       0.99      1.00      0.99      5373

    accuracy                           0.99     11217
   macro avg       0.99      0.99      0.99     11217
weighted avg       0.99      0.99      0.99     11217



# Random Forest Classifier

In [39]:
from sklearn.ensemble import RandomForestClassifier # an ensemble learning algorithm that combines multiple decision trees to make predictions by aggregating the results of individual trees, resulting in improved accuracy and robustness in classification tasks

RFC = RandomForestClassifier(random_state = 0)
RFC.fit(xv_train, y_train)

In [40]:
pred_RFC = RFC.predict(xv_test)

In [41]:
RFC.score(xv_test, y_test)

0.9896585539805652

In [42]:
print(classification_report(y_test, pred_RFC))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5844
           1       0.99      0.99      0.99      5373

    accuracy                           0.99     11217
   macro avg       0.99      0.99      0.99     11217
weighted avg       0.99      0.99      0.99     11217



# Model Testing

In [43]:
def output_label(n):
    if n == 0:
        return "Fake"
    elif n == 1:
        return "Real"
    
def manual_testing(news):
    testing_news = {"text": [news]}
    testing_news_df = pd.DataFrame(testing_news)
    testing_news_df["text"] = testing_news_df["text"].apply(preprocess)
    x_test = testing_news_df["text"]
    xv_test = vectorization.transform(x_test)
    pred_LR = LR.predict(xv_test)
    pred_DT = DT.predict(xv_test)
    pred_GBC = GBC.predict(xv_test)
    pred_RFC = RFC.predict(xv_test)
    
    return print("\n\nLR Prediction: {} \nDT Prediction: {} \nGB Prediction: {} \nRF Prediction: {}".format(output_label(pred_LR[0]), output_label(pred_DT[0]), output_label(pred_GBC[0]), output_label(pred_RFC[0])))


In [44]:
# Testing with real news from the dataset
news = str(input())
manual_testing(news)

BRUSSELS (Reuters) - NATO allies on Tuesday welcomed President Donald Trump s decision to commit more forces to Afghanistan, as part of a new U.S. strategy he said would require more troops and funding from America s partners. Having run for the White House last year on a pledge to withdraw swiftly from Afghanistan, Trump reversed course on Monday and promised a stepped-up military campaign against  Taliban insurgents, saying:  Our troops will fight to win .  U.S. officials said he had signed off on plans to send about 4,000 more U.S. troops to add to the roughly 8,400 now deployed in Afghanistan. But his speech did not define benchmarks for successfully ending the war that began with the U.S.-led invasion of Afghanistan in 2001, and which he acknowledged had required an   extraordinary sacrifice of blood and treasure .  We will ask our NATO allies and global partners to support our new strategy, with additional troops and funding increases in line with our own. We are confident they w

In [45]:
# Testing with fake news from the dataset
news = str(input())
manual_testing(news)

21st Century Wire says For those who still refuse to entertain any Michael Hastings  Boston Breaks  conspiracy theories, this latest story offers yet more proof that hacking a vehicle is not only possible   it s relatively easy to do. Everyone has seen or heard those annoying ads on TV and radio, for those new  smart  monitoring devices that insurance companies are trying hard to convince you to install in your car. They tell us,  If you are a safe driver, then we ll be sure to pass on premium discounts to safe drivers. To even the most mild skeptic, this little invention reeks of Big Brother tech.According to crafty insurance moguls, their revolutionary  smart  plug-in device connects to your car s computer port which is normally located just beneath the steering wheel. They claim that it only records drivers when they accelerate, brake and steer. They claim that this data will reveal how erratic a driver you are, which in turn will determine your road risk actuary   and how much you 

In [46]:
# Testing with real news from the Internet
news = str(input())
manual_testing(news)

Ron DeSantis’ decision to announce his 2024 White House bid in a conversation with Elon Musk on Twitter on Wednesday will make a typically blunt statement about his campaign, the unruly populism of the modern Republican Party and an accelerating conservative media revolution. Florida’s governor will finally jump into the race by throwing down a gauntlet to ex-President Donald Trump with a launch strategy that frames him as the true anti-establishment rebel in the race who is willing to crush the conventions of traditional presidential politics. His choice of venue on Twitter Spaces – the site’s audio platform – also exemplifies the Trump-era GOP’s transformation into a party that rewards gesture politics and whose activists respond to the unmoderated social media jungle while disdaining traditional standards of conduct and governance.  Ron DeSantis to kick off 2024 presidential campaign in conversation with Twitter owner Elon Musk But while Twitter’s attractiveness to conservative vote



LR Prediction: Fake 
DT Prediction: Fake 
GB Prediction: Fake 
RF Prediction: Fake


In [83]:
news = str(input())
manual_testing(news)

Patrick Henningsen  21st Century WireRemember when the Obama Administration told the world how it hoped to identify 5,000 reliable non-jihadist  moderate  rebels hanging out in Turkey and Jordan, who might want to fight for Washington in Syria? After all the drama over its infamous  train and equip  program to create their own Arab army in Syria, they want to give it another try.This week, Pentagon officials announced their new plan to train up to 7,000 more  moderate  fighters, but this time the project would take place inside Syria (and to hell with international law).We re told that this was requested by Ankara, and with all NATO allies singing the same hymn   claiming that this new effort will help in securing Turkey s porous border with Syria, or so the story goes. Washington s political cover for this is fashioned from the popular post-Paris theme: to protect civilized Europe from invading hordes and the terrorists who hide among them, as stated in the Wall Street Journal: The pr



LR Prediction: Fake 
DT Prediction: Fake 
GB Prediction: Fake 
RF Prediction: Fake


In [84]:
news = str(input())
manual_testing(news)

JAKARTA (Reuters) - Indonesia will buy 11 Sukhoi fighter jets worth $1.14 billion from Russia in exchange for cash and Indonesian commodities, two cabinet ministers said on Tuesday. The Southeast Asian country has pledged to ship up to $570 million worth of commodities in addition to cash to pay for the Suhkoi SU-35 fighter jets, which are expected to be delivered in stages starting in two years. Indonesian Trade Minister Enggartiasto Lukita said in a joint statement with Defence Minister Ryamizard Ryacudu that details of the type and volume of commodities were  still being negotiated . Previously he had said the exports could include palm oil, tea, and coffee. The deal is expected to be finalised soon between Indonesian state trading company PT Perusahaan Perdangangan Indonesia and Russian state conglomerate Rostec. Russia is currently facing a new round of U.S.-imposed trade sanctions. Meanwhile, Southeast Asia s largest economy is trying to promote its palm oil products amid threats