# Intro to ML Final Project
**Author: Bradley Sides**

Github Link:

Dataset: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset


## Problem Description
In our age of information, where "news" is shown in bite sized formats on every social media platform, it is becoming increasingly difficult to separate "real" news from "fake" news. The consequences of people believing fake news stories run deep and cause major problems for people across the world. Machine learning, when paired with NLP techniques, can prove to be very valuable in identifying fake news when it occurs.

In this project, we will use BERT paired with a TF-IDF vector embedding, run through a Random Forest to classify new articles.

## Step 0: Installs and Imports for Colab

In [1]:
pip install bertopic



In [2]:
from bertopic import BERTopic
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score


## Step 1: EDA

In [3]:
# IMPORTANT: Adjust the path to fit your file structure

real_df = pd.read_csv('/True.csv.zip')
fake_df = pd.read_csv('/Fake.csv.zip')

fake_df = fake_df[:5000]
real_df = real_df[:5000]

print("Rows in Fake DF: ", len(fake_df))
print("Rows in Real DF: ", len(real_df))

fake_df.head()

Rows in Fake DF:  5000
Rows in Real DF:  5000


Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [4]:
# Drop "subject" and "date" columns as we do not need them
# Add "Real" and "Fake" as labels before we shuffle and train our model

real_df = real_df.drop(['subject', 'date'], axis = 1)
fake_df = fake_df.drop(['subject', 'date'], axis = 1)
real_df = real_df.reset_index(drop=True)
fake_df = fake_df.reset_index(drop=True)
real_df['Label'] = 'Real'
fake_df['Label'] = 'Fake'

# Concatenate title with body,
real_df['full_text'] = real_df['title'] + ' ' + real_df['text']
fake_df['full_text'] = fake_df['title'] + ' ' + fake_df['text']
real_df = real_df.drop(['title', 'text'], axis = 1)
fake_df = fake_df.drop(['title', 'text'], axis = 1)

# Reduce to 2 columns (test and train) and shuffle
full_df = pd.concat([real_df, fake_df], axis = 0)
full_df = full_df.sample(frac = 1).reset_index(drop = True)
full_df.head(20)

Unnamed: 0,Label,full_text
0,Real,"Factbox: Five facts about Tom Cotton, Trump's ..."
1,Fake,"Trump Declares NATO ‘Obsolete,’ Wants Europea..."
2,Real,Trump: military solutions 'locked and loaded' ...
3,Fake,Trump Whines About Obamacare And Takes Credit...
4,Fake,Texas Republican Rejects Canadian Offer Of Re...
5,Real,"Trump says 'his decision' to fire FBI chief, c..."
6,Real,"Trump to nominate Chatterjee, Powelson to FERC..."
7,Fake,Trump Just Literally Gave The WWE A Cabinet P...
8,Fake,Michelle Obama TORCHES Donald Trump: ‘It’s No...
9,Real,Nuclear weapons will not bring security for No...


In [5]:
# Save some examples for manual testing later on

real_examples = full_df[full_df['Label'] == 'Real'].sample(2)
fake_examples = full_df[full_df['Label'] == 'Fake'].sample(2)
test_examples = pd.concat([real_examples, fake_examples], axis = 0)
full_df = full_df.drop(test_examples.index)

In [6]:
# Turn columns to lists for processing

text = full_df['full_text'].tolist()
labels = full_df['Label'].tolist()
print(text[0])
print(labels[0])

Factbox: Five facts about Tom Cotton, Trump's likely pick for CIA WASHINGTON (Reuters) - U.S. Senator Tom Cotton, a hawkish Iraq war veteran who has said he did not consider waterboarding to be torture, is likely to be chosen by President Donald Trump as the next leader of the Central Intelligence Agency amid a Cabinet shake-up, senior administration officials said. Here are five facts about the Arkansas Republican: - Cotton, 40, is a staunch Trump ally who has vigorously opposed the Iran nuclear deal. He served one term in the House of Representatives before being elected to the Senate in 2014 as part of a Republican wave. Cotton sits on the Intelligence Committee but has no experience managing a large organization. - Cotton has given a qualified endorsement of the intelligence community’s assessment that Russia interfered in the 2016 presidential election to boost Trump’s prospects. “I have no doubts about the intelligence community’s assessment,” he said in an Oct. 5 interview with 

## Step 1.5: Data Cleaning

In [7]:
# String formatting, remove stop words

stop_words = set(stopwords.words('english'))
text = pd.Series(text)
text = text.str.replace(r'[^\w\s]', '', regex = True).str.lower()
text = text.apply(
        lambda x: " ".join([word for word in x.split() if word not in stop_words])
)
print(text[0])
print(labels[0])

factbox five facts tom cotton trumps likely pick cia washington reuters us senator tom cotton hawkish iraq war veteran said consider waterboarding torture likely chosen president donald trump next leader central intelligence agency amid cabinet shakeup senior administration officials said five facts arkansas republican cotton 40 staunch trump ally vigorously opposed iran nuclear deal served one term house representatives elected senate 2014 part republican wave cotton sits intelligence committee experience managing large organization cotton given qualified endorsement intelligence communitys assessment russia interfered 2016 presidential election boost trumps prospects doubts intelligence communitys assessment said oct 5 interview washington post columnist david ignatius cotton wants boost defense budget saying us military crisis pointed russia china growing threats accused moscow trying divide nato said china seeking control pacific rim decorated army veteran wars iraq afghanistan cot

In [20]:
# Split into train/test sets, train topic model

X_train, X_test, y_train, y_test = train_test_split(text, labels, test_size = 0.2)

topic_model = BERTopic(language = "english", calculate_probabilities = True)
X_train = X_train.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)

train_topics, train_probabilities = topic_model.fit_transform(X_train)
test_topics, test_probabilities = topic_model.transform(X_test)

In [21]:
# Look at the topic model
topic_model.visualize_topics()

In [22]:
print(topic_model.get_topic_freq())
print(topic_model.get_topic(0))

     Topic  Count
0       -1   2192
16       0    339
13       1    313
2        2    157
48       3    136
..     ...    ...
146    151     10
97     152     10
81     153     10
156    154     10
45     155     10

[157 rows x 2 columns]
[('tax', 0.05014021376798714), ('percent', 0.016941119810610715), ('reform', 0.01628601804852351), ('rate', 0.015432111534218921), ('corporate', 0.013706270457391798), ('bill', 0.012816609597588647), ('taxes', 0.012418738496464837), ('deduction', 0.011682178001643832), ('plan', 0.011421607272704757), ('income', 0.011090335772713867)]


### Pause Point:
This section is optional to skip. The following code blocks reduce the number of topics to 20, which generally makes sense for a topic model of this size. When doing this, you can clearly see the graph separate into 2 distinct clusters of topics. This makes the model fit at 100% accuracy, which isn't necessarily ideal. The larger number of clusters makes me more confident that the model is not overfitting on its training data.

In [11]:
# Reduce topics to 20

topic_model.reduce_topics(X_train, nr_topics = 20)
train_topics, train_probabilities = topic_model.transform(X_train)
test_topics, test_probabilities = topic_model.transform(X_test)

In [12]:
# Look at the topic model (again) - much better now!
topic_model.visualize_topics()

In [13]:
print(topic_model.get_topic_freq())
print(topic_model.get_topic(0))

    Topic  Count
0       0   2580
2      -1   2524
3       1    801
4       2    427
6       3    351
5       4    252
12      5    181
8       6    160
1       7    159
16      8    114
10      9    109
11     10     82
9      11     52
14     12     50
7      13     46
17     14     35
13     15     28
15     16     19
18     17     13
19     18     13
[('trump', 0.0345330709556434), ('said', 0.015980054128805498), ('president', 0.015273194332116708), ('donald', 0.01477044350355264), ('white', 0.013015089677068712), ('russia', 0.011911046659212013), ('election', 0.011874338535344323), ('house', 0.011639145398094032), ('campaign', 0.011392246349070573), ('us', 0.011323488241486034)]


## Step 2: Classification

In [23]:
# Make train/test probs into feature arrays
X_train_feats = np.array(train_probabilities)
X_test_feats = np.array(test_probabilities)

# Initialize vectorizer to transform text into numerical representations
# Using 10k max features and uni/bigrams
vec = TfidfVectorizer(max_features = 10000, ngram_range = (1, 2))

# Vectorize text for classification, inspect shapes
X_train_vec = vec.fit_transform(X_train)
X_test_vec = vec.transform(X_test)
print("X_train_vec shape:", X_train_vec.shape)
print("Shape of training probabilities:", np.array(train_probabilities).shape)

# Stack tuples of training vecttors w/ probabilities
X_train_feats = np.hstack([X_train_vec.toarray(), np.array(train_probabilities)])
X_test_feats = np.hstack([X_test_vec.toarray(), np.array(test_probabilities)])

print("Training model...")
# Train and fit Random Forest Classifier
# Using params: 100 estimators, max depth of 10
clf = RandomForestClassifier(n_estimators = 100, max_depth = 10)
clf.fit(X_train_feats, y_train)
print("Model complete")
y_pred = clf.predict(X_test_feats)

X_train_vec shape: (7996, 10000)
Shape of training probabilities: (7996, 156)
Training model...
Model complete


In [24]:
# Inspect results

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

        Fake       0.99      1.00      0.99      1031
        Real       1.00      0.99      0.99       969

    accuracy                           0.99      2000
   macro avg       0.99      0.99      0.99      2000
weighted avg       0.99      0.99      0.99      2000



## Step 3: Analysis

In [26]:
# Cross Validation
cval = cross_val_score(clf, X_train_feats, y_train, cv = 5, scoring = "accuracy")
print("Cross Validation Accuracy: ", cval)
print("Mean Cross Val Accuracy: ", cval.mean())

Cross Validation Accuracy:  [0.989375   0.98936836 0.99499687 0.98686679 0.98561601]
Mean Cross Val Accuracy:  0.9892446060037523


In [27]:
# Feature importance:

importance = clf.feature_importances_
names = vec.get_feature_names_out()
features_names = list(names) + [f"Topic_{i}" for i in range(train_probabilities.shape[1])]
idxs = np.argsort(importance)[::-1][:10]
top_feats = [(features_names[i], importance[i]) for i in idxs]
print("top features: ", [(feature[0], feature[1]) for feature in top_feats])

top features:  [('reuters', 0.05845737692731941), ('image', 0.047136045024518666), ('image via', 0.03832805931457969), ('president donald', 0.02960309273578443), ('images', 0.029346698540044125), ('washington reuters', 0.02806784865644236), ('via', 0.02709914325400352), ('said', 0.022598611389183024), ('us', 0.020748231705539057), ('washington', 0.01600600087306453)]


In [28]:
topic_model.visualize_barchart(top_n_topics = 10)

In [29]:
for idx, row in test_examples.iterrows():
  new_stuff = row['full_text']
  print("_____________________________________")
  print("Story: ", new_stuff)
  text_series = pd.Series(new_stuff)
  new_stuff = re.sub(r'[^\w\s]', '', new_stuff).lower()
  new_stuff = " ".join([word for word in new_stuff.split() if word not in stop_words])
  tfidf = vec.transform(text_series)
  _, probs = topic_model.transform([new_stuff])
  feats = np.hstack([tfidf.toarray(), np.array(probs)])
  predicted_label = clf.predict(feats)
  print("Predicted Label: ", predicted_label)
  print("Actual Label: ", row['Label'])

_____________________________________
Predicted Label:  ['Real']
Actual Label:  Real
_____________________________________
Story:  Republican green groups seek to temper Trump on climate change NEW YORK (Reuters) - President Donald Trump’s outspoken doubts about climate change and his administration’s efforts to roll back regulation to combat it have stirred a sleepy faction in U.S. politics: the Republican environmental movement.  The various groups represent conservatives, Catholics and the younger generation of Republicans who, unlike Trump, not only recognize the science of climate change but want to see their party wrest the initiative from Democrats and lead efforts to combat global warming.  Conservative green groups such as ConservAmerica and republicEn, along with politically neutral religious groups such as Catholic Climate Covenant and bipartisan groups such as the Citizens Climate Lobby, have ramped up efforts to recruit more congressional Republicans to work on addressing 

## Discussion/Conclusion

Using 5,000 examples of real news and 5,000 examples of fake news, I cleaned up the text data and used BERT to create a topic model. From here, I vectorized the text data using TF-IDF scores, which I then fed into a random forest classifier that used 100 estimators and a maximum depth of 10. I achieved some very high performance using a random forest classifier, achieving preicison, recall, and cross-validation scores of >0.99. I was very impressed with how well the random forest performed, especially on vectorized text data. Binary classification definitely made it easier for the model, and on top of that BERT is a very robust NLP model, so it makes sense that the model was able to perform well. I'd like to do future experimentation on some more nuanced examples of real versus fake news, as the line between the two isn't always clear, however I am satisfied with how this project turned out. Thanks for reading!