# Regis University MSDS 686 Final Project
# Fake News Classification

This notebook trains a model to classify articles as real or fake news. This issue is more relevant than ever. While internet literacy has always been a problem, there has been a growing number of false stories being presented as truthful online. The goal of this notebook is to train a model that can successfully detect which stories are false.

The notebook is organized into the following sections: 
1. Importing dependencies and data
2. Creating dataset
3. Models
  * Bag of words with TF-IDF
  * LSTM
4. Summary & Conclusion

# Importing Dependencies and Data
## Dependencies

In [None]:
# utilities
import os
import shutil
# data processing
import pandas as pd
import numpy as np
# text processing
import string
import re
# machine learning
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import models, layers, backend, initializers, Input
from tensorflow.keras.layers import TextVectorization, Bidirectional, LSTM, LayerNormalization, MultiHeadAttention, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.model_selection import train_test_split
# seed notebook
np.random.seed(1)

## Obtain data from Kaggle

The data used for this task is from the Fake and real news dataset on Kaggle, posted by Clement Bisaillon. The dataset can be found at: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset.

The data contains 4 features: the title, text, date, and subject. The dataset is (nearly) balanced between true and false articles, and the dates of the articles range from March 2015 to February 2018. However, the model will be trained solely on the text of the articles.

In [None]:
# Install kaggle
! pip install kaggle -q

# Upload previously downloaded Kaggle API token
from google.colab import files
files.upload()

# Make directory for kaggle and copy token there, change permissions so owner has read/write access
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

# Check that kaggle is installed correctly
! kaggle datasets list

Saving kaggle.json to kaggle.json
ref                                                             title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
meirnizri/covid19-dataset                                       COVID-19 Dataset                                      5MB  2022-11-13 15:47:17           7647        234  1.0              
madhurpant/world-deaths-and-causes-1990-2019                    World Deaths and Causes (1990 - 2019)               442KB  2022-11-29 07:09:27           1484         36  1.0              
thedevastator/jobs-dataset-from-glassdoor                       Salary Prediction                                     3MB  2022-11-16 13:52:31           4922        111  1.0              
swaptr/fifa-world-cup-2022

In [None]:
# Download data
! kaggle datasets download -d clmentbisaillon/fake-and-real-news-dataset

# Unzip dataset
! unzip fake-and-real-news-dataset.zip

Downloading fake-and-real-news-dataset.zip to /content
100% 41.0M/41.0M [00:01<00:00, 40.3MB/s]
100% 41.0M/41.0M [00:01<00:00, 26.8MB/s]
Archive:  fake-and-real-news-dataset.zip
  inflating: Fake.csv                
  inflating: True.csv                


In [None]:
true_stories = pd.read_csv('Fake.csv')
false_stories = pd.read_csv('True.csv')

In [21]:
true_stories.head()

Unnamed: 0,title,text,subject,date,target
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",True
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",True
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",True
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",True
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",True


In [22]:
false_stories.head()

Unnamed: 0,title,text,subject,date,target
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",fake
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",fake
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",fake
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",fake
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",fake


# Creating the Dataset 
While this dataset is not particularly large and could fit in memory, I'll store it in the session memory and pull data from the appropriate directories in batches. Separate directories are created for the training, validation, and testing data, and each directory contains folders for each class (true and false). Approximately 60% of the data is used for training, 20% for validation, and 20% for testing.

In [None]:
# Assign target label
true_stories['target'] = 'true'
false_stories['target'] = 'fake'

# Create train, validation, and test folders
os.makedirs('train')
os.makedirs('val')
os.makedirs('test')

# Create subfolders for true and fake news stories
for category in ['true', 'fake']:
  os.makedirs('train/'+category)
  os.makedirs('val/'+category)
  os.makedirs('test/'+category)

In [None]:
# Use 60% of data for training, 20% for validation, 20% for testing
train_ratio = int(0.6 * len(true_stories))
val_ratio = int(0.2 * len(true_stories))

# Slice data
train_true = true_stories[:train_ratio]
train_false = false_stories[:train_ratio]

val_true = true_stories[train_ratio:train_ratio + val_ratio]
val_false = false_stories[train_ratio:train_ratio + val_ratio]

test_true = true_stories[train_ratio + val_ratio:]
test_false = false_stories[train_ratio + val_ratio:]

# Utility function to place news articles into respective folders as .txt files
def move_articles(data, base_path):
  category = data['target'].iloc[0]
  i = 0
  for article in data['text']:
    with open(base_path + category + '/' + str(i) + '.txt', 'w', encoding='utf-8') as my_article:
      i += 1
      my_article.write(article)

# Move articles to respective folders
move_articles(train_true, 'train/')
move_articles(train_false, 'train/')
move_articles(val_true, 'val/')
move_articles(val_false, 'val/')
move_articles(test_true, 'test/')
move_articles(test_false, 'test/')

In [None]:
# Validate that files are in correct location
batch_size = 32
train_flow = keras.utils.text_dataset_from_directory('train/')
val_flow = keras.utils.text_dataset_from_directory('val/')
test_flow = keras.utils.text_dataset_from_directory('test/')

Found 28176 files belonging to 2 classes.
Found 9392 files belonging to 2 classes.
Found 7330 files belonging to 2 classes.


### Setting a Common-Sense Baseline

In [None]:
num_true = len(true_stories)
num_false = len(false_stories)
num_articles = num_true + num_false

print(f'Percent true: {num_true / num_articles * 100}%')
print(f'Percent false: {num_false / num_articles * 100}%')

Percent true: 52.29854336496058%
Percent false: 47.70145663503943%


With 52% of the stories being true and 48% of the stories being false, the data is nearly evenly split. A common-sense baseline to beat would be an accuracy of 50%. 

The data is now ready to be fed to the models. I'll start with a bag of words approach and implement TF-IDF.

# Model 1: Bag of Words with TF-IDF
This model takes a bag of words approach and implements TF-IDF. Before training, the data is vectorized with a vocabulary size of 20,000 words using the TextVectorization() function in Keras. This Keras implementation allows us to do a few things in one step:
* Typical text pre-processing steps such as removing whitespace and puctuation and converting to lowercase are handled in this step
* Bi-grams are generated from the text
* The text is represented as TF-IDF scores

In [None]:
# Define text vectorizer
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode='tf_idf',
    standardize='lower_and_strip_punctuation',
    split='whitespace')

# Compute vocabulary from train dataset
vocab = train_flow.map(lambda x, y: x)
text_vectorization.adapt(vocab)

# Apply vectorization to text data
tf_idf_bigram_train = train_flow.map(lambda x, y: (text_vectorization(x), y))
tf_idf_bigram_val = val_flow.map(lambda x, y: (text_vectorization(x), y))
tf_idf_bigram_test = test_flow.map(lambda x, y: (text_vectorization(x), y))

# Define model
def tf_idf_bigram(max_tokens=20000):
  backend.clear_session()
  inputs = keras.Input(shape=(max_tokens,))
  x = layers.Dense(units=16, activation='relu') (inputs)
  x = layers.Dropout(0.5)(x)
  outputs = layers.Dense(1, activation='sigmoid') (x)
  model = keras.Model(inputs, outputs)
  model.compile(optimizer='rmsprop',
                loss='binary_crossentropy',
                metrics=['accuracy'])
  
  return model

# Initialize model with callbacks
m1 = tf_idf_bigram()
callbacks = [keras.callbacks.EarlyStopping(monitor= 'val_accuracy',
                                           patience = 3,
                                           restore_best_weights = True)]

# Fit data to model
m1.fit(tf_idf_bigram_train,
       validation_data = tf_idf_bigram_val,
       epochs = 20,
       callbacks = callbacks)

# Evaluate model on test data
m1.evaluate(tf_idf_bigram_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20


[0.22258391976356506, 0.9578444957733154]

# Model 2: LSTM with Pre-Trained GLOVE Embeddings
This model uses a LSTM recurrent neural network. This allows for a better understanding of context in the text and removes unnecessary data via the forget gate. 

This model leverages pre-trained GLOVE word embeddings. This choice was made due to the relatively small size of the dataset--it is unlikely that deriving word embeddings from the training data would provide a higher accuracy than using GLOVE embeddings. The text vectorizer is also re-defined to not use TF-IDF.

In [None]:
# Embeddings manually uploaded to session storage
path_to_glove = 'glove.6B.100d.txt'

# Dictionary to index embeddings
embeddings_index = {}
with open(path_to_glove) as f:
  # store words and array of vector coefficients as key-value pair
  for line in f:
    word, coefs = line.split(maxsplit=1)
    coefs = np.fromstring(coefs, "f", sep=" ")
    embeddings_index[word] = coefs

# Redefine text vectorizer
text_vectorization = TextVectorization(
    max_tokens=20000,
    standardize='lower_and_strip_punctuation',
    split='whitespace',
    output_mode='int',
)

# Vectorize text
vocab = train_flow.map(lambda x, y: x)
text_vectorization.adapt(vocab)
vocab = text_vectorization.get_vocabulary()

# Index words and save as dict
word_index = dict(zip(vocab, range(len(vocab))))

# Create embedding matrix with shape (max tokens, embedding dimension)
max_tokens = 20000
embedding_dim = 100
embedding_matrix = np.zeros((max_tokens, embedding_dim))
# Iterate through word index
for word, i in word_index.items():
  # Save the GLOVE embeddings to the embedding matrix for the first 20000 tokens 
  if i < max_tokens:
    embedding_vector = embeddings_index.get(word)
  if embedding_vector is not None:
    embedding_matrix[i] = embedding_vector

# Apply vectorization to text data
lstm_train = train_flow.map(lambda x, y: (text_vectorization(x), y))
lstm_val = val_flow.map(lambda x, y: (text_vectorization(x), y))
lstm_test = test_flow.map(lambda x, y: (text_vectorization(x), y))

In [12]:
def lstm():
  inputs = keras.Input(shape=(None,), dtype='int64')
  embedded = layers.Embedding(max_tokens,
                              embedding_dim,
                              embeddings_initializer = tf.keras.initializers.Constant(embedding_matrix),
                              trainable=False,
                              mask_zero=True) (inputs)
  x = layers.Bidirectional(LSTM(32)) (embedded)
  x = layers.Dropout(0.5) (x)
  outputs = layers.Dense(1, activation='sigmoid') (x)
  model = keras.Model(inputs, outputs)
  model.compile(optimizer='rmsprop',
                loss='binary_crossentropy',
                metrics=['accuracy'])
  return model

m2 = lstm()

callbacks = [keras.callbacks.EarlyStopping(monitor= 'val_accuracy',
                                           patience = 3,
                                           restore_best_weights = True)]

m2.fit(lstm_train, validation_data=lstm_val, epochs=20, callbacks=callbacks)
m2.evaluate(lstm_test)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20


[0.008170774206519127, 0.9986357688903809]

In [27]:
m1.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


In [28]:
m2.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 100)         2000000   
                                                                 
 bidirectional (Bidirectiona  (None, 64)               34048     
 l)                                                              
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2,034,113
Trainable params: 34,113
Non-trainable params: 2,000,000
____________________________________________

In [24]:
# Evaluate all models side by side
m1.evaluate(tf_idf_bigram_test)
m2.evaluate(lstm_test)



[0.008170774206519127, 0.9986357688903809]

# Summary & Conclusion
This notebook trained 2 models to distinguish between true and false news stories. The methods used included a bag of words approach utilizing bi-grams and TF-IDF and LSTM with GLOVE embeddings.

Both models performed well, outperforming the common-sense baseline of 50%. The bag of words approach achieved a score of 95.78%, and the LSTM model achieved a score of 99.86%