# Fake vs. Real News Classification using NLP

The goal of this project is to classify news as fake or real based on the title of the article and the article text

Dataset was pulled from kaggle: https://www.kaggle.com/datasets/jillanisofttech/fake-or-real-news

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

## Reading in the Data

In [2]:
# Pulling the data from my github repository
!wget https://github.com/dallastealer/FakeNewsClassifier/blob/main/fake_or_real_news.csv.zip?raw=true

--2022-04-25 19:37:17--  https://github.com/dallastealer/FakeNewsClassifier/blob/main/fake_or_real_news.csv.zip?raw=true
Resolving github.com (github.com)... 52.192.72.89
Connecting to github.com (github.com)|52.192.72.89|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/dallastealer/FakeNewsClassifier/raw/main/fake_or_real_news.csv.zip [following]
--2022-04-25 19:37:17--  https://github.com/dallastealer/FakeNewsClassifier/raw/main/fake_or_real_news.csv.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dallastealer/FakeNewsClassifier/main/fake_or_real_news.csv.zip [following]
--2022-04-25 19:37:17--  https://raw.githubusercontent.com/dallastealer/FakeNewsClassifier/main/fake_or_real_news.csv.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubuserco

In [3]:
from zipfile import ZipFile
# use zipfile to extract contents from downloaded zip
file = ZipFile("fake_or_real_news.csv.zip?raw=true")
file.extractall()
file.close()

## View Data

In [4]:
df = pd.read_csv("fake_or_real_news.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [5]:
# View a sample text
df["text"].iloc[0]



In [6]:
# Check average length of title in words

title_lengths = [len(title.split()) for title in df["title"]]
np.mean(title_lengths)

10.496448303078138

In [7]:
# Check average length of text in words
text_lengths = [len(text.split()) for text in df["text"]]
np.mean(text_lengths)

776.3007103393844

## Prepare Data For Model

In [8]:
# Append Title to the front of each article

X = []
y = []

for index, row in df.iterrows():
  X.append((row["title"] + ": " + row["text"]).lower())
  y.append(row["label"])

X[0], y[:5]

 ['FAKE', 'FAKE', 'REAL', 'FAKE', 'REAL'])

In [9]:
# Check what text length covers 95% of the data

lengths = [len(article.split()) for article in X]
coverage_length = np.percentile(lengths, 95)
print(f"95% of the articles are {int(coverage_length)} words in length or less")

95% of the articles are 2036 words in length or less


## Splitting Data into Training and Testing Sets

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
len(X_train), len(y_train), len(X_test), len(y_test)

(5384, 5384, 951, 951)

In [11]:
# Turn labels into 1 and 0
y_labels = {"FAKE": 0, "REAL": 1}
y_train = [y_labels[label] for label in y_train]
y_test = [y_labels[label] for label in y_test]
y_train[:5], y_test[:5]

([1, 1, 0, 0, 1], [1, 1, 0, 1, 0])

## Text Vectorizer

In [12]:
from tensorflow.keras.layers import TextVectorization

In [13]:
vectorizer = TextVectorization(max_tokens=60000, #Vocabulary size, may need to increase
                               output_mode="int",
                               output_sequence_length=int(coverage_length))

In [14]:
# Fit vectorizer to training data
vectorizer.adapt(X_train)

In [15]:
# Test that vectorizer is working

print(f"10 most common words: {vectorizer.get_vocabulary()[:10]}")
print(f"10 least common words: {vectorizer.get_vocabulary()[-10:]}")
print(f"Vocabulary size: {len(vectorizer.get_vocabulary())}")

10 most common words: ['', '[UNK]', 'the', 'to', 'of', 'and', 'a', 'in', 'that', 'is']
10 least common words: ['‘beyonce', '‘better’', '‘bernie', '‘beneficiaries’', '‘beneath', '‘beloved’', '‘belgium', '‘behind', '‘becoming', '‘be']
Vocabulary size: 60000


## Custom Embedding Layer

In [16]:
from tensorflow.keras.layers import Embedding

embedding = Embedding(input_dim=60000, #Should match vocab size
                      output_dim=128,
                      input_length=int(coverage_length))

In [17]:
# Test Embedding
embedding(vectorizer(X_train[0]))

<tf.Tensor: shape=(2036, 128), dtype=float32, numpy=
array([[-0.02320634, -0.03135882, -0.03625693, ..., -0.01390151,
         0.02480343,  0.00113877],
       [ 0.02084053, -0.04840516, -0.00087418, ...,  0.01350318,
         0.04633057, -0.00526023],
       [ 0.04451228,  0.02061618, -0.01488979, ...,  0.00531194,
         0.04571616, -0.0433359 ],
       ...,
       [ 0.00754567, -0.0466529 , -0.02295011, ..., -0.03327689,
        -0.01854169,  0.04939835],
       [ 0.00754567, -0.0466529 , -0.02295011, ..., -0.03327689,
        -0.01854169,  0.04939835],
       [ 0.00754567, -0.0466529 , -0.02295011, ..., -0.03327689,
        -0.01854169,  0.04939835]], dtype=float32)>

## Create Baseline with Sklearn TF-IDF and Naive Bayes

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

model_0 = Pipeline([
  ("tfidf", TfidfVectorizer()), # Words to numbers
  ("clf", MultinomialNB()) # model the text
])

# Fit the pipeline to the training data
model_0.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [19]:
# Evaluate Baseline Accuracy
baseline_accuracy = model_0.score(X_test, y_test)
print(f"Baseline model performed with accuracy: {baseline_accuracy.round(2)}")

Baseline model performed with accuracy: 0.8


## Turn data into tf dataset

In [20]:
from tensorflow.python.data.ops.dataset_ops import AUTOTUNE
training_data = tf.data.Dataset.from_tensor_slices((X_train, y_train))
testing_data = tf.data.Dataset.from_tensor_slices((X_test, y_test))

# Turn data into batches and use prefetching to speed up training process
training_data = training_data.batch(32).prefetch(AUTOTUNE)
testing_data = testing_data.batch(32).prefetch(AUTOTUNE)

## First Model: Simple Dense Model

In [21]:
from tensorflow.keras import layers

inputs = layers.Input(shape=(1,), dtype="string") # Input layer

x = vectorizer(inputs) # Turn words into numbers
x = embedding(x) # Turn each vectorized word into a vector of length 128
x = layers.Dense(64, activation="relu")(x) # Dense Layer
x = layers.GlobalAveragePooling1D()(x) # Average pooling to reduce dimensionality
outputs = layers.Dense(1, activation="sigmoid")(x) # Output layer

model_1 = tf.keras.Model(inputs, outputs)
model_1.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 2036)             0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 2036, 128)         7680000   
                                                                 
 dense (Dense)               (None, 2036, 64)          8256      
                                                                 
 global_average_pooling1d (G  (None, 64)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dense_1 (Dense)             (None, 1)                 65    

In [22]:
# Compile first model
model_1.compile(loss="binary_crossentropy",
                optimizer="adam",
                metrics=["accuracy"])

In [23]:
# Fit first model
history_1 = model_1.fit(training_data,
                        epochs=5,
                        steps_per_epoch=len(training_data),
                        validation_data=testing_data,
                        validation_steps=int(0.5 * len(testing_data)))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [26]:
model_1_accuracy = model_1.evaluate(testing_data)



In [32]:
print(f"Baseline accuracy: {(baseline_accuracy * 100).round(2)}%")
print(f"Model 1 accuracy: {np.round(model_1_accuracy[1] * 100, 2)}%")

Baseline accuracy: 80.34%
Model 1 accuracy: 91.9%


## Sample Predictions from Model 1

In [44]:
labels_from_preds = {0: "FAKE", 1: "REAL"}

In [56]:
article_to_test = np.random.randint(0, len(X_test))
print(f"Article:\n {X_test[article_to_test][:200]}...")
pred_prob = model_1.predict([X_test[article_to_test]])
pred_prob_rounded = int(np.round(pred_prob).squeeze())
print(f"Model prediction: {labels_from_preds[pred_prob_rounded]}")
print(f"Actual Label: {labels_from_preds[y_test[article_to_test]]}")


Article:
 will barack obama delay or suspend the election if hillary is forced out by the new fbi email investigation?: in: government , government corruption , obama exposed , sleuth journal just when it looke...
Model prediction: FAKE
Actual Label: FAKE


## Model 2: LSTM

In [78]:
inputs = layers.Input(shape=(1,), dtype="string")
x = vectorizer(inputs)
x = embedding(x)

x = layers.LSTM(128)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model_2 = tf.keras.Model(inputs, outputs)
model_2.summary()

Model: "model_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_9 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 2036)             0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 2036, 128)         7680000   
                                                                 
 lstm_9 (LSTM)               (None, 2036, 128)         131584    
                                                                 
 lstm_10 (LSTM)              (None, 2036, 64)          49408     
                                                                 
 lstm_11 (LSTM)              (None, 16)                5184      
                                                           

In [79]:
model_2.compile(loss="binary_crossentropy",
                optimizer="adam",
                metrics=["accuracy"])

In [80]:
history_2 = model_2.fit(training_data,
                        epochs=5,
                        validation_data=testing_data,
                        validation_steps=len(testing_data))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
model_2_accuracy = model_2.evaluate(testing_data)
print(f"Baseline accuracy: {(baseline_accuracy * 100).round(2)}%")
print(f"Model 1 accuracy: {np.round(model_1_accuracy[1] * 100, 2)}%")
print(f"Model 1 accuracy: {np.round(model_2_accuracy[1] * 100, 2)}%")

## Model 3: Universal Sentence Encoder from tensorflow hub

In [83]:
import tensorflow_hub as hub

In [91]:
universal_sentence_encoder = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4", #Link from tensorflow hub
                                            trainable=False,
                                            input_shape=[],
                                            dtype=tf.string)

In [94]:
model_3 = tf.keras.Sequential([
  universal_sentence_encoder, #Sentence encoder gets rid of need for vectorizer and embeddings
  layers.Dense(64, activation="relu"),
  layers.Dense(1, activation="sigmoid")
])

model_3.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer_3 (KerasLayer)  (None, 512)               256797824 
                                                                 
 dense_20 (Dense)            (None, 64)                32832     
                                                                 
 dense_21 (Dense)            (None, 1)                 65        
                                                                 
Total params: 256,830,721
Trainable params: 32,897
Non-trainable params: 256,797,824
_________________________________________________________________


In [95]:
model_3.compile(loss="binary_crossentropy",
                optimizer="adam",
                metrics=["accuracy"])

In [96]:
history_3 = model_3.fit(training_data,
                        epochs=5,
                        steps_per_epoch=len(training_data),
                        validation_data=testing_data,
                        validation_steps=int(0.15* len(testing_data)))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [97]:
model_3_accuracy = model_3.evaluate(testing_data)
print(f"Baseline accuracy: {(baseline_accuracy * 100).round(2)}%")
print(f"Model 1 accuracy: {np.round(model_1_accuracy[1] * 100, 2)}%")
print(f"Model 3 accuracy: {np.round(model_3_accuracy[1] * 100, 2)}%")

Baseline accuracy: 80.34%
Model 1 accuracy: 91.9%
Model 3 accuracy: 87.7%


## Model 4: Same as model 3 but double the epochs

In [98]:
model_4 = tf.keras.Sequential([
  universal_sentence_encoder,
  layers.Dense(64, activation="relu"),
  layers.Dense(1, activation="sigmoid")
])

model_4.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer_3 (KerasLayer)  (None, 512)               256797824 
                                                                 
 dense_22 (Dense)            (None, 64)                32832     
                                                                 
 dense_23 (Dense)            (None, 1)                 65        
                                                                 
Total params: 256,830,721
Trainable params: 32,897
Non-trainable params: 256,797,824
_________________________________________________________________


In [99]:
model_4.compile(loss="binary_crossentropy",
                optimizer="adam",
                metrics=["accuracy"])

In [100]:
history_4 = model_4.fit(training_data,
                        epochs=10,
                        steps_per_epoch=len(training_data),
                        validation_data=testing_data,
                        validation_steps=int(0.15*len(testing_data)))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [102]:
model_4_accuracy = model_4.evaluate(testing_data)



In [103]:
print(f"Baseline accuracy: {(baseline_accuracy * 100).round(2)}%")
print(f"Model 1 accuracy: {np.round(model_1_accuracy[1] * 100, 2)}%")
print(f"Model 3 accuracy: {np.round(model_3_accuracy[1] * 100, 2)}%")
print(f"Model 4 accuracy: {np.round(model_4_accuracy[1] * 100, 2)}%")

Baseline accuracy: 80.34%
Model 1 accuracy: 91.9%
Model 3 accuracy: 87.7%
Model 4 accuracy: 89.06%


## Model 5: First model but twice as many epochs

In [104]:
inputs = layers.Input(shape=(1,), dtype="string")

x = vectorizer(inputs)
x = embedding(x)
x = layers.Dense(64, activation="relu")(x)
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model_5 = tf.keras.Model(inputs, outputs)
model_5.summary()

Model: "model_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_13 (InputLayer)       [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 2036)             0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 2036, 128)         7680000   
                                                                 
 dense_24 (Dense)            (None, 2036, 64)          8256      
                                                                 
 global_average_pooling1d_1   (None, 64)               0         
 (GlobalAveragePooling1D)                                        
                                                                 
 dense_25 (Dense)            (None, 1)                 65  

In [105]:
model_5.compile(loss="binary_crossentropy",
                optimizer="adam",
                metrics=["accuracy"])

In [106]:
history_5 = model_5.fit(training_data,
                        epochs=10,
                        steps_per_epoch=len(training_data),
                        validation_data=testing_data,
                        validation_steps=int(0.15*len(testing_data)))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [107]:
model_5_accuracy = model_5.evaluate(testing_data)



In [108]:
#model_5.save("PUT SAVE PATH HERE")

INFO:tensorflow:Assets written to: drive/MyDrive/Models/dense_with_custom_embeddings_10epochs/assets


INFO:tensorflow:Assets written to: drive/MyDrive/Models/dense_with_custom_embeddings_10epochs/assets


## Model 6: Testing if same (or better) results can be achieved with even smaller model

In [189]:
inputs = layers.Input(shape=(1,), dtype="string")

x = vectorizer(inputs)
x = embedding(x)
x = layers.Dense(16, activation="relu")(x)
x = layers.Dense(8, activation="relu")(x)
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model_6 = tf.keras.Model(inputs, outputs)
model_6.summary()

Model: "model_34"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_38 (InputLayer)       [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, 2036)             0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 2036, 128)         7680000   
                                                                 
 dense_81 (Dense)            (None, 2036, 16)          2064      
                                                                 
 dense_82 (Dense)            (None, 2036, 8)           136       
                                                                 
 global_average_pooling1d_17  (None, 8)                0         
  (GlobalAveragePooling1D)                                

In [190]:
model_6.compile(loss="binary_crossentropy",
                optimizer="adam",
                metrics=["accuracy"])

In [191]:
history_6 = model_6.fit(training_data,
                        epochs=8,
                        steps_per_epoch=len(training_data),
                        validation_data=testing_data,
                        validation_steps=len(testing_data))

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


In [None]:
model_6_accuracy = model_6.evaluate(testing_data)
print(f"Baseline accuracy: {(baseline_accuracy * 100).round(2)}%")
print(f"Model 1 accuracy: {np.round(model_1_accuracy[1] * 100, 2)}%")
print(f"Model 3 accuracy: {np.round(model_3_accuracy[1] * 100, 2)}%")
print(f"Model 4 accuracy: {np.round(model_6_accuracy[1] * 100, 2)}%")

Overall, it appears that a simple dense model outperforms LSTM based models. Additionally, using a custom embedding layer produces better results than the pretrained universal sentence encoder.

In [193]:
#model_6.save("PUT SAVE PATH HERE")

INFO:tensorflow:Assets written to: drive/MyDrive/Models/final_fake_news_model/assets


INFO:tensorflow:Assets written to: drive/MyDrive/Models/final_fake_news_model/assets


## Writing a function to see the model's predictions

In [216]:
def predict_news(text, title=None, model=model_6):
  """
  Function to predict whether the passed text is fake or real news

  Parameters:
  text: contents of the article
  title (optional): title of the article, if passed it will be prepended to the text
  model (optional): the model with which to make predictions. Default is model_6

  Returns:
  A tuple with the first index being the predicted label, and the second index being the prediction probability
  """
  text = text.lower()
  labels = {0: "FAKE", 1: "REAL"}
  if title:
    title = title.lower()
    text = title + ": " + text
  prediction_probability = model.predict([text])
  prediction_rounded = np.round(prediction_probability.squeeze())
  prediction_label = labels[prediction_rounded]
  return (prediction_label, np.round(prediction_probability.squeeze(), 4))

In [219]:
test_text = """ Try it out! Paste the text from an article you want to test here
"""
test_title = "Paste the title of the article you want to test here"

In [221]:
# If you've pasted in your article and title, uncomment the line below to see the prediction!

#predict_news(test_text, title=test_title)