# Classifying Fake News Articles

**Table of Contents:**

- Introduction
- Traditional feature engineering
    - Bag of Words model
    - Bag of N-Grams model
    - TF-IDF model
- Document embeddings from pre-trained word embeddings
    - Pre-trained Word2Vec
    - Pre-trained FastText
- Document embeddings from self-trained word embeddings
    - Word2Vec
    - FastText
- Document embeddings from BERT sentence embeddings
- Conclusion

## Introduction

In this project I am going to classify news articles into fake news articles and real news articles, based on the linguistic features of the text. In order to classify these articles, I will use different feature engineering techniques, such as the Bag of Words Model, the TF-IDF Model, and document embeddings resulting from several word embedding techniques and the BERT sentence embedding model. 

This project's main goal is to compare several feature engineering methods. Therefore, I choose to use only one machine learnings algorithm to classify the news articles, which is logistic regression. I tried some others as well, but logistic regression performs well and does not require any hyperparameter tuning (although one could say that regularization could be used as a hyperparameter). 

The data can be downloaded from [here](https://drive.google.com/file/d/1er9NJTLUA3qnRuyhfzuN0XUsoIC4a-_q/view). Let's get started!

In [1]:
import os
os.chdir("D:/Projects/fake news")

In [2]:
# import dependencies
import pandas as pd
import numpy as np
import spacy
import re
import nltk

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import GridSearchCV

from gensim.models import word2vec
from gensim.models import FastText
from gensim.models import KeyedVectors

import tensorflow_hub as hub
import tensorflow_text as text
import tensorflow as tf

In [3]:
# import the dataset
data = pd.read_csv("news.csv")

In [4]:
data.head(5)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [5]:
print("The dataset contains {} columns".format(data.shape[0]))

The dataset contains 6335 columns


The dataset consists of 6335 news articles. However, some news articles are empty, so let's delete those.

In [6]:
# remove empty articles
data = data[data["text"] != " "]

In [7]:
print("The dataset now consists of {} columns".format(data.shape[0]))

The dataset now consists of 6299 columns


The dataset now consists of 6299 columns. However, there are some duplicate articles in the dataset. Let's delete those duplicates (keep the first article, drop the others).

In [8]:
# remove duplicates
data = data.drop_duplicates(subset=["text"], keep="first")

In [9]:
print("The dataset now consists of {} columns".format(data.shape[0]))

The dataset now consists of 6059 columns


In [10]:
# number of fake and real articles
data["label"].value_counts()

FAKE    3070
REAL    2989
Name: label, dtype: int64

As you can see, the dataset is balanced. That is, the proportion of real articles and the proportion of fake articles are about 50 percent. For training, this is a desirable feature. It also means that we could use accuracy as the performance metric. Another reason for the use of accuracy as the performance metric, is that misclassifying fake news articles is not more or less important than misclassifying real articles. That is, our business goal is to predict accurately.

In [11]:
# first news article
data["text"][0]



In [12]:
# We only need the text column for generating the features (X), and we need the label column for our target variable (y)
y = data["label"]
X = data["text"]

In [13]:
# Split the dataset into a training set and a test set. 
# We will use the test set to evaluate the performance of the feature engineering techniques. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
train_articles_list = X_train.to_list()
test_articles_list = X_test.to_list()

For most of the feature engineering methods, we are going to use normalized text data. For this we need to take several pre-processing steps. These pre-processing steps include:
- Transform the token to the lemma of the word
- Get rid of stop words
- Get rid of puntcation marks
- Get rid of digits
- Everything to lowercase

In [15]:
# Load a spacy nlp model for text pre-processing
nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner"])

In [16]:
# define a normalize function for pre-processing the texts
def normalize(article):
    article = nlp(article)
    # transform to lemma and get rid of stopwords and punctuation marks
    article = [token.lemma_ for token in article if not (token.is_stop or token.is_punct)]
    # get rid of white spaces
    article = [token for token in article if not token.strip() == ""]
    # get rid of numbers
    article = [token for token in article if not re.search("[0-9]", token)]
    # everything to lowercase
    article = [token.lower() for token in article]
    article = " ".join(article)
    return article

In [17]:
# pre-processing the texts
train_articles_normalized = [normalize(article) for article in train_articles_list]
test_articles_normalized = [normalize(article) for article in test_articles_list]

In [18]:
# normalized first news article in training set
train_articles_normalized[0]

"des moines iowa cnn doubt donald trump pull major counter program feat compete gop debate expect draw million viewer thursday night dazzle crowd hundred enthusiastic supporter announce raise $ million veteran day $ million checkbook love vet say know theme america great go vet trump say trump bite surprise pull stunt look camera like academy awards real estate magnate say take stage auditorium drake university minute debate begin mile away actually tell camera bite know honor vet rally restrain performance trump standard dispense usual riff poll number avoid jab fellow candidate exception low energy shoot jeb bush instead deliver speech focus problem veteran face return iraq afghanistan inadequate healthcare house drug abuse mental health issue homelessness vet mistreat illegal immigrant treat well case vet go happen go happen clearly enjoy even away debate trump tell audience medium sensation campaign fact daughter ivanka pregnant ivanka say great baby iowa great definitely win somew

In [19]:
train_articles_normalized = np.array(train_articles_normalized, dtype="object")
test_articles_normalized = np.array(test_articles_normalized, dtype="object")

## Traditional feature engineering

In this section, I am going to use several traditional feature engineering methods in order to classify fake news articles from real articles. These methods are all bag of words models, which literally means that each article is represented as a bag of words, eliminating the word order. These models leads to sparse article vectors as the size of the vector is the size of the resulting dictionary, and the article only contains a limited amount of words. As simple as this approach might be, the resulting prediction accuracy is very good, as we will see!

### Bag of Words model

First, the standard Bag of Words model will be used. For each article, this method will create a numerical vector, where the size of the vector is equal to the size of the vocabulary, and where each element represents the count of a word that is in the article.

In [20]:
# instantiate the count vectorizor
count_vectorizer = CountVectorizer()

In [21]:
# fit the count vectorizor on the normalized training articles and transform
cv_train = count_vectorizer.fit_transform(train_articles_normalized)
# transform on the normalized test articles
cv_test = count_vectorizer.transform(test_articles_normalized)

In [22]:
# instantiate logistic regression model and fit on the training data
logreg_cv = LogisticRegression(max_iter=300)
logreg_cv.fit(cv_train, y_train)
# Predict on test set
predictions_logreg_cv = logreg_cv.predict(cv_test)
print("The accuracy on the test set is {}".format(accuracy_score(predictions_logreg_cv, y_test)))

The accuracy on the test set is 0.905940594059406


As you can see, the accuracy on the unseen data (the test set) is about 90.6 percent. This is very high, as a naive prediction model would have an accuracy of about 50 percent. One explanation for why this simple model seems to work very well is that there are certain words that are more used in fake articles than in real articles, and vice versa. 

Let's see which words are most associated with fake articles, and which words are more associated with real articles. For this we look at the coefficients of the logistic regression model and calculate the odds.

In [23]:
vocabulary = count_vectorizer.get_feature_names()
cv_odds = pd.DataFrame({"words":vocabulary, "odds": np.exp(logreg_cv.coef_[0])})

In [24]:
# the 10 words most associated with fake news
cv_odds.sort_values(by="odds", ascending=False).head(10)

Unnamed: 0,words,odds
14378,executive,2.202659
37731,saturday,2.047612
8936,convention,1.977184
38473,sen,1.972079
34739,race,1.963441
16320,friday,1.954766
43810,transition,1.941896
8699,conservative,1.939244
16114,fox,1.931455
26406,marriage,1.892626


In [25]:
# the 10 words least associated with fake news
cv_odds.sort_values(by="odds", ascending=True).head(10)

Unnamed: 0,words,odds
30365,october,0.191716
29984,november,0.36686
38834,share,0.417894
14038,establishment,0.442641
2431,article,0.448705
40244,source,0.452321
42010,swipe,0.486274
33772,print,0.490359
33304,posted,0.500167
30361,oct,0.529232


As you can see, some words are more associated to fake news than others. The word "executive" is most associated with fake news. For every time that "executive" occurs in an article, the odds that the observation is in the category fake news are 2.2 times as large as the odds that the observation is not in the category fake news, when all other variables are held constant. Although this supplementary analysis could be very interesting, it is probably the combination of several words in an article that determine whether the article is fake news or real news.

### Bag of N-Grams model

Let's see whether we can improve on the standard Bag of Words model, by including bi-grams in the vocabulary.

In [26]:
# instantiate the count vectorizor with unigrams and bi-grams
count_vectorizer2 = CountVectorizer(ngram_range=(1,2))

In [27]:
# fit the count vectorizor on the normalized training articles and transform
cv2_train = count_vectorizer2.fit_transform(train_articles_normalized)
# transform on the normalized test articles
cv2_test = count_vectorizer2.transform(test_articles_normalized)

In [28]:
# instantiate logistic regression model and fit on the training data
logreg_cv2 = LogisticRegression(max_iter=300)
logreg_cv2.fit(cv2_train, y_train)
# Predict on test set
predictions_logreg_cv2 = logreg_cv2.predict(cv2_test)
print("The accuracy on the test set is {}".format(accuracy_score(predictions_logreg_cv2, y_test)))

The accuracy on the test set is 0.9183168316831684


Including bi-grams in the standard Bag of Words model slightly improves the performance.

### TF-IDF model

Simple counting the words in the standard Bag of Words model might not be the best way to go. It could be better to use TF-IDF (term frequency - inverse docucment frequency). Let's see how this method performs.

In [29]:
# instantiate the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.6)

In [30]:
tfidf_train = tfidf_vectorizer.fit_transform(train_articles_normalized)
tfidf_test = tfidf_vectorizer.transform(test_articles_normalized)

In [31]:
logreg_tfidf = LogisticRegression(max_iter=300)
logreg_tfidf.fit(tfidf_train, y_train)
predictions_logreg_tfidf = logreg_tfidf.predict(tfidf_test)
print("The accuracy on the test set is {}".format(accuracy_score(predictions_logreg_tfidf, y_test)))

The accuracy on the test set is 0.9075907590759076


The accuracy for the TF-IDF model is slightly higher then for the standard Bag of Words model. Tuning some of the parameters can lead to slightly different performances. Overall, the traditional feature engineering combined with logistic regression do a very good job in predicting whether an article is fake news or not.

## Document embeddings from pre-trained word embeddings

Natural language processing is in constant development, and one of the new popular methods for feature engineering are word embeddings. Word embeddings are vector representations of words and are trained with neural networks. The idea behind the approach is that if a word occurs in the same context as another word, then these words are similar to one another and have similar meanings. 

In this section I will use pre-trained word embeddings. There are two models that I will use: Word2Vec and FastText. From these word embeddings I create document embeddings by simply averaging the word embeddings for the respective article. Let's see how these document embeddings perform in classifying articles into the fake news and the real news categories.

### Pre-trained Word2Vec

In [32]:
# Loading model
word2vec_path = "D:/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz"
model = KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

In [33]:
feature_size = len(model["fake"])
print("A word embedding vector has {} elements".format(feature_size))

A word embedding vector has 300 elements


Each word embedding has 300 elements. The smaller the distance between two vectors, the more similar these words are. Let's see which word is most similar to the set of words "fake news":

In [34]:
# The most similar words to fake news
model.most_similar(positive=["fake", "news"])[:5]

[('bogus', 0.5605027675628662),
 ('phony', 0.5478013753890991),
 ('site_FreakingNews.com', 0.5443505644798279),
 ('MCOT_online', 0.48262903094291687),
 ('Latest_Tanker_Operator', 0.47789376974105835)]

The smallest distance occurs for the word "bogus". Hence, "bogus" is most similar to the set of words "fake news". Now that we saw that these word embeddings actually capture something relevant, let's calculate document embeddings by simply aggregating the word embeddings for each article.

In [35]:
# Create a function that aggregates the word embeddings into a document embedding
def get_document_embeddings(documents, model):
    
    document_embeddings=[]
    for document in documents:
        document_embedding = np.zeros((feature_size), dtype="float32")
        n_tokens = 0
        for token in document:
             if token in model:
                document_embedding = np.add(document_embedding, model[token])
                n_tokens += 1

        if n_tokens > 0:
            document_embedding = np.divide(document_embedding, n_tokens)

        document_embeddings.append(document_embedding)
    return document_embeddings

In [36]:
# Create document embeddings for the training and test set
train_document_embeddings = get_document_embeddings(train_articles_normalized, model)
test_document_embeddings = get_document_embeddings(test_articles_normalized, model)

In [37]:
# Instantiate a logistic regression model and fit on training data
logreg_word2vec_pretrained = LogisticRegression(max_iter=300)
logreg_word2vec_pretrained.fit(train_document_embeddings, y_train)
# Generate predictions on the test data
predictions_logreg_word2vec_pretrained = logreg_word2vec_pretrained.predict(test_document_embeddings)
print("The accuracy on the test set is {}".format(accuracy_score(predictions_logreg_word2vec_pretrained, y_test)))

The accuracy on the test set is 0.6468646864686468


Although the accuracy is higher than the naive prediction model accuracy of about 50 percent, the accuracy of these pre-trained Word2Vec word embeddings is much lower compared with the accuracy of the traditional feature engineering methods. Let's see whether FastText performs better.

### Pre-trained FastText

In [38]:
# Loading model
fasttext_path = "D:/gensim-data/fasttext-wiki-news-subwords-300/fasttext-wiki-news-subwords-300.gz"
model = KeyedVectors.load_word2vec_format(fasttext_path, binary=False)

In [39]:
feature_size = len(model["fake"])
print("A word embedding vector has {} elements".format(feature_size))

A word embedding vector has 300 elements


In [40]:
# The most similar words to fake news
model.most_similar(positive=["fake", "news"])[:5]

[('fake-news', 0.7599742412567139),
 ('pseudo-news', 0.7420461773872375),
 ('good-news', 0.7235771417617798),
 ('fakey', 0.7220522165298462),
 ('bad-news', 0.7200027704238892)]

As you can see, the FastText model has different similar words to "fake news" compared with the Word2Vec model. 

In [41]:
# Create document embeddings for the training and test set
train_document_embeddings = get_document_embeddings(train_articles_normalized, model)
test_document_embeddings = get_document_embeddings(test_articles_normalized, model)

In [42]:
# Instantiate a logistic regression model and fit on training data
logreg_fasttext_pretrained = LogisticRegression(max_iter=300)
logreg_fasttext_pretrained.fit(train_document_embeddings, y_train)
# Generate predictions on the test data
predictions_logreg_fasttext_pretrained = logreg_fasttext_pretrained.predict(test_document_embeddings)
print("The accuracy on the test set is {}".format(accuracy_score(predictions_logreg_fasttext_pretrained, y_test)))

The accuracy on the test set is 0.6204620462046204


The use of FastText word embeddings leads to a lower accuracy compared to the use of Word2Vec embeddings. This is somewhat unexpected, as FastText is a newer model. Nevertheless, in this use case, traditional feature engineering methods outperform the pre-trained word embedding methods.

## Document embeddings from self-trained word embeddings

In this section, I am going to create my own word embeddings, and see whether these perform better or worse compared with the pre-trained word embeddings. I will use the Gensim module to create these word embeddings. Again I will use the Word2Vec model and the FastText model. The size of the embeddings is treated as a hyperparameter that is tuned.

In [43]:
# Create a new normalize function, as the Gensim models desire a different data structure as input
# More specifically, Gensim models demand context words, and therefore the input should consists of sentences
def normalize2(article):
    article_doc = nlp(article)
    sentences = list(article_doc.sents)
    article_normalized = []
    for sentence in sentences:
        # transform to lemma and get rid of stopwords and punctuation marks
        sentence = [token.lemma_ for token in sentence if not (token.is_stop or token.is_punct)]
        # get rid of white spaces
        sentence = [token for token in sentence if not token.strip() == ""]
        # get rid of numbers
        sentence = [token for token in sentence if not re.search("[0-9]", token)]
        # everything to lowercase
        sentence= [token.lower() for token in sentence]
        article_normalized.append(sentence)
    return article_normalized
        

In [44]:
# get normalized articles
train_articles_normalized2 = [normalize2(article) for article in train_articles_list]
test_articles_normalized2 = [normalize2(article) for article in test_articles_list]

In [45]:
# the first five sentences normalized for the first article
train_articles_normalized2[0][:5]

[['des', 'moines', 'iowa', 'cnn'],
 ['doubt',
  'donald',
  'trump',
  'pull',
  'major',
  'counter',
  'program',
  'feat',
  'compete',
  'gop',
  'debate',
  'expect',
  'draw',
  'million',
  'viewer'],
 ['thursday',
  'night',
  'dazzle',
  'crowd',
  'hundred',
  'enthusiastic',
  'supporter',
  'announce',
  'raise',
  '$',
  'million',
  'veteran',
  'day'],
 ['$', 'million', 'checkbook'],
 ['love', 'vet', 'say']]

In [46]:
train_articles_normalized2 = np.array(train_articles_normalized2, dtype="object")
test_articles_normalized2 = np.array(test_articles_normalized2, dtype="object")

In [47]:
# set parameters for the Gensim models
window_context = 10
min_word_count = 1
sample = 1e-3

In [48]:
# This object generates document embeddings based on self-created word embeddings
class GetDocumentEmbedding(BaseEstimator, TransformerMixin):
    def __init__(self, modeltype, feature_size):
        self.modeltype = modeltype
        self.feature_size = feature_size
        self.vocabulary = None
        self.model = None
        
    def fit(self, X, y=None):
        return self      
    
    def fit_transform(self, X, y=None):
        
        if self.modeltype == "word2vec":
            corpus = X.tolist()
            corpus = [sentence for article in corpus for sentence in article]
            model = word2vec.Word2Vec(corpus, vector_size=self.feature_size, window=window_context, min_count=min_word_count, sample=sample)
            self.model = model
            vocabulary = set(model.wv.index_to_key)
        
        elif self.modeltype == "fasttext":
            corpus = X.tolist()
            corpus = [sentence for article in corpus for sentence in article]
            model = FastText(corpus, vector_size=self.feature_size, window=window_context, min_count=min_word_count, sample=sample, sg=1)
            self.model = model
            vocabulary = set(model.wv.index_to_key)
        else:
            vocabulary = set()

        self.vocabulary = vocabulary
            
        document_embeddings=[]
        for document in X:
            document_embedding = np.zeros((self.feature_size), dtype="float32")
            n_tokens = 0
            for sentence in document:
                for token in sentence:
                    if token in vocabulary:
                        document_embedding = np.add(document_embedding, self.model.wv[token])
                        n_tokens += 1

            if n_tokens > 0:
                document_embedding = np.divide(document_embedding, n_tokens)

            document_embeddings.append(document_embedding)
        return document_embeddings        
        
    def transform(self, X, y=None):
        
        document_embeddings=[]
        for document in X:
            document_embedding = np.zeros((self.feature_size), dtype="float32")
            n_tokens = 0
            for sentence in document:
                for token in sentence:
                    if token in self.vocabulary:
                        document_embedding = np.add(document_embedding, self.model.wv[token])
                        n_tokens += 1

            if n_tokens > 0:
                document_embedding = np.divide(document_embedding, n_tokens)

            document_embeddings.append(document_embedding)
        return document_embeddings 


### Word2Vec

In [49]:
# create pipeline
full_pipeline = Pipeline([
    ("embedding", GetDocumentEmbedding("word2vec", 50)),
    ("logreg", LogisticRegression(max_iter=300))
])

In [50]:
# create grid search
param_grid = [{"embedding__feature_size": [50,100,250,500]}]

grid_search = GridSearchCV(full_pipeline, param_grid, cv=5, scoring="accuracy")

In [51]:
grid_search.fit(train_articles_normalized2, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('embedding',
                                        GetDocumentEmbedding(feature_size=50,
                                                             modeltype='word2vec')),
                                       ('logreg',
                                        LogisticRegression(max_iter=300))]),
             param_grid=[{'embedding__feature_size': [50, 100, 250, 500]}],
             scoring='accuracy')

In [52]:
# accuracy scores on validation sets for each of the feature sizes
grid_search.cv_results_["mean_test_score"]

array([0.86321747, 0.86569447, 0.86590001, 0.86383667])

In [53]:
# best model feature size
grid_search.best_params_

{'embedding__feature_size': 250}

The best feature size is 100. Let's use this to predict on the test set and to evaluate how well the use of self-trained Word2Vec word embeddings performs compared with the pre-trained embeddings.

In [54]:
# create final pipeline
full_pipeline = Pipeline([
    ("embedding", GetDocumentEmbedding("word2vec", 100)),
    ("logreg", LogisticRegression(max_iter=300))
])

In [55]:
full_pipeline.fit(train_articles_normalized2, y_train)

Pipeline(steps=[('embedding',
                 GetDocumentEmbedding(feature_size=100, modeltype='word2vec')),
                ('logreg', LogisticRegression(max_iter=300))])

In [56]:
# create predictions on test set
predictions = full_pipeline.predict(test_articles_normalized2)

In [57]:
print("The accuracy on the test set is {}".format(accuracy_score(predictions, y_test)))

The accuracy on the test set is 0.8646864686468647


The accuracy is quite high. It is higher than using the pre-trained word embeddings. But traditional feature engineering still outperforms the use of self-trained Word2Vec word embeddings. Let's see whether the use of FastText makes a difference.

### FastText

In [58]:
# create pipeline
full_pipeline = Pipeline([
    ("embedding", GetDocumentEmbedding("fasttext", 50)),
    ("logreg", LogisticRegression(max_iter=300))
])

In [59]:
# create grid search
param_grid = [{"embedding__feature_size": [50,100,250,500]}]

grid_search = GridSearchCV(full_pipeline, param_grid, cv=5, scoring="accuracy")

In [60]:
grid_search.fit(train_articles_normalized2, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('embedding',
                                        GetDocumentEmbedding(feature_size=50,
                                                             modeltype='fasttext')),
                                       ('logreg',
                                        LogisticRegression(max_iter=300))]),
             param_grid=[{'embedding__feature_size': [50, 100, 250, 500]}],
             scoring='accuracy')

In [61]:
# accuracy scores on validation sets for each of the feature sizes
grid_search.cv_results_["mean_test_score"]

array([0.87023289, 0.8737406 , 0.8772466 , 0.87848457])

In [62]:
# best model feature size
grid_search.best_params_

{'embedding__feature_size': 500}

The best feature size is 500. Let's use this to predict on the test set and to evaluate how well the use of self-trained FastText word embeddings performs compared with the pre-trained embeddings.

In [63]:
# create final pipeline
full_pipeline = Pipeline([
    ("embedding", GetDocumentEmbedding("fasttext", 500)),
    ("logreg", LogisticRegression(max_iter=300))
])

In [64]:
full_pipeline.fit(train_articles_normalized2, y_train)

Pipeline(steps=[('embedding',
                 GetDocumentEmbedding(feature_size=500, modeltype='fasttext')),
                ('logreg', LogisticRegression(max_iter=300))])

In [65]:
predictions = full_pipeline.predict(test_articles_normalized2)

In [66]:
print("The accuracy on the test set is {}".format(accuracy_score(predictions, y_test)))

The accuracy on the test set is 0.8853135313531353


The accuracy is quite high. It is higher than when using the pre-trained FastText word embeddings. It is also higher than when using self-trained Word2Vec word embeddings. However, the traditional feature engineering methods still work best.

## Document embeddings from BERT sentence embeddings

The final feature engineering method we are going to examine are BERT sentence embeddings. These are also embeddings, but they are generated with a different neural network model. One of the main advantages over the other embedding methods is that the model learns the context of a word based on the surrounding words. For example, under the older embedding models, the word "bank" would have the same meaning irrespective of the context. Under BERT however, the word "bank" could mean different things, depending on the context.

Let's load the BERT model from TensorFlow Hub.

In [67]:
# load BERT model
preprocess_url = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
encoder_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"
os.environ['TFHUB_CACHE_DIR'] = 'D:/tensorflow hub/tf_cache'
bert_preprocess_model = hub.KerasLayer(preprocess_url)
bert_model = hub.KerasLayer(encoder_url)

In the next few cells I am going to create BERT document embeddings from BERT sentence embeddings, by averaging the sentence embeddings over the documents. Then, I will estimate a logistic regression model with the document embeddings as input. 

In [68]:
# create a function to get BERT document embeddings  from BERT sentence embeddings

sentence_tokenizer = nltk.sent_tokenize

def get_document_bert_embedding(article):
    # get rid of new line seperators
    article = article.replace("\n", "")
    # create sentences
    sentences = sentence_tokenizer(article)
    # preprocess sentences
    text_preprocessed = bert_preprocess_model(sentences)
    # get sentence embeddings
    bert_results = bert_model(text_preprocessed)
    # get document embeddings
    document_embedding = tf.reduce_mean(bert_results["pooled_output"], 0).numpy()
    return document_embedding   


In [69]:
# clear up memory
del full_pipeline, predictions, grid_search, train_articles_normalized2, test_articles_normalized2, logreg_fasttext_pretrained, predictions_logreg_fasttext_pretrained, train_document_embeddings, test_document_embeddings, model, logreg_word2vec_pretrained, predictions_logreg_word2vec_pretrained, logreg_tfidf, predictions_logreg_tfidf, tfidf_test, tfidf_train, predictions_logreg_cv2, logreg_cv2, cv2_test, cv2_train, predictions_logreg_cv, logreg_cv, cv_test, cv_train, cv_odds, vocabulary, train_articles_normalized, test_articles_normalized, nlp, X_train, X_test, y, X, data

In [70]:
# get BERT document embeddings
train_articles_document_bert_embeddings = [get_document_bert_embedding(article) for article in train_articles_list]
test_articles_document_bert_embeddings = [get_document_bert_embedding(article) for article in test_articles_list]

In [71]:
train_articles_document_bert_embeddings = np.array(train_articles_document_bert_embeddings)
test_articles_document_bert_embeddings = np.array(test_articles_document_bert_embeddings)

In [72]:
# fit logistic regression model with BERT document embeddings
logreg = LogisticRegression(max_iter=1000)
logreg.fit(train_articles_document_bert_embeddings, y_train)

LogisticRegression(max_iter=1000)

In [73]:
# get predictions and print accuracy
predictions = logreg.predict(test_articles_document_bert_embeddings)
print("The accuracy on the test set is {}".format(accuracy_score(predictions, y_test)))

The accuracy on the test set is 0.9240924092409241


The document embeddings from the BERT sentence embeddings generate the highest accuracy on the test set!

## Conclusion

In this project I classified fake news from real news with a logistic regression model, but with different feature engineering techniques, such as the Bag of Words Model, the TF-IDF Model, and document embeddings resulting from several word embedding techniques and the BERT sentence embedding model. Traditional feature engineering techniques, such as the Bag of Words Model seem to work very well on this data. However, the use of BERT sentence embeddings generate the best predictions on the test set.