# Fake News Detector

In this notebook we show how to predict if a new is fake or not using Word2Vec to get word embeddings (vector representations of words) and a simple ML model.

## Load the data

Dataset has been extracted from Kaggle

In [1]:
import pandas as pd

DATA_PATH = "data/WELFake_Dataset.csv"

df = pd.read_csv(DATA_PATH)
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


## Clean the data

First we check how many NaN values we have in the dataset

In [2]:
# check missing vlaues by column
df.isnull().sum()

Unnamed: 0      0
title         558
text           39
label           0
dtype: int64

## Remove missing values

In [3]:
# clean dataframe from NaN values

df = df.dropna()

## Data Cleaning

As we are working with text data, we need to clean it. Cleaning means removing all the characters that are not useful for our model. In this case, we will remove all the punctuation and numbers. We also know that some words are not useful for our model, so we will remove them as well. These words are called stop words. We will use the stop words from the nltk library.

### Load Stop Words

In [4]:
import nltk

nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
stop_words = set(stop_words)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/sngular/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Clean a single word from punctuation and numbers

In [5]:
import string

def clean_word(word: str) -> str:
    """Remove punctuation and lowercase a word

    Args:
        word (str): the word to clean

    Returns:
        str: the cleaned word
    """
    word = word.lower()
    word = word.strip()

    for letter in word:
        if letter in string.punctuation:
            word = word.replace(letter, '')

    return word

clean_word("Hello!?.")

'hello'

### Clean text by removing punctuation, numbers and stop words

In [6]:
def clean_text(text: str) -> list[str]:

    clean_text_list = []
    for word in text.split():
        cleaned_word = clean_word(word)
        if cleaned_word not in stop_words:
            clean_text_list.append(cleaned_word)

    return clean_text_list

clean_text("Hello!, How are You today?")

['hello', 'today']

We add a new column to the dataframe with the cleaned text: removing stop words and punctuation and lowercasing the text.

In [7]:
df["clean_text"] = df["text"].apply(clean_text)

In [8]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label,clean_text
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1,"[comment, expected, barack, obama, members, fy..."
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1,"[demonstrators, gathered, last, night, exercis..."
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0,"[dozen, politically, active, pastors, came, pr..."
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1,"[rs28, sarmat, missile, dubbed, satan, 2, repl..."
5,5,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1,"[say, one, time, someone, sued, southern, pove..."


## Vectorize Words

We will use Word2Vec to vectorize the words. Word2Vec is a model that takes as input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.

In [9]:
import gensim


EMBEDDING_DIM = 100


sentences = df["clean_text"]  # this is a list of list of words


model = gensim.models.Word2Vec(
    sentences=sentences,
    vector_size=EMBEDDING_DIM,
    window=5,  # the number of words before and after a word to consider
    min_count=1,  # ignore words with frequency lower than this
)

In [10]:
# check words close to Rajoy

model.wv.most_similar("rajoy")

[('puigdemont', 0.856658935546875),
 ('mariano', 0.8286176323890686),
 ('madrid', 0.8090261816978455),
 ('catalan', 0.8077750205993652),
 ('catalonia', 0.765256404876709),
 ('secessionists', 0.7488061189651489),
 ('carles', 0.7243034243583679),
 ('krg', 0.7059369683265686),
 ('plebiscite', 0.6874198317527771),
 ('gentiloni', 0.6853983402252197)]

## Vectorize articles

Now that we know how to convert a word into a vector, we need to reduce a whole article into a single vector. We will do this by averaging the vectors of all the words in the article.

In [11]:
import numpy as np


def vectorize_text(text: list[str]) -> np.ndarray:
    """Vectorize a text by doing a sumatory of the word vectors

    Args:
        text (str): the text to vectorize

    Returns:
        np.ndarray: the vectorized text
    """
    text_vector = np.zeros(EMBEDDING_DIM, np.float32)
    for word in text:
        word_vector = model.wv[word]
        text_vector += word_vector  # equivalent to text_vector = text_vector + word_vector

    return text_vector

In [12]:
# add a column with every article converted to a single vector

X = df["clean_text"].apply(vectorize_text)

In [17]:
X = np.array(X.tolist(), dtype=np.float32)
y = np.array(df["label"].to_list(), dtype=np.float32)

## Save the Processed Dataset

In [18]:
np.save("data/features.npy", X)
np.save("data/labels.npy", y)