# Fake News Detector <a id='top'></a>

## Data Exploration

_Using Amazon's SageMaker for Train | Deployment_

---

This project is about a classification model that examines a text file from news and performs binary classification; labeling that news as either fake or real. The model was trained using a dataset from [kaggle](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset). The dataset consists of about 21000 news labeled as true and about 23000 news categorized as fake.

The project is inspired from a [research](https://blogs.scientificamerican.com/beautiful-minds/liberals-and-conservatives-are-both-susceptible-to-fake-news-but-for-different-reasons/?WT.mc_id=send-to-friend) that suggests both liberals and conservatives are motivated to believe fake news, and dismiss real news that contradicts their ideologies. Fake news spread through social media has become a serious problem and this study aims to build an unbiased model that could detect news as fake or real.

The first step in working with any dataset is loading the data in and noting what information is included in the dataset. This is an important step in eventually working with the data, and knowing what kinds of features we have to work with as we transform and group the data. So, this notebook is all about exploring the data and noting patterns about the features we are given and the distribution of data.

## General Outline

1. [Import Libraries](#import)
2. [Read in the Data](#read)
3. [Prepare and Process the Data](#prepare)
4. [Check Data](#check)
5. [Explore the Data](#explore)
    1. [Distribution of the Classes](#classes)
    2. [Distribution of the News Topics](#topics)
    3. [Distribution of the News Topics and Classes](#topics-classes)
    4. [Explore Date](#date)
    5. [Explore Text](#text)
    6. [Explore Words](#words)
        1. [Word Clouds in Fake and True News](#wordcloud)
        2. [Most Frequent Words in Fake and True News](#word-frequency)
6. [Next Notebook](#next)

## Import Libraries <a id='import'></a>

In the next cells we will import the required libraries for this analysis and set some global variables.

In [2]:
# Install libraries.
# !pip install chart_studio
# !pip install wordcloud

# Import libraries.
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from PIL import Image
import chart_studio.plotly as py
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud

import re
from bs4 import BeautifulSoup
from tqdm import tqdm
from collections import Counter

# Set Plotly theme.
pio.templates.default = "gridon"

# Set global variables.
RANDOM_STATE = 5

ModuleNotFoundError: No module named 'chart_studio'

[Go to the top](#top)

## Read in the Data <a id='read'></a>

The cell below will load the data into `pandas` dataframes.

> **Acknowledgements for data**:
>- Ahmed H, Traore I, Saad S. “Detecting opinion spams and fake news using text classification”, Journal of Security and Privacy, Volume 1, Issue 1, Wiley, January/February 2018.
>- Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).

The dataset is made of multiple text strings and other characteristics summarized in `csv` files named `Fake.csv` and `True.csv`, which we can read in using `pandas`.

In [3]:
true = pd.read_csv("data/True.csv")
fake = pd.read_csv("data/Fake.csv")

# Show first rows for each dataset.
display(true.head())
display(fake.head())

# Print the number of real and fake news.
print('\nThere are {} real and {} fake news'.format(true.shape[0], fake.shape[0]))

FileNotFoundError: [Errno 2] No such file or directory: 'data/True.csv'

[Go to the top](#top)

## Prepare and Process the Data <a id='prepare'></a>

It is more convenient to merge the two datasets and then apply any processing tasks and text transformations to the new dataframe. First, we need to create a new column `label` to save the labels of each text. We will also shuffle the final dataframe and then delete the previous datasets to free up some memory.

In [None]:
# Create the 'label' column.
true['label'] = 'True'
fake['label'] = 'Fake'

# Concatenate the 2 dfs.
df = pd.concat([true, fake])

# To save a bit of memory we can set fake and true to None.
fake = true = None

#  Shuffle data.
df = df.sample(frac=1, random_state=RANDOM_STATE).reset_index(drop=True)

# Show first rows.
df.head()

[Go to the top](#top)

## Check Data <a id='check'></a>

The dataframe will be examined for the quality of its data. The types and shape of the data will be checked, as well as if there are any missing records.

In [None]:
# Check df.
df.info()

**Inference**

- There are no missing values.
- Dates were recognized as strings. Later, we will use a `pandas` date parser function to recognize dates correctly.

[Go to the top](#top)

## Explore the Data <a id='explore'></a>

Next, let's look at the distribution of data.

### Distribution of the Classes <a id='classes'></a>

We need to check how evenly is our data distributed among the two classes.

In [None]:
# Show counts for each class.
fig = px.bar(df.groupby('label').count().reset_index(), x='label', y='title', text='title', opacity=0.6)
fig.update_layout(title_text='Distribution of News')
fig.update_xaxes(showgrid=False, title_text=None)
fig.update_yaxes(showgrid=False, title_text=None)
fig.update_yaxes(showticklabels=False)
fig.show()

**Inference**

- It seems that the dataset is balanced, so no actions needed to handle any imbalances between the classes.

[Go to the top](#top)

### Distribution of the News Topics <a id='topics'></a>

It may also be helpful to look at the `subject` distribution.

In [None]:
# Show counts for each class.
fig = px.bar(df.groupby('subject').count()['title'].reset_index().sort_values(by='title'),
             x='subject', y='title', text='title', opacity=0.6)
fig.update_layout(title_text='Distribution of News Subjects')
fig.update_xaxes(showgrid=False, title_text=None)
fig.update_yaxes(showgrid=False, title_text=None)
fig.update_yaxes(showticklabels=False)
fig.show()

**Inference**

- We have 5 categories with a lot of `True` and `Fake` news and 3 with only a few hundres.

[Go to the top](#top)

### Distribution of the News Topics and Classes <a id='topics-classes'></a>

Let's dig deeper and see the distribution of labels inside each subject.

In [None]:
df_sum = df.groupby(['label', 'subject']).count().reset_index()
fig = px.bar(df_sum, x='label', y='title', color='subject', text='title', opacity=0.6)
fig.update_xaxes(showgrid=False, title_text=None)
fig.update_yaxes(showgrid=False, title_text=None)
fig.update_yaxes(showticklabels=False)
fig.show()

**Inference**

- It seems that the `Fake` and `Real` news datasets do not contain the same topics in the `subject` category. Possibly, `politics` is similar to `politicsNews`, but since `subjects` are mapped differently between `True` and `Fake` news, it would be better to remove this feature from our model.

[Go to the top](#top)

### Explore Date <a id='date'></a>

It would be interesting to see if there are any patterns in `date`.

In [None]:
# Convert date str into date object. Take care of any errors for invalid dates.
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df_date = df.groupby(['label', 'date'])['title'].count().reset_index()

fig = px.line(df_date, x='date', y='title', color='label')
fig.update_xaxes(title_text=None)
fig.update_yaxes(title_text=None)
fig.update_layout(legend_title_text=None)
fig.show()

**Inference**

- Not too much to infer from this time series plot.
- Let's extract the week of the year, month of the year and other date features to check if there is any seasonality. It would be wise to compare the date features at the same window. It seems that the more `True` news observed after August 2017 is not because of the date. For the same reason, the lack of `True` news before January 2016 is not due to the date. In other words, we cannot infer that all the news before January 2016 were `Fake`.

In [None]:
# Filter df based on date.
df_filtered = df[(df['date'] < '2017-08-31') & (df['date'] > '2016-02-01')].copy()
df_filtered.loc[:, 'weekday'] = df_filtered['date'].dt.dayofweek
df_filtered.loc[:, 'week'] = df_filtered['date'].dt.weekofyear
df_filtered.loc[:, 'month'] = df_filtered['date'].dt.month
df_filtered.loc[:, 'quarter'] = df_filtered['date'].dt.quarter

df_weekday = df_filtered.groupby(['label', 'weekday']).count()['title'].reset_index()

fig = px.line(df_weekday, x='weekday', y='title', color='label')
fig.update_layout(title_text='Day of Week')
fig.update_xaxes(title_text=None)
fig.update_yaxes(title_text=None)
fig.update_layout(legend_title_text=None)
fig.show()

In [None]:
df_week = df_filtered.groupby(['label', 'week']).count()['title'].reset_index()

fig = px.line(df_week, x='week', y='title', color='label')
fig.update_layout(title_text='Week of the Year')
fig.update_xaxes(title_text=None)
fig.update_yaxes(title_text=None)
fig.update_layout(legend_title_text=None)
fig.show()

In [None]:
df_month = df_filtered.groupby(['label', 'month']).count()['title'].reset_index()

fig = px.line(df_month, x='month', y='title', color='label')
fig.update_layout(title_text='Monthly')
fig.update_xaxes(title_text=None)
fig.update_yaxes(title_text=None)
fig.update_layout(legend_title_text=None)
fig.show()

In [None]:
df_quarter = df_filtered.groupby(['label', 'quarter']).count()['title'].reset_index()

fig = px.line(df_quarter, x='quarter', y='title', color='label')
fig.update_layout(title_text='Quarterly')
fig.update_xaxes(title_text=None)
fig.update_yaxes(title_text=None)
fig.update_layout(legend_title_text=None)
fig.show()

**Inference**

- There is no clear distinction between date features in `Fake` and `True` news.

[Go to the top](#top)

### Explore Text <a id='text'></a>

Let's print a couple of `True` and `Fake` examples to check if there are any differences.

In [None]:
print('Fake News\n')
print(df[df.label == 'Fake']['text'].tolist()[3])
print()
print(df[df.label == 'Fake']['text'].tolist()[5])
print()
print('\n\nTrue News\n')
print(df[df.label == 'True']['text'].tolist()[0])
print()
print(df[df.label == 'True']['text'].tolist()[2])

**Inference**

- Maybe `True` news contain more dates, numbers, and names than `Fake` ones. Later, we will extract these features and use them as input in the models.

[Go to the top](#top)

### Explore Words <a id='words'></a>

It is interesting to see what kind of words and phrases are commonly used in `Fake` and `True` news. Before this exploration, a good strategy is to clean the text applying the following steps:

- Read a text file as a string of raw text.
- Lower case all words, so that captialization is ignored (e.g., IndIcaTE is treated the same as Indicate).
- Normalize numbers, replacing them with the text `number`.
- Remove non-words, remove punctuation, and trim all white spaces (tabs, newlines, spaces) to a single space character.
- Tokenize the raw text string into a list of words where each entry is a word. Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded. The tokens become the input for another process like parsing and text mining. Then the words will be ready to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).
- Use lemmatization or stemming to consolidate closely redundant words. For example, "discount", "discounts", "discounted" and "discounting" will be all replaced with "discount". Sometimes, the Stemmer actually strips off additional characters from the end, so "include", "includes", "included", and "including" are all replaced with "includ".
- Remove stopwords. Stop words are so frequently used that for many tasks (but not all) they don't carry much information. Examples are "any", "all", "what", etc. NLTK has an inbuilt corpus of english stopwords that can be loaded and used.
- Apply additional text preparation steps, such as normalizing links and emails: All https and http links will be replaced with the text "link" and all emails will be replaced with the text "email".
- Render either a word cloud or a bar chart with the most frequent unigrams, bigrams, trigrams, etc.

#### Word Clouds in Fake and True News <a id='wordcloud'></a>

First, we can use directly the `WordCloud` module before doing any heavy text processing, so as to get an idea of the most important words or phrases in fake and true news.

In [None]:
# Create a function to create a word cloud.
def make_wordcloud(text, mask, color):
    wordcloud = WordCloud(max_words=200, mask=mask,
                          background_color='white',
                          contour_width=2,
                          contour_color=color).generate(text)
    plt.figure(figsize=(17,12))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()

# Read an image in order to use it as a shape for our word cloud.
fake_mask = np.array(Image.open("data/fake.png"))
true_mask = np.array(Image.open("data/true.png"))

# Get the fake and true news.
fake_text = " ".join(text for text in df[df.label == 'Fake']['text'])
true_text = " ".join(text for text in df[df.label == 'True']['text'])

# Render word clouds.
make_wordcloud(fake_text, fake_mask, 'blue')
make_wordcloud(true_text, true_mask, 'orange')

**Inference**

- Fake news contain a lot of words like Donald Trump, Hillary Clinton, White House, and United States.
- True news contains a lot of the words found in fake, but also contains a lot of dates like on Tuesday, on Monday, on Sunday, last week, etc.

[Go to the top](#top)

#### Most Frequent Words in Fake and True News <a id='word-frequency'></a>

Let's apply the text processing steps described before and then print the most frequent words.

In [None]:
# Create a new 'tqdm' instance to time and estimate the progress of functions.
tqdm.pandas()

# Create a function to clean and prepare text.
def clean_text(text):
    """ Remove any punctuation, numbers, newlines, and stopwords.
    Convert to lower case.
    Split the text string into individual words, stem each word,
    and append the stemmed word to words. Make sure there's a single
    space between each stemmed word.
    Args:
        text: text, string
    Returns:
        words: cleaned words, list
    """
    
    # Replace numbers with the str 'number'.
    text = re.sub('\d+', 'number', text)
    
    # Replace newlines with spaces.
    text = re.sub('\n', ' ', text)
    
    # Replace punctuation with spaces.
    text = re.sub(r'[^\w\s]', ' ', text)
    
    # Remove HTML tags.
    text = BeautifulSoup(text, "html.parser").get_text()
    
    # Replace links with the str 'link'
    text = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
                   'link', text, flags=re.MULTILINE)

    # Replace emails with the str 'email'
    text = re.sub('\S+@\S+', 'email', text, flags=re.MULTILINE)
    
    # Convert all letters to lower case.
    text = text.lower()
    
    # Create the stemmer.
    stemmer = SnowballStemmer('english')
    
    # Split text into words.
    words = text.split()
    
    # Remove stopwords.
    words = [w for w in words if w not in stopwords.words('english')]
    
    # Stem words.
    words = [stemmer.stem(w) for w in words]
    
    return words

# Apply the cleaning function to the dataset.
df.text = df.text.progress_apply(clean_text)

Create a function to count and return the most frequent words and then plot them in a horizontal bar chart.

In [None]:
# Create a function to count and return the most frequent words.
def frequent_words(label, max_words):
    # Gather text and concatenate.
    text = df[df['label'] == label]['text'].values
    text = np.concatenate(text)
    
    # Count words.
    counts = Counter(text)
    
    # Create a pandas df from the Counter dictionary.
    df_counts = pd.DataFrame.from_dict(counts, orient='index')
    df_counts = df_counts.rename(columns={0:'counts'})
    
    # Return a df with the most frequent words.
    return df_counts.sort_values(by='counts', ascending=False).head(max_words).sort_values(by='counts')

# Get the 50 most frequent words.
df_fake_counts = frequent_words(label='Fake', max_words=50)
df_true_counts = frequent_words(label='True', max_words=50)

# Plot horizontal bar charts.
fig = make_subplots(rows=1, cols=2, subplot_titles=("Fake News", "True News"))

fig.add_trace(go.Bar(x=df_fake_counts.counts.tolist(),
                     y=df_fake_counts.index.values.tolist(),
                     orientation='h', opacity=0.6), 1, 1)

fig.add_trace(go.Bar(x=df_true_counts.counts.tolist(),
                     y=df_true_counts.index.values.tolist(),
                     orientation='h', opacity=0.6), 1, 2)

fig.update_layout(height=900, width=900, title_text="Most Frequent Words", showlegend=False)
fig.update_xaxes(showgrid=False, title_text=None)
fig.update_yaxes(showgrid=False, title_text=None)
fig.show()

**Inference**

- We got an idea of the most frequent unigrams and bigrams. We could further continue the analysis of trigrams or even higher n-grams, but the purpose of this study is to build a classification model.

[Go to the top](#top)

## Next Notebook <a id='next'></a>

In the next [notebook](https://github.com/gtraskas/fake-news-detector/blob/master/fake-news-detector.ipynb), we will use these datasets to train a complete fake news classifier. We will extract meaningful features from the text, which we will use to train and deploy a classification model in an AWS SageMaker notebook instance.