# Fakebook
### by Bea Adajar, Gab Barbudo, and Irene Bermejo

# Problem

Recently, the propagation of fake news through social media has become a rampant issue all around the world. In the Philippines, it manifests itself through the spread of unverified reports through various Facebook pages and users, even from popular personalities.

In this day and age, it is therefore important to be aware that they are being spread and also to know what characteristics to look out for in spotting fake news articles and pages. 


# Research Questions:

To answer the problem, we have come up with the following research questions:

<ol>
    <li> What are the top fake news articles of 2017 published on Facebook, based on the number of shares and reactions? </li>
    <li>What are the most common topics of these fake news?</li>
    <li>How do people respond to (react, comment, or share) these fake news?</li>
    <li>Are there common trends and behaviours by Facebook pages that propagate fake news?</li>
</ol>


# Dataset

## I. Description

The dataset will consist of fake news articles published on Facebook in 2017. These posts were obtained through Facebook’s Graph API. 

## II. Columns 

+ ```(str) name```: Name of page who originally created the post; only appears when post is shared by current page
+ ```(datetime) created_time```: Time post was created
+ ```(str) message```:  Message of the post
+ ```(int) like```: Number of like reactions for the post
+ ```(int) love```: Number of love reactions for the post
+ ```(int) wow```: Number of wow reactions for the post
+ ```(int) haha```: Number of haha reactions for the post
+ ```(int) sad```: Number of sad reactions for the post
+ ```(int) angry```: Number of angry reactions for the post
+ ```(int) thankful```: Number of thankful reactions for the post
+ ```(int) total_reacts```: Total number of reactions for the post
+ ```(int) comments```: Number of comments on the post
+ ```(int) shares```: Number of shares of the post
+ ```(int) id```: The unique identifier of a Facebook post
+ ```(int) page_id```: The unique identifier of the Facebook page

## III. Scope of the dataset

The scope of the dataset covers all 2017 posts from 11 Facebook pages "carrying fake or unverified contents," based on the CBCP Pastoral Guidelines on the Use of Social Media issued on January 30. The said pages are also included in the [CBCP’s list of fake news sites](https://www.rappler.com/nation/173832-cbcp-list-websites-fake-news):

+ ClassifiedTrends
+ i.am.filipino
+ DuterteNewsInfo
+ FilipiNewsPH
+ benign0
+ NetizenOfficialPH
+ okd2ads
+ PinoyViralNewsPH
+ PinoyWorld.co
+ PublicTrendingOfficial
+ TheVolatilian


# Data Preparation: FB Fake News Scraper (via Graph API)

## I. Data Gathering

First, we gathered the data using Facebook's Graph API, which is used for querying information from Facebook users and groups. These are the following steps we've taken in obtaining the necessary information.

In [1]:
# import necessary libraries
import datetime
import json
import pandas as pd
import requests
import time
from tqdm import tqdm
import urllib.request as urllib

To be able to use the API, an access token is needed. We used an app access token, which should not be shared with anyone; thus, the value of ```access_token``` indicated below was removed for security purposes.

```page_ids``` is an array containing the page IDs of all the Facebook pages to be scraped. These sites were selected from <a href='https://www.rappler.com/nation/173832-cbcp-list-websites-fake-news'>CBCP's list fake news sites</a>, and filtered to include only those with Facebook pages that could be accessed via the Graph API.

In [2]:
access_token = '######'    # do not share access token with anyone!!
page_ids = ['ClassifiedTrends','i.am.filipino','Dutertenewsinfo','FilipiNewsPH','benign0','NetizenPHOfficial','okd2ads','pinoyviralnewsph','PinoyWorld.co','PublicTrendingOfficial','TheVolatilian']

Next, we defined the functions used for scraping the necessary data.

```request_until_succeed``` is a helper function that catches the HTTP Error 400. This is necessary because when large amounts of data are being scraped at a time, this error tends to arise and interrupt the scraper. The helper function catches the errors, pauses for 5 seconds, then tries again, until the URL is successfully opened.

```get_feed_data``` is the main function. Given a page ID and access token, it returns the following data:
+ ```name```: Name of page who originally created the post; only appears when post is shared by current page
+ ```created_time```: Time post was created
+ ```message```:  Message of the post
+ ```like```: Number of like reactions for the post
+ ```love```: Number of love reactions for the post
+ ```wow```: Number of wow reactions for the post
+ ```haha```: Number of haha reactions for the post
+ ```sad```: Number of sad reactions for the post
+ ```angry```: Number of angry reactions for the post
+ ```thankful```: Number of thankful reactions for the post
+ ```total_reacts```: Total number of reactions for the post
+ ```comments```: Number of comments on the post
+ ```shares```: Number of shares of the post
+ ```id```: The unique identifier of a Facebook post
+ ```page_id```: The unique identifier of the Facebook page

Only posts from 2017 are returned.

For now, we've excluded scraping for the names of users and pages who have shared a post. Currently, the node for that request in Graph API is unreliable, so we may do it only once we've determined the top fake news posts.

Note that some cleaning is required for the desired formatted output. Here are the functions and the returned data before cleaning and formatting is implemented:

In [3]:
# helper function to catch HTTP Error 400 (Internal Error)
def request_until_succeed(url):
    success = False
    while not success:
        try:
            response = urllib.urlopen(url)
            
            if response.getcode() == 200:
                success = True
        except Exception as e:
            print(e)
            time.sleep(5)
            print('Error for URL %s: %s' % (url, datetime.datetime.now()))
        
    return response.read()


# returns information about posts on a page
def get_feed_data(page_id, access_token):
    # create url
    base = 'https://graph.facebook.com/v2.12'
    node = '/' + page_id + '/feed'
    params = '/?fields=name,created_time,message,reactions.type(LIKE).limit(0).summary(total_count).as(like),reactions.type(LOVE).limit(0).summary(total_count).as(love),reactions.type(WOW).limit(0).summary(total_count).as(wow),reactions.type(HAHA).limit(0).summary(total_count).as(haha),reactions.type(SAD).limit(0).summary(total_count).as(sad),reactions.type(ANGRY).limit(0).summary(total_count).as(angry),reactions.type(THANKFUL).limit(0).summary(total_count).as(thankful),reactions.limit(0).summary(total_count).as(total_reacts),comments.limit(0).summary(true),shares&access_token=%s' % (access_token)
    url = base + node + params
    
    # retrieve data
    feed_data = json.loads(request_until_succeed(url))
    data = []
    
    # keep paginating until all 2017 posts have been scraped
    # note: some posts from 2016 may be included as well
    while feed_data['data'][0]['created_time'][:4]!='2016':
        try:
            data += feed_data['data']
            feed_data = requests.get(feed_data['paging']['next']).json()
        except KeyError:
            break
    
    # final filtering for 2017 posts only
    data = [data[x] for x in range(len(data)) if data[x]['created_time'][:4]=='2017']

    print('Number of posts scraped:', len(data))
    return data


# only print the first result
print(json.dumps(get_feed_data(page_ids[0], access_token)[0], indent=4, ensure_ascii=False))

Number of posts scraped: 278
{
    "name": "Poor Old Man Accidentally Scratches Luxurious Car, Leaves Note That Touches Everyone’s Heart",
    "created_time": "2017-12-24T20:29:19+0000",
    "like": {
        "data": [],
        "summary": {
            "total_count": 52
        }
    },
    "love": {
        "data": [],
        "summary": {
            "total_count": 2
        }
    },
    "wow": {
        "data": [],
        "summary": {
            "total_count": 0
        }
    },
    "haha": {
        "data": [],
        "summary": {
            "total_count": 0
        }
    },
    "sad": {
        "data": [],
        "summary": {
            "total_count": 2
        }
    },
    "angry": {
        "data": [],
        "summary": {
            "total_count": 0
        }
    },
    "thankful": {
        "data": [],
        "summary": {
            "total_count": 0
        }
    },
    "total_reacts": {
        "data": [],
        "summary": {
            "total_count": 56
        }

## Data Cleaning

A lot of the data is unnecessarily nested, and not all fields are required. Thus, we added in code to remove what is not needed and to fix the format of the output.

Below are the new functions:

In [4]:
# helper function to catch HTTP Error 400 (Internal Error)
def request_until_succeed(url):
    success = False
    while not success:
        try:
            response = urllib.urlopen(url)
            
            if response.getcode() == 200:
                success = True
        except Exception as e:
            print(e)
            time.sleep(5)
            print('Error for URL %s: %s' % (url, datetime.datetime.now()))
        
    return response.read()


# returns information about posts on a page
def get_feed_data(page_id, access_token):
    # create url
    base = 'https://graph.facebook.com/v2.12'
    node = '/' + page_id + '/feed'
    params = '/?fields=name,created_time,message,reactions.type(LIKE).limit(0).summary(total_count).as(like),reactions.type(LOVE).limit(0).summary(total_count).as(love),reactions.type(WOW).limit(0).summary(total_count).as(wow),reactions.type(HAHA).limit(0).summary(total_count).as(haha),reactions.type(SAD).limit(0).summary(total_count).as(sad),reactions.type(ANGRY).limit(0).summary(total_count).as(angry),reactions.type(THANKFUL).limit(0).summary(total_count).as(thankful),reactions.limit(0).summary(total_count).as(total_reacts),comments.limit(0).summary(true),shares&access_token=%s' % (access_token)
    url = base + node + params
    
    # retrieve data
    feed_data = json.loads(request_until_succeed(url))
    data = []
    data += feed_data['data']
    
    # keep paginating until all 2017 posts have been scraped
    while feed_data['data'][0]['created_time'][:4]!='2016':
        try:
            data += feed_data['data']
            feed_data = requests.get(feed_data['paging']['next']).json()
        except KeyError:
            break
    
    # final filtering for 2017 posts
    data = [data[x] for x in range(len(data)) if data[x]['created_time'][:4]=='2017']
    
    # ADDITIONAL CODE NOT IN PREVIOUS CELL
    # fix output format
    for x in data:
        try:
            # add page_id
            x['page_id'] = page_id
            # remove unnecessary data
            x['like'] = x['like']['summary']['total_count']
            x['love'] = x['love']['summary']['total_count']
            x['wow'] = x['wow']['summary']['total_count']
            x['haha'] = x['haha']['summary']['total_count']
            x['sad'] = x['sad']['summary']['total_count']
            x['angry'] = x['angry']['summary']['total_count']
            x['thankful'] = x['thankful']['summary']['total_count']
            x['total_reacts'] = x['total_reacts']['summary']['total_count']
            x['comments'] = x['comments']['summary']['total_count']
            x['shares'] = x['shares']['count']
        except Exception:
            # if attribute does not exist, do nothing
            pass
    
    print('Number of posts scraped:', len(data))
    return data


# only print the first result
print(json.dumps(get_feed_data(page_ids[0], access_token)[0], indent=4, ensure_ascii=False))

Number of posts scraped: 302
{
    "name": "Poor Old Man Accidentally Scratches Luxurious Car, Leaves Note That Touches Everyone’s Heart",
    "created_time": "2017-12-24T20:29:19+0000",
    "like": 52,
    "love": 2,
    "wow": 0,
    "haha": 0,
    "sad": 2,
    "angry": 0,
    "thankful": 0,
    "total_reacts": 56,
    "comments": 1,
    "shares": 4,
    "id": "979290192127575_1645818285474759",
    "page_id": "ClassifiedTrends"
}


We then applied the function to all the Facebook pages in the ```page_ids``` array. Afterwards, we dumped the data into a json file.

In [5]:
fb_data = []

for i in tqdm(range(len(page_ids))):
    try:
        page = page_ids[i]
        temp = get_feed_data(page, access_token)

        for x in temp:
            x['page_id'] = page

        print('Scraped:', page)
        fb_data += temp
    except Exception as e:
        print('Error:', e)

with open('data.json', 'w', encoding='utf-8') as outfile:
    json.dump(fb_data, outfile, ensure_ascii=False, indent=4)
    print('Successfully saved data to JSON file.')

  0%|                                                                 | 0/11 [00:00<?, ?it/s]

Number of posts scraped: 302
Scraped: ClassifiedTrends


  9%|█████▏                                                   | 1/11 [00:17<02:52, 17.22s/it]

Number of posts scraped: 981
Scraped: i.am.filipino


 18%|██████████▏                                             | 2/11 [05:35<25:11, 167.95s/it]

Number of posts scraped: 115
Scraped: Dutertenewsinfo


 27%|███████████████▎                                        | 3/11 [05:48<15:28, 116.03s/it]

Number of posts scraped: 340
Scraped: FilipiNewsPH


 36%|████████████████████▋                                    | 4/11 [06:19<11:04, 94.88s/it]

Number of posts scraped: 615
Scraped: benign0


 45%|█████████████████████████▉                               | 5/11 [07:38<09:10, 91.71s/it]

Number of posts scraped: 570
Scraped: NetizenPHOfficial


 55%|███████████████████████████████                          | 6/11 [09:48<08:10, 98.09s/it]

Number of posts scraped: 607
Scraped: okd2ads


 64%|████████████████████████████████████▎                    | 7/11 [10:54<06:13, 93.49s/it]

Number of posts scraped: 592
Scraped: pinoyviralnewsph


 73%|█████████████████████████████████████████▍               | 8/11 [12:12<04:34, 91.56s/it]

Number of posts scraped: 57
Scraped: PinoyWorld.co


 82%|██████████████████████████████████████████████▋          | 9/11 [12:18<02:44, 82.09s/it]

Number of posts scraped: 34
Scraped: PublicTrendingOfficial


 91%|██████████████████████████████████████████████████▉     | 10/11 [12:22<01:14, 74.26s/it]

Number of posts scraped: 409
Scraped: TheVolatilian


100%|████████████████████████████████████████████████████████| 11/11 [13:01<00:00, 71.05s/it]


Successfully saved data to JSON file.


Here is a preview of the JSON file as a DataFrame:

In [6]:
df = pd.read_json('data.json', encoding='utf-8')
df = df[['name','created_time','like','love','wow','haha','sad','angry','thankful','total_reacts','comments','shares','id','page_id']]
df = df.rename(columns={'id':'post_id'})
df.tail()

Unnamed: 0,name,created_time,like,love,wow,haha,sad,angry,thankful,total_reacts,comments,shares,post_id,page_id
4617,Investment advice: ignore the doomsayers - The...,2017-01-01 23:07:42,439,15,3,4,0,0,0,461,10,69.0,2.935174e+29,TheVolatilian
4618,Big actors waiting in the wings - The Volatilian™,2017-01-01 23:05:09,558,3,2,0,0,0,0,563,14,27.0,2.935174e+29,TheVolatilian
4619,The Philippines’ Web-less masses - The Volatil...,2017-01-01 23:03:06,522,3,2,1,1,0,0,529,8,23.0,2.935174e+29,TheVolatilian
4620,Philippine trade opportunities from Brexit - T...,2017-01-01 23:00:34,426,11,13,0,0,0,0,450,5,40.0,2.935174e+29,TheVolatilian
4621,Plan to get ship-shape and competitive - The V...,2017-01-01 22:57:38,536,4,2,2,0,0,0,544,6,23.0,2.935174e+29,TheVolatilian


# Description of Data Visualization Process

## I. Word Bar Graphs and Pie Charts

In order to count the words, we want to get the base form of the words and remove unnecessary stop words.

### String Cleaning and Lemmatization

<ol>
<li>For each message in the data, urls, special characters, extra white space and numbers are removed from the message. </li>
<br />
<li> Then spaCy library is used for preprocessing and lemmatizing the English words. Since Tagalog is currently not supported in NLP libraries in python, the Tagalog Words Stemmer (https://github.com/crlwingen/TagalogStemmerPython/blob/master/TglStemmer.py) is used to process the Tagalog words. This removes the affixes of the Tagalog word and returns the root word. Since the Tagalog Stemmer only returns the root of tagalog words, the Tagalog stop words are added to spaCy's dictionary.
<br />
<br />
Tagalog stop words are from https://github.com/stopwords-iso/stopwords-tl/blob/master/stopwords-tl.json
<br />
<br />
</li>

<li>
Map each word to its respective list. Three separate lists are used because each list will be cleaned differently. This is to consider that the language may be Tagalog, Taglish, or English. *Other Filipino languages (e.g Bisaya) will be included in the tagalog list)*

Since word count is the only concern, the order of these messages do not matter.

<br \>

Lists:
<ul>
    <li> ```english``` parsed using the standard lemma_ of spaCy's nlp. </li>
    <br />
    <li>```proper``` parses also using the standard lemma_of spaCy's nlp.</li>
    <br />
        <ul>
            <li>It contains the proper nouns (not part of the english dictionary, checked using .isupper) </li>
            <li>The reason for this is so that the proper nouns would not be parsed with the tagalog stemmer </li>
        </ul>
    <br />
    <li>```tagalog``` parsed using the TagalogStemmer. </li>   
        <ul> 
            <li> ```tagalog_str``` will serve as input to the TagalogStemmer. </li>
            <li>Then the TagalogStemmer will output a list of the root words. </li>
        </ul>
</ul>

</li>

</ol>

In [None]:
# Instantiate bag of words
bows = []

for i in range(0, len(messages)):
    english = []
    proper = []
    tagalog_str = '' # This is a string because stemmer requires string input

    # For each message tokenize the words, 
    # Check if it is a stop word
    # If it is not, append to respective list
    for word in nlp(messages[i]):
        if len(word.text) > 2 and not word.is_stop:
            if word.text in nlp.vocab:
                english.append(word.lemma_)
            elif word.text[0].isupper():
                proper.append(word.lemma_)
            else:
                tagalog_str += word.text + ' ' 

    try:
        tagalog = TglStemmer.stemmer('2', tagalog_str, '2')[1]
    except:
        # Skip if tagalog_str is empty
        continue

    # Append the list of all the clean words to the bag of words
    bows.append(english + proper + tagalog)
    
    # bows is then stored to a json file and text file, bows.json and bows.txt respectively.

### Bar Graph (Frequency of Word)

In order to get the top 10 words:  
- ```cat bows.txt | tr ' ' '\n' | sort | uniq -c | sort > freq.txt``` command is run on the terminal.

This generated the final list of top 10 words/category (joining similar words that refer to the same person/thing into one category):

- word_list = [['medium'], ['share'], ['filipinos', 'filipino'], ['country'], ['illegal'], ['robredo', 'leni'], ['jueteng', 'uete'], ['drugs', 'drug', 'droga'], ['pilipinas','philippines', 'philippine', 'philippines'],  ['duterte', 'president', 'rodrigo']]

- word_occurences = [178, 181, 184, 187, 187, 187, 304, 354, 695, 971]



### Pie Chart (No. of Occurences)

Using the top 10 words/categories from the word_list, for each word/category count the number of messages it appears in the bows.json.


In [2]:
for bow in bows:
    for i in range(0, len(word_list)):
        for word in word_list[i]:
            if word in bow:
                word_count[i] += 1

This resulted to the final word_count = [101, 170, 155, 132, 34, 141, 16, 97, 733, 705]

## II. Scatter Plots

### Scatter Plot (All Articles)

The scatter plot contains all the fake news articles from the ten chosen Facebook pages and the number of reactions (x axis) and shares (y axis) of each article. This is to show which topics have the highest reach.

### Bubble Chart (By Page)

The bubble chart contains the total number of reactions (x axis), shares (y axis), and articles (radius) each page has to show which page has the highest reach.

## III. Reactions/Shares/Comments Bar Graph

This bar graph displays the top ten posts based on either number of reactions, shares, or comments. There is a dropdown that allows the user to toggle between criteria to see whether or not the posts that rank highest in one criterion are the same in other criteria.

The x-axis previews the title of the post while the y-axis is the number of reactions/shares/comments. Each bar has a tooltip that displays the complete title of the post, the content of the post's caption or message, and the number of reactions/shares/comments.

## IV. Trend Graph

The Trend graph contains all posts by date and score/reach per page in order to see if there is a certain trend/behaviour on how the page posts.

For each post, the score/reach points are calculated with weights: likes * 0.25 + (total_reacts - likes) * 0.5 + comments * 0.75 + shares. Each is pushed into an array to be plotted in the graph chronologically. 

``` javascript
var arr = [];

var convertPoints = function(data, label1, label2) {
    var obj = Object.values(data).map(function(c) {
    var score = (c.like * 0.25 + (c.total_reacts - c.like) * 0.5 + c.comments * 0.75 + c.shares * 1) || 0;
                                                   
        return {[label1]: moment(c.created_time).format('YYYY-MM-DD HH:mm:ss').toString(),[label2]: score, 
                title: c.name};
    });
    Object.keys(obj).forEach(function(item) {
    arr.push(obj[item]);
    })

convertPoints(ClassifiedTrends, "date1", "y1");
convertPoints(IAmFilipino, "date2", "y2");
convertPoints(DuterteNewsInfo, "date3", "y3");
convertPoints(FilipiNewsPH, "date4", "y4");
convertPoints(Benigno, "date5", "y5");
convertPoints(NetizenPHOfficial, "date6", "y6");
convertPoints(Okd2Ads, "date7", "y7");
convertPoints(PinoyViralNewsPH, "date8", "y8");
convertPoints(PinoyWorld, "date9", "y9");
convertPoints(PublicTrendingOfficial, "date10", "y10");
convertPoints(TheVolatilian, "date11", "y11");

```




# Findings

- The most common fake news topics and commonly used words are related to politics. More specifically, the top frequently-used words are related to politics and crimes ("Duterte," "president," "Rodrigo, "jueteng," "drugs/droga").
- However, the posts with the highest reach are articles which have nothing to do with politics. In fact, they are not actually classified as "fake news" but more appropriately "clickbait."
- The topics that have the most number of shares and reactions are clickbait. Meanwhile, the topics that have the most number of comments are related to politics.
- The common trend of pages is there is a majority of posts with steady low reach points and very few yet dominant spikes of high reach points. The ones with higher reach points are the clickbait articles.

# Critical Evaluation

- These pages could be a strategy of propaganda. Clickbaits are used in order to gain reach and followers which gives them an avenue to flood post on political agendas. This possibility is supported by the fact that there is higher interaction with articles that are considered clickbait, but more articles about politics.
- Although clickbait articles garner more attention through reactions and shares, the political posts generate more discussion as there are more comments. This means that although the reach may not be as high in terms of number of reactions and shares, more people are vocal about their thoughts regarding political issues.

# Problems, Issues, and Limitations

+ Limited to handpicked pages. It is difficult to personally specify fake news pages, which would involve a lot of searching through Facebook as well as an establishment of a concrete criteria.
+ Topic is limited to the most commonly used words.
+ Data that could be acquired was also limited by what the Graph API could retrieve. Scraping Facebook for data through any other means is not allowed.