# Project 1: Scraping Facebook Page

*Note: This project is based on Max Woolf's tutorial (http://minimaxir.com/2015/07/facebook-scraper/) and his GitHub documentation (https://github.com/minimaxir/facebook-page-post-scraper/blob/master/examples/how_to_build_facebook_scraper.ipynb). Do check out his other works!*

For this project, we are using Facebook's Graph API v2.9. To access the API, please register here: https://developers.facebook.com/.

In [1]:
# Import Python libraries

import urllib2
import json
import datetime
import csv
import time

Accessing Facebook page data requires an access token.

Since the user access token expires within an hour, we need to create a dummy application for the sole purpose of scraping and use the app ID and app secret generated there as described here, both of which never expire.

In [2]:
app_id = "1940006869546648"
app_secret = "7b27f0f69664660892edf6ee9fea0f8b"

access_token = app_id + "|" + app_secret
print access_token

1940006869546648|7b27f0f69664660892edf6ee9fea0f8b


Now we can access public Facebook data without limit. I'm going to do analysis on Jokowi's page, because I'm interested to see how his popularity rose/declined since the beginning of his political career.

In [3]:
page_id = 'Jokowi'

First, we need to write a function to ping Jokowi's Facebook page to verify that the access_token works and the page_id is valid.

In [4]:
def testFacebookPageData(page_id, access_token):
    
    # construct the URL string
    base = "https://graph.facebook.com/v2.9"
    node = "/" + page_id
    parameters = "/?access_token=%s" % access_token
    url = base + node + parameters
    print url
    
    # retrieve data
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    data = json.loads(response.read())
    
    print json.dumps(data, indent=4, sort_keys=True)

testFacebookPageData(page_id, access_token)

https://graph.facebook.com/v2.9/Jokowi/?access_token=1940006869546648|7b27f0f69664660892edf6ee9fea0f8b
{
    "id": "390581294464059", 
    "name": "Presiden Joko Widodo"
}


When scraping large amounts of data from public APIs, there's a high probability that you'll hit an HTTP Error 500 (Internal Error) at some point. There is no way to avoid that on our end.

Instead, we'll use a helper function to catch the error and try again after a few seconds, which usually works. This helper function also consolidates the data retrival code, so it kills two birds with one stone.

In [5]:
def request_until_succeed(url):
    req = urllib2.Request(url)
    success = False
    while success is False:
        try: 
            response = urllib2.urlopen(req)
            if response.getcode() == 200:
                success = True
        except Exception, e:
            print e
            time.sleep(5)
            
            print "Error for URL %s: %s" % (url, datetime.datetime.now())

    return response.read()

The data is the Facebook Page metadata however; we need to change the endpoint to the /feed endpoint.

In [6]:
def testFacebookPageFeedData(page_id, access_token):
    
    # construct the URL string
    base = "https://graph.facebook.com/v2.9"
    node = "/" + page_id + "/feed" # changed
    parameters = "/?access_token=%s" % access_token
    url = base + node + parameters
    
    # retrieve data
    data = json.loads(request_until_succeed(url))
    
    print json.dumps(data, indent=4, sort_keys=True)
    

testFacebookPageFeedData(page_id, access_token)

{
    "data": [
        {
            "created_time": "2017-05-08T08:04:11+0000", 
            "id": "390581294464059_698944980294354", 
            "message": "Mari bersaing dan berlomba-lomba untuk kemajuan bangsa...."
        }, 
        {
            "created_time": "2017-05-08T02:00:01+0000", 
            "id": "390581294464059_698848546970664", 
            "message": "Sebuah kehormatan dari lembaga adat Tanah Bumbu, Kalimantan Selatan, tersampir di pundak saya kini: gelar Kapiteng Lau Pulo.\n\nDisematkan pada Puncak Budaya Maritim Pesta Laut Mappanretasi di Pantai Pagatan, kemarin, gelar ini mengandung harapan untuk menjaga kedaulatan laut dan pulaunya. Pesta Adat Mappanretasi di Kabupaten Tanah Bumbu ini menjadi bukti bahwa jati diri kita, karakter kita, budaya kita adalah kodrat dari bangsa dan negara kita Indonesia, yaitu bangsa maritim.\n\nKita telah lama memunggungi lautan, padahal kekayaan kita ada di laut. Bahkan diperkirakan sumber daya alam laut Indonesia memiliki poten

In v2.9, the default behavior is to return very, very little metadata for statuses in order to reduce bandwidth, with the expectation the user will request the necessary fields.

We don't need data on every status yet. Let's reduce the requested fields to exactly what we need, and the number of stories returned to 1 so we can process it.

In [7]:
def getFacebookPageFeedData(page_id, access_token, num_statuses):
    
    # construct the URL string
    base = "https://graph.facebook.com/v2.9"
    node = "/" + page_id + "/feed" 
    parameters = "/?fields=message,link,created_time,type,name,id,likes.limit(1).summary(true),comments.limit(1).summary(true),shares,reactions.limit(1).summary(true)&limit=%s&access_token=%s" % (num_statuses, access_token) # changed
    url = base + node + parameters
    
    # retrieve data
    data = json.loads(request_until_succeed(url))
    
    return data
    

test_status = getFacebookPageFeedData(page_id, access_token, 1)["data"][0]
print json.dumps(test_status, indent=4, sort_keys=True)

{
    "comments": {
        "data": [
            {
                "created_time": "2017-05-08T08:04:43+0000", 
                "from": {
                    "id": "2014035415493059", 
                    "name": "Zulkifli"
                }, 
                "id": "698944980294354_698945110294341", 
                "message": "Cool pakde \ud83d\ude0d"
            }
        ], 
        "paging": {
            "cursors": {
                "after": "WTI5dGJXVnVkRjlqZAFhKemIzSTZAOams0T1RRMU1URXdNamswTXpReE9qRTBPVFF5TXpBMk9EUT0ZD", 
                "before": "WTI5dGJXVnVkRjlqZAFhKemIzSTZAOams0T1RRMU1URXdNamswTXpReE9qRTBPVFF5TXpBMk9EUT0ZD"
            }, 
            "next": "https://graph.facebook.com/v2.9/390581294464059_698944980294354/comments?access_token=1940006869546648%7C7b27f0f69664660892edf6ee9fea0f8b&summary=true&limit=1&after=WTI5dGJXVnVkRjlqZAFhKemIzSTZAOams0T1RRMU1URXdNamswTXpReE9qRTBPVFF5TXpBMk9EUT0ZD"
        }, 
        "summary": {
            "can_comment": false, 
     

Now that we have a sample Facebook page status, we can write a function to process each field individually.

In [21]:
def processFacebookPageFeedStatus(status):
    
    # The status is now a Python dictionary, so for top-level items,
    # we can simply call the key.
    
    # Additionally, some items may not always exist,
    # so must check for existence first
    
    status_id = status['id']
    status_type = status['type']
    
    status_message = '' if 'message' not in status.keys() else status['message'].encode('utf-8')
    link_name = '' if 'name' not in status.keys() else status['name'].encode('utf-8')
    status_link = '' if 'link' not in status.keys() else status['link']
    
    
    # Time needs special care since a) it's in UTC and
    # b) it's not easy to use in statistical programs.
    
    status_published = datetime.datetime.strptime(status['created_time'],'%Y-%m-%dT%H:%M:%S+0000')
    status_published = status_published + datetime.timedelta(hours=-5) # EST
    status_published = status_published.strftime('%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs
    
    # Nested items require chaining dictionary keys.
    num_comments = 0 if 'comments' not in status.keys() else status['comments']['summary']['total_count']
    num_shares = 0 if 'shares' not in status.keys() else status['shares']['count']
    num_reactions = 0 if 'reactions' not in status.keys() else status['reactions']['summary']['total_count']

    # return a tuple of all processed data
    return (status_id, status_message, link_name, status_type, status_link,
           status_published, num_comments, num_shares, num_reactions)

processed_test_status = processFacebookPageFeedStatus(test_status)
print processed_test_status

(u'390581294464059_698944980294354', 'Mari bersaing dan berlomba-lomba untuk kemajuan bangsa....', 'Timeline Photos', u'photo', u'https://www.facebook.com/Jokowi/photos/a.395824410606414.1073741848.390581294464059/698944980294354/?type=3', '2017-05-08 03:04:11', 716, 1506, 22757)


Before writing our scrape function, we need to write a helper function to construct URL string which we'll use to query Facebook Page statuses.

In [22]:
def getFacebookPageFeedUrl(base_url):

    # Construct the URL string; see http://stackoverflow.com/a/37239851 for
    # Reactions parameters
    fields = "&fields=message,link,created_time,type,name,id," + \
        "comments.limit(0).summary(true),shares,reactions.limit(0).summary(true)"
        
    return base_url + fields

We need another helper function to process Reactions metadata. The Python script is taken from this: https://github.com/minimaxir/facebook-page-post-scraper/blob/master/get_fb_posts_fb_page.py. Another work by Max Woolf. Told you this dude is cool!

In [23]:
def getReactionsForStatuses(base_url):

    reaction_types = ['like', 'love', 'wow', 'haha', 'sad', 'angry']
    reactions_dict = {}   # dict of {status_id: tuple<6>}

    for reaction_type in reaction_types:
        fields = "&fields=reactions.type({}).limit(0).summary(total_count)".format(
            reaction_type.upper())

        url = base_url + fields

        data = json.loads(request_until_succeed(url))['data']

        data_processed = set()  # set() removes rare duplicates in statuses
        for status in data:
            id = status['id']
            count = status['reactions']['summary']['total_count']
            data_processed.add((id, count))

        for id, count in data_processed:
            if id in reactions_dict:
                reactions_dict[id] = reactions_dict[id] + (count,)
            else:
                reactions_dict[id] = (count,)

    return reactions_dict

Surprisingly, we're almost done! Now we just need to:

Query each page of Facebook Page Statuses (100 statuses per page) using getFacebookPageFeedUrl.
Process all statuses on that page using processFacebookPageFeedStatus and writing the output to a CSV file.
Navigate to the next page, and repeat until no more statuses

This block implements both the writing to CSV and page navigation.


In [24]:
def scrapeFacebookPageFeedStatus(page_id, access_token):
    with open('%s_facebook_statuses.csv' % page_id, 'wb') as file:
        w = csv.writer(file)
        w.writerow(["status_id", "status_message", "link_name", "status_type", "status_link",
           "status_published", "num_comments", "num_shares", "num_reactions", "num_likes",
           "num_loves", "num_wows", "num_hahas", "num_sads", "num_angrys"])
        
        has_next_page = True
        num_processed = 0   # keep a count on how many we've processed
        scrape_starttime = datetime.datetime.now()
        
        after = ''
        base = "https://graph.facebook.com/v2.9"
        node = "/{}/posts".format(page_id)
        parameters = "/?limit={}&access_token={}".format(100, access_token)

        print "Scraping %s Facebook Page: %s\n" % (page_id, scrape_starttime)
        
        
        while has_next_page:
            after = '' if after is '' else "&after={}".format(after)
            base_url = base + node + parameters + after

            url = getFacebookPageFeedUrl(base_url)
            statuses = json.loads(request_until_succeed(url))
            reactions = getReactionsForStatuses(base_url)

            for status in statuses['data']:

                # Ensure it is a status with the expected metadata
                if 'reactions' in status:
                    status_data = processFacebookPageFeedStatus(status)
                    reactions_data = reactions[status_data[0]]
                    w.writerow(status_data + reactions_data)

                num_processed += 1
                if num_processed % 100 == 0:
                    print("{} Statuses Processed: {}".format(num_processed, datetime.datetime.now()))

            # if there is no next page, we're done.
            if 'paging' in statuses:
                after = statuses['paging']['cursors']['after']
            else:
                has_next_page = False
                print "\nDone!\n%s Statuses Processed in %s" % (num_processed, datetime.datetime.now() - scrape_starttime)
        
        
if __name__ == '__main__':
    scrapeFacebookPageFeedStatus(page_id, access_token)

Scraping Jokowi Facebook Page: 2017-05-08 21:35:10.813113

100 Statuses Processed: 2017-05-08 21:35:14.209514
200 Statuses Processed: 2017-05-08 21:35:18.020457
300 Statuses Processed: 2017-05-08 21:35:23.839714
400 Statuses Processed: 2017-05-08 21:35:29.960010
500 Statuses Processed: 2017-05-08 21:35:34.705224
600 Statuses Processed: 2017-05-08 21:35:38.842685

Done!
639 Statuses Processed in 0:00:32.253749
