# Facebook data retrieval

This small notebook was used to gather the needed information for the main part of the project.

### What kind of data do we need?

For our problem we would like to have some data about the events: how many people said they were going, how many people were interested, the date of the facebook event creation and date of the event. What would also help in order to estimate how much does promoting the event help is to get all the posts related to it.

### How do we get the data?

After some (hopefully) rigurous research, it has been revealed that there is no 'easy way', to just use a library. We had to use the Facebook Graph API. To do this, we had to register our account as a developer and follow along some steps (taken from tutorials found on google). After a while, we found that what we needed were the **feed** and **event** edges.
We have also used **tagged** and **visitor_posts** edges to further get some statistics about the page.

In [1]:
import requests
import json
import os

Useful documentation: \
https://developers.facebook.com/docs/graph-api/reference/v13.0/post \
https://developers.facebook.com/docs/graph-api/reference/event \
https://developers.facebook.com/docs/graph-api/results \
https://developers.facebook.com/docs/graph-api/reference/v13.0/page \
https://developers.facebook.com/docs/graph-api/reference/page/visitor_posts \
https://developers.facebook.com/docs/graph-api/reference/page/tagged 

In [3]:
# Create data folder if it does not exist
data_folder = "data"

try: 
    os.mkdir(data_folder) 
except OSError as error: 
    print(error)

[WinError 183] Cannot create a file when that file already exists: 'data'


In [103]:
access_token = "access_token"
access_token_field = "access_token=" + access_token
# Set number of returned elements limit to maximum allowed by API
limit_token_field = "limit=100"

events_api_gate = "https://graph.facebook.com/v13.0/1581916451935422/events"
events_api_fields = "fields=id,attending_count,created_time,interested_count,name,noreply_count,place{name},start_time"
events_url = events_api_gate + "?" + events_api_fields + "&" + access_token_field + "&" + limit_token_field

feed_api_gate = "https://graph.facebook.com/v13.0/1581916451935422/feed"
feed_api_fields = "fields=id,created_time,is_popular,message,status_type,shares"
feed_url = feed_api_gate + "?" + feed_api_fields + "&" + access_token_field + "&" + limit_token_field

tagged_api_gate = "https://graph.facebook.com/v13.0/1581916451935422/tagged"
tagged_api_fields = "fields=id,message,tagged_time,from,status_type,story"
tagged_url = tagged_api_gate + "?" + tagged_api_fields + "&" + access_token_field + "&" + limit_token_field

visitor_posts_api_gate = "https://graph.facebook.com/v13.0/1581916451935422/visitor_posts"
visitor_posts_api_fields = "fields=id,message,created_time,status_type,story"
visitor_posts_url = visitor_posts_api_gate + "?" + visitor_posts_api_fields + "&" + access_token_field + "&" + limit_token_field

In [5]:
def read_all_entries(url):
    """
    Read all the entries from a given facebook API url.
    Returned facebook data can be either in full (if it does not exceed a certain number) 
    or paged (limited to a number of entries per page).
    """
    all_entries = []
    is_done = False
    
    while not is_done:
        r = requests.get(url)
        # Append entries on page to entries we want to store
        all_entries = all_entries + r.json()['data']
        if 'next' in r.json()['paging']:
            url = r.json()['paging']['next']
        else:
            is_done = True
    
    return all_entries   
    

def preprocess_events(events):
    """
    For all the events store only place name, instead of a json containing pace_id and place_name.
    """
    for event in events:
        place_name = event['place']['name']
        event['place'] = place_name
    return events

def preprocess_feed(feed_posts):
    """
    For all the feed post store only share count, if available.
    """
    for feed_post in feed_posts:
        if 'shares' in feed_post:
            share_count = feed_post['shares']['count']
            feed_post['shares'] = share_count
    return feed_posts

def write_to_file(folder_name, file_name, entries):
    """
    Function to write json entries nicely, so they can be easily read.
    """
    output_file = open(folder_name + "\\" + file_name, 'w')
    json.dump(entries, output_file, indent=4)
    

For further reference, to get number of likes of a facebook 'object'(i.e. event or post):
https://developers.facebook.com/docs/graph-api/reference/v13.0/object/reactions

In [105]:
def retrieve_like_count(post_id, access_token_field):
    """
    Return the number of likes for a certain post.
    """
    url = "https://graph.facebook.com/v13.0/" + post_id + "?fields=reactions.summary(total_count)" + "&" + access_token_field
    r = requests.get(url)
    if 'reactions' in r.json():
        return r.json()['reactions']['summary']['total_count']
    else:
        print(r.json())
        return 0

def retrieve_like_counts_for_posts(feed_posts, access_token_field):
    """
    Retrieve the number of likes for given feed_posts, using the access_token_id and add them to the feed_posts
    in a field called 'likes'
    """
    for feed_post in feed_posts:
        number_of_likes = retrieve_like_count(feed_post['id'], access_token_field)
        feed_post['likes'] = number_of_likes
    return feed_posts

In [106]:
# Gather all event data
event_entries = read_all_entries(events_url)
event_entries = preprocess_events(event_entries)
write_to_file("event_entries.json", event_entries)

In [107]:
event_entries[1:3]

[{'id': '2758405671134506',
  'attending_count': 5,
  'created_time': '2022-03-14T16:07:33+0000',
  'interested_count': 34,
  'name': 'Unscripted - Începători - Program de dezvoltare prin improvizație',
  'noreply_count': 690,
  'place': 'Centrul de Voluntariat Cluj-Napoca',
  'start_time': '2022-03-30T18:30:00+0300'},
 {'id': '875247883267346',
  'attending_count': 6,
  'created_time': '2022-03-14T16:29:47+0000',
  'interested_count': 11,
  'name': 'UnscriptED - Avansați - Program de dezvoltare prin improvizație',
  'noreply_count': 42,
  'place': 'Centrul de Voluntariat Cluj-Napoca',
  'start_time': '2022-03-28T18:30:00+0300'}]

In [108]:
# Gather all feed data
feed_entries = read_all_entries(feed_url)
feed_entries = preprocess_feed(feed_entries)
feed_entries = retrieve_like_counts_for_posts(feed_entries, access_token_field)
write_to_file('feed_entries.json', feed_entries)

{'id': '1581916451935422_2411050642255166'}


In [109]:
feed_entries[1:3]

[{'id': '1581916451935422_977906799511742',
  'created_time': '2022-04-05T18:14:02+0000',
  'is_popular': False,
  'status_type': 'created_event',
  'likes': 9},
 {'id': '1581916451935422_4774228602704175',
  'created_time': '2022-04-05T09:16:26+0000',
  'is_popular': False,
  'message': 'Stați tunați că vine și evenimentul. În curând. Nu doar la final de săptămână...',
  'status_type': 'added_photos',
  'shares': 3,
  'likes': 20}]

In [110]:
def preprocess_tagged(tagged_posts):
    """
    For all entries in tagged_posts store only 'name' field of 'from' field of the entry.
    If there is no 'from' field present in entry, assume the post is _Internal_.
    """
    for tagged_post in tagged_posts:
        if 'from' in tagged_post:
            from_value = tagged_post['from']['name']
            tagged_post['from'] = from_value
        else:
            tagged_post['from'] = 'Internal'
    return tagged_posts

In [111]:
# Gather all tagged posts data
tagged_entries = read_all_entries(tagged_url)
tagged_entries = preprocess_tagged(tagged_entries)
write_to_file('tagged_entries.json', tagged_entries)

In [112]:
# Gather all visitor posts data
visitor_posts_entries = read_all_entries(visitor_posts_url)
write_to_file('visitor_posts_entries.json', visitor_posts_entries)