# Facebook Data Crawling
In this notebook, we will be crawling data from Facebook using the Facebook Graph API. We will be using the facebook-scraper

## Install the required library
We will be using the facebook-scraper library to crawl data from Facebook. We will install this library using pip.

In [4]:
%pip install facebook_scraper pandas numpy ast copy datetime re

Collecting facebook_scraper
  Using cached facebook_scraper-0.2.59-py3-none-any.whl (45 kB)
Collecting ast
  Using cached AST-0.0.2.tar.gz (19 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mGetting requirements to build wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[19 lines of output][0m
  [31m   [0m Traceback (most recent call last):
  [31m   [0m   File "/home/codespace/.python/current/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
  [31m   [0m     main()
  [31m   [0m   File "/home/codespace/.python/current/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
  [31m   [0m     json_out['return_val'] = hook(**hook_input['kwargs'])
  [31m   [0m   File "/home/codespace/.python/cu

In [8]:
# from facebook_scraper import get_posts, get_profile
import pandas as pd
import numpy as np
import ast
import copy
import datetime
import re

## Crawl the data using facebook_scraper
Now we can get the data from Facebook using the facebook_scraper library. We will be using the get_posts function to get the posts from the fanpage. This function will return a list of dictionaries, where each dictionary represents a post. We will be saving this list of dictionaries to a json file. More information about what you can do with the facebook_scraper library can be found here: https://github.com/kevinzg/facebook-scraper

## Define variables
First we have to define some variables that we will be using throughout the notebook. 
- FANPAGE_LINK: The link to the fanpage that we want to crawl data from. This can be found by going to the fanpage and copying the link from the address bar. For example, the link to the fanpage of the [Nintendo Switch](https://www.facebook.com/NintendoSwitch/) is https://www.facebook.com/NintendoSwitch/. We will be using this link as the value for FANPAGE_LINK.

- COOKIE_PATH: The path to the cookie file that we will be using to authenticate with Facebook. This cookie file can be obtained by logging into Facebook and copying the cookie from the browser. For example, in Chromium, use extension [Get cookies.txt LOCALLY](https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid) to get the cookie file. Then save the cookie to a file and use the path to this file as the value for COOKIE_PATH. <span style="color:red; font-weight:bold">USE COOKIE FROM A FAKE ACCOUNT, OTHERWISE YOUR REAL ACCOUNT MIGHT GET BANNED.</span>.


- FOLDER_NAME: The name of the folder that we will be saving the data to. This folder will be created in the same directory as this notebook.

In [9]:
FANPAGE_LINK ="EliudKipchogeOfficial"
FOLDER_PATH = "Data/"
COOKIE_PATH = "cookies.txt"

PAGES_NUMBER = 30 # Number of pages to crawl

In [None]:
post_list = []
for post in get_posts(FANPAGE_LINK,
                    options={"comments": True, "reactions": True, "allow_extra_requests": True},
                    extra_info=True, pages=PAGES_NUMBER, cookies=COOKIE_PATH):
    print(post)
    post_list.append(post)

## Convert list of dicts to df

Now we can convert the list of dictionaries to a pandas dataframe. We will be using the pandas library to do this. We will also be saving the dataframe to a xlxs or csv file.

In [3]:
# # Only for initial scraping of website

# # Initialize dataframe to scrape Facebook post
# post_df_full = pd.DataFrame(columns=post_list[0].keys(), index=range(len(post_list)), data=post_list)

# # To df
# path=FOLDER_PATH + FANPAGE_LINK + "copy" + ".csv"
# post_df_full.to_csv(path, index=False)
# print(path)

In [10]:
# Used for subsequent sessions
# import crawled data to dataframe
path=FOLDER_PATH + FANPAGE_LINK + ".csv"
print(path)
post_df_full = pd.read_csv(path, low_memory= False)
LENGTH = len(post_df_full)

Data/EliudKipchogeOfficial.csv


In [11]:
# list of fields to be omitted from data file used for analysis
file = open("fields.txt","r")
words = list()
for line in file:
    wordie = copy.deepcopy(line.replace('\n', ''))
    words.append(wordie)
print(words)

# removing said fields from dataframe
post_df_full = post_df_full.drop(words,axis=1)

['post_id', 'text', 'shared_text', 'original_text', 'timestamp', 'likes', 'image', 'image_lowquality', 'images', 'images_lowquality', 'images_lowquality_description', 'video', 'video_duration_seconds', 'video_height', 'video_id', 'video_quality', 'video_size_MB', 'video_thumbnail', 'video_watches', 'video_width', 'post_url', 'link', 'links', 'user_id', 'username', 'user_url', 'is_live', 'factcheck', 'shared_post_id', 'shared_time', 'shared_user_id', 'shared_username', 'shared_post_url', 'available', 'w3_fb_url', 'with', 'page_id', 'sharers', 'image_id', 'image_ids', 'video_ids', 'videos', 'was_live', 'fetched_time']


In [16]:
# clean the text of each post, removing icons, leaving only text, numbers and common punctuation
post_df_full['post_text'] = post_df_full['post_text'].str.replace('[^a-zA-Z0-9,.\s]', '', regex=True)
post_df_full['text_length'] = post_df_full['post_text'].str.len()


In [17]:
# changing cells with no data to "no data"
post_df_full = post_df_full.fillna({'reactions': "no_data", "reactors": "no_data"})
post_df_full['comments_full'] = post_df_full["comments_full"].replace({"[]": "no_data"})
post_df_full['reactors'] = post_df_full["reactors"].replace({"[]": "no_data"})

In [18]:
# extract number of each type of reaction 
emojis_count = {'like': [], 'love': [], 'haha': [], 'wow': [], 'care': [], 'angry': [], 'sad': []}
emojis = set(emojis_count.keys())
for i in range(LENGTH):
    reacts = copy.deepcopy(emojis)
    if post_df_full.loc[i, "reactions"] != "no_data":
        react = ast.literal_eval(post_df_full.loc[i, "reactions"])
        for i in react.keys():
            emojis_count[i].append(react[i])
            reacts.remove(i)
        for i in reacts:
            emojis_count[i].append(0)
    else:
        for i in reacts:
            emojis_count[i].append(0)
for emoji in emojis:
    post_df_full[emoji] = emojis_count[emoji]
post_df_full = post_df_full.drop(["reactions"], axis=1)


In [19]:
# find reactors' facebook accounts' links
reactor_ids = list()
for i in range(LENGTH):
    if post_df_full.loc[i, "reactors"] != "no_data":
        reactor_id = list() 
        reactors = ast.literal_eval(post_df_full.loc[i, "reactors"])
        for person in reactors:
            reactor_id.append(person['link'])
        reactor_ids.append(copy.deepcopy(reactor_id))
    else:
        reactor_ids.append("no_data")
post_df_full['reactor_ids'] = reactor_ids
post_df_full = post_df_full.drop(["reactors"], axis=1)

# Find IDs of everyone who reacted
reactors_ids = list()
pattern = r'id=(\d{15})'
for ids in reactor_ids:
    if ids != "no_data":
        matches = list()
        for id in ids:
            x = re.findall(pattern, id)
            if len(x) > 0:
                matches.append(x[0])
        reactors_ids.append(matches)
    else:
        reactors_ids.append("no_data")
        
post_df_full['reactor_ids'] = reactors_ids


In [20]:
# extract relevant information from comments: text and commenters' ids, find average comment length of each post
comments_texts = []
commenters_ids = []
comment_avg_length = list()
for i in range(LENGTH):
    if post_df_full.loc[i, "comments_full"] != "no_data":
        counter = 0
        comment_length = list()
        comment_text = ""
        commenters_id = list()
        commenters = eval(post_df_full.loc[i, "comments_full"])
        for person in commenters:
            commenters_id.append(person['commenter_id'])
            comment_text += person['comment_text']
            comment_text += ". "
            counter += 1
            comment_length.append(len(person['comment_text']))
        commenters_ids.append(copy.deepcopy(commenters_id))
        comments_texts.append(comment_text)
        comment_avg_length.append(sum(comment_length) / counter)
    else:
        commenters_ids.append("no_data")
        comments_texts.append("no_data")
        comment_avg_length.append(0)
post_df_full["comments_text"] = comments_texts
post_df_full["comments_text"] = post_df_full["comments_text"].str.replace('[^a-zA-Z0-9,.\s]', '', regex=True)
post_df_full["comments_text"] = post_df_full["comments_text"].apply(lambda x: x.replace('\n', ' '))
post_df_full["commenters_ids"] = commenters_ids
post_df_full["avg_comment_length"] = comment_avg_length
post_df_full = post_df_full.drop(["comments_full"], axis=1)


In [21]:
# Import dataframe as csv file
post_df_full.to_csv("cleaned.csv")