# Facebook Collector Workflow
The CASM Lab Facebook Collector projects follow our internal standard approach to social media data:

1. collect
2. cache
3. parse
4. analyze

Under the collect step live scripts for getting data in "raw" form. Here, raw means whatever default format for the data is. Usually this means JSON dumped by an API, but for scrapers it's whatever data structure and format we decided to use. We are greedy in collection meaning we pull whatever data the API will let us have. In the EveryBlock projects, it means data returned by the ever-changing and often-unavailable EveryBlock Content API.

Once we have "raw" data, we cache it by storing a read-only copy somewhere accessible to the whole team. Usually this storage step is handled by the collection script and isn't an extra scripting step. I call it out here though because it's conceptually important - social media data changes all the time, and caching lets us keep track of what the data looked like at the time of collection (e.g., what was returned, what structure was standard then).

Next, we parse. Parsing scripts pull data from the read-only caches and put them in formats that are appropriate for analysis or whatever comes next. For instance, some of our Twitter user timeline tools collect data from search API, cache it, then parse it into a MySQL database for display on our Django-backed website. This leaves us with two related, but not identical, copies of the data - one in JSON from Twitter, and one in MySQL. Parsing scripts also do any data transformations that are necessary for analysis (e.g., converting timestamps, calculating user stats).

Finally, we get to analyze the data. Often analysis is included in the same script as parsing, but sometimes analysis steps will live on their own. Some of the analysis will involve machine learning or natural language processing, but some will be simple word clouds or descriptive statistics.

## Setup
Note: This code has been tested on OS X 10.11.3 and Windows 10.


1. Create a Facebook Web App on [https://developers.facebook.com](https://developers.facebook.com). Or your Facebook account needs to be granted at least Tester permission to modify and run a current web app.
2. Clone the FacebookGroupCollector repo
3. Use Python 3
4. Run Ruby 2.2.3 or newer version to start a localhost before you run .py file. (if running the project on Jupyter notebook, skip this step)

### fechGUI.py
- appId: the App ID on your Developer Dashboard
- GroupID: the value that you have to fill on the web page. You need to know the GroupID of the group (eg, [Asian American Chicago Network (AACN)](https://www.facebook.com/groups/asianamericanchicagonetwork/) 160475740743826) you would like to collect data from (Get groupID using: https://lookup-id.com/), and input it in the GUI.

### parse.py
- path: the directory that you put your raw data
- outputFile: the directory that you put your parsed data

### toCSV.py
- f: the directory that you put your parsed data
- outputFile: the directory that you put your targeted csv data

## Collect and Cache
1. Download fetchGUI.py, and make sure that you have Firefox browser.
2. Open Command Prompt from the folder where fetchGUI.py locates.
3. Use localhost to serve the project. (skip this step if using Jupyter notebook, since it's already a host)

In [2]:
% serve

ERROR: Line magic function `%serve` not found.


4. Run fetchGUI.py with Python 3, type in the group ID, click the Get Group Feed button and a webpage tab would pop up in Firefox.

In [6]:
% run fetchGUI.py

ImportError: No module named 'Tkinter'

5. At the first time you click the button on the webpage, a Facebook website about your verification would show up. Log in and click Okay. If error shows up, it means you have not been granted permission.
6. Click the button again, and windows to save data webpages would pop up.
7. Check the box "Do this automatically for files like this from now on" and save webpages to one empty folder in your computer.
8. After downloading data is complete, type in the name for the combined json file, and click the Combine Files button.


## Parse and Prepare for Qualitative Analysis
After downloading all data, we do a little curation with ```parse.py``` and ```toCSV.py```.

####Step 1. Parse data to wanted schema.

Having all raw data in the directory RawData/, run the ```parse.py``` first to extract targeted information for future analysis including 

-	message [the content of the entry]
-	postId  [the id of the entry]
-	parentPostId  [if the current entry is a post, this is the id of its parent]
-	parentCommentId  [if the current entry is a comment, this is the id of its parent]
-	authorName  [the author name of the current entry]
-	metaData [including hasLink, hasEvent, hasPhoto, hasVideo and hasTags. Boolean type and the default value is False.]

Run command

In [7]:
% run parse.py

FileNotFoundError: [Errno 2] No such file or directory: 'ParsedData/parsed-data-Mar8.json'

The output will be one JSON file composed of all entries. And each entry looks like the example below.

	{
  		"hasVideo": false,
  		"hasPhoto": false,
  		"hasLink": true,
  		"parentPostId": "",
  		"authorName": "Shenyun Shenny",
  		"hasEvent": false,
  		"message": "Please join AACN this Saturday morning at 11 AM  for yummy dim sum at Ming Hin (2168 South Archer Avenue, Chicago, IL) in Chinatown. \n\nRSVP on our Meetup page: http:\/\/meetu.ps\/3mPFG",
  		"postId": "160475740743826_167108120080588",
  		"hasTags": false,
  		"parentCommentId": ""
	}

Now we have all clean data we need.

####Step 2. Convert JSON data into Excel-friendly file (csv).
The next step is to put the data in a Excel form so that we can analyze them one by one and take notes. The reason to do this is to classify different topics and discover new issues or questions from the feed.

The easist way to do this is to convert our JSON data into CSV data so that Excel can just open it in a nice format. And that's what ```toCSV.py``` file does. Run command

In [9]:
% run toCSV.py

FileNotFoundError: [Errno 2] No such file or directory: 'Forms/aacn-Mar8.csv'

And open the output file in Excel. It would be like the table below:

| postId  | parentPostId | parentCommentId | authorName | message | hasVideo | hasPhoto | hasEvent | hasLink | hasTags |
|---|---|---|---|---|---|---|---|---|---|
| 160475740743826_167108120080588  |   |   | Shenyun Shenny  | Please join AACN this Saturday morning at 11 AM  for yummy dim sum at Ming Hin (2168 South Archer Avenue, Chicago, IL) in Chinatown. RSVP on our Meetup page: http://meetu.ps/3mPFG  | FALSE  | FALSE  | FALSE  | TRUE  | FALSE  |


And now you can add you own column such as "notes" or "categories" to do further qualitative analysis.