# Facebook Collector Workflow
The CASM Lab Facebook Collector projects follow our internal standard approach to social media data:

1. collect
2. cache
3. parse
4. analyze

Under the collect step live scripts for getting data in "raw" form. Here, raw means whatever default format for the data is. Usually this means JSON dumped by an API, but for scrapers it's whatever data structure and format we decided to use. We are greedy in collection meaning we pull whatever data the API will let us have. In the EveryBlock projects, it means data returned by the ever-changing and often-unavailable EveryBlock Content API.

Once we have "raw" data, we cache it by storing a read-only copy somewhere accessible to the whole team. Usually this storage step is handled by the collection script and isn't an extra scripting step. I call it out here though because it's conceptually important - social media data changes all the time, and caching lets us keep track of what the data looked like at the time of collection (e.g., what was returned, what structure was standard then).

Next, we parse. Parsing scripts pull data from the read-only caches and put them in formats that are appropriate for analysis or whatever comes next. For instance, some of our Twitter user timeline tools collect data from search API, cache it, then parse it into a MySQL database for display on our Django-backed website. This leaves us with two related, but not identical, copies of the data - one in JSON from Twitter, and one in MySQL. Parsing scripts also do any data transformations that are necessary for analysis (e.g., converting timestamps, calculating user stats).

Finally, we get to analyze the data. Often analysis is included in the same script as parsing, but sometimes analysis steps will live on their own. Some of the analysis will involve machine learning or natural language processing, but some will be simple word clouds or descriptive statistics.

## Setup
Note: This code has been tested on OS X 10.11.3 and Windows 10.


1. Create a Facebook Web App on [https://developers.facebook.com](https://developers.facebook.com). Or your Facebook account needs to be granted at least Tester permission to modify and run a current web app;
2. Clone the FacebookGroupCollector repo;
3. Use Python 3;
4. Open Command Prompt from the folder where ```fetch.html``` locates;
5. Start a localhost at localhost:4000. We run Ruby 2.2.3 or newer version to start a localhost in this guide by entering "serve".


In [None]:
% serve

In [25]:
%%javascript

 window.fbAsyncInit = function() {
    FB.init({
      appId      : '1065187026859191',
      xfbml      : true,
      version    : 'v2.6'
    });
     console.log("init");
  };

  (function(d, s, id){
     var js, fjs = d.getElementsByTagName(s)[0];
     if (d.getElementById(id)) {return;}
     js = d.createElement(s); js.id = id;
     fjs.parentNode.insertBefore(js, fjs);
   }(document, 'script', 'facebook-jssdk'));

  var pageCount = 1;
  var groupID;
  var json = [];

  function readData(link) {
//     var link = document.getElementById("GET-link").value;
    var regroup = /https:\/\/www.facebook.com\/groups\/(.*?)\//;
  	// var regroup = new RegExp("https:\/\/www.facebook.com\/groups\/(.*?)\/");
  	var repage = /https:\/\/www.facebook.com\/(.*?)\//;
  	var name;
  	var url;

  	if (link.match(regroup)) {
  		name = link.match(regroup)[1];
        url = "search?q=" + name + "&type=group";
  	} else if (link.match(repage)) {
  		name = link.match(repage)[1];
  		url = name;
  	} else {
  		console.log("Wrong link");
      var pro = document.createElement('p');
      pro.id = "wrong"
      pro.textContent = "Wrong link";
      document.getElementById('content').appendChild(pro);
  		return;
  	}
      
    console.log("before log in");
      
	FB.getLoginStatus(function(response) {
        console.log("try log in");
		if (response.status === 'connected') {
		  console.log("connected");
		  var accessToken = response.authResponse.accessToken;
		  grabID(url, accessToken);

		} else {
		    console.log("not logged in");
		    FB.login();
		  };
	});
  }

  function grabID(url, accessToken) {
  	FB.api(
  		url,
  		{access_token : accessToken},
  		function (response) {
	        
	        if (response && !response.error) {
	          var t = JSON.stringify(response);

	          if (response["data"]) {
	          	groupID = response["data"][0]["id"];
	          } else {
	          	groupID = response["id"];
	          }
	          
	        } else {
	          console.log("some error when grab id!");
	        }
	        var url2 = groupID + "/feed?fields=caption,created_time,description,from,id,link,message,message_tags,name,story,type,updated_time,comments{comments{object,parent,message,from,id,created_time},from,id,message,created_time,object},likes{id,name}";
          var pro = document.createElement('p');
          pro.id = "process"
          pro.textContent = "Grabbing data, please wait.";
          document.getElementById('content').appendChild(pro);

	        getPosts(url2, accessToken);

		}
	)

  }

  function getPosts(url2, accessToken) {


  	FB.api(
  		url2,
  		{fields: 'posts'},
      	{access_token : accessToken},
      	function (response) {
        
        	if (response && !response.error) {
              json.push(response);

        	} else {
          		console.log("some error when get posts!");
        	}
        	setTimeout(function() {
            	try {
              		console.log(pageCount);
              		nextPage = response["paging"]["next"];
              		console.log("nextPage exists! trying to grab next page");
              		getPosts(nextPage, accessToken);
              
              		pageCount += 1;   

                  document.getElementById('process').textContent = "Grabbing page" + pageCount.toString();
            	}
            	catch(e) {
              		console.log("the end of pages!");
              		console.log(pageCount);

                  var blob = new Blob([JSON.stringify(json)], {type: "application/json"});
                  var url3 = URL.createObjectURL(blob);

                  var a = document.createElement('a');
                  a.download = "backup.json";
                  a.href = url3;
                  a.textContent = "Download backup.json";
                  document.getElementById('content').appendChild(a);

                  var pro = document.getElementById('process');
                  pro.textContent = "Complete. Click the link to download."
            	};
        	}, 10000);
        
      	}
    );

  }

readData("https://www.facebook.com/AsianAmericanChicagoNetwork/");

<IPython.core.display.Javascript object>

### fetch.html
- Page link: open in the browser and input your page link. 

### grabdata.js
- appId: the App ID on your Developer Dashboard

### cache.py
- raw-path: the directory that you put your raw data
- cache-path: the directory that you put your cached data

### parse.py
- path: the directory that you put your cached data
- outputFile: the directory that you put your parsed data

### toCSV.py
- f: the directory that you put your parsed data
- outputFile: the directory that you put your targeted csv data


## Collect
1. After started a localhost at 4000, open ```fetch.html``` in the browser and input your page link. It could be a Facebook page or a Facebook group.
2. Click the button and wait to grab the json data from the feed as the page says.
3. When the data is prepared, click the link as the page asks to download the raw data.
4. Check your raw-data-sample folder and see if it's there, if not, put it there for the next steps.

Note:
At the first time you click the button on the webpage, a Facebook website about your verification would show up. Log in and repeat the previous steps. 


## Cache
1. After downloading data is complete, run the ```cache.py``` file to clean raw file and add meta data including the ```Facebook group ID```, ```Group name```, ```last post ID```, ```last post```, and all other content ("[data]").

Run command

In [None]:
% run cache.py

## Parse and Prepare for Qualitative Analysis
After downloading all data, we do a little curation with ```parse.py``` and ```toCSV.py```.

#### Step 1. Parse data to wanted schema.

Having all raw data in the directory RawData/, run the ```parse.py``` first to extract targeted information for future analysis including 

-	message [the content of the entry]
-	postId  [the id of the entry]
-	parentPostId  [if the current entry is a post, this is the id of its parent]
-	parentCommentId  [if the current entry is a comment, this is the id of its parent]
-	authorName  [the author name of the current entry]
-	metaData [including hasLink, hasEvent, hasPhoto, hasVideo and hasTags. Boolean type and the default value is False.]

Run command

In [None]:
% run parse.py

The output will be one JSON file composed of all entries. And each entry looks like the example below.

	{
  		"hasVideo": false,
  		"hasPhoto": false,
  		"hasLink": true,
  		"parentPostId": "",
  		"authorName": "Shenyun Shenny",
  		"hasEvent": false,
  		"message": "Please join AACN this Saturday morning at 11 AM  for yummy dim sum at Ming Hin (2168 South Archer Avenue, Chicago, IL) in Chinatown. \n\nRSVP on our Meetup page: http:\/\/meetu.ps\/3mPFG",
  		"postId": "160475740743826_167108120080588",
  		"hasTags": false,
  		"parentCommentId": ""
	}

Now we have all clean data we need.

#### Step 2. Convert JSON data into Excel-friendly file (csv).
The next step is to put the data in a Excel form so that we can analyze them one by one and take notes. The reason to do this is to classify different topics and discover new issues or questions from the feed.

The easist way to do this is to convert our JSON data into CSV data so that Excel can just open it in a nice format. And that's what ```toCSV.py``` file does. Run command

In [None]:
% run toCSV.py

And open the output file in Excel. It would be like the table below:

| postId  | parentPostId | parentCommentId | authorName | message | hasVideo | hasPhoto | hasEvent | hasLink | hasTags |
|---|---|---|---|---|---|---|---|---|---|
| 160475740743826_167108120080588  |   |   | Shenyun Shenny  | Please join AACN this Saturday morning at 11 AM  for yummy dim sum at Ming Hin (2168 South Archer Avenue, Chicago, IL) in Chinatown. RSVP on our Meetup page: http://meetu.ps/3mPFG  | FALSE  | FALSE  | FALSE  | TRUE  | FALSE  |


And now you can add you own column such as "notes" or "categories" to do further qualitative analysis.