#### https://stackoverflow.com/questions/4598572/facebook-api-how-to-get-count-of-group-members?noredirect=1&lq=1

## Major Challenges:
  -No way to query for a simple COUNT of total group members. Have to keep querying until you run out.
  
  -No guarantee on true randomness of query results, which breaks capture-recapture. Have to do an in-memory/streaming sample.
  
  -Don't know how to reliably generate a Facebook User Access Token that lasts forever. Have to go through API Explorer.


## Initialization/Constants:

  -Obtain a new API key each time from: https://developers.facebook.com/tools/explorer/145634995501895/?method=GET&path=227570797388782%2Fmembers%2F&version=v2.10 Use the "Get Token" button in the top right. Not sure how to prompt it from python code.

## Todo:

  -Implement more complex mark and recapture techniques
  
  -Clean up code -- split into more functions
  
  -Incorporate Panos' propensity model

In [1]:
import pandas as pd
import json
import requests 
import random

In [2]:
access_token = r"EAACEdEose0cBAI0lGIGLj2ffd6XIXIu7WZAG7uANnv34XAmD7YUAUENaVkOIZB4Fza4LZBdlz8dSsmMvVYeqsdzcDOaeIyAC7B8e0u9SxXFtASTZB9j8Ah9ivYZAOJiK6l9ObrzKT5gYF3ZCDZC6crDdqUDzim4Ef1iPD6ZAgfRnjEW1gPZBVOHvzjAonZCKNqJZC8ZD"
group_id = "227570797388782"  #Look this up from Facebook Group meta tags

In [3]:
def reservoir_sampler(data, num_samples, current_list = [], samples_seen = 0):
    '''
    This is a streaming sampling algorithm...
    Returns list of current state of samples, plus the number of elements it has seen so far.
    '''
    if samples_seen < num_samples:
        current_list.append(data)
    elif samples_seen >= num_samples and random.random() < (num_samples / float(samples_seen + 1)):
        #to guarantee that every data point has an equal chance of ending up in the current_list
        replace = random.randint(0,len(current_list)-1)
        current_list[replace] = data
    return current_list, samples_seen + 1

#if we haven't seen a sufficient number of samples yet, keep collecting -- if we have seen enough samples, 
#if the random sample is sufficiently low, we replace something in our current list with the new data,
#every new data point has an equal chance of ending up in our list

In [None]:
#observe activity from a certain period of time
#group 1 and two are based 
#bias to people discussing in the channel..compensate for different levels of activity
#lifetime --> what is the time difference between groups ...how does it change when activit periods move farther away

In [None]:
join groups and find new groups to get

In [4]:
def mark_from_group(num_to_mark, group_id, access_token, get_all_group_members = False, api_limit = 1500):
    '''
    Inputs:
        -num_to_mark: Number of samples to mark from group
        -group_id: Facebook Group ID
        -access_token: Facebook access Token
        -get_all_group_members: Boolean flag for whether or not to keep pulling for all group members
        -api_limit: Number of samples to pull each time. CANNOT set more than 2000, 1500 to be safe. 
    
    Outputs:
        -Set() of "num_to_mark" samples -- contains IDs of members marked.
        -Total number of members of group
    '''
    base_url = r"https://graph.facebook.com/"
    request_url = base_url + r"v2.10/" + group_id + r"/members"
    parameters = { "access_token" : access_token, "limit" : api_limit }  #Max limit is about 2000 before it fails
#making the call from Facebook
    r = requests.get(request_url, params = parameters)
    
    #print(r.url)
    payload = r.json() #data I will get back out
    next_url = payload['paging'].get('next') 
    #if 'next' key exists, then they will give new url for next chunk of people
    #if the next_url field exits, it means there are more members in group
    data = payload.get('data') #returns none if it doesn't exist
    
    marked_samples = []
    num_seen = 0
    for person in data:
        marked_samples, num_seen = reservoir_sampler(person['id'], num_to_mark, marked_samples, num_seen) 
        #all we care about is ID, right?
        #only first 1500 people
    
    if get_all_group_members:
        while(next_url is not None):  #can be dangerous, code might run for a very long time!!
            r = requests.get(next_url)
            payload = r.json()
            data = payload.get('data')
            next_url = payload['paging'].get('next') #reasigned so that it can account for every group member

            for person in data:
                marked_samples, num_seen = reservoir_sampler(person['id'], num_to_mark, marked_samples, num_seen)

    return (set(marked_samples), num_seen)

In [9]:
#Capture (first sample) 
first_group, true_group_size = mark_from_group(250, group_id, access_token, True, 1500) #TRUE to get all group members

In [10]:
#Re-Capture (second sample)
second_group, true_group_size_2 = mark_from_group(250, group_id, access_token, True, 1500)

In [11]:
#Estimating Group Size

def lincoln_estimator(initial_group, recaptured_group):
    intersect = set.intersection(initial_group, recaptured_group)
    print("Intersection size: ", len(intersect))
    return len(initial_group) * len(recaptured_group) / len(intersect)

estimated_group_size = lincoln_estimator(first_group, second_group)
print(estimated_group_size, true_group_size)
print("Error: ", (abs(estimated_group_size - true_group_size) / true_group_size)* 100, "%" )

Intersection size:  11
5681.818181818182 6180
Error:  8.061194468961457 %


In [None]:
#next step would be implement
#isnt a simple way to pull everyone
#have to do it a chunk at a time
#got help with brother for the random sampling part
#have this data that comes in one chunk at a time
#what kinds of visualzations do we need?
#more complicated estimator?

#add requests.get(url) get the user based upon the