# Contribution scraping, workbook 3

Tasks:
* Decide whether to scrape more info
* Scrape a subset of cryptocurrency data + save; what statistics can I get for each user?
* Look into parsing usernames, perhaps image recognition on avi's?

Start by looking at the full range of what's available to see if we should include more than just contributions.

In [4]:
import requests
import json

In [5]:
contributors_url = 'https://api.github.com/repos/recursecenter/blaggregator/contributors'

In [12]:
def get_contributors_test(url):    
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        return (response.json())
    else:
        print('[!] HTTP {0} calling repo [{1}]'.format(response.status_code, contrib_url))
        return None

In [13]:
blag = get_contributors_test(contributors_url)

In [16]:
blag[1]

{'login': 'punchagan',
 'id': 315678,
 'node_id': 'MDQ6VXNlcjMxNTY3OA==',
 'avatar_url': 'https://avatars0.githubusercontent.com/u/315678?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/punchagan',
 'html_url': 'https://github.com/punchagan',
 'followers_url': 'https://api.github.com/users/punchagan/followers',
 'following_url': 'https://api.github.com/users/punchagan/following{/other_user}',
 'gists_url': 'https://api.github.com/users/punchagan/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/punchagan/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/punchagan/subscriptions',
 'organizations_url': 'https://api.github.com/users/punchagan/orgs',
 'repos_url': 'https://api.github.com/users/punchagan/repos',
 'events_url': 'https://api.github.com/users/punchagan/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/punchagan/received_events',
 'type': 'User',
 'site_admin': False,
 'contributions': 249}

It might be useful to get the profile URL and avatar URL, since this can be parsed as a picture of a user. But, should we look this up in a separate scrape?

In fact if we follow the API URL, we can see that there's an optional field `name` which can help give us additional information about the person.

In [23]:
def find_contributor_name(url):
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        return (response.json())
    else:
        print('[!] HTTP {0} calling repo [{1}]'.format(response.status_code, contrib_url))
        return None

In [24]:
punchagan = find_contributor_name('https://api.github.com/users/punchagan')

In [27]:
punchagan['name']

'Puneeth Chaganti'

So, we'll need to write a second scraper to get the human name from a list of usernames. Perhaps this could be an additional step to build into the repo scraper, rather than collect all of the repos first, then go back to collect all of the names.

ALSO: At some point I might need to dowload profile images in order to pass them through an algorithm.

### Gender parsing steps:
1. Filter profiles by those with human name and those without
    * No human name goes into 'uncertain' bucket
2. Parse human names for gender
    * Names with ambiguous gender also into 'uncertain' bucket
3. Scrape avi's for every username in 'uncertain'
4. Apply gender classification algorithm to avis.


## Tweak scraper file to collect url, avi link


In [15]:
blag[0]

{'login': 'sursh',
 'id': 719590,
 'node_id': 'MDQ6VXNlcjcxOTU5MA==',
 'avatar_url': 'https://avatars1.githubusercontent.com/u/719590?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/sursh',
 'html_url': 'https://github.com/sursh',
 'followers_url': 'https://api.github.com/users/sursh/followers',
 'following_url': 'https://api.github.com/users/sursh/following{/other_user}',
 'gists_url': 'https://api.github.com/users/sursh/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/sursh/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/sursh/subscriptions',
 'organizations_url': 'https://api.github.com/users/sursh/orgs',
 'repos_url': 'https://api.github.com/users/sursh/repos',
 'events_url': 'https://api.github.com/users/sursh/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/sursh/received_events',
 'type': 'User',
 'site_admin': False,
 'contributions': 278}

In [67]:
# %load tor_get_repo_contributions.py
import json
import requests

session = requests.session()
session.proxies = {}

session.proxies['http'] = 'socks5h://localhost:9050'
session.proxies['https'] = 'socks5h://localhost:9050'

api_url_base = 'https://api.github.com/'
headers = {'Content-Type': 'application/json',
           'User-Agent': 'python-requests/3.1.7',
           'Accept': 'application/vnd.github.v3+json'}


def get_repos(orgname):
    api_url = '{}orgs/{}/repos'.format(api_url_base, orgname)
    # use session.get instead of request
    response = session.get(api_url, headers=headers)

    if response.status_code == 200:
        return (response.json())
    else:
        print('[!] HTTP {0} calling [{1}]'.format(response.status_code, api_url))
        return None


def get_contributors(repo):
    name = repo['name']
    contrib_url = repo['contributors_url']
    
    response = session.get(contrib_url, headers=headers)

    if response.status_code == 200:
        return (
            # returns `contribution_response`
            {'name': name,
             'data': response.json()}
        )
    else:
        print('[!] HTTP {0} calling repo [{1}]'.format(response.status_code, contrib_url))
        return None


def build_contribution_list(contribution_response):
    all_repo_contributions = list()

    for i in range(0,len(contribution_response['data'])):
        ctr = dict()
        ctr["repo"] = contribution_response['name']
        ctr["username"] = contribution_response['data'][i]['login']
        ctr["contributions"] = contribution_response['data'][i]['contributions']
        ctr["avatar_url"] = contribution_response['data'][i]['avatar_url']
        ctr["profile_url"] = contribution_response['data'][i]['url']
        
        all_repo_contributions.append(ctr)
    
    return all_repo_contributions


all_org_contributions = list()
    
def get_all_contributions(org):
    repos = get_repos(org)

    for repo in repos:
        contributors = get_contributors(repo)
        contribution_list = build_contribution_list(contributors)
        all_org_contributions.append(contribution_list)
        
# need to do real name lookup next.
    # add_names_to_list(all_orall_org_contributions)
    # print("Done!")

In [70]:
get_all_contributions('recursecenter')

In [72]:
# We now have a list with one item per repo
all_org_contributions[0]

[{'repo': 'hs-cli',
  'username': 'zachallaun',
  'contributions': 51,
  'avatar_url': 'https://avatars0.githubusercontent.com/u/503938?v=4',
  'profile_url': 'https://api.github.com/users/zachallaun'},
 {'repo': 'hs-cli',
  'username': 'davidbalbert',
  'contributions': 2,
  'avatar_url': 'https://avatars2.githubusercontent.com/u/123350?v=4',
  'profile_url': 'https://api.github.com/users/davidbalbert'}]

In [46]:
# Access just the url with nested for loops
for i in range (0, len(all_org_contributions)):
    for j in range (0, len(all_org_contributions[i])):
        # this would be the find_name lookup
        print(all_org_contributions[i][j]['profile_url'])

https://api.github.com/users/zachallaun
https://api.github.com/users/davidbalbert
https://api.github.com/users/danielmendel
https://api.github.com/users/astrieanna
https://api.github.com/users/zachallaun
https://api.github.com/users/chuckha
https://api.github.com/users/ncollins
https://api.github.com/users/maxlikely
https://api.github.com/users/StefanKarpinski
https://api.github.com/users/jroes
https://api.github.com/users/sursh
https://api.github.com/users/punchagan
https://api.github.com/users/davidbalbert
https://api.github.com/users/kenyavs
https://api.github.com/users/stanzheng
https://api.github.com/users/akaptur
https://api.github.com/users/porterjamesj
https://api.github.com/users/santialbo
https://api.github.com/users/strugee
https://api.github.com/users/danluu
https://api.github.com/users/PuercoPop
https://api.github.com/users/pnf
https://api.github.com/users/alliejones
https://api.github.com/users/nnja
https://api.github.com/users/graue
https://api.github.com/users/thomasboy

In [78]:
def lookup_human_name(profile_url):
    # restate headers in case I need to change
    lookup_headers = {'Content-Type': 'application/json',
           'User-Agent': 'python-requests/3.6.1',
           'Accept': 'application/vnd.github.v3+json'}
    
    response = session.get(profile_url, headers=lookup_headers)

    if response.status_code == 200:
        return (response.json()['name'])
    else:
        print('[!] HTTP {0} looking up user [{1}]'.format(response.status_code, profile_url))
        return None

In [79]:
# Test for profiles without a name filled in 
a = lookup_human_name('https://api.github.com/users/sursh')

In [80]:
print(a) # this won't throw an error so should be ok

None


In [81]:
# This for loop would be in last stage of repo scrape
for i in range (0, len(all_org_contributions)):
    for j in range (0, len(all_org_contributions[i])):
        human_name = lookup_human_name(all_org_contributions[i][j]['profile_url'])
        all_org_contributions[i][j]['name'] = human_name

[!] HTTP 403 looking up user [https://api.github.com/users/thomasballinger]
[!] HTTP 403 looking up user [https://api.github.com/users/margo73465]
[!] HTTP 403 looking up user [https://api.github.com/users/maccman]
[!] HTTP 403 looking up user [https://api.github.com/users/josh]
[!] HTTP 403 looking up user [https://api.github.com/users/alex-stripe]
[!] HTTP 403 looking up user [https://api.github.com/users/zachallaun]
[!] HTTP 403 looking up user [https://api.github.com/users/davidbalbert]
[!] HTTP 403 looking up user [https://api.github.com/users/tmm1]
[!] HTTP 403 looking up user [https://api.github.com/users/raggi]
[!] HTTP 403 looking up user [https://api.github.com/users/jakedouglas]
[!] HTTP 403 looking up user [https://api.github.com/users/sodabrew]
[!] HTTP 403 looking up user [https://api.github.com/users/igrigorik]
[!] HTTP 403 looking up user [https://api.github.com/users/smparkes]
[!] HTTP 403 looking up user [https://api.github.com/users/dj2]
[!] HTTP 403 looking up user 

## To do - 
preserve progress so that we don't lose everything if there's an error half way

In [86]:
all_org_contributions[0] 
# This works for getting names except I'm getting rate banned

[{'repo': 'hs-cli',
  'username': 'zachallaun',
  'contributions': 51,
  'avatar_url': 'https://avatars0.githubusercontent.com/u/503938?v=4',
  'profile_url': 'https://api.github.com/users/zachallaun',
  'name': 'Zach Allaun'},
 {'repo': 'hs-cli',
  'username': 'davidbalbert',
  'contributions': 2,
  'avatar_url': 'https://avatars2.githubusercontent.com/u/123350?v=4',
  'profile_url': 'https://api.github.com/users/davidbalbert',
  'name': 'David Albert'}]

In [73]:
a = lookup_human_name('https://api.github.com/users/sursh')

[!] HTTP 403 looking up user [https://api.github.com/users/sursh]


Got a message from GitHub:
```
{
  "message": "API rate limit exceeded for 66.108.11.123. (But here's the good news: 
  Authenticated requests get a higher rate limit. Check out the documentation for more details.)",
  "documentation_url": "https://developer.github.com/v3/#rate-limiting"
}
```
=> Need to lookup how to do authentication w/o putting my client id & key on a public repo.

GitHub mentions authentication [here](https://developer.github.com/v3/#authentication):

Authentication token can be put into header:
`curl -H "Authorization: token OAUTH-TOKEN" https://api.github.com`

Example code from [StackOverflow](https://stackoverflow.com/questions/17622439/how-to-use-github-api-token-in-python-for-requesting):
```python
self.headers = {'Authorization': 'token %s' % self.api_token}
r = requests.post(url, headers=self.headers)
```

## OAuth Apps
In order to make this work I need an OAuth application token, that I get here.
https://developer.github.com/apps/building-oauth-apps/