# Data Collection of GitHub 

In this notebook, I'll use the <strong>GitHub API</strong> using <strong>GitHub OAuth Authentication</strong> to extract various information from my user profile such as repositories, commits and languages used and more. I will also save this data to <strong>comma seprated Values </strong> files so that I can draw insights such as Number of Commits per Repository

# Import Libraries

In [1]:
import json
import requests
import pandas as pd

I will fetch the required credentials like username and token from the json file and create an authentication variable.<br></br>
<strong>Note : </strong>Authentication using username and password would be deprecated by <strong>GitHub</strong>, so using OAuth authentication

In [26]:
user_credentials = json.loads(open('credentials.json').read())
token =  user_credentials['token']
headers = {'Authorization': 'token ' + token}

# User Information

I will fetch the user data such as Name of Repository, UserName, commits, Repositpry Url and more

In [28]:
response = requests.get('https://api.github.com/user', headers=headers)
user_data = response.json()
user_data

{'login': 'animeshsingh04',
 'id': 32563403,
 'node_id': 'MDQ6VXNlcjMyNTYzNDAz',
 'avatar_url': 'https://avatars3.githubusercontent.com/u/32563403?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/animeshsingh04',
 'html_url': 'https://github.com/animeshsingh04',
 'followers_url': 'https://api.github.com/users/animeshsingh04/followers',
 'following_url': 'https://api.github.com/users/animeshsingh04/following{/other_user}',
 'gists_url': 'https://api.github.com/users/animeshsingh04/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/animeshsingh04/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/animeshsingh04/subscriptions',
 'organizations_url': 'https://api.github.com/users/animeshsingh04/orgs',
 'repos_url': 'https://api.github.com/users/animeshsingh04/repos',
 'events_url': 'https://api.github.com/users/animeshsingh04/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/animeshsingh04/received_events',
 'type

# Repository Information

I will fetch repositories for user

In [29]:
repos_url = user_data['repos_url']

In [31]:
page_no = 1
repository_data = []
while (True):
    repos_response = requests.get(repos_url, headers=headers)
    repos_response = repos_response.json()
    repository_data = repository_data + repos_response
    repos_fetched = len(repos_response)
    print("Total repositories fetched: {}".format(repos_fetched))
    if (repos_fetched == 30):
        page_no = page_no + 1
        url = user_data['repos_url'].encode("UTF-8") + '?page=' + str(page_no)
    else:
        break

Total repositories fetched: 15


I have total 15 Repositries till date

I will take one random repository and see what data I can keep

In [32]:
repository_data[5]

{'id': 272244192,
 'node_id': 'MDEwOlJlcG9zaXRvcnkyNzIyNDQxOTI=',
 'name': 'AnnotationTool-Yolo',
 'full_name': 'animeshsingh04/AnnotationTool-Yolo',
 'private': False,
 'owner': {'login': 'animeshsingh04',
  'id': 32563403,
  'node_id': 'MDQ6VXNlcjMyNTYzNDAz',
  'avatar_url': 'https://avatars3.githubusercontent.com/u/32563403?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/animeshsingh04',
  'html_url': 'https://github.com/animeshsingh04',
  'followers_url': 'https://api.github.com/users/animeshsingh04/followers',
  'following_url': 'https://api.github.com/users/animeshsingh04/following{/other_user}',
  'gists_url': 'https://api.github.com/users/animeshsingh04/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/animeshsingh04/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/animeshsingh04/subscriptions',
  'organizations_url': 'https://api.github.com/users/animeshsingh04/orgs',
  'repos_url': 'https://api.github.com/users/ani

There are number of things that we can keep track, like following -

<ol>
  <li>Id : Unique id of Each repository</li>
  <li>Name : Name of the Repository </li>
  <li>Description : Description about the Repository</li>
  <li>Created_at : Date and time when Repository was created</li>
  <li>updated_at : Date and time when Repository was updated last time</li>
  <li>owner : UserName of the owner of Repository</li>
  <li>watchers_count : Number of watch on Respository</li>
  <li>url : Url of Repository</li>
  <li>commits_url : Url of each commit in the Repository</li>
  <li>languages_url : Url of Languages which used in Repository</li>
</ol>

In [34]:
repos_information = []
for repo in repository_data:
    data = []
    data.append(repo['id'])
    data.append(repo['name'])
    data.append(repo['description'])
    data.append(repo['created_at'])
    data.append(repo['updated_at'])
    data.append(repo['owner']['login'])
    data.append(repo['watchers_count'])
    data.append(repo['url'])
    data.append(repo['commits_url'].split("{")[0])
    data.append(repo['languages_url'])
    repos_information.append(data)

In [36]:
repo_df_column = ['Id','Name','Description','Created on','Updated on','Owner','Watchers count', 
                    'Url','Commits url','Languages url']

In [37]:
repository_info_df = pd.DataFrame(repos_information,columns=repo_df_column)
repository_info_df.head(5)

Unnamed: 0,Id,Name,Description,Created on,Updated on,Owner,Watchers count,Url,Commits url,Languages url
0,234071231,Angular,,2020-01-15T12:01:40Z,2020-06-30T15:56:42Z,animeshsingh04,1,https://api.github.com/repos/animeshsingh04/An...,https://api.github.com/repos/animeshsingh04/An...,https://api.github.com/repos/animeshsingh04/An...
1,239988172,AngularProject-to-showApilog-from-file,,2020-02-12T10:43:17Z,2020-06-30T15:56:39Z,animeshsingh04,1,https://api.github.com/repos/animeshsingh04/An...,https://api.github.com/repos/animeshsingh04/An...,https://api.github.com/repos/animeshsingh04/An...
2,233563609,AngularProject-to-showApilog-from-logfile,This Repo showcase API log using Angular 6 fro...,2020-01-13T09:55:31Z,2020-06-30T15:56:26Z,animeshsingh04,1,https://api.github.com/repos/animeshsingh04/An...,https://api.github.com/repos/animeshsingh04/An...,https://api.github.com/repos/animeshsingh04/An...
3,233396617,Animation-using-css3,Simple animation using pure css3 and html,2020-01-12T13:26:55Z,2020-06-30T15:56:45Z,animeshsingh04,1,https://api.github.com/repos/animeshsingh04/An...,https://api.github.com/repos/animeshsingh04/An...,https://api.github.com/repos/animeshsingh04/An...
4,233679163,animeshsingh04.github.io,,2020-01-13T19:46:12Z,2020-06-30T15:56:43Z,animeshsingh04,1,https://api.github.com/repos/animeshsingh04/an...,https://api.github.com/repos/animeshsingh04/an...,https://api.github.com/repos/animeshsingh04/an...


# Languages used in each Repository

I will use language_url for each repository to iterate and get the languages used in each repository and save bak in datafame (repso_df)

In [38]:
for i in range(len(repository_info_df)):
    language_url = repository_info_df.loc[i,'Languages url']
    language_response = requests.get(language_url, headers=headers)
    language_response =  language_response.json()
    print(language_response)
    if language_response != {}:
        language=[]
        for key,value in language_response.items():
            language.append(key)
        language = ', '.join(language)
        repository_info_df.loc[i,'Language'] = language
    else:
        repository_info_df.loc[i,'Language'] = ""

{'JavaScript': 9510143, 'HTML': 1360}
{'JavaScript': 9510143, 'HTML': 1391}
{'TypeScript': 9822, 'HTML': 2219, 'JavaScript': 1834, 'CSS': 283}
{'CSS': 3569, 'HTML': 2981, 'JavaScript': 965}
{'CSS': 3764, 'HTML': 3072, 'JavaScript': 988}
{'Jupyter Notebook': 357157, 'Python': 9388}
{'JavaScript': 6203, 'CSS': 3351, 'HTML': 2270}
{'Jupyter Notebook': 224490}
{'JavaScript': 12625, 'HTML': 6185, 'CSS': 1778}
{'Jupyter Notebook': 246501}
{'HTML': 1470}
{'Jupyter Notebook': 572859}
{'CSS': 20698, 'HTML': 20646, 'JavaScript': 704}
{'Java': 6170, 'HTML': 3532, 'Scala': 1268, 'JavaScript': 1227, 'CSS': 701}
{'Jupyter Notebook': 169087}


I will save all data we caputred in dataframe to .csv file called <strong>repso_info.csv</strong>

In [40]:
repository_info_df.to_csv('repository_info.csv', index = False)

# Commit Data of Each Repository

Now we will try to get the commit data for each Repository using <strong>commit_url</strong> which we saved earlier in <strong>repos_df</strong>

I will save the Id(i.e nothing but the Repository Id for refrence to create relation between repos_df), Commit Id(sha is nothing but commit Id), Commit Message and Commit Date

In [12]:
commit_information= []
for index, row in repository_info_df.iterrows():
    commit_url = row['Commits url']
    commit_response = requests.get(commit_url,headers=headers)
    commit_response = commit_response.json()
    for commit in commit_response:
        commit_data = []
        commit_data.append(repository_info_df.loc[index,"Id"])
        commit_data.append(commit['sha'])
        commit_data.append(commit['commit']['message'])
        commit_data.append(commit['commit']['author']['date'])
        commit_information.append(commit_data)


In [22]:
commit_df_column = ['Id','Commit Id','Commit Message','Commited on']
commit_df = pd.DataFrame(commit_information,columns=commit_df_column)

I will save the coomit dataframe to <strong>commit_info.csv</strong>

In [23]:
commit_df.to_csv('commit_info.csv',index = False)

In [25]:
commit_df.head(5)

Unnamed: 0,Id,Commit Id,Commit Message,Commited on
0,234071231,834843d5307f829103a185e29ea9df0de279bda2,Updated the Index.html,2020-01-15T12:13:41Z
1,234071231,b69b1517f299958fc24a3a897890197d7eb74dd5,READMe.md file commited,2020-01-15T12:04:52Z
2,234071231,9072be80ea4de086796e72291627b21024e21e13,Initial Commit,2020-01-15T12:03:13Z
3,239988172,89fe55999449365d4d026c00ce7bfcf587ed59bc,updated the index.html,2020-02-12T11:20:24Z
4,239988172,3411437305d486593b14a01693fe92bec2b61208,Merge branch 'master' of https://github.com/an...,2020-02-12T10:52:02Z


In [None]:
_i