## Processing Data

Let us process the data before storing into Dynamodb.

* Here are fields which we are interested in.
  * id
  * node_id
  * name
  * full_name
  * owner.login
  * owner.id
  * owner.node_id
  * owner.type
  * owner.site_admin
  * html_url
  * description
  * fork
  * created_at
* We will define owner as Map or dict with the fields we are looking for.
* Read the data from list public repositories up to 100.
* Get all the fields from get repository API.
* Build a collection so that we can write to the target database.
  

In [3]:
import requests

### 1. Read list of 100 repos

In [4]:
res = requests.get(
    'https://api.github.com/repositories?since=758759529',
    headers={'Authorization': 'token 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'}
)

In [5]:
import json
repos = json.loads(res.content.decode('utf-8'))

### 2. Use loop to process repo by repo

In [6]:
%%time  

# Use this cell level majic to measure cell and loop performance
for i in range(1000000):
    _ = i * 2


CPU times: user 151 ms, sys: 8.24 ms, total: 159 ms
Wall time: 158 ms


In [7]:
%%time
repos_details = []
for repo in repos:
    try:
        owner = repo['owner']['login']
        name = repo['name']
        rd = json.loads(requests.get(
            f'https://api.github.com/repos/{owner}/{name}',
            headers={'Authorization': 'token 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'}
        ).content.decode('utf-8'))    # 2.1 Read details of this repo
        repo_details = {
            'id': rd['id'],
            'node_id': rd['node_id'],
            'name': rd['name'],
            'full_name': rd['full_name'],
            'owner': {
                'login': rd['owner']['login'],
                'id': rd['owner']['id'],
                'node_id': rd['owner']['node_id'],
                'type': rd['owner']['type'],
                'site_admin': rd['owner']['site_admin']
            },
            'html_url': rd['html_url'],
            'description': rd['description'],
            'fork': rd['fork'],
            'created_at': rd['created_at']     #2.2 Reformat details and put into list
        }
        repos_details.append(repo_details)
    except:
        pass

CPU times: user 3.11 s, sys: 159 ms, total: 3.27 s
Wall time: 47.9 s


In [9]:
len(repos_details) # Because we pass (skip) 

98

In [10]:
repos_details[0]

{'id': 758759532,
 'node_id': 'R_kgDOLTnAbA',
 'name': 'Henriquetx06',
 'full_name': 'Henriquetx06/Henriquetx06',
 'owner': {'login': 'Henriquetx06',
  'id': 159830970,
  'node_id': 'U_kgDOCYbTug',
  'type': 'User',
  'site_admin': False},
 'html_url': 'https://github.com/Henriquetx06/Henriquetx06',
 'description': None,
 'fork': False,
 'created_at': '2024-02-17T02:21:37Z'}

### 3. Modularize the previous function

In [18]:
# 3.1 Get the list of all

import requests, json
def list_repos(token, since='758759529'):
    res = requests.get(
        f'https://api.github.com/repositories?since={since}',
        headers={'Authorization': f'token {token}'}
    )
    return json.loads(res.content.decode('utf-8'))

In [12]:
repos = list_repos('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')

In [13]:
repos[0]

{'id': 758759532,
 'node_id': 'R_kgDOLTnAbA',
 'name': 'Henriquetx06',
 'full_name': 'Henriquetx06/Henriquetx06',
 'private': False,
 'owner': {'login': 'Henriquetx06',
  'id': 159830970,
  'node_id': 'U_kgDOCYbTug',
  'avatar_url': 'https://avatars.githubusercontent.com/u/159830970?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/Henriquetx06',
  'html_url': 'https://github.com/Henriquetx06',
  'followers_url': 'https://api.github.com/users/Henriquetx06/followers',
  'following_url': 'https://api.github.com/users/Henriquetx06/following{/other_user}',
  'gists_url': 'https://api.github.com/users/Henriquetx06/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/Henriquetx06/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/Henriquetx06/subscriptions',
  'organizations_url': 'https://api.github.com/users/Henriquetx06/orgs',
  'repos_url': 'https://api.github.com/users/Henriquetx06/repos',
  'events_url': 'https://api.github.com/use

In [22]:
# 3.3 Helper function : Get details of a single repo

def get_repo_details(owner, name, token):
    repo_details = json.loads(requests.get(
        f'https://api.github.com/repos/{owner}/{name}',
        headers={'Authorization': f'token {token}'}
    ).content.decode('utf-8'))
    return repo_details

In [23]:
# 3.4 Helper function : Change format of a single repo

def extract_repo_fields(repo_details):
    repo_fields = {
        'id': repo_details['id'],
        'node_id': repo_details['node_id'],
        'name': repo_details['name'],
        'full_name': repo_details['full_name'],
        'owner': {
            'login': repo_details['owner']['login'],
            'id': repo_details['owner']['id'],
            'node_id': repo_details['owner']['node_id'],
            'type': repo_details['owner']['type'],
            'site_admin': repo_details['owner']['site_admin']
        },
        'html_url': repo_details['html_url'],
        'description': repo_details['description'],
        'fork': repo_details['fork'],
        'created_at': repo_details['created_at']
    }
    return repo_fields

In [24]:
# 3.2 Loop for all single repo in repos

def get_repos(repos, token):
    repos_details = []
    for repo in repos:
        try:
            owner = repo['owner']['login']
            name = repo['name']
            repo_details = get_repo_details(owner, name, token) # Get single details from Github
            repo_fields = extract_repo_fields(repo_details) # Arrange data format
            repos_details.append(repo_fields) # Add final arranged result to final list
        except:
            pass
    return repos_details

In [25]:
repos_details = get_repos(repos, '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')

In [26]:
len(repos_details)

98

In [27]:
repos_details[0]

{'id': 758759532,
 'node_id': 'R_kgDOLTnAbA',
 'name': 'Henriquetx06',
 'full_name': 'Henriquetx06/Henriquetx06',
 'owner': {'login': 'Henriquetx06',
  'id': 159830970,
  'node_id': 'U_kgDOCYbTug',
  'type': 'User',
  'site_admin': False},
 'html_url': 'https://github.com/Henriquetx06/Henriquetx06',
 'description': None,
 'fork': False,
 'created_at': '2024-02-17T02:21:37Z'}

In [28]:
repos_details[-1]

{'id': 758759785,
 'node_id': 'R_kgDOLTnBaQ',
 'name': 'myportofolio',
 'full_name': 'nivedeepika/myportofolio',
 'owner': {'login': 'nivedeepika',
  'id': 118369201,
  'node_id': 'U_kgDOBw4rsQ',
  'type': 'User',
  'site_admin': False},
 'html_url': 'https://github.com/nivedeepika/myportofolio',
 'description': None,
 'fork': False,
 'created_at': '2024-02-17T02:22:44Z'}

In [24]:
repos_details[-1]['id']

333256115