# Data Retrieval

This notebook outlines the steps to retrieve all data from PyPI and the subsequent GitHub data for the packages that can be identified as linking to a GitHub repo. We pull dependency graph information from GitHub along with README data.

In [1]:
# All the data from PyPI
pypi_data_path = 'data_retrieval/data/pypi_data.json'

# Only Packages from PyPI that linked to Github Repo
pypi_github_pkgs_path = 'data_retrieval/data/pypi_github_data.json'

# Data from GitHub API for packages identified in step 2
github_data_path = 'data_retrieval/data/github_data.json'

## 1.) Retrieve All PyPI Data

In [3]:
from data_retrieval.async_pypi_retrieval import get_all_packages

# There will many 404 errors but this is expected - watch for any strange errors
get_all_packages(save_to=pypi_data_path)

  0%|          | 0/1 [00:00<?, ?it/s]

Scanning 100 packages
Getting all packages...
404 Client Error: Not Found for url: https://pypi.org/pypi/biaozhun/json
404 Client Error: Not Found for url: https://pypi.org/pypi/cdssh/json


100%|██████████| 1/1 [00:12<00:00, 12.29s/it]

There were 2 exceptions out of 100 requests.
Saving data to /data...
Saved data to: data_retrieval/data/pypi_data.json





## 2.) Find All PyPI Data That Links to a GitHub Repository

In [4]:
import ujson
import time
from urllib.parse import urlparse

In [5]:
# Load data we just collected
with open(pypi_data_path, 'r', encoding='utf-8') as f:
    json = ujson.loads(f.read())

In [6]:
verbose = False
github_packages = []

for package_data in json['data']:
    urls = package_data['project_urls']
    if urls is None:
        continue

    is_github = False
    for url in urls.values():
        parsed = urlparse(url)
        p_split = parsed.path.split('/')

        # Greater than two so a user and repo is listed
        if 'github.com' in str(parsed.netloc) and len(p_split) > 2:
            # Build link to just repo if it goes any further
            cleaned_github = 'https://github.com/' + p_split[1] + '/' + p_split[2] + '/'
            package_data['github_link'] = cleaned_github
            # Save
            github_packages.append(package_data)
            is_github = True
            break

    if not is_github and verbose:
        # View all links associated with packages NOT identified as having a GitHub Link
        # Helpful to ensure we don't miss anyy
        print(f'Is not github: {urls} from {package_data["name"]}')
        
# Log and save at completion
print(f'There were  {len(github_packages)}/{len(json["data"])} packages with github links found.')
print(f'Saving data to /data...')
with open(pypi_github_pkgs_path, 'w', encoding='utf-8') as f:
    ujson.dump({
        "data": github_packages,
        "timestamp": time.time()
    }, f)

print(f'Saved data to: {pypi_github_pkgs_path}')

There were  64/98 packages with github links found.
Saving data to /data...
Saved data to: data_retrieval/data/pypi_github_data.json


## 3.) Retrieve data from GitHub API

This is the trickiest part of the data to get due to being ratelimited by GitHub's API. To get this without constantly re-running the script every hour this while loop will run checking if requests will be served or rate limited - until stopped.

This means **the looping cell will need to be terminated by hand after a couple days or so** (takes that long to go through all the data.)

In [7]:
from environs import Env

env = Env()
env.read_env()

github_auth = env("GITHUB_AUTH")

In [11]:
import time
from data_retrieval.async_github_retrieval import RetrieveGitHubData
from data_retrieval.github_utils import GitHubUtils

github_utils = GitHubUtils(github_auth)
github_retrieval = RetrieveGitHubData(github_data_path, pypi_github_pkgs_path, github_auth)

In [12]:
# Initialize file with structure to allow us to update a state
# Will only need to run this cell once!
github_retrieval.init_data_map()

Initialized Data Map to: data_retrieval/data/github_data.json


This cell below **will need to be terminated by hand after a couple days or so** (takes that long to go through all the data.)

Expect to see errors like "loading", "403", and "rate_limited"

In [13]:
while True:
    github_retrieval.clear_error()
    # Every 20 successes clear errors
    successes = 0
    while successes < 20:
        if github_utils.within_github_rate_limit():
            print('\nWITHIN RATE LIMIT!')
            successes += 1
            github_retrieval.get_all_github_data()
        else:
            print('Waiting 5 Minutes...')
            time.sleep(60 * 5)

  0%|          | 0/1 [00:00<?, ?it/s]

CLEARING ERRORS...
Saving data to /data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 0 packages with data
There were 0 packages that throw errors
There are 64 packages total
--------------------------------------------



WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...

Error: {'path': ['repository', 'dependencyGraphManifests'], 'locations': [{'line': 5, 'column': 5}], 'message': 'loading'} 
 URL: https://github.com/tarkatronic/django-excel-response/


Error: {'path': ['repository', 'dependencyGraphManifests'], 'locations': [{'line': 5, 'column': 5}], 'message': 'loading'} 
 URL: https://github.com/hchasestevens/astpath/


Error: {'path': ['repository', 'dependencyGraphManifests'], 'locations': [{'line': 5, 'column': 5}], 'message': 'loading'} 
 URL: https://github.com/cjrh/sqllogformatter/


Error: {'path': ['repository', 'dependencyGraphManifests'], 'locations': [{'lin

100%|██████████| 1/1 [01:02<00:00, 62.96s/it]
100%|██████████| 1/1 [00:00<00:00, 4236.67it/s]

There were 14 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------



WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...



100%|██████████| 1/1 [00:00<00:00, 6636.56it/s]

There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------



WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...



100%|██████████| 1/1 [00:00<00:00, 7145.32it/s]

There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------



WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...



100%|██████████| 1/1 [00:00<00:00, 5405.03it/s]

There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------



WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...



100%|██████████| 1/1 [00:00<00:00, 6141.00it/s]

There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------



WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...





There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------



WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...


100%|██████████| 1/1 [00:00<00:00, 6689.48it/s]

Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 2837.82it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 6990.51it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 4583.94it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 5849.80it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 8774.69it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 6932.73it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 6374.32it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 6326.25it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 7928.74it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 3688.92it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 3106.89it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 2208.69it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 2473.06it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 14 packages that throw errors
There are 64 packages total
--------------------------------------------


CLEARING ERRORS...
Saving data to /data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 50 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





  0%|          | 0/1 [00:00<?, ?it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...


100%|██████████| 1/1 [00:05<00:00,  5.18s/it]


There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------



WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...


100%|██████████| 1/1 [00:00<00:00, 5405.03it/s]

Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 6732.43it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 7530.17it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 3189.58it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 8272.79it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 9822.73it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 2755.78it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 2916.76it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 6574.14it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 6898.53it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 6403.52it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 4957.81it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 2385.84it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 8272.79it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 6141.00it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 10407.70it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------





100%|██████████| 1/1 [00:00<00:00, 7928.74it/s]


WITHIN RATE LIMIT!
Setting up packages to be distributed to workers...
Getting all github data...
There were 0 exceptions out of 64 requests.
Saving data...
Saved data to: data_retrieval/data/github_data.json

------------- SOME STATS -------------------
There were 57 packages with data
There were 7 packages that throw errors
There are 64 packages total
--------------------------------------------







WEIRD ERROR: 
Waiting 5 Minutes...


KeyboardInterrupt: 