## Remove Transitives

In this notebook, we explore how to remove transitives from the training of python HPF model

In [1]:
from pip._internal.req.req_file import parse_requirements
from pip._internal.download import PipSession
import requests
import requests_cache
from datetime import timedelta

In [2]:
from pip._vendor.distlib.util import normalize_name

In [3]:
requests_cache.install_cache(expire_after=timedelta(hours=5))

#### Let's start by seeing how to get transitives for a package from pypa api

In [4]:
session = PipSession()

In [5]:
r = requests.get(url='https://pypi.org/pypi/flask/json', headers={"Accept": "application/json"})

In [6]:
r.status_code

200

In [7]:
c = r.json()

In [8]:
requires_dist = c.get('info', {}).get('requires_dist', [])

In [9]:
## These are the transitives for flask. Remember, this is not the complete graph of dependencies that flask requires.
## It's just first level transitives
requires_dist

['Werkzeug (>=0.15)',
 'Jinja2 (>=2.10.1)',
 'itsdangerous (>=0.24)',
 'click (>=5.1)',
 "pytest ; extra == 'dev'",
 "coverage ; extra == 'dev'",
 "tox ; extra == 'dev'",
 "sphinx ; extra == 'dev'",
 "pallets-sphinx-themes ; extra == 'dev'",
 "sphinxcontrib-log-cabinet ; extra == 'dev'",
 "sphinx-issues ; extra == 'dev'",
 "sphinx ; extra == 'docs'",
 "pallets-sphinx-themes ; extra == 'docs'",
 "sphinxcontrib-log-cabinet ; extra == 'docs'",
 "sphinx-issues ; extra == 'docs'",
 "python-dotenv ; extra == 'dotenv'"]

**We now parse the transitives to get the names and normalize those names as well. We don't filter out extra dependencies, because we also want to remove those when somebody is using that extra field while installing.**

In [10]:
for d in requires_dist:
    print(normalize_name(d.split()[0]))

werkzeug
jinja2
itsdangerous
click
pytest
coverage
tox
sphinx
pallets-sphinx-themes
sphinxcontrib-log-cabinet
sphinx-issues
sphinx
pallets-sphinx-themes
sphinxcontrib-log-cabinet
sphinx-issues
python-dotenv


**The above logic seems to be working fine. Let's start looking at BQ data and filter out the transitives**

In [11]:
import json

In [12]:
manifest_json = [{
    "ecosystem": "pypi",
    "package_list": []
}]

manifest_json_without_transitives = [{
    "ecosystem": "pypi",
    "package_list": []
}]

In [13]:
length_tuple_list = []

In [14]:
def remove_transitives(req_names, trans_names):
    req_names = set(req_names) - trans_names
    return list(req_names)

#### Let's the get to the actual work now

Here we do the following:
1. We open the data gathered from BQ and start processing the requirements.txt
2. For each requirement in requirement.txt we get the first level transitives using the above logic.
3. Now, we will have transitives for every dependency mentioned in the requirements.txt
4. We remove all the transitives from requirements.txt and save the direct requirements.

To give an example as to why this works by figuring out just the first level transitives is: 

Let's assume the following package list: {a, b, c, d}. We now take the following cases:

1. a->b, b->c, c->d. So when you calculate the total transitive graph, b c and d will be eliminated
2. a->b, b->d. Here a and c will be directs

Similarly, we can think of other examples. One case that I can think of is: a->b, b->c and then c->a. I don't think if that is even possible in python, but I think it is because you can basically install a and then b and then c given the version constraints are satisfied. But in the above case, we will end up removing all the three dependencies. IMO we aren't losing on much because such cases rarely exist.

In [None]:
with open("python-bigquery-data.json", "r") as f, open('error-log-pip-trans.txt', 'a') as log:
    content = json.load(f)
    for x in content:
        if x.get('content'):
            with open("temp-requirements.txt", "w") as w:
                w.write(x.get('content'))
            req_names = []
            trans_names = set()
            try:
                for p in parse_requirements("temp-requirements.txt", session=session):
                    if p.name:
                        name = normalize_name(p.name)
                        r = requests.get(url='https://pypi.org/pypi/{}/json'.format(name), headers={"Accept": "application/json"})
                        if r.status_code == 200:
                            print("Package: {} done".format(name))
                            req_names.append(name)
                            response = r.json()
                            requires_dist = response.get('info', {}).get('requires_dist', [])
                            requires_dist = [] if not requires_dist else requires_dist
                            requires_dist = [normalize_name(d.split()[0]) for d in requires_dist]
                            for r in requires_dist:
                                trans_names.add(r)
            except Exception as e:
                log.write(str(e))
            manifest_json[0].get("package_list").append(req_names)
            req_names_direct = remove_transitives(req_names, trans_names)
            manifest_json_without_transitives[0].get("package_list").append(req_names_direct)
            length_tuple_list.append((len(req_names), len(req_names_direct)))

#### We save the files below

In [22]:
with open('manifest-list-with-trans.json', 'w') as f:
    json.dump(manifest_json, f)

In [23]:
with open('manifest-list-without-trans.json', 'w') as f:
    json.dump(manifest_json_without_transitives, f)

In [25]:
len(length_tuple_list)

161562

In [26]:
import pickle

In [33]:
with open('length-tuple.pkl', 'wb') as f:
    pickle.dump(length_tuple_list, f)