Skip to content

Commit

Permalink
add experimental design for dockerfile parsing
Browse files Browse the repository at this point in the history
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed May 16, 2024
1 parent 5165ae8 commit 9577f49
Show file tree
Hide file tree
Showing 11 changed files with 444 additions and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

We want to study the impact of pulling times (at different scales) for different sizes containers, meaning varying individual layer and total container sizes. This entire study is going to be powered by dd, which will allow us to generate files of arbitrary sizes.

- [test](test): early testing of the design below. What I learned is that we need to likely evaluate the container landscape (average sizes) before deciding on our experiment design.
- [test](experiments/test): early testing of the design below. What I learned is that we need to likely evaluate the container landscape (average sizes) before deciding on our experiment design.


Next steps are to sample a Dockerfile set to get container URIs and sizes that people are using.
Expand Down
58 changes: 58 additions & 0 deletions experiments/dockerfile/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Dockerfile

## Design

This could be a really interesting study. I want to look at:

- Across a set of ML orgs and research software containers, get a list of unique containers. For each:
- Assess current total size and size of layers (an average? distribution?) Try to describe distribution?
- Look at containers over time (represented by sorted version tags, I have a library that does this) and look at how the size / layer metrics change.

What we want to know from the above is, what does the "average" container look like in terms of size, both total and in layers, and then how has it changed over time. As a sub-analysis we can say something about the percentage of research software projects (from the RSEPedia) that provide containers, period.

Given the above, we can then create an experiment that properly measures just outside of the range of what people are actually building, min and max wise. Questions we want to answer:

1. What is the tradeoff between container size (total vs. layers)?
2. It it better to have fewer large layers or more smaller layers?
3. Can we determine redundandy of layers across containers (arguably that is a good thing)
4. Given the needs of storage and time to pull, what are strategies around that?

For the third point, it would be really useful to see some kind of overlap _between_ layers of containers. For example, if I'm running a bunch of ML containers on a cluster, it obviously would be better for storage to have many containers share the same layers. But what does it mean, design wise, to do that? For the fourth point, I'd like to investigate and test some of the caching or optimization strategies so, for example, we don't spend a ton of money on pulling containers alone.


## Parsing

Instead of trying to sample across all Dockerfile, let's sample those we find in GitHub organizations that we know do a lot of machine learning, etc. We will take two approaches:

**GitHub orgs**

- start with a list of GitHub orgs (e.g., nvidia, Hugging Face) and list repos
- clone all repos and find all Dockerfile
- get FROM name in Dockerfile
- for each unique FROM, get the container tags, and sort across time
- determine how size has changed over time (getting larger)?
- create a distribution of sizes

```bash
export GITHUB_TOKEN=xxxxxxxxx
python parse_repos.py
```

**Software Databases**

We can do the same procedure, but just searching repositories in the [Research Software encyclopedia](https://rseng.github.io/software) to get a sampling of projects in the more rse (or closer to hpc maybe?) ecosystem. If bioconda containers also has a sampling of a different community, I have almost 10k known container URIs in shpc-registry.


## Attempt 1

> This did not work (still) because of API limits.
We are going to programatically get Dockerfile from GitHub. While we could do a clone to search for them across a repository, it's more a convention to have one at the top level, so we are going to guess the path, trying both master and main. Since the GitHub API limits to 1K results, we are going to search in the range of a week.

```bash
pip install rse
export GITHUB_TOKEN=xxxxxxxxx
python search.py
```


195 changes: 195 additions & 0 deletions experiments/dockerfile/parse_repos.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
#!/usr/bin/env python3

from datetime import datetime, timedelta

from rse.utils.command import Command
import rse.utils.file as utils


import tempfile
import requests
import json
import argparse
import sys
import shutil
import os

here = os.path.abspath(os.path.dirname(__file__))

token = os.environ.get("GITHUB_TOKEN")
if not token:
sys.exit("Please export GITHUB_TOKEN")


def clone(url, dest):
dest = os.path.join(dest, os.path.basename(url))
cmd = Command("git clone --depth 1 %s %s" % (url, dest))
cmd.execute()
if cmd.returncode != 0:
print("Issue cloning %s" % url)
return
return dest


def get_range(dt, days=360):
"""
Get the range of datetimes.
"""
# Allow function to be used for both cases
if isinstance(dt, str):
dt = datetime.strptime(dt, "%Y-%m-%d")
next_dt = dt + timedelta(days=days)
return str(dt).split(" ")[0], str(next_dt).split(" ")[0]


def get_parser():
parser = argparse.ArgumentParser(description="Dockerfile Scraper")
parser.add_argument(
"--start-date",
default="2013-04-11",
help="starting date",
)
parser.add_argument(
"--outdir",
default=os.path.join(here, "data", "orgs"),
help="output data directory for results",
)
parser.add_argument(
"--days",
default=100,
help="days to search for repos over",
)
return parser


def main():
parser = get_parser()
args, _ = parser.parse_known_args()

if not os.path.exists(args.outdir):
os.makedirs(args.outdir)

# Parse these known ML orgs
orgs = [
"nvidia",
"huggingface",
"pytorch",
"tensorflow",
"azure",
"udacity",
"tensorflow",
"scikitlearn",
]

# Create a base temporary folder to work from
tempdir = tempfile.mkdtemp()

# Prepare headers
headers = {
"Authorization": f"Bearer {token}",
"X-GitHub-Api-Version": "2022-11-28",
"Accept": "application/vnd.github+json",
}

# Let's parse by 1 week at a time
# Going backwards in time
results = []
while True:
for org in orgs:
# Create an output directory for the org
outdir = os.path.join(args.outdir, org)
if not os.path.exists(outdir):
os.makedirs(outdir)

url = f"https://api.github.com/orgs/{org}/repos"

print(f"Looking for repos for {org}")
# Keep track of repos seen so we don't do them twice
seen = set()

# Get listing of repos
total_results = 1
page = 1
repos = []
while total_results > 0:
response = requests.get(
url, headers=headers, params={"per_page": 100, "page": page}
)

# Stop and interactive debugging
if response.status_code != 200:
print(
f"Issue with request: {response.status_code} at {datetime.now()}"
)
import IPython

IPython.embed()

new_repos = response.json()
repos += new_repos
total_results = len(new_repos)
print(f"Found {total_results} new repos")
page += 1

print(f"Found a total of {len(repos)} repositories for {org}")

for item in repos:
user, repo = item["html_url"].split("/")[-2:]
uri = f"{user}/{repo}"
if uri in seen:
continue

seen.add(uri)
path = os.path.join(outdir, user, repo)
if os.path.exists(path):
continue

os.makedirs(path)
dest = None
try:
# Try clone (and cut out early if not successful)
dest = clone(item["html_url"], tempdir)
if not dest:
continue

# Recursive find Dockerfile and copy to keep
files = list(utils.recursive_find(dest, "Dockerfile*"))
print(f" Found {len(files)} Dockerfile in {uri}")
if not files and os.path.exists(path):
shutil.rmtree(path)

for filename in files:
# Get relative path to repo
relpath = os.path.relpath(filename, os.path.join(tempdir, repo))
file_dest = os.path.join(path, relpath)
dirname = os.path.dirname(file_dest)
if not os.path.exists(dirname):
os.makedirs(dirname)
shutil.copyfile(filename, file_dest)
except:
print(f"Issue with {item['html_url']}, skipping")

if dest:
cleanup(dest)

import IPython

IPython.embed()
sys.exit()
if os.path.exists(tempdir):
shutil.rmtree(tempdir)


def cleanup(dest):
try:
if dest and os.path.exists(dest):
shutil.rmtree(dest)
except:
print("Likely too many files, check with ulimit -n and set with ulimit -n 4096")
import IPython

IPython.embed()


if __name__ == "__main__":
main()

0 comments on commit 9577f49

Please sign in to comment.