add experimental design for dockerfile parsing

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
converged-computing · May 16, 2024 · 9577f49 · 9577f49
1 parent 5165ae8
commit 9577f49
Show file tree

Hide file tree

Showing 11 changed files with 444 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 We want to study the impact of pulling times (at different scales) for different sizes containers, meaning varying individual layer and total container sizes. This entire study is going to be powered by dd, which will allow us to generate files of arbitrary sizes.
 
- - [test](test): early testing of the design below. What I learned is that we need to likely evaluate the container landscape (average sizes) before deciding on our experiment design. 
+ - [test](experiments/test): early testing of the design below. What I learned is that we need to likely evaluate the container landscape (average sizes) before deciding on our experiment design. 
 
 
 Next steps are to sample a Dockerfile set to get container URIs and sizes that people are using.

diff --git a/experiments/dockerfile/README.md b/experiments/dockerfile/README.md
@@ -0,0 +1,58 @@
+# Dockerfile
+
+## Design
+
+This could be a really interesting study. I want to look at:
+
+ - Across a set of ML orgs and research software containers, get a list of unique containers. For each:
+   - Assess current total size and size of layers (an average? distribution?) Try to describe distribution?
+   - Look at containers over time (represented by sorted version tags, I have a library that does this) and look at how the size / layer metrics change.
+
+What we want to know from the above is, what does the "average" container look like in terms of size, both total and in layers, and then how has it changed over time. As a sub-analysis we can say something about the percentage of research software projects (from the RSEPedia) that provide containers, period.
+
+Given the above, we can then create an experiment that properly measures just outside of the range of what people are actually building, min and max wise. Questions we want to answer:
+
+1. What is the tradeoff between container size (total vs. layers)?
+2. It it better to have fewer large layers or more smaller layers?
+3. Can we determine redundandy of layers across containers (arguably that is a good thing)
+4. Given the needs of storage and time to pull, what are strategies around that?
+
+For the third point, it would be really useful to see some kind of overlap _between_ layers of containers. For example, if I'm running a bunch of ML containers on a cluster, it obviously would be better for storage to have many containers share the same layers. But what does it mean, design wise, to do that? For the fourth point, I'd like to investigate and test some of the caching or optimization strategies so, for example, we don't spend a ton of money on pulling containers alone.
+
+
+## Parsing
+
+Instead of trying to sample across all Dockerfile, let's sample those we find in GitHub organizations that we know do a lot of machine learning, etc. We will take two approaches:
+
+**GitHub orgs**
+
+- start with a list of GitHub orgs (e.g., nvidia, Hugging Face) and list repos
+- clone all repos and find all Dockerfile
+- get FROM name in Dockerfile
+  - for each unique FROM, get the container tags, and sort across time
+  - determine how size has changed over time (getting larger)?
+  - create a distribution of sizes
+
+```bash
+export GITHUB_TOKEN=xxxxxxxxx
+python parse_repos.py
+```
+
+**Software Databases**
+
+We can do the same procedure, but just searching repositories in the [Research Software encyclopedia](https://rseng.github.io/software) to get a sampling of projects in the more rse (or closer to hpc maybe?) ecosystem. If bioconda containers also has a sampling of a different community, I have almost 10k known container URIs in shpc-registry.
+
+
+## Attempt 1
+
+> This did not work (still) because of API limits.
+
+We are going to programatically get Dockerfile from GitHub. While we could do a clone to search for them across a repository, it's more a convention to have one at the top level, so we are going to guess the path, trying both master and main. Since the GitHub API limits to 1K results, we are going to search in the range of a week.
+
+```bash
+pip install rse
+export GITHUB_TOKEN=xxxxxxxxx
+python search.py
+```
+
+
diff --git a/experiments/dockerfile/parse_repos.py b/experiments/dockerfile/parse_repos.py
@@ -0,0 +1,195 @@
+#!/usr/bin/env python3
+
+from datetime import datetime, timedelta
+
+from rse.utils.command import Command
+import rse.utils.file as utils
+
+
+import tempfile
+import requests
+import json
+import argparse
+import sys
+import shutil
+import os
+
+here = os.path.abspath(os.path.dirname(__file__))
+
+token = os.environ.get("GITHUB_TOKEN")
+if not token:
+    sys.exit("Please export GITHUB_TOKEN")
+
+
+def clone(url, dest):
+    dest = os.path.join(dest, os.path.basename(url))
+    cmd = Command("git clone --depth 1 %s %s" % (url, dest))
+    cmd.execute()
+    if cmd.returncode != 0:
+        print("Issue cloning %s" % url)
+        return
+    return dest
+
+
+def get_range(dt, days=360):
+    """
+    Get the range of datetimes.
+    """
+    # Allow function to be used for both cases
+    if isinstance(dt, str):
+        dt = datetime.strptime(dt, "%Y-%m-%d")
+    next_dt = dt + timedelta(days=days)
+    return str(dt).split(" ")[0], str(next_dt).split(" ")[0]
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(description="Dockerfile Scraper")
+    parser.add_argument(
+        "--start-date",
+        default="2013-04-11",
+        help="starting date",
+    )
+    parser.add_argument(
+        "--outdir",
+        default=os.path.join(here, "data", "orgs"),
+        help="output data directory for results",
+    )
+    parser.add_argument(
+        "--days",
+        default=100,
+        help="days to search for repos over",
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args, _ = parser.parse_known_args()
+
+    if not os.path.exists(args.outdir):
+        os.makedirs(args.outdir)
+
+    # Parse these known ML orgs
+    orgs = [
+        "nvidia",
+        "huggingface",
+        "pytorch",
+        "tensorflow",
+        "azure",
+        "udacity",
+        "tensorflow",
+        "scikitlearn",
+    ]
+
+    # Create a base temporary folder to work from
+    tempdir = tempfile.mkdtemp()
+
+    # Prepare headers
+    headers = {
+        "Authorization": f"Bearer {token}",
+        "X-GitHub-Api-Version": "2022-11-28",
+        "Accept": "application/vnd.github+json",
+    }
+
+    # Let's parse by 1 week at a time
+    # Going backwards in time
+    results = []
+    while True:
+        for org in orgs:
+            # Create an output directory for the org
+            outdir = os.path.join(args.outdir, org)
+            if not os.path.exists(outdir):
+                os.makedirs(outdir)
+
+            url = f"https://api.github.com/orgs/{org}/repos"
+
+            print(f"Looking for repos for {org}")
+            # Keep track of repos seen so we don't do them twice
+            seen = set()
+
+            # Get listing of repos
+            total_results = 1
+            page = 1
+            repos = []
+            while total_results > 0:
+                response = requests.get(
+                    url, headers=headers, params={"per_page": 100, "page": page}
+                )
+
+                # Stop and interactive debugging
+                if response.status_code != 200:
+                    print(
+                        f"Issue with request: {response.status_code} at {datetime.now()}"
+                    )
+                    import IPython
+
+                    IPython.embed()
+
+                new_repos = response.json()
+                repos += new_repos
+                total_results = len(new_repos)
+                print(f"Found {total_results} new repos")
+                page += 1
+
+            print(f"Found a total of {len(repos)} repositories for {org}")
+
+            for item in repos:
+                user, repo = item["html_url"].split("/")[-2:]
+                uri = f"{user}/{repo}"
+                if uri in seen:
+                    continue
+
+                seen.add(uri)
+                path = os.path.join(outdir, user, repo)
+                if os.path.exists(path):
+                    continue
+
+                os.makedirs(path)
+                dest = None
+                try:
+                    # Try clone (and cut out early if not successful)
+                    dest = clone(item["html_url"], tempdir)
+                    if not dest:
+                        continue
+
+                    # Recursive find Dockerfile and copy to keep
+                    files = list(utils.recursive_find(dest, "Dockerfile*"))
+                    print(f"  Found {len(files)} Dockerfile in {uri}")
+                    if not files and os.path.exists(path):
+                        shutil.rmtree(path)
+
+                    for filename in files:
+                        # Get relative path to repo
+                        relpath = os.path.relpath(filename, os.path.join(tempdir, repo))
+                        file_dest = os.path.join(path, relpath)
+                        dirname = os.path.dirname(file_dest)
+                        if not os.path.exists(dirname):
+                            os.makedirs(dirname)
+                        shutil.copyfile(filename, file_dest)
+                except:
+                    print(f"Issue with {item['html_url']}, skipping")
+
+                if dest:
+                    cleanup(dest)
+
+    import IPython
+
+    IPython.embed()
+    sys.exit()
+    if os.path.exists(tempdir):
+        shutil.rmtree(tempdir)
+
+
+def cleanup(dest):
+    try:
+        if dest and os.path.exists(dest):
+            shutil.rmtree(dest)
+    except:
+        print("Likely too many files, check with ulimit -n and set with ulimit -n 4096")
+        import IPython
+
+        IPython.embed()
+
+
+if __name__ == "__main__":
+    main()