# Scraping Repositories

Before we can analyse public Earth Engine repositories, we need to download them. We'll use `git clone` to download a local copy of every GEE repository.

In [1]:
import requests
import json
import os
from tqdm.auto import tqdm
import pandas as pd

## Repository Stats

You can download a JSON directory of repositories from [here](https://earthengine.googlesource.com/?format=JSON). We'll use that to get the name and URL of the repositories we need to grab.

In [4]:
directory_path = os.path.join("..", "data", "directory.json")

with open(directory_path) as src:
    directory = json.load(src)

repos = list(directory.keys())

How many public GEE repositories are there?

In [5]:
len(repos)

11175

How about how many unique users? First, let's just select repos that begin with `users/`, which excludes official examples.

In [6]:
user_repos = [repo for repo in repos if repo.startswith("users/")]

Now let's parse the usernames from the repository names. We'll use a `set` to remove duplicates from users who have multiple public repos.

In [8]:
usernames = {repo.split("/")[1] for repo in user_repos}
n_users = len(usernames)

In [9]:
print(f"There are {n_users} unique users with public GEE repositories!")

There are 8344 unique users with public GEE repositories!


## Cloning Repositories

To do a deeper analysis, we'll need to download every public repository. The `directory.json` file contains the URL of each repository, which we'll use to `git clone` a local copy.

In [16]:
clone_urls = {repo["name"]:repo["clone_url"] for repo in directory.values()}

In [36]:
def clone_repository(name, url):
    path = os.path.join(repo_dir, name)
    
    # If path already exists, skip cloning
    if os.path.exists(path):
        return
    
    os.makedirs(path)
    os.system(f'git clone "{url}" "{path}"')

This will take some time (about 8 hours on my laptop) and download 10+ gb of data.

In [1]:
repo_dir = os.path.join("..", "data", "repos")

for name, url in tqdm(clone_urls.items(), desc="Cloning"):
    try:
        clone_repository(name, url)
    # Some repositories contain files with invalid filenames. Skip those files.
    except Exception as e:
        pass