# Parsing the Code

To learn more about how Earth Engine users interact with different datasets and modules, we'll have to start digging further into source code by parsing each file, line-by-line to look for patterns. This script will do the heavy lifting of parsing our code, saving the results out to CSVs for further analysis and visualization in other notebooks.

In [1]:
import re
import json
import os
import glob
from tqdm.auto import tqdm

## Load Repos and Source Files

First, we need to re-load our list of source files like we did in the Summarizing Code notebook.

In [8]:
directory_path = os.path.join("..", "data", "directory.json")

with open(directory_path) as src:
    directory = json.load(src)

repos = list(directory.keys())

repo_dir = os.path.join("..", "data", "repos")
repo_paths = [os.path.join(repo_dir, repo) for repo in repos]

# Filter out missing and empty repos
invalid_repos = [repo for repo in repo_paths if not os.path.exists(repo) or os.listdir(repo) == []]

valid_repos = [repo for repo in repo_paths if repo not in invalid_repos]

In [11]:
def get_source_files(directory):
    """Find all the files that are probably Earth Engine source code in a given directory.
    Files should have no extension or .js extension, not be included in a hidden directory
    like .git, and not be in a list of commonly excluded files.
    
    This function also renames all non-js source files to .js to
    allow code analysis later on.
    """
    # Common non-source files that might be included in a repository and don't have a file extension.
    exclude_files = ["LICENSE"]
                     
    # List all files and folders recursively, excluding hidden stuff (e.g. .git)
    everything = glob.glob(os.path.join(directory, "**", "*"), recursive=True)
    
    source_files = []
    for file in everything:
        # Exclude subdirectories
        if not os.path.isfile(file):
            continue
        # Exclude specific files
        if os.path.basename(file) in exclude_files:
            continue
        # Exclude any filetypes other than JS
        if "." in file and not file.endswith(".js"):
            continue
        # Add .js extension to all source files for later analysis
        if not file.endswith(".js"):
            os.rename(file, file + ".js")
            file = file + ".js"
        
        source_files.append(file)
        
    return source_files

In [12]:
source_files = []

for repo in valid_repos:
    source_files += get_source_files(repo)

## Parse the Code

Now we can start looking through individual source code files for specific patterns to track.

In [16]:
from collections import Counter

We'll need a function that can take a module name or a file path and parse out the username.

In [17]:
def get_user(s):
    """Take a path-like string and return the username associated"""
    user_pattern = r"""users\/([a-zA-Z]*[0-9]*)\/"""
    try:
        return re.findall(user_pattern, s)[0]
    # A few repos don't have a username
    except IndexError:
        return None

We'll use the regex patterns below to parse source code and see how often datasets, modules, and other parameters are used.

In [18]:
# Match image collection names
collection_pattern = re.compile(r"""ee\.ImageCollection\(['"](\S+?)['"]\)\.?""")
# Match image names
image_pattern = re.compile(r"""ee\.Image\(['"](\S+?)['"]\)\.?""")
# Match latitude and longitude coordinates from ee.Geometry.Point constructors
point_pattern = re.compile(r"""ee\.Geometry\.Point\((\[[^a-zA-Z]*?\])\)\.?""")

# Match CRS parameters
crs_pattern = re.compile(r"""crs:\s*['"]([a-zA-Z]{4}:[0-9]{4})['"]""")
# Match module imports
module_pattern = re.compile(r"""require\(['"](\S+?)['"]\)""")
# Match dates of the format yyyy-mm-dd
date_pattern = re.compile(r"""([0-9]{4}-[0-9]{2}-[0-9]{2})""")

Iterate over every line of every source code file and count unique datasets, modules, etc. This takes a while.

In [None]:
input("Are you sure you want to run?")

collections = Counter()
images = Counter()
crs = Counter()
modules = Counter()
points = []
dates = Counter()

for file in tqdm(source_files, desc="Parsing"):
    try:
        file_user = get_user(file)
        with open(file, encoding="utf-8") as src:
            for line in src:
                collection_match = collection_pattern.search(line)
                if collection_match:
                    collections.update([collection_match.group(1)])
                    
                image_match = image_pattern.search(line)
                if image_match:
                    images.update([image_match.group(1)])
                
                crs_match = crs_pattern.search(line)
                if crs_match:
                    crs.update([crs_match.group(1)])

                point_match = point_pattern.search(line)
                if point_match:
                    points.append(point_match.group(1))

                module_match = module_pattern.search(line)
                if module_match:
                    mod = module_match.group(1)
                    # Filter out when a user requires their own modules
                    if get_user(mod) != file_user:
                        modules.update([mod])
                        
                date_match = date_pattern.search(line)
                if date_match:
                    dates.update([date_match.group(1)])

    # A very small number of files aren't readable. Skip those.
    except UnicodeDecodeError:
        pass

Are you sure you want to run?


Parsing:   0%|          | 0/213393 [00:00<?, ?it/s]

## Export Raw Data

Since it took a while to process this data, let's save it to disk for further analysis.

In [25]:
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

Export a dataframe showing the number of imports for each collection.

In [113]:
collection_df = pd.DataFrame.from_dict(collections, orient="index").reset_index().rename(columns={"index": "collection", 0: "imports"})
collection_df.to_csv(os.path.join("..", "data", "collections.csv"), index=False)

Export a dataframe showing the number of imports for each image.

In [116]:
images_df = pd.DataFrame.from_dict(images, orient="index").reset_index().rename(columns={"index": "image", 0: "imports"})
images_df.to_csv(os.path.join("..", "data", "images.csv"), index=False)

Export a dataframe showing the number of imports for each module.

In [117]:
modules_df = pd.DataFrame.from_dict(modules, orient="index").reset_index().rename(columns={"index": "module", 0: "imports"})
modules_df.to_csv(os.path.join("..", "data", "modules.csv"), index=False)

Export a dataframe showing the number of time each date was used.

In [26]:
dates_df = pd.DataFrame.from_dict(dates, orient="index").reset_index().rename(columns={"index": "date", 0: "used"})
dates_df.to_csv(os.path.join("..", "data", "dates.csv"), index=False)

Convert all the CRS codes to upper case, re-count them, and export a dataframe.

In [195]:
# Convert all CRS's to uppercase.
c = Counter()
for k, v in crs.items():
    c.update({k.upper(): v})
    
crs_df = pd.DataFrame.from_dict(c, orient="index").reset_index().rename(columns={"index": "crs", 0: "exports"})
crs_df.to_csv(os.path.join("..", "data", "crs.csv"), index=False)

Parse the point coordinates to geometries and export as a geopackage.

In [178]:
point_geoms = []

coord_pattern = r"""^\[['"]*?\s*?(\s?[\-0-9.]+)\s*?,\s*?([\-0-9.]+)\s*?['"]*?\]"""

for point in set(points):
    match = re.search(coord_pattern, point)
    try:
        lng, lat = match.group(1, 2)
        try:
            lng = float(lng)
            lat = float(lat)
        except ValueError:
            continue
        if lng < -180 or lng > 180 or lat < -180 or lat > 180:
            continue
        point_geoms.append(Point([float(lng), float(lat)]))
        
    # So many syntax errors!
    except AttributeError:
        pass
    
pt_gdf = gpd.GeoDataFrame(geometry=point_geoms, crs="EPSG:4326")
pt_gdf.to_file(os.path.join("..", "data", "points.gpkg"), driver="GPKG")