# Summarizing the Code

Let's starting digging into the public repositories. This will just be a surface level look to see how much code there really is.

In [1]:
import os
import glob
import json
from tqdm.auto import tqdm

## Load Repositories

Load the repository directory.

In [2]:
directory_path = os.path.join("..", "data", "directory.json")

with open(directory_path) as src:
    directory = json.load(src)

repos = list(directory.keys())

Remember, we have about 11,000 public repositories.

In [3]:
len(repos)

11175

Build paths to each repository.

In [4]:
repo_dir = os.path.join("..", "data", "repos")
repo_paths = [os.path.join(repo_dir, repo) for repo in repos]

### Filter Invalid Repos

Some repositories may have been skipped due to invalid paths. Let's see how many of the repos in the directory weren't cloned.

In [5]:
missing_repos = [repo for repo in repo_paths if not os.path.exists(repo)]
len(missing_repos)

3

Other repositories may have failed to clone due to bad filenames within the repository or connection issues. There's not much we can do if a repository partially failed to clone and is missing one or two files, but let's at least see how many failed completely and have *no* files.

In [6]:
empty_repos = [repo for repo in repo_paths if os.path.exists(repo) and os.listdir(repo) == []]
len(empty_repos)

41

So out of our original ~11,000 repositories, only about 0.4% are missing. Not bad. Let's exclude those from future analysis.

In [7]:
invalid_repos = missing_repos + empty_repos

valid_repos = [repo for repo in repo_paths if repo not in invalid_repos]
len(valid_repos)

11131

## Locating Code

GEE is written in Javascript, but source files created in the Code Editor do not have a `.js` extension, which will make them harder to find. We'll write a function to recursively look through each repository directory and pull out anything that looks like source code. At the same time, we'll add that `.js` extension to make it easier to find and analyze the code in the future.

In [8]:
def get_source_files(directory):
    """Find all the files that are probably Earth Engine source code in a given directory.
    Files should have no extension or .js extension, not be included in a hidden directory
    like .git, and not be in a list of commonly excluded files.
    
    This function also renames all non-js source files to .js to
    allow code analysis later on.
    """
    # Common non-source files that might be included in a repository and don't have a file extension.
    exclude_files = ["LICENSE"]
                     
    # List all files and folders recursively, excluding hidden stuff (e.g. .git)
    everything = glob.glob(os.path.join(directory, "**", "*"), recursive=True)
    
    source_files = []
    for file in everything:
        # Exclude subdirectories
        if not os.path.isfile(file):
            continue
        # Exclude specific files
        if os.path.basename(file) in exclude_files:
            continue
        # Exclude any filetypes other than JS
        if "." in file and not file.endswith(".js"):
            continue
        # Add .js extension to all source files for later analysis
        if not file.endswith(".js"):
            os.rename(file, file + ".js")
            file = file + ".js"
        
        source_files.append(file)
        
    return source_files

Find and add `.js` extension to all source code files. This can't be undone, so back up your cloned repos if you're concerned about changing filenames.

In [9]:
source_files = []

for repo in valid_repos:
    source_files += get_source_files(repo)

## Code Stats

So, how many source code files do we have?

In [10]:
f"There are {len(source_files):,} source code files!"

'There are 213,393 source code files!'

How much storage space does this source code take up? We'll use `os.path.getsize` to return the size in bytes of each source file.

In [11]:
source_sizes = {}

for file in source_files:
    source_sizes[file] = os.path.getsize(file)

Get the total byes and convert to gigabytes.

In [12]:
sum(source_sizes.values()) / 1_000_000_000

3.756135441

Almost 4 gigabytes of source code!

### Lines of Code

One way to analyze source code is to count lines of code (LOC). This excludes whitespace and documentation to give a good idea of how much executable code actually exists within files. We'll use the [pygount](https://github.com/roskakori/pygount) package to run this analysis.

In [13]:
from pygount import SourceAnalysis

Count the lines of code and documentation in all source files. This will take some time.

In [None]:
input("Are you sure you want to run?")

lines_of_code = 0
lines_of_doc = 0

for file in tqdm(source_files, desc="Analyzing"):
    analysis = SourceAnalysis.from_file(source_path=file, encoding="utf-8", group="ee")
    
    lines_of_code += analysis.code_count
    lines_of_doc += analysis.documentation_count

In [None]:
print(lines_of_code)
print(lines_of_doc)