This script produces JSON objects storing the splits between train / test / val for the three datasets that we have, based on geographic location. The approach is to (1) create a mapping based on basin ID in the label geojson and then (2) match the paths of files in the raw data directory with those in the IDs in the geojson. This is only possible because we named the scenes specifically for the lakes around which the download was centered.

The main parameters are,
* What directory contains all the Lake ID-named scenes?
* What is the path to the geojson that contains the lake IDs and their sub-basins?
* Which sub-basins should be assigned to which splits?
* Where should we save the result?

Note that I'm using sub-basins instead of basins, because I found that among the data we actually have downloaded, the number of scenes per basin are very highly skewed.

To run this notebook with the different data sources, you can use (for example),

Bing:

```
papermill -p in_dir /datadrive/glaciers/bing_glaciers/bing_glacial_lakes -p out_dir /datadrive/glaciers/bing_glaciers/bing_glacial_lakes/splits create_splits.ipynb -
```


In [None]:
from pathlib import Path
import json
import geopandas as gpd
import shutil

In [None]:
in_dir = "/datadrive/snake/lakes/le7-2015/"
label_path = "/datadrive/snake/lakes/GL_3basins_2015.shp"
out_dir = "/datadrive/snake/lakes/le7-2015/splits/"

In [None]:
basin_mapping = {
    "train": ["Arun", "Bheri", "Budhi Gandaki", "Dudh Koshi", "Humla", "Indrawati", "Kali", "Kali Gandaki"], 
    "val": ["Karnali", "Kawari", "Likhu", "Marsyangdi", "Mugu", "Seti"], 
    "test": ["Sun Koshi", "Tama Koshi", "Tamor", "Tila", "Trishuli", "West Seti"]
}

paths = {"in": Path(in_dir), "label": Path(label_path), "out": Path(out_dir)}
if paths["out"].exists():
    shutil.rmtree(paths["out"])

for split in basin_mapping.keys():
    (paths["out"] / split).mkdir(parents=True)

In [None]:
y = gpd.read_file(paths["label"])
ids = {}
for split in basin_mapping.keys():
    ids[split] = list(y[y.Sub_Basin.isin(basin_mapping[split])].GL_ID.values)

for path in paths["in"].glob("*tif"):
    for split in ids.keys():
        for i in ids[split]:
            if path.stem.find(i) != -1:
                shutil.copy2(path, paths["out"] / split)