In [None]:
import rasterio
import geopandas as gpd
import data as dt
import csv
import numpy as np
from pathlib import Path

The three parameters of this script are,

* Which directory contains all the scenes that we want to extract preprocessing-relevant summaries and masks for?
* Where do we want to save our results?
* What is the full path to the labels from which we will construct our masks?

For the first question, we will use the following choices (on the Azure machine),

* Bing Recent: `/datadrive/glaciers/bing_glaciers/bing_glacial_lakes`
* Landsat 2015: `/datadrive/snake/lakes/le7-2015`
* Landsat all: `/datadrive/snake/lakes/imagery`

Technically we only need Landsat all, but Landsat 2015 is convenient because that's the only data that we can train on (the rest would be for purely inference purposes).

In [None]:
in_dir = Path("data/raw")
out_dir = Path("data/processed")
label_path = in_dir / "GL_3basins_2015.shp"

Next, we read in the label data and set up the writer to which we will save summary statistics.

In [None]:
scene_list = in_dir.glob("*.tif")
y = gpd.read_file(label_path)
fields = ["scene"] + sum([[f"{s}_{i}" for i in range(3)] for s in ["mean", "sdev"]], [])
stat_path = out_dir / "statistics.csv"
f = open(stat_path, "a")
writer = csv.writer(f)
writer.writerow(fields)

Finally, we can loop over all the scenes in `in_dir` and save the relevant statistics and masks.

In [3]:
for scene in scene_list:
    img = rasterio.open(scene)
    result = dt.preprocessor(img, y)
    np.save(out_dir / f"{scene.stem}-labels.npy", result[2])
    writer.writerow([str(scene)] + list(np.hstack(result[:2])))

In [None]:
import matplotlib.pyplot as plt
for s in result[2]:
    plt.imshow(s)
    plt.show()