## Make all training mask files from all geojson building masks from SN2 datasets -- Sept 18, 2021

This notebook:
- installs conda
- installs geospatial python libraries
- attaches to my google drive, specifically the Khartoum directory containing all sample chips and building masks
- creates byte building masks from the geojson files corresponding to each input image.
- writes them to a subdirectory of the Khartoum directory.


#### Miniconda installation.

This setup process follows instructions given in this very good and clear article: [Conda + Google Colab](https://towardsdatascience.com/conda-google-colab-75f7c867a522).

In [None]:
%%bash

MINICONDA_INSTALLER_SCRIPT=Miniconda3-py37_4.10.3-Linux-x86_64.sh
MINICONDA_PREFIX=/usr/local
wget https://repo.continuum.io/miniconda/$MINICONDA_INSTALLER_SCRIPT
chmod +x $MINICONDA_INSTALLER_SCRIPT
./$MINICONDA_INSTALLER_SCRIPT -b -f -p $MINICONDA_PREFIX

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - _openmp_mutex==4.5=1_gnu
    - brotlipy==0.7.0=py37h27cfd23_1003
    - ca-certificates==2021.7.5=h06a4308_1
    - certifi==2021.5.30=py37h06a4308_0
    - cffi==1.14.6=py37h400218f_0
    - chardet==4.0.0=py37h06a4308_1003
    - conda-package-handling==1.7.3=py37h27cfd23_1
    - conda==4.10.3=py37h06a4308_0
    - cryptography==3.4.7=py37hd23ed53_0
    - idna==2.10=pyhd3eb1b0_0
    - ld_impl_linux-64==2.35.1=h7274673_9
    - libffi==3.3=he6710b0_2
    - libgcc-ng==9.3.0=h5101ec6_17
    - libgomp==9.3.0=h5101ec6_17
    - libstdcxx-ng==9.3.0=hd4cf53a_17
    - ncurses==6.2=he6710b0_1
    - openssl==1.1.1k=h27cfd23_0
    - pip==21.1.3=py37h06a4308_0
    - pycosat==0.6.3=py37h27cfd23_0
    - pycparser==2.20=py_2
    - pyopenssl=

--2021-09-20 19:50:20--  https://repo.continuum.io/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh
Resolving repo.continuum.io (repo.continuum.io)... 104.18.201.79, 104.18.200.79, 2606:4700::6812:c94f, ...
Connecting to repo.continuum.io (repo.continuum.io)|104.18.201.79|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh [following]
--2021-09-20 19:50:20--  https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8303, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 89026327 (85M) [application/x-sh]
Saving to: ‘Miniconda3-py37_4.10.3-Linux-x86_64.sh’

     0K .......... .......... .......... .......... ..........  0% 3.10M 27s
    50K .......... .......... .......... 

In [None]:
!which conda # should return /usr/local/bin/conda

/usr/local/bin/conda


In [None]:
!conda --version #should return 4.10.3

conda 4.10.3


In [None]:
!which python # still returns /usr/local/bin/python

/usr/local/bin/python


In [None]:
!python --version

Python 3.7.10


In [None]:
# Now that you have installed Conda you need to update Conda and all its dependencies to their most recent versions without updating Python to 3.8+.
# This code updates everything while holding python constant at 3.7.
%%bash

conda install --channel defaults conda python=3.7 --yes
conda update --channel defaults --all --yes

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - conda
    - python=3.7


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1l             |       h7f8727e_0         2.5 MB
    python-3.7.11              |       h12debd9_0        45.3 MB
    ------------------------------------------------------------
                                           Total:        47.9 MB

The following packages will be UPDATED:

  openssl                                 1.1.1k-h27cfd23_0 --> 1.1.1l-h7f8727e_0
  python                                  3.7.10-h12debd9_4 --> 3.7.11-h12debd9_0



Downloading and Extracting Packages
python-3.7.11        | 45.3 MB   |            |   0% python-3.7.11        | 45.3 MB   |            |   0% python-3.7.11    

You've updated conda. In theory. Actually, the version number didn't change. But the version number of the python installation has changed.


In [None]:
!conda --version # now returns 4.10.3

conda 4.10.3


In [None]:
!python --version

Python 3.7.11


Now you need to modify your path settings so things get installed properly. The initial sys.path looks like the one given in the writeup I'm following.

Note that the preinstalled packages included with Google Colab are installed into the /usr/local/lib/python3.6/dist-packages directory. You can get an idea of what packages are available by simply listing the contents of this directory.

(The ls returns gobs of stuff, so I've commented it out).

Any package that you install with Conda will be installed into the directory /usr/local/lib/python3.7/site-packages so you will need to add this directory to sys.path in order for these packages to be available for import.

Note that because the /usr/local/lib/python3.6/dist-packages directory containing the pre-installed Google Colab packages appears ahead of the /usr/local/lib/python3.6/site-packages directory where Conda installs packages, the version of a package available via Google Colab will take precedence over any version of the same package installed via Conda.


In [None]:
import sys
sys.path

['',
 '/content',
 '/env/python',
 '/usr/lib/python37.zip',
 '/usr/lib/python3.7',
 '/usr/lib/python3.7/lib-dynload',
 '/usr/local/lib/python3.7/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.7/dist-packages/IPython/extensions',
 '/root/.ipython']

In [None]:
# !ls /usr/local/lib/python3.7/dist-packages

In [None]:
import sys
_ = (sys.path
        .append("/usr/local/lib/python3.7/site-packages"))

#### Installing the python geospatial libraries you'll need.

First, mount your google drive to colab. Remember the authorization step you need to respond to, every time you run this code.


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
!ls /content/drive/MyDrive/Khartoum

geojson  pansharp     test_masks    train_masks  val_masks
masks	 test_frames  train_frames  val_frames


Install a minimal set of geospatial libraries into the base environment. geopandas contains osgeo, which in turn contains gdal and ogr, the libraries you need for burning raster masks. 

In [None]:
!conda install --channel conda-forge geopandas geojson --yes

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
Solving environment: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | /

In [None]:
from osgeo import ogr, gdal
import geojson

In [None]:
# Support function for making a blank mask given a raster to match (ds) and a path to the mask file. 
def make_blank_mask_from_img(ds, mask_path):
    '''
    ds: gdal raster dataset (we'll match its georeferencing and size in the byte mask)
    mask_path: where to write the mask file.
    '''
    dr = ds.GetDriver()

    # create a 1-band raster!
    ds_new = dr.Create(mask_path,ds.RasterXSize,ds.RasterYSize,1,gdal.GDT_Byte) 
    if ds_new is None:
      print(f"Could not create new mask file: {mask_path}")
    else:
      ds_new.SetGeoTransform(ds.GetGeoTransform())
      ds_new.SetProjection(ds.GetProjection())
    return ds_new 

This is a function to burn a mask into a byte file. 

In [None]:
# WHAT a hassle it was to get this to work. But this did finally work.
# It will throw an error if it can't open the geojson file.
def burn_bldgs_to_mask(src_raster_path, bldg_gjson_path, mask_path):
  

  # get vector layer
  lyr = None
  with open(bldg_gjson_path) as f:
    the_json = f.read() 
    ds = ogr.Open(the_json)
    lyr = ds.GetLayer()
  assert lyr is not None

  #todo: log number of bldgs in the layer
  # len(lyr)

  # Open raster source file, make a mask with equal pixel spacing and georeferencing
  # burn the vector layer into the mask. 
  # Nullify the mask_ds variable at the end, to flush the image to disk.
  
  mask_ds = None
  ras_ds = gdal.Open(src_raster_path)
  if ras_ds is None:
    print(f"Cannot open raster {src_raster_path}: can't write mask file")
  else:
    mask_ds = make_blank_mask_from_img(ras_ds, mask_path)
    gdal.RasterizeLayer(mask_ds, [1], lyr, burn_values=[1] ) 
    # mask_ds.GetRasterBand(1).SetNoDataValue(0.0) 
    mask_ds = None
  return

Now for processing:
- get a list of source raster files 
- loop over all source files to create mask chips. 
  - check for matching building files.
    - if there isn't one, should I assume there are no buildings in that chip? No -- all images have associated geojson files, even if they are empty.
  - create matching mask and burn buildings to it


In [None]:
# define paths
chip_base = r'/content/drive/MyDrive/Khartoum/pansharp' #example: RGB-PanSharpen_AOI_5_Khartoum_img1.tif
mask_base = r'/content/drive/MyDrive/Khartoum/masks' #example: RGB-PanSharpen_AOI_5_Khartoum_mask1.tif
json_base = r'/content/drive/MyDrive/Khartoum/geojson/buildings' #example: buildings_AOI_5_Khartoum_img1.geojson


# get source raster files
import os, sys
ps_files = os.listdir(chip_base)

# extract image numbers and construct the paths of the other files
import re
ps_pattern = re.compile(r"img(?P<numbers>[0-9]+)\.tif$")
matches_pre = [ps_pattern.search(filename) for filename in ps_files]

# remove null matches to tif.aux.xml files
matches = [m for m in matches_pre if m is not None]

print(f"there are {len(matches)} chips in the training dataset")


# Sort the file numbers so that mask files get created in sort order. 
# So that if the instance dies, you can tell which 
# mask files did not get created. 
numbers_s = sorted([int(match.group("numbers")) for match in matches])

psharp_files = [f"RGB-PanSharpen_AOI_5_Khartoum_img{num}.tif" for num in numbers_s]
mask_files = [f"RGB-PanSharpen_AOI_5_Khartoum_mask{num}.tif" for num in numbers_s]
json_files = [f"buildings_AOI_5_Khartoum_img{num}.geojson" for num in numbers_s]

all_files_index = list(zip([int(num) for num in numbers_s], 
                      [os.path.join(chip_base, chip_file) for chip_file in psharp_files],
                      [os.path.join(mask_base, mask_file) for mask_file in mask_files],
                      [os.path.join(json_base, json_file) for json_file in json_files]))

# # list 10 entries for a sanity check. Note some integers are missing for some reason.
for ii, afi in enumerate(all_files_index):
  print (f"{afi[0]}\n {afi[1]}\n {afi[2]}\n {afi[3]}\n\n")
  if ii == 10: break


there are 1012 chips in the training dataset
1
 /content/drive/MyDrive/Khartoum/pansharp/RGB-PanSharpen_AOI_5_Khartoum_img1.tif
 /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask1.tif
 /content/drive/MyDrive/Khartoum/geojson/buildings/buildings_AOI_5_Khartoum_img1.geojson


2
 /content/drive/MyDrive/Khartoum/pansharp/RGB-PanSharpen_AOI_5_Khartoum_img2.tif
 /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask2.tif
 /content/drive/MyDrive/Khartoum/geojson/buildings/buildings_AOI_5_Khartoum_img2.geojson


3
 /content/drive/MyDrive/Khartoum/pansharp/RGB-PanSharpen_AOI_5_Khartoum_img3.tif
 /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask3.tif
 /content/drive/MyDrive/Khartoum/geojson/buildings/buildings_AOI_5_Khartoum_img3.geojson


4
 /content/drive/MyDrive/Khartoum/pansharp/RGB-PanSharpen_AOI_5_Khartoum_img4.tif
 /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask4.tif
 /content/drive/MyDrive/Khartoum/g

### Factoid: iterating over the zip object exhausts it; this is different behavior from a list. 

From [this web page](https://stackoverflow.com/questions/35395860/looping-zipped-list-in-python):

As pointed out in the first comment, the zip object is consumed with the first print(*zipped_list). However, you can convert the zip object to a list first, like so, to be able to use the values of zip object again:
```
zipped_list = list(zip(list1, list2))
```

In [None]:
print(numbers_s) # I note 94 is in there, but windows explorer thinks it's not syncing. 

[1, 2, 3, 4, 5, 7, 9, 10, 11, 14, 15, 16, 17, 20, 21, 22, 23, 24, 25, 27, 28, 31, 33, 37, 38, 39, 40, 43, 45, 46, 47, 48, 49, 50, 51, 54, 56, 58, 59, 61, 64, 66, 69, 70, 72, 73, 74, 77, 78, 79, 82, 83, 86, 87, 90, 91, 94, 95, 96, 97, 98, 99, 100, 104, 105, 106, 108, 109, 110, 111, 112, 114, 115, 117, 118, 119, 120, 121, 122, 124, 125, 126, 128, 132, 133, 134, 136, 138, 139, 141, 144, 145, 146, 147, 148, 149, 150, 152, 153, 155, 157, 158, 159, 160, 161, 163, 165, 166, 168, 169, 171, 173, 174, 175, 176, 179, 180, 181, 182, 183, 185, 186, 187, 188, 189, 193, 194, 195, 196, 197, 199, 200, 202, 204, 206, 208, 209, 210, 211, 214, 215, 217, 221, 223, 224, 226, 227, 229, 231, 233, 234, 235, 236, 237, 238, 239, 241, 242, 244, 245, 246, 248, 249, 250, 251, 253, 257, 259, 260, 261, 263, 264, 265, 266, 268, 271, 272, 275, 277, 278, 279, 280, 281, 282, 285, 286, 287, 291, 292, 293, 296, 297, 298, 299, 300, 301, 302, 305, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 321, 322, 323, 325, 326, 328

In [None]:
for fileset in all_files_index:
  (src_raster_path, bldg_gjson_path, mask_path) = (fileset[1], fileset[3], fileset[2])
  try:
    burn_bldgs_to_mask(src_raster_path, bldg_gjson_path, mask_path)
    print(f"Wrote file: {mask_path}")
  except OSError as e:
    print("File open error: {e}")
    break


Wrote file: /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask1.tif
Wrote file: /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask2.tif
Wrote file: /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask3.tif
Wrote file: /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask4.tif
Wrote file: /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask5.tif
Wrote file: /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask7.tif
Wrote file: /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask9.tif
Wrote file: /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask10.tif
Wrote file: /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask11.tif
Wrote file: /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask14.tif
Wrote file: /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask15.tif
Wrote 