### In this document, I'll stage the Khartoum pansharpened-RGB data for processing.

1. upload the pansharp-RGB chips and the building model jsons to google drive. This happens offline. (The files are now in MyDrive/Khartoum/{pansharp, geojson/buildings}).
2. convert all the building model jsons to equivalent segmentation masks. This is done in MakeMaskFiles_large.ipynb.

I've done both of these things -- now I'll do a sanity check to make sure every image chip has a mask chip.

But first, below are 5 blocks you need to run every time you do anything in colab. These are taken from MakeMaskFiles_large.ipynb, but are as compressed as I can get them to be.

In [None]:
# Mount the drive: THIS NEEDS A USER RESPONSE
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Install conda: shell block
%%capture
%%bash

MINICONDA_INSTALLER_SCRIPT=Miniconda3-py37_4.10.3-Linux-x86_64.sh
MINICONDA_PREFIX=/usr/local
wget https://repo.continuum.io/miniconda/$MINICONDA_INSTALLER_SCRIPT
chmod +x $MINICONDA_INSTALLER_SCRIPT
./$MINICONDA_INSTALLER_SCRIPT -b -f -p $MINICONDA_PREFIX

conda install --channel defaults conda python=3.7 --yes
conda update --channel defaults --all --yes

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - _openmp_mutex==4.5=1_gnu
    - brotlipy==0.7.0=py37h27cfd23_1003
    - ca-certificates==2021.7.5=h06a4308_1
    - certifi==2021.5.30=py37h06a4308_0
    - cffi==1.14.6=py37h400218f_0
    - chardet==4.0.0=py37h06a4308_1003
    - conda-package-handling==1.7.3=py37h27cfd23_1
    - conda==4.10.3=py37h06a4308_0
    - cryptography==3.4.7=py37hd23ed53_0
    - idna==2.10=pyhd3eb1b0_0
    - ld_impl_linux-64==2.35.1=h7274673_9
    - libffi==3.3=he6710b0_2
    - libgcc-ng==9.3.0=h5101ec6_17
    - libgomp==9.3.0=h5101ec6_17
    - libstdcxx-ng==9.3.0=hd4cf53a_17
    - ncurses==6.2=he6710b0_1
    - openssl==1.1.1k=h27cfd23_0
    - pip==21.1.3=py37h06a4308_0
    - pycosat==0.6.3=py37h27cfd23_0
    - pycparser==2.20=py_2
    - pyopenssl=

--2021-09-20 17:45:36--  https://repo.continuum.io/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh
Resolving repo.continuum.io (repo.continuum.io)... 104.18.201.79, 104.18.200.79, 2606:4700::6812:c84f, ...
Connecting to repo.continuum.io (repo.continuum.io)|104.18.201.79|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh [following]
--2021-09-20 17:45:36--  https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.130.3, 104.16.131.3, 2606:4700::6810:8203, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.130.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 89026327 (85M) [application/x-sh]
Saving to: ‘Miniconda3-py37_4.10.3-Linux-x86_64.sh’

     0K .......... .......... .......... .......... ..........  0% 4.63M 18s
    50K .......... .......... .......... 

In [None]:
# Python block
import sys
_ = (sys.path
        .append("/usr/local/lib/python3.7/site-packages"))

In [None]:
# shell block
%%capture
!conda install --channel conda-forge geopandas geojson --yes

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | /

In [None]:
# Python block
from osgeo import ogr, gdal

### Check that the images and mask files are in 1-1 correspondence.

Sanity check to make sure all image files have matching masks. They do. Also get a list of the tif.aux.xml files to remove.

In [None]:
chip_base = r'/content/drive/MyDrive/Khartoum/pansharp' #example: RGB-PanSharpen_AOI_5_Khartoum_img1.tif
mask_base = r'/content/drive/MyDrive/Khartoum/masks' #example: RGB-PanSharpen_AOI_5_Khartoum_mask1.tif

import os, sys
ps_files = os.listdir(chip_base)

import re
ps_pattern = re.compile(r"img(?P<numbers>[0-9]+)\.tif$")

# iterate over files in the directory of images
missing_masks = []
remove_files = []
for ps_file in ps_files:
  match = ps_pattern.search(ps_file)

  # discard null matches to tif.aux.xml files
  if match is None:
    remove_files.append(ps_file)
    continue
  img_num = match.group("numbers")

  mask_filename = os.path.join(mask_base, f"RGB-PanSharpen_AOI_5_Khartoum_mask{img_num}.tif")

  mask_file = gdal.Open(mask_filename)
  if mask_file is None:
    print(f"Unable to open mask file: {mask_filename}")
    missing_masks.append(ps_file)
  mask_file = None

print(f"Image files with missing masks: {len(missing_masks)}")

Image files with missing masks: 0


In [None]:
for rmfile in remove_files: 
  print(rmfile)
 


In [None]:
# Now check for missing image files corresponding to mask files. 
mask_pattern = re.compile(r"mask(?P<numbers>[0-9]+)\.tif$")

mask_files = os.listdir(mask_base)

# iterate over files in the directory of masks
missing_imgs = []
unmatch_files = []
for mask_file in mask_files:
  match = mask_pattern.search(mask_file)

  # discard null matches
  if match is None:
    unmatch_files.append(mask_file)
    continue
  num = match.group("numbers")

  img_filename = os.path.join(chip_base, f"RGB-PanSharpen_AOI_5_Khartoum_img{num}.tif")

  img_file = gdal.Open(img_filename)
  if img_file is None:
    print(f"Unable to open img file: {img_filename}")
    missing_imgs.append(mask_file)
  img_file = None

print(f"Mask files with missing images: {len(missing_imgs)}")
print(f"Mask files that don't match the mask file regex pattern: {len(unmatch_files)}")


Mask files with missing images: 0
Mask files that don't match the mask file regex pattern: 0


### added September 27, 2021: jointly normalize all the image files by channel. 

Go through the full image set and calculate channel statistics. These may include percentiles, mean, and standard deviation depending on what I decide to do in the end. 

Then go through the image set again and write the normalized images to a new directory.

In [None]:
upper_perc = .95
lower_perc = .05

from PIL import Image

DATA_PATH = '/content/drive/MyDrive/Khartoum'
FRAME_PATH = os.path.join(DATA_PATH,'pansharp')
NORM_FRAME_PATH = os.path.join(DATA_PATH, 'pansharp-norm'



### Set up training, validation, and test splits.

This code is derived from examples at [A Keras pipeline for image segmentation](https://towardsdatascience.com/a-keras-pipeline-for-image-segmentation-part-1-6515a421157d).

In [None]:
import random
from PIL import Image

DATA_PATH = '/content/drive/MyDrive/Khartoum'
FRAME_PATH = os.path.join(DATA_PATH,'pansharp')
MASK_PATH = os.path.join(DATA_PATH,'masks')

# Create folders to hold images and masks

folders = ['train_frames', 'train_masks', 'val_frames', 'val_masks', 'test_frames', 'test_masks']
for folder in folders:
  new_folder = os.path.join(DATA_PATH,folder)
  if not os.path.exists(new_folder):
    os.mkdir(new_folder)
  
# Get all frames and masks, sort them, shuffle them to generate data sets.

all_frames = os.listdir(FRAME_PATH)
all_masks = os.listdir(MASK_PATH)

#sort in place
all_frames.sort(key=lambda var:[int(x) if x.isdigit() else x 
                                for x in re.findall(r'[^0-9]|[0-9]+', var)])
all_masks.sort(key=lambda var:[int(x) if x.isdigit() else x 
                               for x in re.findall(r'[^0-9]|[0-9]+', var)])

In [None]:
# sanity check 1-1 correspondence of sorted filenames
print(all_frames[-1])
print(all_masks[-1])

RGB-PanSharpen_AOI_5_Khartoum_img1686.tif
RGB-PanSharpen_AOI_5_Khartoum_mask1686.tif


In [None]:
all_pairs = list(zip([os.path.join(chip_base, img_file) for img_file in all_frames], [os.path.join(mask_base, mask_file) for mask_file in all_masks]))

In [None]:
random.seed(230) # for reproducibility
random.shuffle(all_pairs)


# Generate train, val, and test sets in the ratio: 7:2:1
train_split = int(0.7*len(all_pairs))
val_split = int(0.9 * len(all_pairs))

train_pairs = all_pairs[:train_split]
val_pairs = all_pairs[train_split:val_split]
test_pairs = all_pairs[val_split:]

In [None]:
dir_pair_train = ['/content/drive/MyDrive/Khartoum/train_frames', 
                  '/content/drive/MyDrive/Khartoum/train_masks']
dir_pair_val = ['/content/drive/MyDrive/Khartoum/val_frames', 
                  '/content/drive/MyDrive/Khartoum/val_masks']
dir_pair_test = ['/content/drive/MyDrive/Khartoum/test_frames', 
                  '/content/drive/MyDrive/Khartoum/test_masks']

In [None]:

from shutil import copy
def make_data_split(pair, dir_pair):

  print(f"Copying {pair[0]} to {dir_pair[0]}")
  print(f"Copying {pair[1]} to {dir_pair[1]}")
  print("")
  copy(pair[0], dir_pair[0])
  copy(pair[1], dir_pair[1])

### Move training, validation, and test images and masks to the appropriate directories. 

- train_frames
- train_masks
- val_frames
- val_masks
- test_frames
- test_masks

In [None]:
%%capture
for item in train_pairs:
  make_data_split(item, dir_pair_train)

print("------------------------------------------------------------")

for item in val_pairs:
  make_data_split(item, dir_pair_val)

print("------------------------------------------------------------")

for item in test_pairs:
  make_data_split(item, dir_pair_test)

Copying /content/drive/MyDrive/Khartoum/pansharp/RGB-PanSharpen_AOI_5_Khartoum_img244.tif to /content/drive/MyDrive/Khartoum/train_frames
Copying /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask244.tif to /content/drive/MyDrive/Khartoum/train_masks

Copying /content/drive/MyDrive/Khartoum/pansharp/RGB-PanSharpen_AOI_5_Khartoum_img1338.tif to /content/drive/MyDrive/Khartoum/train_frames
Copying /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask1338.tif to /content/drive/MyDrive/Khartoum/train_masks

Copying /content/drive/MyDrive/Khartoum/pansharp/RGB-PanSharpen_AOI_5_Khartoum_img147.tif to /content/drive/MyDrive/Khartoum/train_frames
Copying /content/drive/MyDrive/Khartoum/masks/RGB-PanSharpen_AOI_5_Khartoum_mask147.tif to /content/drive/MyDrive/Khartoum/train_masks

Copying /content/drive/MyDrive/Khartoum/pansharp/RGB-PanSharpen_AOI_5_Khartoum_img1456.tif to /content/drive/MyDrive/Khartoum/train_frames
Copying /content/drive/MyDrive/Kharto