# Lesson 3: Training a U-Net Segmentation Model

### Getting set up with the data

Create drive shortcuts of the tiled imagery to your own My Drive Folder by Right-Clicking on the Shared folder `terrabio`. Then, this folder will be available at the following path that is accessible with the google.colab `drive` module: `'/content/gdrive/My Drive/servir-tf/terrabio/'`

We'll be working witht he following folders in the `tiled` folder:
```
tiled/
├── images/
├── images_bright/
├── indices/
├── indices_800/
├── labels/
└── labels_800/
```

pip version 21.1.3 and python 3.7 is required or else tensorflow add ons will not install. python 3.8 won't work because tf add ons version is too old.

In [3]:
!python --version

Python 3.7.11


In [19]:
!pip --version

pip 21.1.3 from /home/rave/miniconda3/envs/servir/lib/python3.7/site-packages/pip (python 3.7)


In [18]:
!pip install pip==21.1.3



In [4]:
!pip install -q rasterio
!pip install -q geopandas
!pip install git+https://github.com/tensorflow/examples.git
!pip install -U tfds-nightly
!pip install focal-loss
!pip install matplotlib
!pip install opencv-python
!pip install sklearn # on colab already but maybe not on local
!pip install tensorflow-addons==0.8.3

Collecting git+https://github.com/tensorflow/examples.git
  Cloning https://github.com/tensorflow/examples.git to /tmp/pip-req-build-w80gv59t
  Running command git clone -q https://github.com/tensorflow/examples.git /tmp/pip-req-build-w80gv59t
Collecting absl-py
  Using cached absl_py-1.0.0-py3-none-any.whl (126 kB)
Building wheels for collected packages: tensorflow-examples
  Building wheel for tensorflow-examples (setup.py) ... [?25ldone
[?25h  Created wheel for tensorflow-examples: filename=tensorflow_examples-6e73f65d7883e301a4115b65988a4f893c47430d_-py3-none-any.whl size=268415 sha256=d6d85e6e9bdbb8a702db0164e2e285e8ce7a5645f0c8505553d0132c11d7a05b
  Stored in directory: /tmp/pip-ephem-wheel-cache-g3khc3kp/wheels/eb/19/50/2a4363c831fa12b400af86325a6f26ade5d2cdc5b406d552ca
Failed to build tensorflow-examples
Installing collected packages: absl-py, tensorflow-examples
    Running setup.py install for tensorflow-examples ... [?25ldone
[33m  DEPRECATION: tensorflow-examples was in

In [11]:
import os, glob, functools, fnmatch
from zipfile import ZipFile
from itertools import product

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['axes.grid'] = False
mpl.rcParams['figure.figsize'] = (12,12)

from sklearn.model_selection import train_test_split
import matplotlib.image as mpimg
import pandas as pd
from PIL import Image

import rasterio
from rasterio import features, mask, windows

import geopandas as gpd

import tensorflow as tf
from tensorflow.python.keras import layers, losses, models
from tensorflow.python.keras import backend as K  
import tensorflow_addons as tfa

from tensorflow_examples.models.pix2pix import pix2pix

import tensorflow_datasets as tfds
tfds.disable_progress_bar()

from IPython.display import clear_output

from focal_loss import SparseCategoricalFocalLoss
from sklearn.metrics import confusion_matrix, f1_score

import cv2

In [None]:
import os
if 'google.colab' in str(get_ipython()):
    root_dir = '/content/gdrive/My Drive/servir-tf/terrabio/' 
    print('Running on CoLab')
else:
    root_dir = './data/' 
    print(f'Not running on CoLab, data needs to be downloaded locally at {os.path.abspath(root_dir)}')

img_dir = root_dir+'tiled/indices/' # or root_dir+'tiled/images_bright/' if using the optical tiles
label_dir = root_dir+'tiled/labels/'

### Enabling GPU

This notebook can utilize a GPU and works better if you use one. Hopefully this notebook is using a GPU, and we can check with the following code.

If it's not using a GPU you can change your session/notebook to use a GPU. See [Instructions](https://colab.research.google.com/notebooks/gpu.ipynb#scrollTo=sXnDmXR7RDr2)

In [12]:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


2021-11-15 15:53:40.520590: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-15 15:53:40.551667: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-15 15:53:40.575217: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-15 15:53:40.575674: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zer

As a recap, let's check out the classes we are workign with

In [16]:
# Read the classes
class_index = pd.read_csv(root_dir+'terrabio_classes.csv')
class_names = class_index.class_name.unique()
print(class_index) 

   class_id   class_name
0         0   Background
1         1     Bushland
2         2      Pasture
3         3        Roads
4         4        Cocoa
5         5   Tree cover
6         6    Developed
7         7        Water
8         8  Agriculture


### Read the image files and label files into tensorflow dataset Python objects

Now we will compile the spectral index image and label tiles into training, validation, and test datasets for use with TensorFlow.

In [22]:
# get lists of image and label tile pairs for training and testing

def get_train_test_lists(imdir, lbldir):
    imgs = glob.glob(imdir+"/*.png")
    dset_list = []
    for img in imgs:
        filename_split = os.path.splitext(img) 
        filename_zero, fileext = filename_split
        basename = os.path.basename(filename_zero)
        dset_list.append(basename)
    x_filenames = []
    y_filenames = []
    for img_id in dset_list:
        x_filenames.append(os.path.join(imdir, "{}.png".format(img_id)))
        y_filenames.append(os.path.join(lbldir, "{}.png".format(img_id)))
    print("number of images: ", len(dset_list))
    print("number of labels: ", len(y_filenames))
    return dset_list, x_filenames, y_filenames

In [23]:
train_list, x_train_filenames, y_train_filenames = get_train_test_lists(img_dir, label_dir)

number of images:  37350
number of labels:  37350


Let's check for the proportion of background tiles. This takes a while. So we can skip by loading from saved results.

In [27]:
skip = False

if not skip:
    background_list_train = []
    for i in train_list: 
        # read in each labeled images
        #print(label_dir+"{}.png".format(i))
        img = np.array(Image.open(label_dir+"{}.png".format(i))) 
        # check if no values in image are greater than zero (background value)
        if img.max()==0:
            background_list_train.append(i)
    print("number of background training images: ", len(background_list_train))
    with open(os.path.join(root_dir,'background_list_train.txt'), 'w') as f:
        for item in background_list_train:
            f.write("%s\n" % item)
else:
    background_list_train = [line.strip() for line in open(root_dir+"background_list_train.txt", 'r')]
    print("number of background training images: ", len(background_list_train))

number of background training images:  36489


We will keep only 10% of the total amount of background images. Too many background tiles can cause a form of class imbalance.

In [None]:
background_removal = len(background_list_train) * 0.9
print(background_removal)
train_list_clean = [y for y in train_list if y not in background_list_train[0:int(background_removal)]]

x_train_filenames = []
y_train_filenames = []
for img_id in train_list_clean: 
    x_train_filenames.append(os.path.join(img_dir, "{}.png".format(img_id)))
    y_train_filenames.append(os.path.join(label_dir, "{}.png".format(img_id)))
    
print(f"Remaining trianing set size : {len(train_list_clean)}")

32840.1
Remaining trianing set size : 4510


Now that we have our set of files we want to use for developing our model, we need to split them into three sets: 
* the Training set for the model to learn from
* the validation set that allows us to evaluate models and make decisions to change models
* and the test set that we will use to communicate the results of the best performing model (as determined by the validation set)

We will split index tiles and label tiles into train, validation and test sets: 70%, 20% and 10%, respectively.

In [32]:
x_train_filenames, x_val_filenames, y_train_filenames, y_val_filenames = train_test_split(x_train_filenames, y_train_filenames, test_size=0.3, random_state=42)
x_val_filenames, x_test_filenames, y_val_filenames, y_test_filenames = train_test_split(x_val_filenames, y_val_filenames, test_size=0.33, random_state=42)

num_train_examples = len(x_train_filenames)
num_val_examples = len(x_val_filenames)
num_test_examples = len(x_test_filenames)

print("Number of training examples: {}".format(num_train_examples))
print("Number of validation examples: {}".format(num_val_examples))
print("Number of test examples: {}".format(num_test_examples))

Number of training examples: 2209
Number of validation examples: 635
Number of test examples: 313


0.09914475768134305