# Dealing with limited data for semantic segmentation
> Strategies for efficiently collecting more data to target specific areas of underperforming models and techniques to adopt to maximize utility of the data



After we have evaluated how well a model has performed, we do one of two things:

* decide we are happy with how the model has performed on the validation set, and report the model performance on the test set (and validation set). Hooray!
* Diagnose issues with our model in terms of false positives or false negatives and make a plan for improving performance on classes that are underperforming.

One of the most fundamental and high impact practices to improve model performance, particularly with deep learning, is to increase the overall size of the training dataset, focusing on classes that are underperforming. However, in remote sensing it is difficult and time consuming to acquire high quality training data labels, particularly compared to other domains where computer vision and machine learning techniques are used. 

Because of this unique difficulty when annotating geospatial imagery, we need to do two things:
* closely inspect our original labeled dataset for quality issues, such as mismatch with the imagery due to date, incorrect class labels, and incorrect label boundaries
* weigh the cost and benefits of annotating new labels rather than trying other approaches to maximize our model's performance with the data we already have.



## Specific concepts that will be covered



**Audience:** This post is geared towards intermediate users who are comfortable with basic machine learning concepts. 

**Time Estimated**: 60-120 min



## Setup Notebook

In [None]:
# install required libraries
!pip install -q rasterio==1.2.10
!pip install -q geopandas==0.10.2
!pip install -q git+https://github.com/tensorflow/examples.git
!pip install -q -U tfds-nightly
!pip install -q focal-loss
!pip install -q tensorflow-addons==0.8.3
#!pip install -q matplotlib==3.5 # UNCOMMENT if running on LOCAL
!pip install -q scikit-learn==1.0.1
!pip install -q scikit-image==0.18.3

In [3]:
# mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
# set your root directory and tiled data folders
if 'google.colab' in str(get_ipython()):
    root_dir = '/content/gdrive/My Drive/servir-tf/terrabio/' 
    print('Running on CoLab')
else:
    root_dir = './data/' 
    print(f'Not running on CoLab, data needs to be downloaded locally at {os.path.abspath(root_dir)}')

img_dir = os.path.join(root_dir,'tiled/indices/') # or root_dir+'tiled/images_bright/' if using the optical tiles
label_dir = os.path.join(root_dir,'tiled/labels/')

Running on CoLab


In [5]:
# go to root directory
%cd $root_dir 

/content/gdrive/My Drive/servir-tf/terrabio


### Enabling GPU

This notebook can utilize a GPU and works better if you use one. Hopefully this notebook is using a GPU, and we can check with the following code.

If it's not using a GPU you can change your session/notebook to use a GPU. See [Instructions](https://colab.research.google.com/notebooks/gpu.ipynb#scrollTo=sXnDmXR7RDr2)

In [None]:
# this is a google colab specific command to ensure TF version 2 is used. 
# it won't work in a regular jupyter notebook, for a regular notebook make sure you install TF version 2
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


### Check out the labels

In [27]:
# Read the classes
class_index = pd.read_csv(root_dir+'tiled/terrabio_classes.csv')
class_names = class_index.class_name.unique()
print(class_index) 


   class_id   class_name
0         0   Background
1         1     Bushland
2         2      Pasture
3         3        Roads
4         4        Cocoa
5         5   Tree cover
6         6    Developed
7         7        Water
8         8  Agriculture
