# Supervised Machine Learning ~ Classification of NYC Water
Author: Connor Finn <br>
Date: June 18, 2020 <br>
Description: <dir>
    This script will perform classification on images of the Greater NYC area in order to distinguish between water and land pixels. The ground truth data was provided by Dr. Narayanaswamy, who detailed polygons of land and water in Google Earth, and stored these regions as json files. To use this notebook, please store these files in known locations in your working environment. <br>

    
References: To develop this code, I referenced Arvind's classification notebook as well as the Google ee [documentation](https://developers.google.com/earth-engine/classification)
    
    


# Part 1: Set UP

### Initialize google earth engine

In [223]:
import ee
import geojson
import json
import pygeoj
import numpy as np
from IPython.display import Image

# Trigger the authentication flow.
ee.Authenticate()

Enter verification code: 4/1AE8F4SNfaQrRTchpAHpBxrrvGsraySYFYUfq7HAKuoCX5FpToU14yo

Successfully saved authorization token.


In [225]:
ee.Initialize()

# Location of interest
This region of manhattan is where we are building our classifier

In [255]:
nycsite = ee.Geometry.Rectangle([-74.04, 40.69, -73.82, 40.94]);

image0 = image_list_nyc[5] #one image from the list of images

parameters = {'min': 0.0,
              'max': 16000.0,
              'dimensions': 768,
              'bands': ['B4', 'B3', 'B2']}

Image(url = image0.clip(nycsite).getThumbUrl(parameters))

# Part 2: Collect Images

### 2.a Function used to collect a list of google ee image objects

In [256]:
def get_images(path_list, row_list, satellite, start_date, end_date, max_cloud_percentage):
  # This function will get a list of image objects ~ according to the provided information

    # get image collection object
    coll = ee.ImageCollection(satellite)\
        .filterDate(start_date, end_date)\
        .filter(ee.Filter.inList('WRS_PATH', path_list))\
        .filter(ee.Filter.inList('WRS_ROW', row_list))\
        .filter(ee.Filter.lt('CLOUD_COVER' , max_cloud_percentage))  # note ~ not less than or equal to

    # get image_id's
    image_ids = list( map( lambda x : x['id'] , coll.getInfo()['features'] ) ) 
    
    # get image objects
    images = list( map( lambda x: ee.Image(x) , image_ids ) )
    
    return images


### 2.b Collect the images

In [257]:
# Get a list of images to work with
p = [14]
r = [32]
sat = 'LANDSAT/LC08/C01/T1'
sd = '2013-05-01'
ed = '2020-05-01'
cc= 10
image_list_nyc = get_images(p, r, sat, sd, ed, cc)

# Part 3: Ground Truth Data

Below is code from Arvind, this is used to take the ground truth data he created with google earth, and create dictionaries. <br>

I have converted it to a function in order to allow for expansion


### 3.a: Function to create featurevector of labeled geolocations

In [258]:
"""
This Function is compiled using code provided by Arvind. It is essential that we can 
Figure out a way to create 2 sets of data points which have no overlap. 
    1. Training ~ 80% 
    2. Testing  ~ 20%
"""

def get_labeled_data(jsonfiles , classalloc , num_points):
    '''
    Input:
        jsonfiles = list of regions in json format
        classalloc = list of integer classifiers 
                + for now, 1 = land, 0 = water
    Output:
        fc = feature collection to train on
        nycfc = another feature collection  
    '''


    # Dictionaries to store intermediate objects before we get to features and feature collections.
    coords_dict = {}
    ee_dict = {}
    randomPts_dict = {}
    features_dict = {}

    # Build the dictionaries
    n = 0
    for jsonfile in jsonfiles:
        jsonfilepath =  jsonfile +'.json'
        with open(jsonfilepath) as f:
            data = geojson.load(f)

        #creating a dictionary of coordinates
        coords_dict[jsonfile + 'coords'] = np.array(data['features'][0]['geometry']['coordinates'][0])[:,0:2].tolist()

        #creating a polygon from coordinate list
        ee_dict[jsonfile + 'ee'] = ee.Geometry.Polygon(coords_dict[jsonfile + 'coords'])

        """
        BIG PROBLEM:
            How can we make the random points below into two EXCLUSIVE GROUPS

        """
        randomPoints = ee.FeatureCollection.randomPoints(region=ee_dict[jsonfile + 'ee'],points=num_points)


        randomPoints = randomPoints.map(lambda x: x.set({'landcover': classalloc[n]})) #This is to add a property named

        randomPts_dict[jsonfile+'Pts'] = randomPoints



        #randomPts_dict[jsonfile + 'rdnmPts'] = ee.FeatureCollection.randomPoints(ee_dict[jsonfile + 'ee'], 100)
        features_dict[jsonfile + 'feature'] = ee.Feature(ee_dict[jsonfile + 'ee'], {'name': jsonfile, 'landcover': classalloc[n]})

        n = n+1
    '''
    The individual features are combined as shown below to create a feature collection.

    You can get some information about the features in the collection using commands as shown below.
    '''
    nycFC = ee.FeatureCollection(list(features_dict.values()))

    nycFCpts = ee.FeatureCollection(list(randomPts_dict.values()))

    fc = ee.FeatureCollection([])
    for x in randomPts_dict.keys():
        fc = fc.merge(randomPts_dict[x])

    
    return fc



### 3.b Collect Ground Truth

In [259]:
# Ground Truth Data
jsonfiles = ['Hudson01', 'Hudson02', 'Bronx01', 'Astoria01'];
classalloc = [1, 1, 0, 0];
num_points = 10000
num_points_test = 2000

'''
Current problem ~ There easily could be overlap between these two samples. need to do a stratafied sample of sorts
'''

training_feature_collection = get_labeled_data(jsonfiles, classalloc , num_points )
testing_feature_collection  = get_labeled_data(jsonfiles, classalloc , num_points_test )

# Part 4: Build ML Classifier

### 3.a ML Class

In [260]:
class Regression_Tree_Classifier():
    
    def __init__(self):
        self.classifier = None  # EE Classifier object
        self.validated = None   # FeatureCollection object
    
    def set_classifier(self, maxNodes = None , minLeafPopulation = 1):
        '''
            Input:
                maxNodes: The maximum number of nodes ~ i.e. if maxNodes = 3, the decision tree
                          will split your dataset two times. (defaults to no limit)
                minLeafPopulation: The minimum number of datapoints in each node:

            Output: None ~ initializes self.classifer
        '''
        
        self.classifier = ee.Classifier.smileCart(maxNodes = 2 , minLeafPopulation = 100)
    
    def train(self, train_coll, image, bands , label ):
        '''
            Input:
                train_coll ~ featureCollection object with labeled data
                image: image object you will train on
                bands: list of strings ('B1' ect) the bands we are training on
                label: what you have named the labeled data (for us it is 'landcover')

            Output: None ~ trains self.classifier
        '''
        
        # prepare the training data
        training = image.select(bands).sampleRegions(\
                   collection =  train_coll,\
                   properties = [label],\
                   scale = 30.0)            # I am currently unsure what this is for ~ leave as 30
    
        # Train the classifier
        self.classifier = self.classifier.train(training, label, bands)
    
    
    def get_training_accuracy(self):
        '''
            Input: None

            Output: 
                float ~ the training accuracy of the classifier
        '''

        return self.classifier.confusionMatrix().accuracy().getInfo()
    
    def test(self, test_coll , image, bands, label):
        '''
            Input:
                test_coll ~ featureCollection object with labeled data
                image: image object you will test on
                bands: list of strings ('B1' ect) the bands we are testing on - needs to be the same as train
                label: what you have named the labeled data (for us it is 'landcover')

            Output: sets the self.validated feature collection 
        '''
        
        
        # test on data not trained on
        validation = image.select(bands).sampleRegions(\
                   collection =  test_coll,\
                   properties = [label],\
                   scale = 30.0)            # I am currently unsure what this is for ~ leave as 30
         

        self.validated = validation.classify(self.classifier)    
    
    def get_testing_accuracy(self , label):
        '''
            Input:
                label: what you have named the labeled data (for us it is 'landcover')
            Output:
                float ~ the training accuracy of the classifier
        '''        
        return self.validated.errorMatrix(label, 'classification').accuracy().getInfo()
        
    
    def apply_to_image(self, image , bands):
        # function to apply classifier to an entire image
        return image.select(bands).classify(self.classifier)
  

### 3.b Train model

In [261]:
training_image =  image_list_nyc[5]  # Randomly select the fifth image
testing_image = image_list_nyc[0]

b =  ['B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B10', 'B11']
l = 'landcover'
cls = Regression_Tree_Classifier()

cls.set_classifier(50 , 1)
cls.train(training_feature_collection, training_image , b, l)
#cls.train(training_feature_collection, testing_image , b, l) # including this makes a big difference!

In [262]:
# I am not sure why this takes longer than the actual training
cls.get_training_accuracy()

0.999975

### 3.c Test model with Image we Trained on

In [263]:
cls.test(testing_feature_collection, training_image, b , l)
cls.get_testing_accuracy(l)

1

In [264]:
result = cls.apply_to_image(training_image , bands)
parameters = {'min': 0.0,
              'max': 1,
              'dimensions': 768,
              'palette': ['white', 'blue']}

Image(url = result.clip(nycsite).getThumbUrl(parameters))

### 3.d Test model with different image

In [265]:
cls.test(testing_feature_collection, testing_image, b , l)
cls.get_testing_accuracy(l)

0.59175

In [266]:
result2 = cls.apply_to_image(testing_image , bands)
Image(url = result2.clip(nycsite).getThumbUrl(parameters))

# Part 4. Takaways
As of now, the model accurately fits the training data (probably overfits). It performs well when used on the image from with the training data came from. it does not extrapolate well to other locations. 

### ideas.  
<dir>
    1. Use a cloud mask ~ remove all pixels with clouds <br>
    2. Train over all images <br>
    3. Choose a model with higher capacity <br>