# Workflow for training dataset generation for ML UNICEF school detection

This notebook will work through few steps we took to generate training dataset for machine learning work of school detection. 

- Training dataset validation and cleaning by Development Seed Data Team, a group of eight expert mappers. 
  - The reviewed schools were classified into three groups against DG vivid base satellite layer by the expert mappers;
    - confirmed schools
    - unreconginized schools
    - not-schools
- we added few more objects, including hospitals, courthouses, marketplaces, parks, and farms, to not-school class to balace the training classes;

- we genrate tile grids with DevSeed Geokit from schools and not-schools points;

- we wrote python function to download the tiles for both classes. 

At the end these are the number we have:

|Tasks|Confirmed | Unreconginized | not-schools| Total |
| ----| -------- | -------------- | ---------- | ----- |
| Data Cleaning  |  6,663   | 11,774         | 2,268      | 20,705| 
| Tile generation  | 5452     | N/A            | 3,953 | 9,405|


## Adding more objects from OSM


- Step one, download [OSM Colombia dataset](http://download.geofabrik.de/south-america/colombia.html) from Geofabric;

- Step two, split training dataset into confirmed schools and not-schools from the valided/cleaned points;

- Step three, using [Development Seed Geokit](https://github.com/developmentseed/geokit) to exact hospitals, courthouses, marketplaces, parks and farms;
  - using hospitals extraction from OSM as an example:
  
    - run 'docker run --rm -v ${PWD}:/app developmentseed/geokit osmfilter colombia.osm --keep="amenity=hospital" -o=hospital_r_colombia.osm'
    
    - run "docker run --rm -v ${PWD}:/app developmentseed/geokit osmtogeojson hospital_r_colombia.osm > hospitals_c.geojson"
    
    - from QGIS only selected points for the training dataset as hospitals_c_final.geojson.


## Create tiles from points

- ### Merge all the geojsons

After adding all the objects, e.g. hospitals, parks, farms, courthouses, and marketplaces, you will need to merge all of the objects' geojsons into `not_schools_final.geojson`.

Run 'docker run --rm -v ${PWD}:/app developmentseed/geokit geojson-merge input1.geojson input2.geojson > output.geojson', remember to replace all the geojson names accordingly.

- ### Genrate not-school tile-grid from not-school points

Run 'docker run --rm -v ${PWD}:/app developmentseed/geokit point2tile data/combined_not_schools_final.geojson --zoom=17 --buffer=0.001 > data/not_schools_tiles_1m.geojson'

- ### generate school tile-grid from school points;

Run 'docker run --rm -v ${PWD}:/app developmentseed/geokit point2tile data/confirmed_schools_final.geojson  --zoom=17 --buffer=0.001 > data/schools_tiles_1m.geojson'


## Down all the tiles for school and not-school

Use following script to download tiles.

In [None]:
## Remember to replace your token with "TOKEN" list following
%%file unicef_school_tiles.json

{"school": "schools_tiles_1m.geojson", 
"not_school": "not_schools_tiles_1m.geojson",
"school_url":"https://a.tiles.mapbox.com/v4/digitalglobe.2lnpeioh/{z}/{x}/{y}.tif?access_token=TOKEN",
"not_school_url": "https://a.tiles.mapbox.com/v4/digitalglobe.2lnpeioh/{z}/{x}/{y}.png?access_token=TOKEN"}

In [None]:
import json
from urllib.parse import urlparse, parse_qs
import requests
import os
from os import makedirs, path as op
import rasterio

In [None]:
def get_tile(geojson, base_url):
    """
    Function to download tiles for school and not-school. 
    The tile index was created using DevSeed Geokit with 1m buffer to the geolocation points for school and not-school classes;
    :param geojson: geojson for tile and tile index from geokit (poin2tile);
    :param base_url: url to access DG vivid and given the token to download the tiles.
    
    :return tiles: a list of tiles 
    """
    # open geojson and get tile index
    with open(geojson, 'r') as data:
        tile_geojson = json.load(data)
    features = tile_geojson["features"]
    # get the tile index as x, y, z formats.
    xyz = [features[i]['properties']['tiles'] for i in range(len(features))]
    
    # create tile folder
    tiles_folder = op.splitext(geojson)[0].split("/")[0]
    if not op.isdir(tiles_folder):
        makedirs(tiles_folder)
        
    # download and get the list of tiles 
    tiles = list()
    for i in range(len(xyz)):
#         x, y, z = str(xyz[i])
        x=str(xyz[i][0])
        y=str(xyz[i][1])
        z=str(xyz[i][2])
        url = base_url.replace('{x}', x).replace('{y}', y).replace('{z}', z)
        o = urlparse(url)
        _, image_format = op.splitext(o.path)
        tile_bn ="{}-{}-{}{}".format(z, x, y,image_format)
        r = requests.get(url)
        tile= op.join(tiles_folder, tile_bn)
        tiles.append(tile)
        with open(tile, 'wb')as w:
            w.write(r.content)
    return tiles

In [None]:
with open("unicef_school_tiles.json", 'r') as config:
    all_data = json.load(config)
    
school_geojson = all_data["school"]
school_turl = all_data["school_url"]
not_school_geojson= all_data["not_school"]
not_school_turl = all_data["not_school_url"]

In [None]:
# download all the school tiles
school_tiles = get_tile(school_geojson, school_turl)

In [None]:
# download all the none school tiles
not_school_tiles = get_tile(not_school_geojson, not_school_turl)