# The Wealth of Cities
## Predicting the Wealth of a City from Satellite Imagery

Accurate measurements of the economic characteristics of populations critically influence research and policy. Such measurements can shape decisions by governments and how to allocate resources and provide infrastructure to improve human livelihoods in a wide-range of situations. Although economic data is readily available for some developing nations, many regions of the modern world remain unexposed to the benefits of economic analysis where regions lack key measures of economic development and efficiency. Regions such as parts of Africa conduct little to no economic surveys or other means of collecting data on their financial situations. We attempt to address this problem by using publicly available satellite imagery to predict the wealth of a city (or, more generally, a geographic region) based on fundamental features identified in these images and running them through a convolutional neural network. Not only would this method be applicable to regions that lack economic data, but could also be applied to cities with a wealth of economic information on a macro level but a dearth on a micro level. For example, cities in America, despite having lots of economic data on state and county levels, could benefit from understanding more granular information in order to improve policy decisions for infrastructure and public support. 

In order for this approach to work, we need to be able to extract relevant features from the images in order to train our machine learning model. Our model will not be able to predict the wealth of individual houses (i.e., families), but will work on clusters of houses (i.e., neighborhoods) because of the complexity of wealth measurements and tendency for neighborhood to be at a nearly homogeneous economic level. As a result, we will need to extract "cluster" features to process with our (NEURAL NETWORK).

Thinking about the kinds of features that would elucidate the wealth of a region, we can start to identify what we need to extract (in some way) from the images. One of the first and most common thoughts is to get satellite imagery of the region at night and observe the night-light intensity; more lights at night tend to correspond with more wealth while less lights at night tend to correspond with poorer areas. Our group has also thought of the following ideas as means to identify wealth:
- Number of cars
- Percentage of green-space
- Number of high-rises
- What time traffic occurs at
- Housing density
- Aerospace/nautical infrastructure

The number of of cars tends to be a good indicator or whether a city has passed a certain threshold for wealth. Yes, some cities that are poorer than others will have more cars, but cities that have no cars tend to be the poorest, so we can figure out a baseline level for the wealth of a city if we can extract the number of cars from the image.

Percentage of green space is perhaps even less reliable than the number of cars, but can also establish relative rankings of wealth between multiple cities. Cities with lots of public funds, and consequently wealth, will tend to spend money on maintaining public green-spaces. Granted, some rural regions tend to also have a lot of green-space in the form of farms or undeveloped land, so in this case green-space does not correspond to higher wealth. However, if we can ensure that the imagery we are looking at represents a urban city, we could perhaps take into  green-space into account to predict the level of wealth.

Number of high-rises is definitely a critical feature of a city's wealth. However, extracting this information from satellite imagery proves to be tricky because of the flatness of the images. One way to get around this is to analyze the shadows produced by buildings at different times of the day. If the buildings are tall, they will cast long shadows at all times of the day (not only briefly in the morning and night).

Housing density is highly correlated to the "urban-ness" of a region, which in turn is suggestive of the wealth of a city. Rural areas (i.e., poorer, generally) have a lower housing density while urban areas (i.e., wealthier, generally) have a higher housing density. Granted, there are exceptions to this trend, but generally this fact will hold and is one of the easier features to extract from satellite imagery.

We will be getting our images from Planet.com, a publicly available database of satellite imagery from the last few years that covers most of the world. Unfortunately, API access is limited to California so we will only be able to run our model using data from California, but there is no reason that this method would not work given more input data from around the world.

In this notebook, we'll take you through the entire process from setting up the program to download images and extract features to running the data through the machine learning pipeline and getting a predicted wealth score for input data. 

First, we'll input the necessary modules. `json` and `io` are just used to load in our Planet.com API key. You can sign up for a free account at https://www.planet.com/. The approval process will take a few days, but after receiving your API key, this entire notebook can be completed in one sitting. We will be using the `requests` module to make API requests for the satellite imagery, which requires authorization using the `requests.auth` module.

In [55]:
import numpy as np
import json, io, math
import requests
from requests.auth import HTTPBasicAuth
from PIL import Image
from PIL import ImageColor
from scipy.misc import toimage
import scipy.ndimage
# with Anaconda run 'conda install -c https://conda.binstar.org/menpo opencv' to get cv2
import cv2

In [43]:
# LOADS in your API_KEY from the config_secret.json file
with io.open("config_secret.json") as cred:
    API_KEY = json.load(cred)["API_KEY"]

BASE_URL = "https://maps.googleapis.com/maps/api/staticmap"
MAX_RGB = 255
DEFAULT_SIZE = 600
DEFAULT_VISIBILITY = "off"

# (lat/pixel, lon/pixel)
# Multiply by pixels to see lat/lon span of image
# Only want zoom levels 13-20
# City zoom = 12
ZOOMS = {
    12: (0.06/230, 0.06/175),
    13: (0.02/153, 0.02/116),
    14: (0.008/122, 0.008/122),
    15: (0.008/245, 0.008/187),
    16: (0.003/183, 0.003/140),
    17: (0.002/245, 0.002/186),
    18: (0.001/245, 0.001/186),
    19: (0.0005/245, 0.0005/186),
    20: (0.0002/196, 0.0002/149)
}

In [45]:
def find_roads(roadmap):
    # This function returns the coordinates of the output from google static maps api with maptype roadmap, 
    # As a Numpy Array.  
    roads = []
    for i in range(roadmap.shape[0]):
        for j in range(roadmap.shape[1]):
            if(roadmap[i][j][1] > roadmap[i][j][0] and roadmap[i][j][1] > roadmap[i][j][2]):
                roads += [(i, j)]
    return roads

def load_image(content):
    return Image.open(io.BytesIO(content)).convert("RGBA").convert("RGB")

def rgb_to_hex(rgb_tuple):
    return "0x%02x%02x%02x" % rgb_tuple

def create_payload(mode, (lat, lon), zoom, params={}, ret_colors=True):

    size = params.get("size", (DEFAULT_SIZE, DEFAULT_SIZE))
    padding = params.get("padding", 0)
    road_color = params.get("road_color", (0, MAX_RGB, 0))
    road_color_hex = rgb_to_hex(road_color)
    man_made_color = params.get("man_made_color", (0, 0, 0))
    man_made_color_hex = rgb_to_hex(man_made_color)
    poi_color = params.get("poi_color", (MAX_RGB, 0, 0))
    poi_color_hex = rgb_to_hex(poi_color)
    water_color = params.get("water_color", (0, 0, MAX_RGB))
    water_color_hex = rgb_to_hex(water_color)
    natural_color = params.get("natural_color", (MAX_RGB, 0, MAX_RGB))
    natural_color_hex = rgb_to_hex(natural_color)
    label_visibility = params.get("label_visibility", DEFAULT_VISIBILITY)

    base_payload = [("size", "{}x{}".format(size[0],size[1])), ("key", API_KEY)]
    style_payload = [("style", "feature:road|element:geometry|color:{}".format(road_color_hex)),
                     ("style", "feature:landscape.man_made|element:geometry.fill|color:{}".format(man_made_color_hex)),
                     ("style", "element:labels|visibility:{}".format(label_visibility)),
                     ("style", "feature:poi|element:geometry|color:{}".format(poi_color_hex)),
                     ("style", "feature:water|element:geometry|color:{}".format(water_color_hex)),
                     ("style", "feature:landscape.natural|element:geometry.fill|color:{}".format(natural_color_hex))]
    satellite_payload = base_payload + [("maptype", "satellite")]
    road_payload = base_payload + style_payload + [("maptype", "roadmap")]
    
    if mode == "satellite": payload = satellite_payload 
    elif mode == "road": payload = road_payload
    else: raise ValueError("Unrecognized mode '{}'. Mode can either be 'satellite' or 'road'.".format(mode))
        
    payload += [("zoom", zoom)] + [("center", "{},{}".format(lat, lon))]
    colors = {
        "road": np.array(road_color),
        "man_made": np.array(man_made_color), 
        "poi": np.array(poi_color),
        "water": np.array(water_color),
        "natural": np.array(natural_color)
    }
    
    return (payload, colors) if ret_colors else payload
    

# bottom left, top right corners
def bounding_box((lat1,lon1), (lat2,lon2), zoom):
    w = lon2 - lon1
    h = lat2 - lat1
    padding = 0,0
    
    w_per_image = (DEFAULT_SIZE - padding[0]) * ZOOMS[zoom][1]
    h_per_image = (DEFAULT_SIZE - padding[1]) * ZOOMS[zoom][0]
    
    num_width = math.ceil(w / w_per_image)
    num_height = math.ceil(h / h_per_image)
    
    lons = np.linspace(lon1 + w_per_image/2, lon2 - w_per_image/2, num=num_width)
    lats = np.linspace(lat1 + h_per_image/2, lat2 - h_per_image/2, num=num_height)
    
    return lats, lons

def get_image(lat, lon, payload, zoom):
    r = requests.get(BASE_URL, params=payload)
    image = load_image(r.content)
    return image
    

def get_images((lat1,lon1), (lat2,lon2), zoom, mode, ret_colors=True):
    lats, lons = bounding_box((lat1,lon1), (lat2,lon2), zoom)
    images = []
    

    for lat in lats:
        for lon in lons:
            payload, colors = create_payload(mode, (lat, lon), zoom)
            images.append( get_image(lat, lon, payload, zoom) )
            
    return (images, colors) if ret_colors else images

In [44]:
def count_pixels(im_arr, color, tolerance=10):
    lower_bound = color - tolerance
    lower_bound[lower_bound < 0] = 0
    upper_bound = color + tolerance
    upper_bound[upper_bound > MAX_RGB] = MAX_RGB
    return np.sum(np.all((im_arr >= lower_bound) & (im_arr <= upper_bound), axis=2))

def extract_features(im_arr, colors, tolerance=10):
    pixels = {
        "total": im_arr.shape[0] * im_arr.shape[1]
    }
    
    for kind, color in colors.iteritems():
        pixels[kind] = count_pixels(im_arr, color, tolerance=tolerance)

    print "PIXEL KINDS:", pixels

In [42]:
### RUN THIS CELL ###
# zoom = 13 -> 6 images (1.5 seconds)
pgh_images, colors = get_images((40.417268, -80.036749), (40.503523, -79.823013), 13, "road")

In [31]:
### THEN RUN THIS CELL ###
for image in pgh_images:
    image_array = np.asarray(image)
    extract_features(image_array, colors)

NameError: name 'pgh_images' is not defined

In [6]:
### OLD CODE ###
# SHIPPING_LANE_COLOR = (120, 160, 248)

# payload += [("markers", "color:red|color:red|label:B|{},{}".format(lat + 0.0002, lon + 0.0002))]
# payload += [("markers", "color:red|color:red|label:A|{},{}".format(lat, lon))]
# print str(lat) + "," + str(lon) + "," + ZOOM_OUT[0][1] + "z"

# lat = lat1 + (lat2 - lat1) / 2
# lon = lon1 + (lon2 - lon1) / 2
# payload =  SATELLITE_PAYLOAD + [("zoom", zoom)] + [("center", "{},{}".format(lat, lon))]
# r = requests.get(BASE_URL, params=payload)
# im = open_image(r.content)
# im.show()

# road_payload =  ROADMAP_PAYLOAD + [("center", "40.714728,-73.9988"), ("zoom", "20")]
# satellite_payload = SATELLITE_PAYLOAD + [("center", "40.714728,-73.9988"), ("zoom", "20")]
# r = requests.get(BASE_URL, params=road_payload)
# print r.url
# r2 = requests.get(BASE_URL, params=satellite_payload)
# im = open_image(r.content)
# roadmap_ar = np.asarray(im)
# roads = find_roads(roadmap_ar)
# roadsonly = np.zeros(roadmap_ar.shape)
# for (i, j) in roads:
#     roadsonly[i][j] = np.array([255,255,255])
# toimage(roadsonly).show()
# sat = open_image(r2.content)
# sat.show()

# venice_payload = ROADMAP_PAYLOAD + [("center", "45.4355638,12.31794"), ("zoom", 14)]
# r = requests.get(BASE_URL, params=venice_payload)
# im = open_image(r.content)
# im.show()

In [46]:
road_payload, colors = create_payload("road", (40.714728,-73.9988), 20)
sat_payload, colors = create_payload("satellite", (40.714728,-73.9988), 20)
r = requests.get(BASE_URL, params=road_payload)
im = load_image(r.content)
r2 = requests.get(BASE_URL, params=sat_payload)
im2 = load_image(r2.content)
sat_ar = np.asarray(im2)
roadmap_ar = np.asarray(im)
roads = find_roads(roadmap_ar)
roadsonly = np.zeros(roadmap_ar.shape)
for (i, j) in roads:
    roadsonly[i][j] = sat_ar[i][j]
cv2.imwrite("thing.png", sat_ar)

True

In [8]:
def road_variance((lat, lon), zoom):
    road_payload, colors = create_payload('road', (lat, lon), zoom)
    sat_payload, colors = create_payload('satellite', (lat, lon), zoom)
    r = requests.get(BASE_URL, params=road_payload)
    im = load_image(r.content)
    r2 = requests.get(BASE_URL, params=sat_payload)
    im2 = load_image(r2.content)
    sat_ar = np.asarray(im2)
    roadmap_ar = np.asarray(im)
    roads = find_roads(roadmap_ar)
    road_pixels = map(lambda (x, y) : sat_ar[x][y], roads)
    return sum(np.std(road_pixels, axis=1))

In [64]:
def highlight_cars(template_files, image, angular_granularity = 8):
    if(isinstance(image, (str, unicode))):
        img_rgb = cv2.imread(image)
    else:
        img_rgb = image
    img_gray = cv2.cvtColor(img_rgb, cv2.COLOR_BGR2GRAY)
    loc = (np.array([]), np.array([]))
    for template_file in template_files:
        template = cv2.imread(template_file,0)
        w, h = template.shape[::-1]
        for i in range(angular_granularity):
            template = scipy.ndimage.rotate(template, 360/(angular_granularity), mode = 'constant')
            res = cv2.matchTemplate(img_gray,template,cv2.TM_CCOEFF_NORMED)
            threshold = 0.4
            found = np.where(res > theshold)
            loc = (np.append(loc[0], found[0]).astype(int), np.append(loc[1], found[1]).astype(int))
    for pt in zip(*loc[::-1]):
        cv2.rectangle(img_rgb, pt, (pt[0] + w, pt[1] + h), (0,0,255), 2)
    return img_rgb
 
type(cv2.imread("car_template.png").dtype)
#cv2.imwrite('res.png', highlight_cars(['car_template2.png', 'car_template.png'], 'thing2.png', angular_granularity=16))

numpy.dtype

In [67]:
lat, lon = (40.714728,-73.9988)
zoom = 20
road_payload, colors = create_payload('road', (lat, lon), zoom)
sat_payload, colors = create_payload('satellite', (lat, lon), zoom)
r = requests.get(BASE_URL, params=road_payload)
im = load_image(r.content)
r2 = requests.get(BASE_URL, params=sat_payload)
im2 = load_image(r2.content)
sat_ar = np.asarray(im2)
roadmap_ar = np.asarray(im)
roads = find_roads(roadmap_ar)
np.zeros(roadmap_ar.shape, dtype = float)
for (i, j) in roads:
    roadsonly[i][j] = sat_ar[i][j]
#toimage(roadmap_ar).show()
#toimage(sat_ar).show()
#cv2.imread(roadsonly,0)
edges = cv2.Canny(roadsonly.astype(float),100,200)
toimage(edges).show()

error: /Users/jenkins/miniconda/0/2.7/conda-bld/work/opencv-2.4.11/modules/imgproc/src/canny.cpp:94: error: (-215) src.depth() == CV_8U in function Canny


In [50]:
x = np.array([3, 4, 5, 6, 7])
print x[(x > 3) & (x < 7)]


[4 5 6]


To get an idea of what these satellite images look like, we will show you how to download a single image and then proceed to, what Planet calls, an Area of Interest, or AOI. First, we define a geometry, which is a collection of latitude and longitude points  that forms a polygon around the area you would like to get pictures from. Remember that Planet API only works with California right now, so if you want to change the coordinates, make sure they remain within the state. Our example geometry is centered on a reservoir in Redding, CA. Next, we'll need to define filters for the Planet API; these include the geometry filter discussed above, as well as date range filters (only getting images within a specified date range), cloud cover filters (perhaps you only want to look at images on clear day), and many more. We then send this request to the Stats API endpoint to see how many possible images there are that fit our criteria. In our example, there are 30 images taken of Redding, CA within the date range that have less than 50% cloud cover.