# Homework Exercise 8: Mapping Areas

In this exercise we explore ways to map area data. We will create choropleth maps of the January 2016 San Francisco crime data that we used in HW7. However, instead of mapping crime incidents as points, we will aggregate the point data by three different area types: San Francisco Police Districts, neighborhoods and census tracts. Two key questions that we will explore are: (1) how does the change in area type used to aggregate the data change the mapped visualizations of the data? and (2) how does the mapped visualizations change when we map counts, densities, and rates within the different areas?

The geographic data we explore are all from the [San Francisco OpenData Portal](https://data.sfgov.org). The population data for San Francisco census tracts was downloaded from the [US Census American Factfinder](http://factfinder.census.gov/) website.

In [None]:
# Import libraries -  run but don't change

from datascience import *   # The basics
import numpy as np

import timeit # to time our functions

import json  # for loading geodata and creating shapely geometries and testing spatial relationships
from shapely import wkt
from shapely.geometry import Polygon, Point, LineString, shape
from shapely import speedups
speedups.enable()

from shapely.ops import transform  # for projection transformations
from functools import partial
import pyproj


In [None]:
# The files we will use - run, don't change

crime_file = 'sfcrimes_jan2016.csv'                     # crime incidents - with neighborhoods

pdist_geofile = 'CurrentPoliceDistricts2s.geojson'      # SF police districts 

sfhoods_geofile = 'SFNeighborhoods_s2.json'             # SF neighborhoods  

sftracts_geofile = 'SF_nhoods_census_s2.json'           # SF census tracts 
sftracts_popdata_file = 'SF_ACS15_POPEST2.csv'

## San Francisco Crime Incident Data, January 2016
First, let's load the SF crime incident data into a table and display it. This should look familiar as it is the same data used in HW7. We can see that the longitude and latitude for each crime location are in the columns labeled X and Y. We can also see from the Address column that these locations are approximate, placed at either the nearest intersection or the middle of the block. This is a common practice to proctect privacy of the residents in these locations.

We can also see that the sfcrimes data includes a PdDistrict column indicating the SF Police Department district  either responsible for or within which the crime incident occured.

In [None]:
# load crime data into a table
sfcrimes = Table.read_table(crime_file)
sfcrimes

### SF Police Department (SFPD) Districts
Let's load the GeoJSON file with the polygons representing the different SFPD districts.  You will notice we use a slightly different syntax to load a GeoJSON file. A GeoJSON file is a special type of JSON file that contains geographic data. We explored GeoJSON data format in a previous exercise.

Once we load the file we can take a look at the properties, or attributes, present in the data that describe each area. We don't want to view all of the data because the coordinate values for the polygons are huge.

In [None]:
# Load police districts geojson file 
with open(pdist_geofile) as json_file:
    sfpd_json = json.load(json_file)
    
# Let's checkout the properties, or attributes that describe the geographic data, in the geojson file
print(sfpd_json['features'][1]['properties'])

If you look at the output from printing the properties for one feature, you will see that the property **district** contains the nmame of the SFPD district. We will use the values in this property to link crime incidents to district polygons. Note that the property labels are *case sensitvie*.

Now let's use the **Maps** module of the datascience library to map the districts.  The Maps module uses the **folium** mapping library but is simpler and has few options. We will use it so we can focus on the data.

In [None]:
# Create Map from SF Police Districts
sfpd_map = Map.read_geojson(pdist_geofile)
sfpd_map


## Choropleth Maps

In HW7 we mapped the crime incident locations as points. One thing we noticed is that there are a lot of points, and this data is only for Jan 2016. With a large number of point objects, such as crime incidents, it is common to aggregate those data to an area of interest and then visualize the aggregations as a **choropleth** map.  Choropleth maps are data maps - data values summarized by geographic regions. Note, these regions are also called zones, areas and polygons - but calling them polygons confuses the representation method with the representation model! The goal for the map symbology is to communicate the data values within the different regions.  This objective is similar to that of a histogram. But unlike a histogram where each bin is of the same width, choropleth maps usually depict different sized regions. This variation in area makes communicating the data values more complex.


### Summing Counts By Regions
In order to map the crime indident data by SFPD district we need to some the counts of crimes per district.  With geographic data this is done in one of two ways:
- **Sum values by attribute**: count the number of objects by a shared attribute value. This is a *group by* operation if the data are in the same table. Otherwise, it could be accomplished with a *join* and *group by* operation.
- **Sum values by location**: count the number of objects at or within the same location. In this case we call upon the **spatial relationship queries** that we explored in HW4. 

Let's first count the number of crime incididents by grouping the data in the **sfcrimes** table based on the values in the **PdDistrict** column.

In [None]:
# Count the number of crime incidents within each police district based on the values in the column PdDistrict
crime_sfpd_dist = sfcrimes.select(('IncidntNum','PdDistrict')).group('PdDistrict',len).sort('IncidntNum len', descending=True)
crime_sfpd_dist = crime_sfpd_dist.relabel('IncidntNum len', 'crime_count')
crime_sfpd_dist.show()

### Point in Polygon Queries

Now, let's count the incidents by spatial location. When the location is a region, this is called a **point in polygon (PIP)** query.  Below, the **geojsonPointInPoly** function is a simple implementation of a PIP query. The features of the input geojson data are checked to see which polygon **contains** the input a point, specified by its coordinates. The specified property (propertyName) of the matched feature is returned.

In [None]:
# Point in Polygon function
def geojsonPointInPoly(x,y, jsonData,propertyName):
   
    property = 'unknown' # we want to find the property of the polygon we intersect
                         # before we check the property is unknown
        
    # construct shapely point based on x,y
    point = Point(x,y)
    
    # check each polygon to see if it contains the point
    for feature in jsonData['features']:
        polygon = shape(feature['geometry'])
        if polygon.contains(point):
            property = feature['properties'][propertyName]
            return property  # return as soon as we find a match

    return property  # and return if we don't find a match


# TEST and TIME it
%timeit geojsonPointInPoly(-122.416, 37.7612, sfpd_json, 'district')

Note that at the bottom of the code block above we test the function with one point and we also time it using the Jupyter notebook magic function **%timeit**. We time the function because PIP queries can be super slow if the polygon geometries have a lot of data points and the data are not spatially indexed.  Those are the kinds of things you need to worry about if you are programming these operations. If you are using desktop GIS software this is less of a concern. The queries can still take a lot of time but the software will build the spatial indices so you don't have to.

To speed up these PIP queries we first simplified the polygon data using a handy online tool called [Mapshaper](http://www.mapshaper.org). Simplification throws away some of the geographic detail in order to speed things up.


### Apply the PIP Query
Let's apply the **geojsonPointInPolygon** function to spatially determine the district in which each crimes incident is located. Following our usually pattern, we will **apply** the function to the sfcrimes table and save the result in a new column.  We will use this pattern over and over in this exercise.

In [None]:
# Use the Point In Polygon spatial relationship functin to determine the police district in which each crime occurs
sfcrimes['district'] = sfcrimes.apply(lambda x, y: geojsonPointInPoly(x,y, sfpd_json,'district'),['X','Y'])
sfcrimes

If you look at the table above you will see that some of the **PdDistrict** values do not match the **district** values. That is interesting, if not disconcerting. Let's count the number of mismatches.

In [None]:
print("Number of mismatches: ", sfcrimes.where((sfcrimes['PdDistrict'] != sfcrimes['district'])).num_rows)
print("Number of crimes incidents: ", sfcrimes.num_rows)


### QUESTION 1
Overall, that's a small percentage of mismatches. But think for a minute about what may have caused it. There are three possible reasons (maybe more). These include:
- A: The SFPD criteria for assignment of district,
- B: The polygon simplification process
- C: The spatial relationshiop used in the PIP function (see above).

In the cell below explain briefly how each of these may have contribted to the mismatch.

**Double-click here to replace this line with your answer.**

An important take-away from the section above is that is important to **CYA**, or cover your answers, when reporting the results of your geospatial analysis.  If you document your assumptions, data, and methods and share your code it will be easier for you and others to understand the results and any discrepencies with other reports.

### Mapping Counts by Region

In order to map the counts by SFPD district, first create a table of the count of crimes within each district.  We will use the column **district** that we calculated with our PIP query because we know that that the crime incident points are spatially contained withing these district polygons that we are mapping.

In [None]:
# Count the number of crimes within each police district
crime_sfpd = sfcrimes.select(('IncidntNum','district')).group('district',len).sort('IncidntNum len', descending=True)
crime_sfpd = crime_sfpd.relabel('IncidntNum len', 'crime_count')
crime_sfpd.show()

#### Join the Crime Counts to the Map
We now join the crime counts to the map data to symbolize the color of each SFPD district based on the crime counts. 

In [None]:
#Join the crime counts to the geojson file by district so we can map it
crimes_per_dist = crime_sfpd.select(['district','crime_count'])

sfpd_map.color(crimes_per_dist, key_on='feature.properties.district', palette='YlOrRd')


### QUESTION 2
According to the map above, 
- A: What SFPD District has the most crime incidents? 
- B: The Tenderloin has a reputation for being a high crime area. Accoriing to the map above is the Tenderloin in the high, medium, or low end of the crime count spectrum?

Unfortunately our map doesn't have a way to add labels. So You can use this [map](sfpd_districts_map.png) to associate the regions with the district names.

**Double-click here to replace this line with your answer.**

### QUESTION  3
Complete the code cells below create a choropleth map of the crime counts within SF Neighborhoods. Use the method we used above for SFPD districts.

In [None]:
# Load the sf neighborhoods shapefile
with open(sfhoods_geofile) as json_file:
    sfhoods_json = json.load(json_file)
    
# Uncomment the line below and print out the names of the fields in the geojson file 
print(...)

## Add a print statement that notes the field that you can use to identify the neighborhood name of each polygon

In [None]:
# Create Map from SF Neighborhoods
sfhoods_map = .... # add your code here to read in the geojson file to a map
sfhoods_map  

In [None]:
# Use the Point In Polygon spatial relationship function to determine the neighborhood in which each crime occurs
sfcrimes['nhood'] = sfcrimes.apply(...) # add your code

In [None]:
# Count the number of crimes within each neighborhood
crime_nhood = sfcrimes.select(...) # add your code to this line
crime_nhood = crime_nhood.relabel('IncidntNum len', 'crime_count')
crime_nhood

If all went well for QUESTION 3 above, the following cell should display a map of crime counts by SF neighborhood.

In [None]:
#Join the crime counts to the geojson file so we can map it
crimes_per_nhood = crime_nhood.select(['nhood','crime_count'])
sfhoods_map.color(crimes_per_nhood, key_on='feature.properties.nhood', palette='YlOrRd')

Now let's create a choropleth map of the crime counts within **SF Census Tracts**.  We will use the method we used above for SFPD districts and neighborhoods.

First we load the geojson data from the file and display the map. This is shown below.

In [None]:
# Load the sf Census Tracts shapefile
with open(sftracts_geofile) as json_file:
    sftracts_json = json.load(json_file)
    
# Let's checkout the fields in the geojson file - ** WHAT IS THE FIELD OF INTEREST? **tractce10**
print(sftracts_json['features'][1]['properties'])

In [None]:
# Create Map from SF Census Tracts GeoJSON
sftracts_map = Map.read_geojson(sftracts_geofile)
sftracts_map


### QUESTION 4
Complete the cells below create a choropleth map of the crime counts within SF Census Tracts.

In [None]:
# Use the Point In Polygon spatial relationship function to determine the tract in which each crime occurs 
# COULD TAKE A FEW MINUTES!
sfcrimes['tractce10'] = sfcrimes.apply(...) # your code here

In [None]:
# Count the number of crimes within each census tract
crime_tracts = sfcrimes.select((...) # your code here
crime_tracts = crime_tracts.relabel('IncidntNum len', 'crime_count')
crime_tracts

In [None]:
#Join the crime counts to the geojson file by district so we can map it
crimes_per_tract = crime_tracts.select(...) # your code here

sftracts_map.color(crimes_per_tract, key_on='feature.properties.tractce10', palette='YlOrRd')

### QUESTION 5

Based on the maps above,

- A: Do each of the three maps communicate the same thing about the distribution of crime in SF?  If not how do they differ?
- B: What map indicates that most of the crime is south of market street?  
- C: What map would you use to show that there is relatively little crime in Bayview Hunters Point area? 
- D: What map would you use to show that Golden Gate Park is a high crime area?
- E: How do these maps indicate the MAUP problem.
- F: What relationshop do you observe between the size of each area and the perception of high crime?

NOTE: Refer to this [neighborhood map](sf_nhoods.png) for neighborhood names.

**Double-click here to replace this line with your answer.**

## Mapping Density
When geographic events like crime incidents or objects like tree locations are aggregated by a region like a neighborhood they usually are **NOT** symbolized by the **counts** within each region. Instead they are more often symbolized by **density** or **rate**.  A **density map** divides the count by a unit of area, for example, crimes per square mile. A **rate map** divides the count by the population of interest, for example, crimes per 1,000 residents. This transformation of the data is called standardization or normalization.  See this nice, simple writeup from indiemapper.com on [data standardization](http://indiemapper.com/app/learnmore.php?l=standardize).

In this section we will map the counts by area. We first need to compute the area for each region. In order to compute area we must **transform** the geographic coordinates to a **projected coordinate reference system** that preserves **area**.

Let's explore this first with the SF Police Districts Data. Look again at the table of crime counts per police district, called **crime_sfpd** above.  What we want to do is add a column to table that has the area in sq kilometers and then use the crime_count and area_km2 columns to compte the density of crime incidents per square kilometer in each district.

In [None]:
crime_sfpd

### Computing area
Below we define a projection transformation object and a function for computing area in square kilometers for a geographic feature in a geojson data object based on the input property name and value.

In [None]:
# Compute area of each region

# First define the projection transformation object
transformTo3310 = partial(
    pyproj.transform,
    pyproj.Proj(init='epsg:4326'),  # source coordinate system - WGS 84, EPSG:4326
    pyproj.Proj(init='epsg:3310'))  # destination coordinate system California Albers
                                    # We want to use this projection because it is a California equal area proj 
                                    # and we are doing area calculations in california


# Function to compute the area of a feature in a geojson file
# Inputs: geojson file name, the name of the property we are searching on to find a match, eg police district
# and the value for the property that we are trying to match. 

def getPolyAreaKM2(jsonData, propertyName, propertyValue):
    
    for feature in jsonData['features']:
        if (feature['properties'][propertyName] == propertyValue):  
            # Did we find the polygon with the input
            # property value for the property name?
            # If yes, then compute its area
            polygon = shape(feature['geometry'])              # create a shapely polygon
            polygon = transform(transformTo3310,polygon)      #transform the coordinates 
            area_km2 = round(polygon.area / (1000 * 1000),3)  # compute the area
           
            return area_km2

# TEST
# Find the area of the polygon in the sfpd_json file where the district equals Mission.
getPolyAreaKM2(sfpd_json,'district','MISSION')

Now that we have our function we can compute the area of each region in the **crime_sfpd** crime counts table.

In [None]:
# Compute the area in sq KM for each region and add it to the crime count table
crime_sfpd['area_km2'] = crime_sfpd.apply(lambda val: getPolyAreaKM2(sfpd_json,'district',val), ['district'])
crime_sfpd.show()

Compute density per sq kilometer as follows:

In [None]:
# simple function to compute density
def getDensity(count, area):
    return (count / area) 

# test
getDensity(crime_sfpd['crime_count'][0], crime_sfpd['area_km2'][0])

Now apply the **getDensity** function to the **crime_sfpd** table which has the crime counts and areas for each SFPD district.

In [None]:
# Compute the crime density per region and add it to the crime count table
crime_sfpd = crime_sfpd.where(crime_sfpd['district']!= 'unknown')   # don't change this - we are removing the row 
                                                                    # where the district is unkown
crime_sfpd['crime_per_km2'] = crime_sfpd.apply(lambda count, area: getDensity(count,area), ['crime_count', 'area_km2'])
crime_sfpd

#### Create a map of crime density per SFPD District


In [None]:
# Create the lookup table for the map
crime_density_per_dist = crime_sfpd.select(['district','crime_per_km2'])

# Color Map the police districts by crime density
sfpd_map.color(crime_density_per_dist, key_on='feature.properties.district', palette='YlOrRd')

### QUESTION 6
- A. Does the density map for SF Police Districts convey the same information about the geographic distribution of crime incidents as the crime count map for these areas?
- B. What district(s) have the highest crime densities?
- C. What district(s) have the lowest crime densities?

**Double-click here to replace this line with your answer.**

### QUESTION 7
Update the cells below create a choropleth map of **crime density** within **SF Neighborhoods**.  Use the method we used above for mapping crime density in SFPD districts.

In [None]:
# Compute the area in sq KM for each region and add it to the crime count table
crime_nhood['area_km2'] = ... # your code here
crime_nhood

In [None]:
# Compute the crime density per region and add it to the crime count table
crime_nhood = crime_nhood.where(crime_nhood['nhood']!= 'unknown') # KEEP THIS!!
crime_nhood['crime_per_km2'] = crime_nhood.apply(...) # YOUR CODE HERE
crime_nhood.sort('crime_per_km2', descending=True)

In [None]:
# Create the lookup table for the map
crime_density_per_nhood = crime_nhood.select(['nhood','crime_per_km2'])

# Color Map the police districts by crime density
sfhoods_map.color(...) # your code here

### QUESTION 8
- A. Based on the map above, what two **neighborhoods** have the highest crime density in San Francisco? Make sure your map is showing neighborhoods not districts! Refer to this [neighborhood map](sf_nhoods.png) for neighborhood names.
- B. Using the data in the table used to create the map symbology above, how many times higher is the density of neighborhood with the highest density than the second highest? Does the map convey this magnitude?
- C. How might you change the map **symbology** to convey this magnitude of difference?

**Double-click here to replace this line with your answer.**

### QUESTION 9
In the cells below enter code to create a choropleth map of crime density within **SF Census Tracts**. Use the method we used above for mapping crime density in SFPD districts.

In [None]:
# Compute the area in sq KM for each region and add it to the crime count table

In [None]:
# Compute the crime density per region and add it to the crime count table

In [None]:
# Create the lookup table for the map

# Color Map the police districts by crime density


### QUESTION 10
How does the [KDE Density map of SF Crimes for Jan 2016](kde_sfcrime_density_jan2016.png), created in HW7, compare with the one created in this section for SF Census Tracts. Do they tell the same story? If not, how do they differ?

**Double-click here to replace this line with your answer.**

## Mapping Rates

The problem with density maps is that they follow the population. So what may look like a high crime area is really a high population area. Often what we may also want to know is where is the crime incident count is high relative to the resident population. In order to compute that we need to get population data.

The most common source of population data is the US Census. Since our crime indident data is from 2016 we want recent population data for the city.  We downloaded these data from the [US Census American Factfinder](http://factfinder.census.gov) website so you don't have too! Below we will join these data to  the sftracts data so that we can map crime rate per capita (or per person).




In [None]:
# Lets take a look at the crime count per census tract table that we created previously
crimes_per_tract.sort('tractce10')

In [None]:
# Now read in the census tract population data from the **sftracts_popdata_file**
sfpop = Table.read_table(sftracts_popdata_file, dtype={'SFTRACT': str})
sfpop

In the above table POPEST is the census estimate of population counts for each census tract. Also, we can see from the above table that we can join the population data to the census tract polygons based on the census tract identifier. This value is in the **SFTRACT** column of the sfpop table and the **tractce10** column of **crimes_per_tract** table. Let's join them so we can compute crime rate.

In [None]:
# Join pop data to crime_count table
crime_rate_per_tract = crimes_per_tract.join('tractce10',sfpop,'SFTRACT')
crime_rate_per_tract

Let's join the neighborhood names to the **crime_rate_per_tract** table so we can see the areas in which high crime rates are occuring.

In [None]:
def getNhoodForTract(tract):
    
    for feature in sftracts_json['features']:
        if (feature['properties']['tractce10'] == tract):  
            return feature['properties']['nhood']

# TEST
# Find the area of the polygon in the sfpd_json file where the district equals Mission.
getNhoodForTract('980900')

In [None]:
# Define the getRatePer1K function
# NOTE: If we just compute per capita crime rate the rates are tooo low to map well with defaults.
def getRatePer1K(count,pop):
    rate = count / (pop / 1000)
    return rate

### QUESTION 11
Complete the next cell to add that to the crime_rate_per_tract table for each census tract.

In [None]:
# Apply the getRatePer1K function
crime_rate_per_tract['crime_rate'] = crime_rate_per_tract.apply(...) # your code here
crime_rate_per_tract.sort('crime_rate', descending=True)

Now we can map crime rate per census tract!

In [None]:
sftracts_map.color(crimes_per_tract, key_on='feature.properties.tractce10', palette='YlOrRd')

### QUESTION 12
- A. What census tract has the highest crime rate? What tract the second highest? Name these by neighborhod plus census tract identifier.
- B. By how many times greater is it than the tract with the second highest rate? Does the map convey the magnitude of this difference?
- C. Would you recommend increasing the police presence in both of these tracts? Explain your answer.
- D. Do the three density maps convey the same information about geographic distribution of crime rate in San Francisco? If not why?

**Double-click here to replace this line with your answer.**

### QUESTION 13
Consider the crime rate within SF Census tracts above as well as the data in the **crime_rate_per_tract** table. Describe how the information in the table helps you better interpret the map. 

**Double-click here to replace this line with your answer.**

### QUESTION 14
Summarize how the count, density and rate maps depict the geographic distribution of crime in San Francisco differently. Do the maps of the different neighborhoods demonstrate the [MAUP](http://gispopsci.org/maup) problem?

**Double-click here to replace this line with your answer.**

## References
SF Current Police Districts
https://data.sfgov.org/api/geospatial/wkhw-cjsf?method=export&format=GeoJSON

SF Analysis Neighborhoods
https://data.sfgov.org/api/geospatial/p5b7-5n3h?method=export&format=GeoJSON

SF Census tracts w/neighborhoods
https://data.sfgov.org/api/geospatial/bwbp-wk3r?method=export&format=GeoJSON

SF crime incidents
https://data.sfgov.org/api/geospatial/wkhw-cjsf?method=export&format=GeoJSON

SF Population data
http://factfinder.census.gov

## WHAT TO SUBMIT

- Download your completed notebook, zip it and upload on bcourses by 5pm Tuesday, **April 19, 2016**.
- ** Please add your name to your notebook title!**