## ESPM 88a: Exploring Geospatial Data

## Final Project: A Spatial Analysis of John Snow's Cholera Map

### Due Date:  Wednesday, May 11, at 5pm PST.

#### What to turn in: 
Submit your zipped Ipython Notebook, titled with your FastnameFirstname and the exercise (ex: **FrontieraPatty_finalproj.ipynb**).  

#### How to turn in: 
Upload file to bCourses before the due date. bCourses will indicate the day and time you submit the assignment. Late assignments will be penalized 10% per day up to 50%. 

#### Grading:
- This project is work 25% of your final grade (25 points).
- Read the questions carefully and be sure to answer them completely.
- Whenever possible your answers should make reference to your work in this notebook.
- Refer to the readings, lecture slides and your HW exercises for guidance.

#### Honor Code
This is not a group project. The work you submit must be your own. If you have questions about the final project please ask or email your instructors.

*Double-click in the cell below to add your name and indicate that you have read this section.*

### NAME:   
<hr/>

In [None]:
# Import libraries -  run but don't change
from datascience import *  
import numpy as np
import math, random

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from functools import partial
import pyproj
from shapely.ops import transform
from shapely.geometry import Polygon, Point, LineString, shape
from shapely import speedups
speedups.enable()

import json  # for loading geodata and creating shapely geometries and testing spatial relationships
import seaborn as sns

import folium
import fileinput # to create folium maps as html files
from class_intervals_only import *  # Import class binning functions

def inline_map(m):
    from IPython.display import HTML, IFrame
    from folium import Map
    if isinstance(m, Map):
        m._build_map()
        srcdoc = m.HTML.replace('"', '&quot;')
        embed = HTML('<iframe srcdoc="{srcdoc}" '
                     'style="width: 100%; height: 100%; position:relative;'
                     'border: none"></iframe>'.format(srcdoc=srcdoc))
    elif isinstance(m, str):
        embed = IFrame(m, width=800, height=650)
    return embed



## Introduction

#### Purpose:
The purpose of this activity is to review and apply the concepts and methods of exploratory spatial data analysis that we covered in the course this semester. These are explored in the context of a single spatial analysis project to better appreciate their usefulness.  

#### Data:
The data for these activities can be downloaded as a single zip file. 
Download and unzip these data to use in this exercise.

#### Contents:
- Section 1.  Exploring the data
- Section 2.  Visual analysis
- Section 3.  Basic Summary Statistics
- Section 4.  Proximity analysis and spatial relatiohsips
- Section 5.  Density analysis
- Section 6.  Additional Questions


This set of exercises will spatially explore GIS data that was created based on John Snow’s map of the London cholera outbreak of 1854. Dr. John Snow was a physician (not the Game of Thrones character) interested in public health and hygiene. He believed that contaminated water was the vector for the transmission of the cholera although the dominant theory at the time was that cholera was an airborne contagion. Snow mapped the addresses of people who had died from the disease and traced its source to the Broad Street water pump. He is considered a pioneer in modern epidemiology and geographic information analysis.  There are many online resources to read more details on this famous case. See, for example, [Wikipedia](https://en.wikipedia.org/wiki/John_Snow) or http://www.ph.ucla.edu/epi/snow.html.





## Section 1. Exploring the Data

A geospatial exploration begins with finding or creating the data that you will use in your analysis. Fortunately, there are many datasets online on Snow's investigation of the London cholera outbreak of 1854. I found the following four sources.

- [NCGIA Dataset](http://www.ncgia.ucsb.edu/pubs/snow/snow.html) - from the National Center for Geographic Information and Analysis (NCGIA) at UC Santa Barbara.
- [Don Boye’s Dataset](http://donboyes.com/2011/10/14/john-snow-and-serendipity/) - from Don Boyes’ website.
- [RT Wilson’s Dataset](http://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/) - from RT Wilson’s website.
- [Yale Tutorial Dataset](http://guides.library.yale.edu/GIS/gisworkshoparchive) - from the Yale Library's *Intermediate Data Analysis with ArcGIS* tutorial.

Each dataset includes slightly different files. This is typical of geospatial data analysis - what do you do when you find multiple versions of what appear to be the same thing? How do you decide which dataset to use? Here are some questions you can ask to help you make your decision:
- What dataset is best documented?
- Why was the data created? Was it created for a similar purpose as the one for which you intend to use it?
- Does the documentation or data indicate the coordinate reference system (CRS) used by the geographic data?
- Is the CRS arbitrary or is it referenced to the Earth (geographic or projected?). If it is arbitrary you can't display or analyze it with other geographic datasets and you don't know the units of analysis.
- What attributes are included with the geographic data? Which dataset has attributes that will be most helpful?
- Which dataset appears to be most comprehensive (has the most data)?
- What source seems to be the most reliable?

In summary, you find the data. If you find multiple versions, you review and compare the data. Then, you use your judgement to select one dataset or mix data from the various datasets. And you document your sources and procedure!



### QUESTION 1
Review the four web sites linked above. In one sentence each, summarize why you think each of these organizations or individuals created their John Snow datasets below in 1.A-1.D. Then answer 1.E. 
- A. NCGIA
- B. Don Boyes
- C. RT Wilson
- D. Yale University

- E. Based on these descriptions alone which version would you be inclined to use and why?

** Double-click here to input your answer.**

### QUESTION 2
Download these four datasets by clicking this [link](https://drive.google.com/open?id=0B-813oF9w22Mc0JsQTkwaXd4Y1U). Open each dataset in **QGIS**.  Note: the Yale Tutorial vector data are in an ESRI file geodatabase. To open this file format in QGIS:
    - Select Add vector layer 
    - Set Source Type = Directory 
        - If needed, set Source to either UK.NTF2, OpenFileGDB or ESRI FileGDB. 
    - Browse to the location of the *.gdb folder, press Open 
        - Available layers will be listed.
        - Select all to open in QGIS.

Then answer the following questions:

- A. How many geographic data layers are in each of the following datasets?
    - NCGIA
    - Don Boyes
    - RT Wilson
    - Yale University

- B. What is the name of the data layer that contains the representations of the cholera deaths?
    - NCGIA
    - Don Boyes
    - RT Wilson
    - Yale University
    
- C. What is the number of features in the deaths data layer?
    - NCGIA
    - Don Boyes
    - RT Wilson
    - Yale University
    
- D. What is the number of deaths represented by those features?
    - NCGIA
    - Don Boyes
    - RT Wilson
    - Yale University
    
- E. What attributes are used to describe each feature in the deaths layer?
    - NCGIA
    - Don Boyes
    - RT Wilson
    - Yale University

- F. What is the code for the CRS for the deaths layer? Give the EPSG code and/or the projection description.
    - NCGIA
    - Don Boyes
    - RT Wilson
    - Yale University
    
** Double-click here to input your answer.**

### QUESTION 3
- a. What is a geographic feature? What are the components of a geographic feature?
- b. Why do the number of features and the number of deaths differ in the Yale death data layer?
- c. Based on your review of the four datasets above, which one would you decide to use and why?

** Double-click here to input your answer.**

### A high resolution image of the John Snow Cholera Map.
*Source: https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/Snow-cholera-map-1.jpg/823px-Snow-cholera-map-1.jpg*
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/Snow-cholera-map-1.jpg/823px-Snow-cholera-map-1.jpg"/>

### QUESTION 4

The map above depicts John Snow's geographic representation of the Cholera epidemic. Describe the visual variables John Snow used to map Cholera deaths.

** Double-click here to input your answer.**

### Mapping the Data

Getting spatial data, understanding its suitability for an application and preparing it for analysis are the largest time components of a geospatial analysis workflow. It is often hard to understand the source of data discrepancies and know if they are deliberate or the result of error. This is where having deeper knowledge of John Snow and the cholera epidemic would be helpful. In most GIS analysis domain knowledge (knowledge of the problem and its context) is just as important as knowledge of methods, tools, and techniques.

The *Yale Tutorial* data have the same number of deaths and only one less pump than the NCGIA data, a well established institution and the original source of these digital data. Since the Yale data are in a projected CRS, we will use it as we move forward with the rest of our analysis in the Jupyter notebook environment. 

Let's start by reading in the data into tables. These data have been slightly transformed: 1) they are in CSV files and 2) the coordinate data have been reprojected to the WGS84 (epsg:4326) CRS.

In [None]:
# The Snow vector data, run but don't change
snow_deaths = 'snow_deaths.csv'
snow_pumps = 'snow_pumps.csv'
 

In [None]:
# read in and view the water pumps data
pumps = Table.read_table(snow_pumps)
pumps

In [None]:
# read in and view the deaths data
deaths = Table.read_table(snow_deaths)
deaths

### QUESTION 5
Look at the coordinates for the deaths and pumps data. Describe in words what these values tell you about where the study area (our area of interest) is located on the surface of the Earth in terms of its position relative to the equator, north or south pole and the prime meridian. 

** Double-click here to input your answer.**

## Section 2. Visual Analysis

Let’s begin a visual analysis of the Snow data.  The goal of this analysis is to see if we can notice any potential patterns that will spark further inquiry.


#### IMPORTANT NOTES

- We are using the **folium** library to map the data in this project rather than the datascience Maps class that we used in several of our exercises. Refer to HW4 and HW7 for syntax examples and functions that you may want to copy and reuse, perhaps with slight edits. 


- Reminder: If you want to see the available folium Map methods you can enter **folium.Map.** in a code cell and hit the tab key. To get help on the specific methods, like **circle_marker**, you can enter **folium.Map.circle_marker?** in a code sell and run it.


- Read any comments within the code cells in case there are any hints or additional directions therein.


### QUESTION 6

Complete the code cells below to create a map of the Snow data that is centered on the cholera deaths. Show the deaths as circle markers and the pumps as simple markers. Each marker should have pop-up text that describes the name of the water pump or the address + number of cases for the deaths. Make sure pumps and deaths have different colors that contrast well.

- A. Update the deaths table with a description column that includes the address plus the number of deaths as a character string, like "3 Broad St (10 deaths)". You will use this column as text for marker popups.
- B. Define a function to add points to a map as **simple markers** with popup text.  The function should allow the user to change the color of the markers.
- C. Define a function to add points to a map as **circle markers** with popup text.  The function should allow the user to change the color of the markers and the radius of the circle. Note that the radius is expressed in the units of the map CRS.
- D. Add the pump and death locations a folium map and display it.

In [None]:
# Question 6.A
# First, update the deaths table with a description column that includes the address 
# plus the number of deaths. 
# You will use this column as text for marker popups.
# MAKE SURE THAT this column contains a string and not an array or list of strings!

## INPUT YOUR CODE BELOW

In [None]:
# Question 6.B
# Define a function to add points to a map as simple markers with popup text.  
# The function should allow the user to change the color of the markers.

## INPUT YOUR CODE BELOW


In [None]:
# Question 6.c
# Define a function to add points to a map as circle markers with popup text.  
# The function should allow the user to change the color and radius of the markers.
# NOTE/HINT: Given the small geographic extent of the study area, 
# the default radius value should be pretty small to make the map informative.
# To improve the look of your markers you may want to add the following arguments to your function definition
# fill_opacity=0.8, line_color=None

## INPUT YOUR CODE BELOW


In [None]:
# Question 6.D
## Add the pump and death locations a folium map and display it.
## INPUT YOUR CODE BELOW

# Calculate the mean center of death locations

# Create the map, centered on the center of of the death locations

# Add circle markers for the deaths

# Add simple markers for the pumps

# Draw the map using the lines below (uncomment)
#
# Create an HTML version of the map
#m.create_map('snow_map.html')

# Draw the map
# inline_map('snow_map.html')


### QUESTION 6.E
How do we know that the units of the map CRS are meters? What is the map CRS?

** Double-click here to input your answer.**

### QUESTION 7
Re-run your map code above, first setting the zoom_start value to 16 and then to 18. Take a look at the pattern of the death points in the map you just made when you set the zoom_start value to 16 vs 18. Do the death points look more clustered or random:
- A. At zoom_start=16?
- B. At zoom_start=18?

- C. Assume the full extent of the area displayed for each zoom_start value above indicates your study area of interest.  How does the change in study area, or scale, change your impression of the pattern of the death points? Why is this important?

- D. In your map do the death points seem to be clustered around any one water pump?  If yes, what is it’s name?

** Double-click here to input your answer.**


### QUESTION 8
Make a few adjustments to the map you made for Question 6 to create a **proportional symbol map** of the death locations where the circle **radius** is proportional to the number of deaths at the address (Num_Cases). Plot the pumps on top of the death points.  To simplify the map, use a different basemap by adding **tiles='CartoDB Positron'** to the folium.Map method. 


In [None]:
### Your code for Question 8 here

### QUESTION 8.b and 8.c
- B. Does any pattern in the relationship between the pumps and the deaths data seem evident with the proportional symbol map? If yes, describe.
- C. What is the difference between a proportional symbol map and a graduated symbol map? Why would you want to use one versus the other?

** Double-click here to input your answer.**

## Section 3. Basic Spatial Summary Statistics

The visual analysis above supports what John Snow found - that the Broad Street water pump was the source of the cholera deaths. But we know what we are looking for. Without foreknowledge, plausible theories are hard to formulate and causality is extremely hard to determine. What if we were examining the death address locations and had no information about the source of cholera and the possible role of the water pumps. Let’s see if we can use spatial analysis to come up with it based on the death data alone. 


### QUESTION 9
In the code cell below, 
- A. create a histogram of the deaths data using 20 bins.
- B. Compute and print the mean number of deaths per househould rounded up to the nearest integer.

In [None]:
### Input your Answer to Question 9 here
# display histogram
# print('Mean number of deaths per household: ', ...)
# print('Mean number of deaths per household, rounded: ', ...)

### Mapping the mean center

Assume we have no knowledge of the water pump theory. Let's plot the mean center of the deaths on top of the image of the original Snow Map and see what that reveals.

But first, in order to add the scanned map to a folium map we need to: 
- use folium to first save the map as an **html** file,
- define and run a function to add the image to the html map,
- disply our html map in the notebook.

This process is shown below. Here we first define some code to make it possible to add the image to a folium Map. We then add a marker on top of the Map to demonstrate how that is done and the order in which it should be done.

In [None]:
# Code to add the snow image map to the folium web map - just run, DO NOT CHANGE
#
# After: https://ocefpaf.github.io/python4oceanographers/blog/2015/07/13/interactive_geo/
#Image Source: https://www.udel.edu/johnmack/frec682/cholera/snow_map.png'

snow_map_js_code_str = """
var imageUrl = './snow_map_udel50.png',
imageBounds = [[51.508726, -0.144670], [51.51725, -0.13015]];
var my_image_overlay = L.imageOverlay(imageUrl, imageBounds);
my_image_overlay.addTo(map).bringToBack();

var layerControl = L.control.layers(baseLayer, layer_list).addTo(map);
layerGroup = L.layerGroup()
        .addLayer(my_image_overlay)
        .addTo(map);
layerControl.addOverlay(layerGroup , 'Snow Map');
</script>
"""

def addSnowImageOverlayToMap(the_map):

    add_lines = False
   
    for line in fileinput.input(the_map, inplace=True):
       
        if 'L.control.layers(baseLayer, layer_list).addTo(map);' in line:
            # If we already have a layer control attached to the map, remove it
            line = ''
            
        if '   </script>' in line:
            line = '' #remove
            # Add the extra lines after the clustered marker line
            add_lines = True
            
        else:
            if add_lines:
                print(snow_map_js_code_str)
            add_lines = False
            
        print(line,) # print lines to file

In [None]:
# Code to draw image of Snow's map on a folium map
# and then add other data to the map as well.

# Center map on the Broad street pump
ctr_lat = pumps.where('Name','43 Broadwick Street')['lat'][0]
ctr_lon = pumps.where('Name','43 Broadwick Street')['lon'][0]

# Draw the map
m = folium.Map([ctr_lat, ctr_lon], zoom_start=15)

# Add a marker for the Broad street Pump
m.simple_marker(location=(ctr_lat,ctr_lon), popup="Broad St Pump")

# NEW:
# In order to display a scanned image on a folium Map we need to save it to an HTML file.
# And then use the addSnowImageOverlayToMap function to insert it in the HTML file.
# Use this code as a template when you are asked to map locations on top of the Snow Map.

# Create the HTML version of the map - for better inline formatting
m.create_map("snow_map.html")

# Add the image of the snow map to the folium HTML map
addSnowImageOverlayToMap("snow_map.html")

# Display the folium HTML map
inline_map("snow_map.html")


Zoom in to the Broad St (now Broadwick St) pump on the map above and turn off the image layer (using the layer control icon in the upper right of the map) you will see that the memorial the John Snow Cholera research is close to but not at the location of the Broad street pump. It is at the corner of Broadwick and Poland Street.

### QUESTION 10
In the cell below create a map of the mean center of the death locations on top of the Snow map. Color the marker green (the default) and give it a descriptive popup.

In [None]:
## Your code for Question 10 here


### QUESTION 10.B and 10.C
- B. Is the image of the Snow Map raster or vector data?
- C. Briefly describe raster data and vector data representations. Give an example of a type of geographic entity or phenomenon that might be best represented as a vector and one as raster to illustrate.

** Double-click here to input your answer.**

### Field work
Put yourself in John Snow's mindset. You are a physical who believes cholera is a waterborne and not an airborne disease, which was the current thinking. Zoom in on the map you just made to where the mean center point is overlaid, you will see two possible named locations from which those who died may have gotten water or water-based beverages  that might be worth investigating: (1) Broad Street Pump and (2) the Brewery.  The mean center is closer to the Broad Street pump. Let's see if we can get more evidence to support that it might be the cause.



### QUESTION 11
In the cell below, re-create the map you just made above but add the mean center of deaths at locations where the number of deaths is greater than two. Color the marker orange and add a descriptive popup.  In a print statement say whether or not the marker moved closer to the brewery or to the Broad Street Pump.

In [None]:
## Insert your code for Q11 here

### Transforming the death data
One problem with mapping the mean centers of these locations is that there are 578 deaths and 342 locations. Let's create a table called **each_death** that has one death per row and then map the mean centers for that table.

In [None]:
each_death = Table(['lon','lat','Address','Num_Cases'])
deaths_sorted = deaths.sort('Address')
for lon, lat, address, cases in deaths_sorted.select(['lon','lat', "Address", "Num_Cases"]).rows:
    for i in range(cases):
        temp_table= Table().with_columns(
            [
            'lon', [lon],
            'lat', [lat],
            'Address', [address],
            'Num_Cases', [cases]
            ])
        each_death.append(temp_table)
    
#view it
each_death

In [None]:
# Check our totals to make sure they add up!
print(sum(deaths['Num_Cases']))
print(deaths.num_rows)
print(each_death.num_rows)

### QUESTION 12
In the cell below create a map that extends your map from Question 11 by adding the mean center of **each_death** locations (color the marker black) and of **each_death** where the number of cases is greater than two (color the marker red). Your map should have four markers with different colors, each with descriptive popups. Update the *zoom_start* value to zoom in on those markers.

In [None]:
## Input your code for Question 12 here.


### QUESTION 13
Describe how the data transformation effected the locations of the mean center points. Which version of the mean center points of the deaths do you think is more accurate?

** Double-click here to input your answer.**


### Mapping the Spread around the Mean

Now that we have a theory that the Broad Street Pump may be the source of the cholera outbreak, let's map the distribution of the deaths around the mean using the **standard deviation of the X and Y coordinates**.

### QUESTION 14
In the cell below copy the function from HW7 to create the standard deviation of the x and y coordinates. **Revise** this function to take a parameter for the number of standard deviations and set the default to 1.

In [None]:
## INPUT YOUR CODE for Question 14 below


### QUESTION 15

Create a one map that displays boxes of the standard deviation of the X and Y coordinates for the **each_death** table for one and two standard deviations.  Give them different colors and descriptive popups. Display them on top of the image of Snow's map. Center the map on the mean center of the **each_death** locations. Add the pumps to the map as well.

In [None]:
## Input your code for Question 15 here.


### QUESTION 16
- A. What pumps are within 1 standard deviation distance of the X and Y coordinates? 
- B. What pumps are within 2?  
- C. What can you say about the deaths based on this that would support our theory that the Broad Street pump is the source of the epidemic?

** Double-click here to input your answer.**

### Nearest Neighbor Distance

Let's explore the clustering of the death locations by computing the nearest neighbor index (**NNI**) for locations in the **each_death** table.  Since NNI is based on metric distance we first need to add projected map coordinates to the each_death table. We can then use these projected coordinates to compute NNI. 

- Note, it is a best practice to label columns with geographic coordinates *lon* or *longitude* and *lat* or *latitude*  and columns with projected coordinates *X* and *Y*.

### QUESTION 17
- A. In general, describe the difference between a projected coordinate reference system and a geographic coordinate reference system?
- B. Describe the key property of a conformal map projection. 
- C. Give an example of when you would use a conformal map projection rather than an equal area map projection.

** Double-click here to input your answer.**

### QUESTION 18

In the cell below complete the function to transform geographic (lat/lon) coordinates to projected coordinates using the the coordinate transformation function **transformTo32630**. Then apply this function to the **each_death** table and the **pumps** table so that you end up with **X** and **Y** columns in those tables.


In [None]:
## Input your code for Question 18 below
## Look for the lines that say - YOUR CODE HERE

transformTo32630 = partial(
    pyproj.transform,
    pyproj.Proj(init='epsg:4326'),   # source coordinate system - WGS 84, EPSG:4326
    pyproj.Proj(init='epsg:32630'))  # destination coordinate system - UTM zone 30N, WGS84 

    
def get_transformed_coords(lon, lat, transform_function):
    
    # YOUR CODE HERE
    ...
    ...
    
    # return the coordinates
    return  ... #YOUR CODE HERE



# Apply the function to the each_death table so that you add an X and Y column
# YOUR CODE HERE



# Add the XY coords to the pumps table
# YOUR CODE HERE



# View the tables to make sure you got it right
#each_death
#pumps

### Computing NNI

You can revise the code for the **getNNI** function in **HW7** to compute **NNI** for the **each_death** table, or for any table. The limitation of the **HW7** version is that the function determines the study area from the bounding box of the set of points for which it is determining NNI.  Since the extent of the point set is not always the same as the extent of the study area, the function should allow you to input a different bounding box for the study area.

### QUESTION 19

Revise the function below to take as input the study area bounding coordinates and compute the study area.
You need only update the lines that say ** YOUR CODE HERE **.

In [None]:
from scipy import spatial as sp
from scipy import reshape, sqrt, identity

def getNNI(x_coords, y_coords, ....... ): # ADD YOUR CODE HERE
    
    ## Calculate the Nearest Neighbor Index of a set of points in a data_table
    ## ASSUMPTION: X, Y and study area bounding coords all in same projected CRS
    
    # number of points in study area
    numpts = x_coords.size
    
    # Get the area of the study area from the bounding box coordinates  
    study_area_height = ... # YOUR CODE HERE
    study_area_width = ...  # YOUR CODE HERE
    
    study_area = study_area_height * study_area_width
    
    # Calculate the Expected Nearest Neighbor distance from the number of points in the study area
    expected_nn_dist = .5 / math.sqrt(numpts/study_area)
    
    # Compute the distance matrix for all crime locations
    # This can take a while if lots of points!
    pts = np.column_stack((x_coords,y_coords))
    my_distance_matrix = sp.distance_matrix(pts,pts) #comparing a point set to itself

    # Compute OBSERVED mean nn distance using scipy distance_matrrix
    # Important - We only want the 1 nearest neighbor! not all NN
    dist_matrix_1nn=[np.sort(elem)[1] for elem in my_distance_matrix ] # sort and grab the second lowest value
                                                                   # as lowest is always dist to self
    dm_len = len(dist_matrix_1nn) # length of the distance matrix
    dm_sum = sum(dist_matrix_1nn) # the sum of the nearest neigbhor distances
    
    obs_mean_nndist = dm_sum/dm_len # the average observed nn distance
    
    
    # Compute the Nearest Neighbor Index which is ratio of observed to expected mean nn distance
    # The NNI measures the spatial distribution from 0 (clustered pattern) to 
    # 1 (randomly dispersed pattern) to 2.15 (regularly dispersed /uniform pattern):
    nni = obs_mean_nndist / expected_nn_dist
    
    return nni


### QUESTION 20

In the cell below, input your code to apply the **getNNI** function and complete the print statements to show the **NNI** for
- A. all points in the **each_death** table, where the study area is the extent of the each_death points.
- B. all points in the **each_death** table, where the study area is the extent of the pump points.
- B. points in the **each_death** table where the number of cases was **greater than 1**, and the study area is the extent of the each_death points.

In [None]:
## Input your code for Question 19 here
## Add code to compute the NNI either in or above these print statements

print("A. Nearest Neighbor Index for each death (each_death extent): ", )

print("B. Nearest Neighbor Index for each death (pump extent): ", )

print("C. NNI for each death where Num Cases > 2 (each_death extent):", ) 

### QUESTION 21

- A. Briefly summarize what the NNI is. What information about the point set does it consider and how? 
- B. What are the range of values for the NNI and what these values mean on that scale.
- C. Can you name a weakness of the NNI?
- D. What do the values from Question 20 tell you about the overall pattern of the death locations?
- E. What do the values from Question 20 tell you about the impact of the study area on the NNI?

** Double-click here to input your answer.**

## Section 4. Proximity Analysis and Spatial Relationships

In all of the previous questions we explored the death address points, using the Snow map for visual context and exploring basic spatial summary statistics. In doing so, we have come up with our theory about the Broad Street water pump. In this section we will explore the relationship between two data sets - death addresses and water pumps.

Some questions we can consider include “Are most of the cholera deaths around the Broad St water pump?” and “How many of the deaths occurred near the other pumps?” 

These questions speak to notions of proximity. Proximity is an extremely important spatial concept and drives many methods of spatial analysis. Everyday terms for proximity include around,  near and neighborhood. Geospatial analysts additionally use terms like nearest neighbor, neighborhood, zone, sphere of influence, and service area. When we think about proximity, there is no distance that defines it. Rather it is context dependant. For example, our sense of distance in considering the nearest airport is much larger than the nearest grocery store.

If you lived in London in 1854 and had to get your water from a public water pump, how far could you or would you want to carry that bucket or jug? A five gallon jug would weigh about 40 pounds. Not very far would be my answer, maybe the length of a football field (120 yards) or about 100 meters. So let’s count the number of deaths within 100 meters of each pump and see if any one pump has a lot more nearby deaths. To do this, you will create buffer polygons around each water pump and then use a spatial relationshop query to count the number of deaths within those areas.

To get you started, below is a function for computing the count of deaths within 100m of each water pump.

In [None]:
def getDeathCountForPumps(x, y, num_deaths): 
    # Get the number of deaths within 100meters of a water pump
    
    # create a polygon of the buffer around a pump point
    pump_poly = Point(x, y).buffer(100) # 100 meter buffer

    # set the count of deaths to the current number of deaths
    count=num_deaths
    
    # Compare the pump buffer polygon to each death point
    for x2, y2, num_cases in each_death.select(["X", "Y", "Num_Cases"]).rows:
    
        death_point = Point(x2, y2)
        
        if pump_poly.contains(death_point):
            # If the point is within the polygon 
            count = count + 1    #increment the death count for that pump
            
    return count


# Test it
getDeathCountForPumps(698671, 5.7108e+06, 0)



### QUESTION 22
Complete the code below to use the **getDeathCountForPumps** function to compute the number of deaths within 100 meters of each pump and add it to the pumps table.

In [None]:
## Complete this code for QUESTION 21

# create an empty column in pumps called death_count and set all values to zero
pumps['death_count'] = [0 for x in pumps['Name'] ]

# Compute the number of deaths within 100m of each pump and add that value to the pumps table

# Show() the pumps table sorted in descending order by the number of deaths in the buffer

### QUESTION 23
In the cell below create a sorted (descending) horizontal bar chart showing the names of the pumps on the Y axis and the count of deaths within 100 meters of the pump.

In [None]:
# Input your code for Question 23 here


### QUESTION 24
- A. What is the total number of deaths within 100 meters of all pumps?
- B. Name the two pumps with the most deaths within 100 meters. Are these the same pumps that were within two standard deviation distance of the X and Y coords of the mean center?
- C. How many times more deaths were within 100 meters of the Broad Street pump than the pump with the second highest number of deaths?
- D. What percent of the total number of deaths is not within 100 meters of any pump?

** Double-click here to input your answer.**

In [None]:
## Code cell for Q23 work

### QUESTION 25

In the code cells below to create a **graduated color map** of the pump buffers, symbolized by the number of deaths within 100 meters of the pump. Set the radius of the pump to be 100 - that way we can see the buffer polygons. Recall the radius is in the units of the map CRS, which is meters.  Specifically you will need to
- A. Determine the colors for each pump buffer polygon.
- B. Draw a map showing the pump buffers on top of the death points and the Snow map.


In [None]:
# Your code for Question 25.A below

# First, define a function to set symbol COLOR based class of attribute data values
 
# Classify the data values into 5 classes 
# Select a classification method that best displays the death hotspots
# Refer to the histogram for insights
 
# Add the colors for each class to pump table in a column labeled 'class'


In [None]:
# Input your code for Question 25.B below
# Your code should display the color-coded water pump buffers on top of the death points
# and the Snow map


### QUESTION 26
- What classification method did you use for your colors above and why?
- What does the horizontal bar chart and graduated color map above tell you about the strengths and limiitations of computing the number of deaths within 100 meters?

** Double-click here to input your answer.**

### Voronoi Polygons
Since many death address locations were not within 100 meters of a pump, let's consider the spatial relationship between death locations and water pumps with a different notion of **near**.  We can create Voronoi polygons around each pump and count the number of deaths within those polygons. [Voronoi polygons](https://en.wikipedia.org/wiki/Voronoi_diagram) divide the extent of the 2D space around a set of points into polygons where each location within the polygon is closer to that point than to any other point.

Our goal will be to create a choropleth map of the Voronoi polygons for the water pumps. The data value that we will use to symbolize the polygons is the number of deaths within each Voronoi polygon. Folium requires geojson as input for choropleth mapping. You can load a GeoJSON file with the Voronoi polygons using the code below.

In [None]:
# Read in the geo_json data and take a look at the properties (attributes) associated with the voronoi polygons
pumps_voronoi_geojson_data = json.load(open('pumps_vorpolys_geoe.geojson', 'r'))   

# Let's checkout the properties, or attributes that describe the geographic data, in the geojson file
print(pumps_voronoi_geojson_data['features'][1]['properties'])
 

If you look at the properties for the Voronoi polygons above you will see the name (**Name**) of the property that you can use as a key to join the GeoJSON data to the pumps data table.

If you take a look at the pumps table it already has a death_count column but that column is for the number of deaths within 100 meters of the pump. Let's relabel the name of that column so we don't get confused.

In [None]:
pumps.relabel('death_count','death_count_100m') 
pumps

Now we need to define a function to compute the number of death points within each GeoJson polygon. For that we can borrow from the **getDeathCountForPumps** function we just used above and the **geojsonPointInPoly** function from HW8.

In [None]:
def getDeathCountForPumpVoronoiPolys(x, y, num_deaths, pump_name): 
    # Get the number of deaths inside the voronoi polygon of a water pump
   
     # set the count of deaths to the current numeber of deaths
    count = num_deaths
    
    found_poly = False
    # Loop over the voronoi polygons to find the match by pump name
    for feature in pumps_voronoi_geojson_data['features']:
        if feature['properties']['Name'] == pump_name:
 
            pump_poly = shape(feature['geometry'])
            found_poly = True
            break  # no need to continue in this for loop as we found it

    if found_poly == True:
        found_poly = False  # reset just in case
 
        # Compare the pump buffer polygon to each death point
        for x, y, num_cases in each_death.select(["lon", "lat", "Num_Cases"]).rows:
            death_point = Point(x, y)

            if pump_poly.contains(death_point):
                
                # If the point is within the polygon 
                count = count + 1    #increment the death count for that pump
 
    return count

 

### QUESTION 27
In the code cell below use the **getDeathCountForPumpVoronoiPolys** function to add the number of deaths within each Voronoi polygon to the pumps table in a column labelled **'death_count_vorpoly'**.

In [None]:
## INPUT your code for Question 27 below
## Hint refer back to Q22


### QUESTION 28.A
In the code cell below add code to review the count of deaths within the Vornoi polygon of each pump. Use print statements where appropriate to display the results for the questions listed below in comments.

In [None]:
## INPUT your code for Question 28 here

# Does the sum of the deaths per pump Voronoi polygon equal the total number of deaths in the each_death table?
 

# Display all rows in the pumps table in sorted order by number of deaths in the Voronoi polygon, high to low
 

# Display a sorted (descending) horizontal bar chart showing the names of the pumps on the Y axis 
# and the count of deaths within the Voronoi Polygon of the pump.


# Percent of total deaths in the top 3 Voronoi polygons with the most deaths


# Percent of total deaths in each of the top 3 Voronoi polygons with the most deaths


 

### QUESTION 28.B
- A. Are the 3 pumps with the highest death counts within the Voronoi polygons the same pumps as those for the buffer polygons? 
- B. Are they in the same order with the same relative percentages of deaths?
- C. Explain any differences you observe.

** Double-click here to input your answer.** 

### Choropleth Map of the Voronoi Polygons
You can use the pump Voronoi polygons to create a choropleth map of the deaths near each pump where the colors represent the count of deaths.  This process is a bit more complex using folium.Map rather than the datascience.Map class.  Therefore, the basic code for this is shown below.

In [None]:
## Code to draw folium choropleth map

# Define our map center point
ctr_lat = np.mean(each_death['lat'])
ctr_lon = np.mean(each_death['lon'])

# Draw the map
m = folium.Map([ctr_lat, ctr_lon], zoom_start=15, width=650, height=500)

# folium requires a pandas data frame
pumps_df = pumps.select(['Name','death_count_vorpoly']).to_df()

m.geo_json(
    geo_str=json.dumps(pumps_voronoi_geojson_data), 
    data=pumps_df, 
    #threshold_scale=[],
    columns=['Name', 'death_count_vorpoly'], 
    fill_color='YlOrRd', 
    fill_opacity=0.85,
    key_on='feature.properties.Name',
)

# Create the map. 
m.create_map('vor_map.html')

# Show the map in-line
m  # DON"T CHANGE THIS LINE - folium geojson maps render differently in notebook

### QUESTION 29
In the code cell below revise the above code to enhance the Voronoi polygon choropleth map. You need to:
- A. Add circle markers for the death locations under the Voronoi polygons
- B. Add circle markers with descriptive popups for the pumps above the polygons 
- C. Add a custom threshold scale of your choice (in terms of number of classes and classification method)
- D. Display a map with thoughtfully selected colors that highlights the trend in the data.

In [None]:
## INPUT YOUR Code for Question 29 below

 

## Section 5. Density Analysis

In a sense, the buffer and voronoi choropleth maps we created above represent ways in which we can begin to consider the density of cholera deaths near each point. But they are maps of counts not density. In order to create a density map we must normalize the counts by a unit of area. 



### QUESTION 30

In the code cell below use comments to outline the coding steps for creating a density map of the count of deaths in the voronoi polygons (psuedo code).


In [None]:
### Input your comments below for Question 30



### QUESTION 31

Another way to map density, as we saw in HW7, is with Kernel Density Estimation, or KDE, which creates a continuous probability density surface from the data.

Complete the code block below to use Seaborn's KDEplot method to create a map of the density of all cholera deaths in the **each_death** table.  The most important parameter that the data scientist needs to suppy is the **bandwidth**, or **bw**.  The bandwidth is a smoothing parameter that indicates distance within which we want to consider neighboring points. Be sure to give your KDEplot a bandwidth that reveals the hotspots of deaths without oversmoothing the data.


In [None]:
## Input your code below where it says YOUR CODE HERE

# Set up the figure
fig, ax = plt.subplots(1, figsize=(7,5))
 
# Set the bandwidth parameter
bandwidth= ... # YOUR CODE HERE  

# add all death points as our background context using ax.scatter and color them red
## YOUR CODE HERE
 

# Create the KDE map of the death locations 
# with the following parameters cmap="Blues", shade=True, shade_lowest=False
## YOUR CODE HERE

# Add the Broad Street Pump ONLY - color it yellow
## YOUR CODE HERE

##---------------------------------
## NO CHANGES BELOW
##---------------------------------

# Remove axes ticks (e.g., coordinates displayed on x and y axes)
ax.set_xticks([])
ax.set_yticks([])

# Add title
ax.set_title("Density map of Cholera Deaths")

# Keep axes proportionate
plt.axis('equal')
fig.tight_layout()

# Draw map
plt.show() 
 

### QUESTION 31.B
Does the Broad Street Pump display on the plot near the location of highest density?

** Double-click here to input your answer.** 

## Section 6. Additional Questions

### QUESTION 32
Discuss the different approaches to calculating and visualizing the distribution of the death locations in the study area - buffers, voronoi polygons, and KDE plots. How do the different methods use the death location data differently?
How do they implement different notions of *near*. What advantage does the KDE method have that the others don't in terms of pinpointing hotspots without any knowlege of their cause?

** Double-click here to input your answer.** 

### QUESTION 33
Mean centers, standard deviational distances, buffer and voronoi polygons are based on straight line, or Euclidean, distance. With reference to the John Snow data explain the shortcomings of geospatial analysis based on Euclidean distance. Can you suggest a better measure of distance for these data.

** Double-click here to input your answer.** 


### QUESTION 34
In the above questions we use the count of deaths at the address locations to determine the density of deaths per unit area. We assume that a higher density indicates closer proximity to the source of the cholera outbreak.  But we are missing some critical information about the study location. What might that data be and how could you use that data to improve this analysis?

** Double-click here to input your answer.** 

### QUESTION 35
Our maps of the **deaths** and **each_death** tables look the same but the death locations are represented differently. Why do they look the same?  What new visual variables, or techniques, do we use to overcome this short-coming and understand the number of deaths at a specific location? 

** Double-click here to input your answer.** 



### Congratulations! 
You're done! You made it to the end of the final project and of ESPM88a.