# HW7 Exploring Point Data

In this exercise we try several methods for exploring geographic entities represented as points. Points can be used to represent fixed entities like fire hydrants, buildings, storm drains, street light poles, trees, mountain summits, etc. Points can also be used to represent the location of events or occurances like where a crime occured, a person lives, lightning struct, an earthquake was epicentered, an endangered kit fox was spotted. The methods we review in this exercise focus on understanding the location of the points in space. Questions we explore include: where are the points centered? What is the spread of the points in the study area? What is the density of points throughout the study area? Do the point locations seem to be clustered, random, or dispersed? 

This exercise is a gentle introduction to **point pattern analysis**, an important area within spatial statistics.  The goal is to get you to see the potential of these methods and inspire you to want to learn more!

We will explore these questions by looking at San Francisco crime incidents from January 2016. We obtained these data from the [San Francisco OpenData Portal](https://data.sfgov.org/).

In [None]:
# Import libraries - don't change
from datascience import *
from osgeo import ogr
import numpy as np
from scipy import spatial as sp
from scipy import reshape, sqrt, identity
import math as math
import folium

# Simple function to show folium maps inline
from IPython.display import HTML

def inline_map(m, height=500):
    """Takes a folium instance and embed HTML."""
    m._build_map()
    srcdoc = m.HTML.replace('"', '&quot;')
    embed = HTML('<iframe srcdoc="{0}" '
                 'style="width: 100%; height: {1}px; '
                 'border: none"></iframe>'.format(srcdoc, height))
    return embed

## Working with Shapefiles in Python
Much of the freely available spatial data that you find online are in the ESRI Shapefile format. There are a number of different modules for reading shapefiles in python. We will work with the OGR/GDAL libraries. While not the easiest, it is the most flexible in that you can use it to read in just about any spatial data file format.

Let's begin by opening the shapefile of crime data, seeing what coordinate reference system it uses, and what attribute data it includes to describe the points. Attribute data are the values of fields, also called columns, that describe each record - also called a row.

In [None]:
# Identify the name of the shapefile of point data
shpfile=r'sfcrime_jan2016.shp' # Crime incidents in San Francisco, Jan 2016

In [None]:
# Open the file and see the CRS and attribute data (columns)
ds=ogr.Open(shpfile)
lyr=ds.GetLayer()
lyr_name = lyr.GetName()
the_proj = lyr.GetSpatialRef().ExportToWkt()
print("The Coordinate reference system for [%s] is: \n\n %s" % (lyr_name,the_proj))

dfn=lyr.GetLayerDefn()
nfields=dfn.GetFieldCount()
fields=[]

for i in range(nfields):
    field_name = dfn.GetFieldDefn(i).GetName()
    fields.append(field_name)
    
ds.Destroy() # close the file handle

print("\nHere are the fields in the layer [%s]:" % (lyr_name))
fields

We can see from above that the **coordinate reference system**, or CRS, for these data is **WGS84**. WGS84 is a geographic coordinate systems that represents locations on a 3D ellipsoidal model of the earth. We need to keep that in mind as some operations on these data will require 2D projected coordinates.

We also see that these data include a number of interesting attributes including crime **category** which indicates the type of crime, police district (**PdDistrict**) and day of week, among others. Any of these attributes can be used to explore the data. Reviewing the attribute data associated with the spatial data is always a good place to start your analysis. You can see if there are attributes that suggest certain questions you may want to consider or if the attributes you wished were there are not. For example, we could not use these data to explore repeat offenders because these data don't include that information.

What is missing from the attribute data is the geographic coordinates - **latitude** and **longitude**. Since we will want to store the data in a table and map the data using folium we need to extract the coordinates from the  shapefile. We do this below when we read the data into a table.



## Read shapefile data into a Table
We can use OGR to read each feature (geometry and attributes) from the shapefile and save those data to an array that we will then use to create a table. We will add the latitude and longitude coordinates to the table when we load the data into the table. We will call these columns **lat** and **lon** which is a common convention. 

In [None]:
ds=ogr.Open(shpfile)
lyr=ds.GetLayer()
lyr
my_attributes = [] # create an array to hold the attribute data
for feature in lyr: # for each feature in the shapefile
    att_data=feature.items()  # get the attribute data 
    geometry = feature.GetGeometryRef() # get the geometry data 
 
    att_data['lon'] = geometry.GetX()   # get the X coordinate - here latitude since the CRS is WGS84 
    att_data['lat'] = geometry.GetY()   # get the Y coordinate
 
    my_attributes.append(att_data)  # add the current point feature to our array of features

# close the file
ds.Destroy() 

# Now load the data into a table called mycrimes
sfcrimes = Table.from_records(my_attributes)  
 

Now take a look at the table to make sure the latitude and longitude values were properly added to the table. A common mistake is to put lats in the lon column etc.

In [None]:
# View the table
sfcrimes

### QUESTION 0
Approximately, what is the range of latitude and longitude coordinates for San Francisco?

**Double-click here to input your answer**

## Exploring the point data

Exploring spatial data beginins with mapping it. This gives a visual summary of the extent and concentration of locations.  We can also describe the point locations by calculating the center of the locations and how the locations are spread around this center.


## Mapping the Center
Let's calculate and map the **mean center** of the crime data for San Francisco during this time period.  The mean center is simply the point whose coordinates are equal to the average of the X or longitude values and the average of the Y or latitude values.

We will use **folium** rather than the datascience Maps module to map our data because of its greater functionality.

In [None]:
# Calculate the mean center of all SF Crimes
mean_ctr_lat = np.mean(sfcrimes['lat'])  # The average of all latitude values in the array
mean_ctr_lon = np.mean(sfcrimes['lon'])  # The average of all longitude values in the array
print("Mean center of SF Crimes is at: %.5f, %.5f" % (mean_ctr_lon, mean_ctr_lat))

### Adding Points to the Map
First, let's create a function to add points to the map. We can then apply this function to all rows in the table. This function will allow us to specify text for the popup box and a color for the marker icon. We can also indicate if we want to us cluster markers.

In [None]:
def mapMyPoint(the_map, lat,lon, popupContent, m_color='blue', isClustered=False):
 
    the_map.simple_marker(location=(lat,lon), popup=popupContent, marker_color=m_color, clustered_marker=isClustered)
    

### Mapping Crime Locations and the Mean Center
Now we can create a map showing crime locations and the Mean Center point of these locations. Since there are alot of crimes we will map the locations using cluster markers. *Note: this can be slow when there are a lot of points!*

In [None]:
# Map crime locations as cluster markers, then map the mean center point
m = folium.Map([mean_ctr_lat, mean_ctr_lon], zoom_start=12) #center map on the mean center
 
# Map the crimes as cluster markers
sfcrimes.apply(lambda lat,lon , thePopup: mapMyPoint(m, lat,lon, thePopup, 'blue', 'True'), ['lat','lon','Descript'])

# Add a marker for the mean center
m.simple_marker(location=(mean_ctr_lat,mean_ctr_lon), popup="mean center of sf crimes")
 
# Draw the map
inline_map(m)

### Exploring Crime Locations by Category

We see from the map above that the mean center of all crimes in January 2016 was at the north east section of the mission district. Since this is the average of the coordinates for all crimes it is interesting but not super informative. We don't know what crime categories are included in this data set. Most people are interested in the location of specific types of crimes, such as violent crimes, rather than crimes in general. To get a better idea of where crime is located in San Francisco, let's explore crime by category. First, let's summarize the crime data to see the counts of crime incidents by category for this time period.

In [None]:
# What are the different categories of crime and the count of incidents for each one?
crime_cats = sfcrimes.select(('IncidntNum','Category')).group('Category',len).sort('IncidntNum len', descending=True)
crime_cats = crime_cats.relabel('IncidntNum len', 'cat_count')
crime_cats.show()

We see above that LARCENY/THEFT, VEHICLE THEFT, ASSAULT, BURGLARY, DRUG/NARCOTIC, MISSING PERSON, ROBBERY, and PROSTITUTION are some of the violent and property crimes that are often the focus of news reports and the ones people are most concerned about.  Let's subset the data on these *bad crimes* and see how the mean center for this subset differs from that for all crime incident locations.  

Note, in this exercise we use the term crimes and crime incidents interchangeably.

In [None]:
# Use where function to subset the data
badcrimes = sfcrimes.where((sfcrimes['Category']=='LARCENY/THEFT') 
                           | (sfcrimes['Category']=='VEHICLE THEFT')  # or category equals vehicle theft
                           | (sfcrimes['Category']=='ASSAULT')        # when where has compound clauses we use this
                           | (sfcrimes['Category']=='BURGLARY')       # syntax
                           | (sfcrimes['Category']=='DRUG/NARCOTIC') 
                           | (sfcrimes['Category']=='MISSING PERSON')
                           | (sfcrimes['Category']=='ROBBERY')
                           | (sfcrimes['Category']=='PROSTITUTION'))

In [None]:
# Calculate the mean center of badcrimes
bad_mean_ctr_lat = np.mean(badcrimes['lat'])  # find the average of all latitude values in the array
bad_mean_ctr_lon = np.mean(badcrimes['lon'])  # find the average of all longitude values in the array

# center map on the mean center of all crimes
m = folium.Map([mean_ctr_lat, mean_ctr_lon], zoom_start=15) #

# Add a marker for the mean center of all SF crimes
m.simple_marker(location=(mean_ctr_lat,mean_ctr_lon), popup="mean center of sf crimes")

# Add a marker for the mean center of all BAD SF crimes
m.simple_marker(location=(bad_mean_ctr_lat,bad_mean_ctr_lon), popup="mean center of BAD crimes", marker_color='black')

# Draw the map
inline_map(m)

You can see above that there is a slight movement in the mean center of *bad crimes* towards Market Street.

### Comparing Mean Centers
We drill down in the data by mapping and comparing the mean centers of specific crime types. Below we map the mean center of all crime incidents and the mean center of LARCENY/THEFT crime incidents.  We use different map symbology, **marker colors**, and **marker popups** to keep track of the different crime types being mapped.

In [None]:
# Map the mean center of all crime locations and the mean center of LARCENY/THEFT crime incidents
m = folium.Map([mean_ctr_lat, mean_ctr_lon], zoom_start=14) #center map on the mean center

larceny_ctr_lat = np.mean(sfcrimes.where('Category','LARCENY/THEFT')['lat'])  # Note simple where syntax
larceny_ctr_lon = np.mean(sfcrimes.where('Category','LARCENY/THEFT')['lon'])

#Add marker fo mean center of LARCENY/THEFT Crime incidents
m.simple_marker(location=(larceny_ctr_lat,larceny_ctr_lon), popup="mean center of LARCENY/THEFT crimes", marker_color='red')

# Add a marker for the mean center of all crime incidents
m.simple_marker(location=(mean_ctr_lat,mean_ctr_lon), popup="mean center of sf crimes")

# Add a marker for the mean center of all BAD SF crime incidents
m.simple_marker(location=(bad_mean_ctr_lat,bad_mean_ctr_lon), popup="mean center of BAD crimes", marker_color='black')
inline_map(m)


The map above indicates that the center of LARCENY/THEFT crimes in north of Market Street while the center of all crimes and bad crimes is south of Market.

### QUESTION 1

In the code blocks below
- A. Complete the function **mapMeanCenterPoint** to calculate the mean center and it to a map. 
- B. Use the function to create ONE map that shows the mean centers of the following crime categories: all SF crimes, DRUG/NARCOTIC crimes, ASSAULT crimes, and MISSING PERSON CRIMES. Give a different color to each marker and add a popup box with text that indicates the crime category.  
- C. Zoom in on the map to see the named place closest to the center of MISSING PERSON crimes. Use a print statement to display the name of that place. 

In [None]:
# Your code here for Question 1.A

def mapMeanCenterPoint(the_map, lats ,lons, popupContent, m_color='blue'):
    #mc_lat = ....
    #mc_lon = ...    

    the_map.simple_marker(location=(mc_lat,mc_lon), popup=popupContent, marker_color=m_color)

In [None]:
# Your code here for Question 1.B and 1.C 

# Create Map to show mean center points
m = folium.Map([mean_ctr_lat,mean_ctr_lon], zoom_start=12) #center map on the mean center of all SF crimes

# Mean center of all crimes
#mapMeanCenterPoint(....)

# Mean center of DRUG/NARCOTIC crimes
#mapMeanCenterPoint(....)

# Mean Center of ASSAULT crimes
#mapMeanCenterPoint(....)

# Mean center of MISSING PERSON crimes
#mapMeanCenterPoint(....)

# Display the map
inline_map(m)

print("The named place (building/location) on the map closest to the mean center of MISSING PERSON crimes is....")

## Mapping Spread

The mean center tells us the center of a set of locations in terms of the average of the X and Y coordinates. The **minimum bounding box** or **MBB** describes the distribution or spread of the locations around the mean center. It is similar to the range for a set of numbers. The MBB is the smallest bounding rectangle that encloses all of the points in the set. It can be calculated from the minimum and maximum values for the X and Y coordinates.

### Mapping the Minimum Bounding Box

First, we define a function to get the minimum bounding box coordinates from a set of points input as arrays of X and Y coordinates.

In [None]:
def getMBB(x_coords,y_coords):
    # Get the coordinates of the Minimum Bounding box
    
    # Get the min and max coordinate values
    x_max = max(x_coords)
    x_min = min(x_coords)
    y_max = max(y_coords)
    y_min = min(y_coords)

    # Determine the bounding points from the min and max coordinate values
    # ul = upper left point, ur = upper right
    # ll = lower left, lr = lower right
    ul_xy =  (y_max, x_min)
    ur_xy =  (y_max, x_max)
    ll_xy =  (y_min, x_min)
    lr_xy =  (y_min, x_max)

    # Return the MBB coordinates
    return [ul_xy, ur_xy, lr_xy, ll_xy, ul_xy]

We can then map the MBB and the Mean Center of all crime incidents in SF.

In [None]:
#Map of mean center and MBB of all SF Crimes

m = folium.Map([mean_ctr_lat, mean_ctr_lon], zoom_start=12) #center map on the mean center

# Add MEAN CENTER point for all SF crime- default blue
m.simple_marker(location=(mean_ctr_lat, mean_ctr_lon), popup="mean center of sf crimes")

# Compute and then add the MBB for all SF crimes
box_coords = getMBB(sfcrimes['lon'], sfcrimes['lat'])
m.line(locations=box_coords, popup="BBox of All SF Crime", line_color='blue', line_opacity=1.0)

inline_map(m)

### QUESTION 2

In the code block below create ONE map showing the minimum bounding boxes of the following four crime categories: all SF crimes, DRUG/NARCOTIC crimes, ASSAULT crimes, and MISSING PERSON CRIMES.  Use same colors that you used above. Add a popup box that indicates the crime category.


In [None]:
# INPUT YOUR CODE FOR Question 2 here

## Standard Deviations of the X and Y Coordinates

Another measure of the distribution of points around the mean center is by creating a bounding box based on the standard deviations of the X and Y coordinates. This box can give us an estimate of where most of the crimes are occuring.  We can compute these values as we would compute the standard deviation of any set of numbers. We can then create a bounding polygon, or bounding box, based on these coordinates and the mean center. This process is shown below.

In [None]:
# Function to calculate the bounding coordinates (or bounding box) 
# from the standard deviations of the x and y coordinates
def getStdDevBox(x_coords,y_coords):
    x_mean = np.mean(x_coords)
    y_mean= np.mean(y_coords)
    x_std = np.std(x_coords)  
    y_std = np.std(y_coords)
    
    ul_xy = (y_mean+y_std, x_mean-x_std)
    ur_xy = (y_mean+y_std, x_mean+x_std)
    ll_xy = (y_mean-y_std, x_mean-x_std)
    lr_xy = (y_mean-y_std, x_mean+x_std)
    
    return [ul_xy,ur_xy,lr_xy,ll_xy,ul_xy]


Now that we have a function to easily compute a standard deviation bounding box we can apply it to our badcrimes table.

In [None]:
# Draw the mean center, MBB, and standard deviation bounding box for all sf bad crimes

m = folium.Map([mean_ctr_lat, mean_ctr_lon], zoom_start=12) #center map on the mean center

# Add MEAN CENTER point for bad crime - black
m.simple_marker(location=(bad_mean_ctr_lat, bad_mean_ctr_lon), popup="mean center of sf bad crimes")

# Compute and then add the MBB for bad crimes
box_coords = getMBB(badcrimes['lon'], badcrimes['lat'])
m.line(locations=box_coords, popup="BBox of Bad Crime", line_color='black', line_opacity=1.0)

# Compute and then add the STD Dev Box for bad crimes
std_box = getStdDevBox(badcrimes['lon'] ,badcrimes['lat'])
m.line(locations=std_box, line_color="black", popup="StdDev Box of bad crimes")

# Draw the map
inline_map(m)


The map above is starting to give us a much richer description of *bad crime* locations in San Francisco.

### QUESTION 3

We can use the standard deviations bounding box to compare different crime types and get a sense of which ones are more concentrated vs dispersed throughtout the city.

In the cell below create ONE map showing the **mean centers** and **standard deviation bounding boxes** for the following crime types: DRUG/NARCOTIC, ASSAULT, PROSTITUTION and BURGLARY crimes. Use different colors for each one - the same colors that you used above if applicable. Add popups to both the centers and boxes.

In [None]:
# Your code here for Question 3 here


### QUESTION 4

Make some summary observations about the distribution of the locations of the different crime types you mapped above. Which crime type seems to be most clustered? Which one the most dispersed? Do any of the boxes indicate a directional trend in the crime incident locations?

** Double-click here to input your answer **

## Exploring Distance
### Are the crime incident locations clustered?

The map above of suggests that some of the crime categories are more tightly clustered than the other types. This is potentially very useful information for a police department and the city residents as a whole.

However, this process can be very time consuming. Let's say we want to consider more crime types or weekdays vs weekends or seasons or night versus day. That's a lot of maps to make and compare. It would be useful to have a single value summary statistice that could help us narrow down areas for further investigation and analysis.

The **Nearest Neighbor Index (NNI)** can be used to describe whether the locations of a set of points, such as crime locations by crime type, is clustered. The **NNI** is a numeric summary statistic that ranges from 0 to 2.15.  NNI is based on the average distance between each point and its nearest neighbor. This observed average distance is compared to what would be the expected average distance if the points where randomly distributed in the study area. By study aea we mean the boundary of the area for which we have data and within which any of the data points could have been located if the process generating the points were random. In other words, our study area is the city of San Franciso (excluding islands) and the assumption is that a crime could occur anywhere in San Francisco.

If the observed point pattern is random then the NNI statistic would be close to 1. If the points are clustered, the average observed nearest neighbor distance would be smaller than the expected average distance and thus the NNI statistic would move closer to 0 with increased clustering. On the other hand, if the locations where showing dispersion, the NNI statistic would be larger than 1 and move closer to 2 with increased dispersion. 

#### Adding projected coordinates to the Table

The NNI is based on distance calculations. These distance calculations require 2D coordinates so we cannot use latitude and longitude. We need to compute X and Y coordinates for each point in a projected coordinate reference system and add those coordinates to the sfcrimes table.  We can do this using the process we used in Homework exercise 4 with the pyproj and shapely libraries.  By convention we label the columns with projected coordinates **X** and **Y** and geographic coordinates **lon** and **lat**.

### QUESTION 5
Complete the function below to compute X and Y coordinates in a projected coordinate reference system (CRS). Update only the lines that say YOUR CODE HERE. Below we are using the CRS UTM Zone 10N, WGS84 (EPSG: 32610). Look back to how you did this in Homework Exercise 4 for reference. You can look up the details of this CRS by EPSG code on [spatialreference.org](http://spatialreference.org).

In [None]:
from functools import partial
import pyproj
from shapely.geometry import Point
from shapely.ops import transform

transformTo32610 = partial(
    pyproj.transform,
    pyproj.Proj(init='epsg:4326'),  # source coordinate system - WGS 84, EPSG:4326
    pyproj.Proj(init='epsg:32610'))  # destination coordinate system - UTM zone 10N, WGS84 


def get_utm_coords(lon, lat):
    
    # Create a Shapely Point from the lat and lon coordinates
    geo_geom = # YOUR CODE HERE 
    
    # Transform lat, lon point to UTM 10N
    utm_geom = # YOUR CODE HERE
    
    # return the x and y coordinates in a list
    return (utm_geom.x, utm_geom.y)


Before applying the function to the sfcrimes table, you can test it on one or two points in the table to make sure it works.

In [None]:
# Test the function on one point
get_utm_coords(sfcrimes['lon'][0], sfcrimes['lat'][0])

Add now we add the UTM10 coordinates to the sfcrimes table.

In [None]:
# Add the XY coords to the sfcrimes table
sfcrimes['XY_coords'] = sfcrimes.apply(lambda x, y: get_utm_coords(x,y), ['lon', 'lat']) # add a column with XY values
sfcrimes['X'] = sfcrimes['XY_coords'][:,0] # now add a column containing just the X coordinates
sfcrimes['Y'] = sfcrimes['XY_coords'][:,1] # and a Y column
sfcrimes # view the table

### Computing the NNI
The first step in computing the NNI is to compute the **expected nearest neigbor distance**. This is the value we would expect if points were randomly distributed in the study area. It is based on the number of points within the study area and the area of the study area. It's calculation is shown below.

In [None]:
# Compute expected nearest neighbor distance 
# This is .5 / math.sqrt(number of pts/area))
# DON'T CHANGE THIS CELL

# What is the density of events in the study area?
# This is equal to the number of points divided by the study area size (area sq meters)
# Data must be in projected coordinates!

numpts = sfcrimes.num_rows  # Tells us the number of points (assuming no empty rows)
                            # np_array.size is what we would use to get the length of an array

# Get the area of the study area from the bounding box coordinates of the PROJECTED coordinates
study_area_height = max(sfcrimes['Y']) - min(sfcrimes['Y'])
study_area_width = max(sfcrimes['X']) - min(sfcrimes['X'])
sarea = study_area_height * study_area_width

# Calculate the Expected Nearest Neighbor distance from the number of points in the study area
expected_nn_dist = .5 / math.sqrt(numpts/sarea)

print('Expected NN distance: ', expected_nn_dist)

The next step in computing NNI is to create a **distance matrix** of the distance between *each point and all other points*.  As you can imagine that's a lot of distances with a large point set so this operation can be slow. Because this metric is based on euclidean distances we must use projected map coordinates and not longitude and latitude values. For this data set, the projected coordinates are in the **X** and **Y** columns. We use the **[distance_matrix](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html#scipy.spatial.distance_matrix)** function from the **Scipy Spatial** library to compute the distance matrix.

In [None]:
# Compute the distance matrix for all crime locations
# This can take a while if lots of points!
x = sfcrimes['X']
y = sfcrimes['Y']
pts = np.column_stack((x,y)) # make a stack of points
my_distance_matrix = sp.distance_matrix(pts,pts) # comparing the point set to itself 
                                                 # to create a distance matrix

# Take a look at the Distance matrix
my_distance_matrix

Now that we have the distance matrix we need to select the distance between each point and its *NEAREST NEIGHBOR* point. We do this by sorting the matrix and selecting the first value for each point. We then compute the average observed nearest neighbor distance by dividing the sum of nearest neighbor distances by the number of points.

In [None]:
# Compute expected mean nn distance using scipy distance_matrrix
# Important - We only want the 1 nearest neighbor! not all NN
dist_matrix_1nn=[np.sort(elem)[1] for elem in my_distance_matrix ] # sort and grab the second lowest value
                                                                   # as lowest is always dist to self
dm_len = len(dist_matrix_1nn) # length of the distance matrix
dm_sum = sum(dist_matrix_1nn) # the sum of the nearest neigbhor distances
obs_mean_nndist = dm_sum/dm_len # the average observed nn distance

print("Number of points ", dm_len)
print("Sum of nearest 1 distances: ", dm_sum)
print("Observed mean nearest neighbor distance: ", obs_mean_nndist, "meters.")

Now that we have computed the **observed** nearest neighbor distance and the **expected** nearest neighbor distance we can compute the **Nearest Neighbor Index**.

In [None]:
# Compute the Nearest Neighbor Index which is ration of observed to expected mean nn distance
# The NNI measures the spatial distribution from 0 (clustered pattern) to 
# 1 (randomly dispersed pattern) to 2.15 (regularly dispersed /uniform pattern):
nni = obs_mean_nndist / expected_nn_dist
print('Nearest Neighbor Index (NNI) for all SF Crimes: ', nni)

We can re-write the code above as a function so that we can apply it to the different categories of crime.
This is shown below in the ** getNNI function**.  This function takes an array of PROJECTED x and y coordinates as input and returns the NNI. You can test it with the SF crime data to make sure you get the same result as above.

In [None]:
## Calculating the Nearest Neighbor Index of a set of points
def getNNI(x_coords, y_coords):
    
    # number of points in study area
    numpts = x_coords.size
    
    # Get the area of the study area from the bounding box coordinates of the PROJECTED coordinates
    # For these data this is the ENTIRE STUDY AREA (MBB) of San Francisco
    # regardless of the crime category
    study_area_height = max(sfcrimes['Y']) - min(sfcrimes['Y'])
    study_area_width = max(sfcrimes['X']) - min(sfcrimes['X'])
    study_area = study_area_height * study_area_width
    
    # Calculate the Expected Nearest Neighbor distance from the number of points in the study area
    expected_nn_dist = .5 / math.sqrt(numpts/study_area)
    
    # Compute the distance matrix for all crime locations
    # This can take a while if lots of points!
    pts = np.column_stack((x_coords,y_coords))
    my_distance_matrix = sp.distance_matrix(pts,pts) #comparing a point set to itself

    # Compute OBSERVED mean nn distance using scipy distance_matrrix
    # Important - We only want the 1 nearest neighbor! not all NN
    dist_matrix_1nn=[np.sort(elem)[1] for elem in my_distance_matrix ] # sort and grab the second lowest value
                                                                   # as lowest is always dist to self
    dm_len = len(dist_matrix_1nn) # length of the distance matrix
    dm_sum = sum(dist_matrix_1nn) # the sum of the nearest neigbhor distances
    
    obs_mean_nndist = dm_sum/dm_len # the average observed nn distance
    
    
    # Compute the Nearest Neighbor Index which is ratio of observed to expected mean nn distance
    # The NNI measures the spatial distribution from 0 (clustered pattern) to 
    # 1 (randomly dispersed pattern) to 2.15 (regularly dispersed /uniform pattern):
    nni = obs_mean_nndist / expected_nn_dist
    
    return nni

### QUESTION 6
In the cell below use the getNNI function to compute the NNI for the following SF crime categories:
    - DRUG/NARCOTICS
    - ASSAULT
    - Burglary
    - PROSTITUTION

Use print statements to output the values.


In [None]:
# Iput your answer to Question 6 here

# First compute the NNI for each crime type listed below

# Now print out the NNIs

### QUESTION 7

Comment on the different NNI results you computed above. What crime category is most clustered? Which is the least?

**Double-click here to input your answer to this question.**

## Mapping Density

The NNI values give us a sense of the overall pattern of point locations. It can indicate if the crime incident locations are clustered. But it doesn't tell us where the locations are clustered. For that we need a map. But we cannot map a summary statistic. However, we can create a density map, or **heat map**, using the method of [kernal density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation), or **KDE**. KDE computes the density of incidents - the number per unit area. When applied to spatial data, a KDE map is like a smooth 2D histogram that estimates the probability density of locations.  

We will use the Seaborn visualization library to create the density maps. Seaborn makes pretty plots with few lines of code.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


### Density map of SF
Let's use Seaborn's KDEplot method to create a map of the density of all crime incidents in San Francisco. Run the code below to make the plot and then take a closer look at it (and the comments).

The most important parameter that the data scientist needs to suppy is the **bandwidth**, or **bw**.  The bandwidth is a smoothing parameter that indicates distance within which we want to consider neighboring points. For example, if I am mapping hots spots of earthquake epicenters I might want a bandwidth of several kilometers. But when mapping crime I might want a bandwidth of one or two street blocks.

Since the distance calculations are euclidean, we need to use projected not geographic coordinates. Similarly, the bandwidth parameter is specified in the units of the coordinate data.

In [None]:
# Set up the figure
fig, ax = plt.subplots(1, figsize=(6, 6))

# Add all crime points as our background context
ax.scatter(sfcrimes['X'],sfcrimes['Y'],color="white" )

# Set the bandwidth parameter
bandwidth=400 

# Create the KDE map
ax = sns.kdeplot(sfcrimes['X'], sfcrimes['Y'], bw=bandwidth, cmap="Blues", shade=True, shade_lowest=False)

# Remove axes ticks (e.g., coordinates displayed on x and y axes)
ax.set_xticks([])
ax.set_yticks([])

# Add title
ax.set_title("Density map of SF Crime Incidents, Jan 2016")

# Keep axes proportionate
plt.axis('equal')

# Draw map
plt.show() 
 
# Ignore warning if one displays below!

The KDE map confirms what we were seeing above with the mean center and standard deviation bounding boxes as well as the NNI values. Crime indidents appear to radiate out from the Market/Duboce Street area and seem more clustered than the white points showing the locations of all crime incidents suggest.

### Comparing KDE Maps

Let's create KDE maps fore three crime types to see how they reveal the location of hot spots, or clustering. We will look at Burglary, Assaults and Drug/Narcotic crimes.

In [None]:
# Display Three Crime KDEplots (called subplots)
 
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5), sharex=True, sharey=True)

ax1.set_title('Burglary')
ax2.set_title('Assualts')
ax3.set_title('Drugs')
 

# Remove axes ticks (coordinates displayed on x and y axes)
ax1.set_xticks([])
ax1.set_yticks([])

# Set the bandwidth parameter
bandwidth=400 

# add all crime points as our background context
ax1.scatter(sfcrimes['X'],sfcrimes['Y'],color="white" )
ax2.scatter(sfcrimes['X'],sfcrimes['Y'],color="white" )
ax3.scatter(sfcrimes['X'],sfcrimes['Y'],color="white" )

sns.kdeplot(sfcrimes.where('Category','BURGLARY')['X'], sfcrimes.where('Category','BURGLARY')['Y'],  
            bw=bandwidth, cmap="Blues", shade=True, shade_lowest=False, ax=ax1)

 
sns.kdeplot(sfcrimes.where('Category','ASSAULT')['X'], sfcrimes.where('Category','ASSAULT')['Y'], 
            bw=bandwidth, cmap="Reds", shade=True, shade_lowest=False, ax=ax2 )
 
sns.kdeplot(sfcrimes.where('Category','DRUG/NARCOTIC')['X'], sfcrimes.where('Category','DRUG/NARCOTIC')['Y'], 
            bw=bandwidth, cmap="Greens", shade=True, shade_lowest=False, ax= ax3 )



### Question 8

- A. Are the KDE plots above consistent with the NNI values? 
- B. What more do the KDE plots indicate that the NNI, standard deviation boxes and mean centers did not? 
- C. Did those other measures and maps tell us that the KDE maps do not?

**Double-click here to input your answer to question 8**

### QUESTION 9

- A. Create the same KDE maps as above but try different bandwidth values. Show one of your maps in the cell below. 
- B. Describe what happens when you increase or decrease the value? How do you decide the appropriate bandwidth for a KDE map? 
- C. Why do you think I choose a bandwidth of 400 above? Is it a good choice or did you find a better one?
- D. What are the units of the projected coordinates for the point data?

**Double-click here to input your answers to 9.B - 9.D.**

In [None]:
# Input your answer to question 9.A here


## Drawing Conclusions 

It's tempting to say that we have located the hot spots of criminal activity in San Francisco as well as those areas where there is not much crime. In the latter case, the data mapped above suggest that there is very little crime happening in Golden Gate Park or in the areas near Lake Merced (South-west San Francisco) or in the inner sunset near UCSF and Twin Peaks. Yet these are also areas with lower population densities. Consequently, our maps and measures may be capturing population density rather than hot and cold spots of criminal activity. Next steps in this type of spatial analysis would include normalizing by population to see where there is a higher rate (rather than count or density) of crime incidents relative to the population.