# Daredevil Demo
---

In this lab, we will explore the potential privacy concerns regarding location data that is supposedly anonymous. We will use a modified version of NYC Taxi data (which is made public and can be found [here](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml)) and modified NYC complaints data (found [here](https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Map-Year-to-Date-/2fra-mtpn)).

Based on the fictional Marvel superhero Daredevil, we will use these two datasets to find the identity/location of Daredevil (if you do not know the background of the superhero, do not worry).

While this is a seemingly trivial example, it turns out that knowing just a little bit of information can be combined with a dataset to discover much more than [intended](https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/).

**We will look at past crime data, and knowing that Daredevil is blind and thus cannot drive himself (assume Uber does not yet exist), must use a taxi to reach crimes far from his home**

*Estimated Time: 60 minutes*

---

**Topics Covered:**
- Loading/Processing Data
- Data Visualization
- Combining, Exploring, and Using Data

**Dependencies:**
*if you are running this through JupyterHub, you do not need to worry about installing these*
- numpy
- datascience
- folium
- datetime


In [1]:
# Just run this cell. It imports all of the packages we will use
import numpy as np
from datascience import *
import folium
import datetime as dt

import warnings
warnings.filterwarnings('ignore')

*Quick note, if you ever want to know more about a certain function, you can add a **?** after a function name to pull up the docstring for the function*

In [2]:
Table.read_table?

## Loading and processing data

To start, we will load in the raw csv data (remember we are using 2 datasets) and view each one individually. Observe the column names and try to make note of what each name means. For some datasets, these names can be obscure and you will need to look directly at the source of the data to have more information about each column. However, in our case, most of the columns have column names we can easily interpret. There are a few columns that are not very clear about what they mean, but none of these columns will affect our search for the Daredevil in any significant way so we will ignore them (at least, in our demo)

In [3]:
# The lines below will load the data
taxis_raw = Table.read_table("taxi_data_draft.csv")
complaints_raw = Table.read_table("DATA/NY_complaints.csv")

# Use .show(x) function to show the first x lines of a table
print("Taxi Data:")
taxis_raw.show(5)
print("Complaints Data:")
complaints_raw.show(5)

Taxi Data:


VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,Trip_distance,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,improvement_surcharge,Total_amount,Payment_type,Trip_type
2,2016-01-06 12:04:53,2016-01-06 12:14:21,N,1,-73.8396,40.7222,-73.8637,40.7326,1,1.58,8.5,0.0,0.5,0.46,0,,0.3,9.76,1,1
2,2016-01-30 22:45:09,2016-01-30 23:04:45,N,1,-73.9437,40.7117,-73.9634,40.6759,1,3.36,15.0,0.5,0.5,3.5,0,,0.3,19.8,1,1
1,2016-01-05 19:36:41,2016-01-05 19:44:39,N,1,-73.94,40.6928,-73.9809,40.6899,1,2.0,8.0,1.0,0.5,0.0,0,,0.3,9.8,2,1
2,2016-01-31 22:56:29,2016-01-31 23:06:23,N,1,-73.9349,40.8476,-73.9425,40.8277,1,1.72,9.0,0.5,0.5,0.0,0,,0.3,10.3,2,1
2,2016-01-09 13:35:55,2016-01-09 13:52:11,N,1,-73.9924,40.6894,-73.9501,40.6939,1,2.87,12.5,0.0,0.5,2.0,0,,0.3,15.3,1,1


Complaints Data:


CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,CMPLNT_TO_DT,CMPLNT_TO_TM,RPT_DT,KY_CD,OFNS_DESC,PD_CD,PD_DESC,CRM_ATPT_CPTD_CD,LAW_CAT_CD,JURIS_DESC,BORO_NM,ADDR_PCT_CD,LOC_OF_OCCUR_DESC,PREM_TYP_DESC,PARKS_NM,HADEVELOPT,X_COORD_CD,Y_COORD_CD,Latitude,Longitude,Lat_Lon
845348933,03/31/2017,23:30:00,,,03/31/2017,578,HARRASSMENT 2,638,"HARASSMENT,SUBD 3,4,5",COMPLETED,VIOLATION,N.Y. POLICE DEPT,BROOKLYN,69,INSIDE,RESIDENCE - APT. HOUSE,,,1012420.0,171737,40.638,-73.8985,"(40.638018389, -73.898491201)"
886921338,03/31/2017,23:25:00,03/31/2017,23:30:00,03/31/2017,344,ASSAULT 3 & RELATED OFFENSES,101,ASSAULT 3,COMPLETED,MISDEMEANOR,N.Y. POLICE DEPT,MANHATTAN,14,,STREET,,,987466.0,215861,40.7592,-73.9884,"(40.759172699, -73.988392793)"
893265998,03/31/2017,23:15:00,03/31/2017,23:25:00,03/31/2017,105,ROBBERY,394,"ROBBERY,LICENSED FOR HIRE VEHICLE",COMPLETED,FELONY,N.Y. POLICE DEPT,BRONX,42,FRONT OF,TAXI (LIVERY LICENSED),,,1010500.0,245411,40.8402,-73.9051,"(40.84024096, -73.905125257)"
518511851,03/31/2017,23:00:00,03/31/2017,23:10:00,03/31/2017,364,OTHER STATE LAWS (NON PENAL LA,809,TAX LAW,COMPLETED,MISDEMEANOR,N.Y. POLICE DEPT,BRONX,49,INSIDE,GROCERY/BODEGA,,,1023620.0,253318,40.8619,-73.8577,"(40.861894559, -73.85766248)"
541009476,03/31/2017,22:55:00,03/31/2017,22:59:00,03/31/2017,235,DANGEROUS DRUGS,511,"CONTROLLED SUBSTANCE, POSSESSI",COMPLETED,MISDEMEANOR,N.Y. POLICE DEPT,BROOKLYN,68,FRONT OF,STREET,,,977956.0,167273,40.6258,-74.0227,"(40.625808217, -74.022675222)"


We see that the data we have gives a lot of information! In particular, there seems to be a wealth of information in the form of times and locations. This sets up our general approach to finding our target. We will assume that for some number of the crimes committed (there would be no way for Daredevil to get to all crimes), Daredevil must have taken a taxi and was dropped off near the location of the crime. Thus, we can try to determine the taxis/ubers Daredevil took and then look at the original pickup location. However, this is complicated by the fact that we have much more data than we want and that we cannot expect Daredevil to have gotten a ride exactly to the same location and at the same exact time.

---
Before we move on to actually analyzing the data, we must process the data to be in a more usable form. Raw data is often very messy and can be a pain to work with. Things such as missing values or nans are often scattered throughout the dataset, and values can often be in a difficult form to use. Thus, by processing the data now, we will make our lives much easier later.

To start, lets make the tables we are working with smaller so they only include columns of interest. While this helps to focus our analysis, note that this also discards potentially useful information. If you finish the demo early and want to try some of your own analysis, feel free to use more columns than we do here (in creating the mock data, we use many more columns).

For our taxi dataset, we will only select the columns for pickup/dropoff times, pickup/dropoff locations and the passenger count. For our complaints data, we will only select the level of offense (LAW_CAT_CD).

In [4]:
# Selecting the columns of Taxi Data according to column index
taxis = taxis_raw.select([1,2,5,6,7,8,9])
taxis.relabel(['lpep_pickup_datetime','Lpep_dropoff_datetime'], ['Pickup_dt','Dropoff_dt']) # renames column
print("Taxi Data:")
taxis.show(5)

# Selecting the columns of Complaints Data according to column index
complaints = complaints_raw.select([1,2,7,9,11,21,22])
print("Complaints Data:")
complaints.show(5)

Taxi Data:


Pickup_dt,Dropoff_dt,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count
2016-01-06 12:04:53,2016-01-06 12:14:21,-73.8396,40.7222,-73.8637,40.7326,1
2016-01-30 22:45:09,2016-01-30 23:04:45,-73.9437,40.7117,-73.9634,40.6759,1
2016-01-05 19:36:41,2016-01-05 19:44:39,-73.94,40.6928,-73.9809,40.6899,1
2016-01-31 22:56:29,2016-01-31 23:06:23,-73.9349,40.8476,-73.9425,40.8277,1
2016-01-09 13:35:55,2016-01-09 13:52:11,-73.9924,40.6894,-73.9501,40.6939,1


Complaints Data:


CMPLNT_FR_DT,CMPLNT_FR_TM,OFNS_DESC,PD_DESC,LAW_CAT_CD,Latitude,Longitude
03/31/2017,23:30:00,HARRASSMENT 2,"HARASSMENT,SUBD 3,4,5",VIOLATION,40.638,-73.8985
03/31/2017,23:25:00,ASSAULT 3 & RELATED OFFENSES,ASSAULT 3,MISDEMEANOR,40.7592,-73.9884
03/31/2017,23:15:00,ROBBERY,"ROBBERY,LICENSED FOR HIRE VEHICLE",FELONY,40.8402,-73.9051
03/31/2017,23:00:00,OTHER STATE LAWS (NON PENAL LA,TAX LAW,MISDEMEANOR,40.8619,-73.8577
03/31/2017,22:55:00,DANGEROUS DRUGS,"CONTROLLED SUBSTANCE, POSSESSI",MISDEMEANOR,40.6258,-74.0227


Now we have selected our columns, let's remove all of the rows that have a missing or null value. We will also remove zero values because some of the taxi locations have 0,0 as their coordinates (which is [clearly](https://www.google.com/maps/place/0%C2%B000'00.0%22N+0%C2%B000'00.0%22E/) not correct)

In [5]:
def remove_nan(t):
    """
    Removes all rows with nan values checking each column
    Note you should use this AFTER stripping the table of columns you do not need
    so you do not remove rows when given a column without much information

    Will remove most nan values but may not work with some other default missing values
    (specifically, will not remove -999, etc. values)

    Parameters:
    t: a table whose rows with nan values you want to remove

    returns a table identical to t but without rows containing nan values
    """
    def checkNotnan(val):
        if (val!=val)|(val=='nan')|(val=='NAN')|(val=='NaN')|(val==0):
            return False
        return True
    for i in range(t.num_columns):
        t = t.where(i, checkNotnan)
    return t

taxis = remove_nan(taxis)
complaints = remove_nan(complaints)
taxis.show(5)
complaints.show(5)

Pickup_dt,Dropoff_dt,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count
2016-01-06 12:04:53,2016-01-06 12:14:21,-73.8396,40.7222,-73.8637,40.7326,1
2016-01-30 22:45:09,2016-01-30 23:04:45,-73.9437,40.7117,-73.9634,40.6759,1
2016-01-05 19:36:41,2016-01-05 19:44:39,-73.94,40.6928,-73.9809,40.6899,1
2016-01-31 22:56:29,2016-01-31 23:06:23,-73.9349,40.8476,-73.9425,40.8277,1
2016-01-09 13:35:55,2016-01-09 13:52:11,-73.9924,40.6894,-73.9501,40.6939,1


CMPLNT_FR_DT,CMPLNT_FR_TM,OFNS_DESC,PD_DESC,LAW_CAT_CD,Latitude,Longitude
03/31/2017,23:30:00,HARRASSMENT 2,"HARASSMENT,SUBD 3,4,5",VIOLATION,40.638,-73.8985
03/31/2017,23:25:00,ASSAULT 3 & RELATED OFFENSES,ASSAULT 3,MISDEMEANOR,40.7592,-73.9884
03/31/2017,23:15:00,ROBBERY,"ROBBERY,LICENSED FOR HIRE VEHICLE",FELONY,40.8402,-73.9051
03/31/2017,23:00:00,OTHER STATE LAWS (NON PENAL LA,TAX LAW,MISDEMEANOR,40.8619,-73.8577
03/31/2017,22:55:00,DANGEROUS DRUGS,"CONTROLLED SUBSTANCE, POSSESSI",MISDEMEANOR,40.6258,-74.0227


We will now convert the formats to a more usable format. Currently the taxi dates are in a string format and we would like to change it to a datetime object. If you do not fully know what this means, do not worry too much about. Basically, we have the times in the format of words, but we would like to convert the times to a format that has some built in functionality.

To do this we will use the *apply* a function to each row in the datetime columns of the taxi data.

In [6]:
# This function will conveniently convert a specific format of string to a datetime
def to_datetime(string_date):
    '''will strip a date in a string format and return a datetime format'''
    if type(string_date)==dt.datetime:
        return string_date
    return dt.datetime.strptime(string_date, '%Y-%m-%d %H:%M:%S')

print("Before:", taxis.column(0))
print()

converted_pickup_col = taxis.apply(to_datetime, 'Pickup_dt')
converted_dropoff_col = taxis.apply(to_datetime, 'Dropoff_dt')
taxis = taxis.with_column('Pickup_dt',converted_pickup_col)
taxis = taxis.with_column('Dropoff_dt',converted_dropoff_col)
print("After:", taxis.column(0))

Before: ['2016-01-06 12:04:53' '2016-01-30 22:45:09' '2016-01-05 19:36:41' ...,
 '2016-01-14 05:51:22' '2016-01-17 23:10:31' '2016-01-05 20:50:35']

After: [datetime.datetime(2016, 1, 6, 12, 4, 53)
 datetime.datetime(2016, 1, 30, 22, 45, 9)
 datetime.datetime(2016, 1, 5, 19, 36, 41) ...,
 datetime.datetime(2016, 1, 14, 5, 51, 22)
 datetime.datetime(2016, 1, 17, 23, 10, 31)
 datetime.datetime(2016, 1, 5, 20, 50, 35)]


Now we will combine the dates and times of the complaints data once again by applying a function but now to two columns! The format of the complaints are also strings.

In [7]:
# The function we will apply to the table. Do not worry too much about the details of it
def combine_date_time(date_string, time_string):
    '''function that takes a date in the format of a string and a 
    time in the format of a string and then combines the two into a new datetime format'''
    if type(date_string)==dt.date:
        date = date_string
    elif type(date_string)==dt.datetime:
        date = date_string.date()
    else:
        date = dt.datetime.strptime(date_string, '%m/%d/%Y').date()
        
    if type(time_string)==dt.time:
        time = time_string
    elif type(date_string)==dt.datetime:
        time = time_string.time()
    else:
        time = dt.datetime.strptime(time_string, '%H:%M:%S').time()
    return dt.datetime.combine(date, time)

# applies the function above
combined_times = complaints.apply(combine_date_time, ["CMPLNT_FR_DT","CMPLNT_FR_TM"])
complaints = complaints.with_column("Complaint_dt", combined_times)
# drops the first two columns and reorders the table
complaints = complaints.drop([0,1]).select([5,3,4,0,1,2])
complaints.show(5)

Complaint_dt,Latitude,Longitude,OFNS_DESC,PD_DESC,LAW_CAT_CD
2017-03-31 23:30:00,40.638,-73.8985,HARRASSMENT 2,"HARASSMENT,SUBD 3,4,5",VIOLATION
2017-03-31 23:25:00,40.7592,-73.9884,ASSAULT 3 & RELATED OFFENSES,ASSAULT 3,MISDEMEANOR
2017-03-31 23:15:00,40.8402,-73.9051,ROBBERY,"ROBBERY,LICENSED FOR HIRE VEHICLE",FELONY
2017-03-31 23:00:00,40.8619,-73.8577,OTHER STATE LAWS (NON PENAL LA,TAX LAW,MISDEMEANOR
2017-03-31 22:55:00,40.6258,-74.0227,DANGEROUS DRUGS,"CONTROLLED SUBSTANCE, POSSESSI",MISDEMEANOR


We will also combine the latitude/longitude data into one column so we can more easily apply functions using the datascience package.

In [8]:
def combine_coordinates(latitude, longitude):
    """
    returns a tuple of the given latitude and longitude
    """
    return (latitude, longitude)

# Apply the function above
taxis_combined_pickup_loc = taxis.apply(combine_coordinates, ['Pickup_latitude','Pickup_longitude'])
taxis_combined_dropoff_loc = taxis.apply(combine_coordinates, ['Dropoff_latitude','Dropoff_longitude'])
complaints_combined_loc = complaints.apply(combine_coordinates, ['Latitude','Longitude'])

# combine with tables and drop previous columns
taxis = taxis.with_column("Pickup_location", taxis_combined_pickup_loc)
taxis = taxis.with_column("Dropoff_location", taxis_combined_dropoff_loc)
taxis = taxis.drop([2,3,4,5])

complaints = complaints.with_column("Location", complaints_combined_loc)
complaints = complaints.drop([1,2])

taxis.show(2)
complaints.show(2)

Pickup_dt,Dropoff_dt,Passenger_count,Pickup_location,Dropoff_location
2016-01-06 12:04:53,2016-01-06 12:14:21,1,[ 40.72217941 -73.83955383],[ 40.73259735 -73.86370087]
2016-01-30 22:45:09,2016-01-30 23:04:45,1,[ 40.7117157 -73.94367218],[ 40.67586517 -73.96343231]


Complaint_dt,OFNS_DESC,PD_DESC,LAW_CAT_CD,Location
2017-03-31 23:30:00,HARRASSMENT 2,"HARASSMENT,SUBD 3,4,5",VIOLATION,[ 40.63801839 -73.8984912 ]
2017-03-31 23:25:00,ASSAULT 3 & RELATED OFFENSES,ASSAULT 3,MISDEMEANOR,[ 40.7591727 -73.98839279]


Now let's move on from the boring (yet important) preprocessing of data!

## Visualization

Before we begin trying to find our DareDevil, we will explore some of the visualization tools that we can use to easily see the data. We will be using folium for this purpose as opposed to the built in mapping function in the datascience package for technical reasons. You can look through the folium [quickstart guide](https://folium.readthedocs.io/en/latest/) or use some of the built in helper functions we provide

In [38]:
# This is the syntax to create an empty map centered at coordinates 40.7127,-74.0059
# This is also the coordinates of NYC so you can simply use these coordinates in any other maps for this lab
map_example = folium.Map(width=700,height=500,location=[40.7128,-74.0059], zoom_start=10)

# to display the map simply type the name
map_example

In order to start plotting points for the lab, folium uses a class called Markers. You can read more documentation [here](https://folium.readthedocs.io/en/latest/quickstart.html#markers). The basics of folium are displayed below.

In [46]:
# Creating a new marker at coordinates (40.8436, -73.5633)
marker_example = folium.Marker([40.72, -73.9633])
# adds the marker to the map
marker_example.add_to(map_example)
# Note that there is no easy way to remove a marker once you add it to the map
# If you want reset a map, simply run map_example = folium.Map(location=[40.7128,-74.0059])
# in order to create a new one instead

# display the map
map_example

We define a function addMarkers below. This function will automatically add markers to a map from a given table assuming the table has a column called "Location". 

In [58]:
def addMarkers(fol_map, mark, location_col, color="blue",icon='star',max_num=25, popup_cols=[]):
    """
    adds markers to folium fol_map based on a table mark
    Parameters:
    fol_map: a folium.Map class that you want to add markers to
    mark: a table containing two columns 'Latitude' and 'Longitude'
        if these columns do not exits, defaults to using first column as latitude and 2nd as longitude
    color: color of the marker added (default: blue)
    icon: icon of marker added (default: star)
    max_num: the maximum number of markers added. Use to not overload folium map (default: 25)
    popup: the columns of the table to be included
    returns nothing. Will modify fol_map directly
    """
    if type(location_col)==str:
        location_col = mark.column_index(location_col)
    for i in range(mark.num_rows):
        row = mark.row(i)
        popup = None
        if len(popup_cols)>0:
            popup = ""
            for col in popup_cols:
                popup += mark.column_labels[col] + ": " + str(row[col]) + '  '
        folium.Marker(row[location_col],icon=folium.Icon(color=color, icon=icon),popup=popup).add_to(fol_map)
        if (i>max_num):
             return

In [59]:
# reset the map_example variable
map_example = folium.Map(width=700,height=500,location=[40.7128,-74.0059], zoom_start=10)

# You can also change the color and icon of the markers
addMarkers(map_example, taxis, 'Dropoff_location', color='red', icon='cloud', popup_cols=[0,1])
# type help(folium.Icon) to get some details of what you can put in color and icon

map_example

## Analyzing Data

We start to actually look into how we are going to analyze the data. We will be looking at the latitude and longitude data from complaints and taxi as well as the times of each table (so if you dropped these columns earlier, go back and change your selection so these columns are included).

Our rationale of the data is that DareDevil uses complaints sent to the NYPD to then go to the location of a crime. Thus, if we look at a crime that Daredevil was present, we expect to find a corresponding taxi that goes to the general area. Then, if we look at where this taxi originated from, we should be (in theory) able to find where Daredevil originates from and thus get closer to identify him.

In the real world, you can imagine that we would use a variety of ideas to begin looking for specific people or narrow our search (e.g. photos of taxis celebrities emerged out of, knowledge of where someone lives, etc.)

In [60]:
# Run this cell to display the tables
taxis.show(1)
complaints.show(1)

Pickup_dt,Dropoff_dt,Passenger_count,Pickup_location,Dropoff_location
2016-01-06 12:04:53,2016-01-06 12:14:21,1,[ 40.72217941 -73.83955383],[ 40.73259735 -73.86370087]


Complaint_dt,OFNS_DESC,PD_DESC,LAW_CAT_CD,Location
2017-03-31 23:30:00,HARRASSMENT 2,"HARASSMENT,SUBD 3,4,5",VIOLATION,[ 40.63801839 -73.8984912 ]


Lets look at the times first. Perhaps we know that DareDevil took some taxi sometime near 11:00 pm (23:00) on January 4th. We can try to find the destination by looking at all taxi rides around that time using the taxi dataset. We can then easily write a function that checks if an event occurs within x minutes from a certain time. Then, we can use this function to select only rows from the tables that correspond to these times.

We will use the Table.where function. This function takes in a column name and a function that returns a boolean value (True or False) and returns a table with only the rows where the function returned true when applied to that column.

In [61]:
def near_11pm_jan_4(time):
    '''
    Returns a boolean (true or false) whether a time is 5 minutes away from January 4, 2016 at 11pm.
    '''
    jan_4_11_pm = dt.datetime(2016,1,4,23)
    return abs(time - jan_4_11_pm) <= dt.timedelta(minutes=5)

# Now we use the .where function to select rows from taxi where the pickup time was within 5 minutes of 11pm
# on January 4th!
near_11_taxis = taxis.where('Pickup_dt', near_11pm_jan_4)
near_11_taxis.show()

Pickup_dt,Dropoff_dt,Passenger_count,Pickup_location,Dropoff_location
2016-01-04 23:00:22,2016-01-05 01:20:31,1,[ 40.67131042 -73.88010406],[ 40.71093369 -73.96182251]
2016-01-04 23:04:10,2016-01-04 23:14:42,2,[ 40.6439209 -73.98931885],[ 40.6598587 -73.95677185]
2016-01-04 22:56:05,2016-01-04 22:58:27,5,[ 40.7487793 -73.8728714],[ 40.74436569 -73.87328339]


You may have noticed that this function is pretty restrictive. It only allows us to check a table for one specific time! Below we will define a new function that is more general and will allow us to compare 2 times with each other. This will prove very useful to us later.

In [62]:
def time_near(time1, time2, error=5):
    '''
    Returns a boolean (true or false) whether 2 times are within error time of each other
    error time is a number representing the minutes in between the two times (default 5 minutes)
    '''
    return abs(time1-time2) <= dt.timedelta(minutes=error)

# Do not worry exactly what this line below does. It essentially creates the same 
# function we had before (and of the same name) using the more general function
near_11pm_jan_4 = lambda x: time_near(x, dt.datetime(2016,1,4,23))
near_11_taxis = taxis.where('Pickup_dt', near_11pm_jan_4)
near_11_taxis.show()

Pickup_dt,Dropoff_dt,Passenger_count,Pickup_location,Dropoff_location
2016-01-04 23:00:22,2016-01-05 01:20:31,1,[ 40.67131042 -73.88010406],[ 40.71093369 -73.96182251]
2016-01-04 23:04:10,2016-01-04 23:14:42,2,[ 40.6439209 -73.98931885],[ 40.6598587 -73.95677185]
2016-01-04 22:56:05,2016-01-04 22:58:27,5,[ 40.7487793 -73.8728714],[ 40.74436569 -73.87328339]


Using this method, we can look at the time of a crime we believe Daredevil to have gone to and look at dropoff or pickup times for taxis around the same time. Below, we will create a list of tables that correspond to the first 5 rows of crimes/complaints. These tables will be all the taxis that have dropoffs within 10 minutes of the crime being reported.

In [63]:
table_list = []
for i in range(5):
    time = complaints.column('Complaint_dt')[i] #gets the ith datetime of the complaint
    temp_function = lambda x: time_near(x, time, 10)
    table_list.append(taxis.where('Dropoff_dt', temp_function))
table_list[4].show()

Pickup_dt,Dropoff_dt,Passenger_count,Pickup_location,Dropoff_location


Now lets look at location data. As with time, we can create a function to let us get all of the rows of a table with a location that is close to some certain coordinate. 

Before we do, we write a function that converts (roughly) the distance in km of two coordinates to make our lives easier.

In [64]:
def dist_coord(loc1,loc2):
    """
    returns distance in km between 2 coordinates
    loc1 and loc2 should be a tuple of coordinates corresponding to the latitude and longitudes
    of 2 locations
    Not entirely accurate (assumes perfectly spherical earth) but works for our purposes
    """
    R = 6373.0
    lat1, lon1 = loc1
    lat2, lon2 = loc2
    lat1 = np.radians(lat1)
    lon1 = np.radians(lon1)
    lat2 = np.radians(lat2)
    lon2 = np.radians(lon2)
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = (np.sin(dlat/2))**2 + np.cos(lat1) * np.cos(lat2) * (np.sin(dlon/2))**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    distance = R * c
    return distance

And now we write a general function (like we did for time) to be used in `.where` call of a table. That is, we will write a function that returns true if the coordinates are within x km of another set coordinate. We use the function to then generate a table of taxis that dropped off a passenger within 1 km of the first complaint location (40.638, -73.8985)

In [65]:
def dist_near(loc1, loc2, error=1):
    """
    returns a boolean (True or False) of whether the coordinates (lat1,lon1) and (lat2,lon2) are
    within error km (default 1 km) of each other
    """
    return dist_coord(loc1, loc2) <= error

complaint_loc = complaints.column('Location')[0]
near_first_complaint_func = lambda x: dist_near(x, complaint_loc)
near_first_complaint_table = taxis.where('Dropoff_location', near_first_complaint_func)
near_first_complaint_table.show()

Pickup_dt,Dropoff_dt,Passenger_count,Pickup_location,Dropoff_location
2016-01-16 03:07:23,2016-01-16 03:11:14,2,[ 40.64406586 -73.89997101],[ 40.63340378 -73.88950348]
2016-01-30 22:29:44,2016-01-30 22:29:54,4,[ 40.64283752 -73.89810944],[ 40.64283371 -73.89808655]
2016-01-18 21:29:14,2016-01-18 21:56:28,1,[ 40.68370438 -73.96775055],[ 40.63659668 -73.88986969]
2016-01-26 22:29:47,2016-01-26 22:53:29,1,[ 40.67927551 -73.93848419],[ 40.63418579 -73.8886261 ]
2016-01-16 17:32:49,2016-01-16 17:53:10,2,[ 40.66906357 -73.93128204],[ 40.6318512 -73.89125061]


Let's visualize this! We plot the location of the complaint in red, and the locations of the taxi dropoffs in green. In addition, we plot the pickup location for each of these taxis in blue so you can see where the taxis picked up passengers who were dropped near the location.

In [79]:
distance_example_map = folium.Map(width=700,height=500,location=complaint_loc.tolist(), zoom_start=12)

addMarkers(distance_example_map, near_first_complaint_table, 'Dropoff_location', color='green',popup_cols=[3,4])
addMarkers(distance_example_map, near_first_complaint_table, 'Pickup_location', color='blue',popup_cols=[3,4])
folium.Marker(complaint_loc, icon=folium.Icon(color='red'),popup='Complaint_location: '+str(complaint_loc)).add_to(distance_example_map)

distance_example_map

Now let's try to find DareDevil! We will try to go through the complaints data and find the complaints we believe (or know) that Daredevil has gone to. Then using this, we will match the crime with a taxi that is near in time and dropoff location. Finally, we will look for common pickup locations and suspect that this is the origin location of Daredevil.

For the sake of speed, we will require that the passenger count in taxis be 1 and use only January's data. We also will set the location to 1 km away and the dropoff time to be within 15 minutes of the complaint time. You are free to change any of these parameters as you see fit. In addition, if you are comfortable, try to experiment by including some of the columns we removed earlier in the pre-processing stage! 

*Note that some of the visualizations can be time-consuming, so we encourage you to try to find a faster way that does not need visualizations (hint: try to use the distance function on the pickup locations you get).*

In [119]:
distance_err = 1
time_err = 15

#felonies = complaints.where("LAW_CAT_CD", "FELONY")
# This version is for the january/reduced size data
# We create a new table with only felonies
felonies = complaints.where("Complaint_dt", 
        are.between_or_equal_to(dt.datetime(2016,1,1), dt.datetime(2016,1,31)))
felonies.show(5)

# taxi_tables will be an array of tables corresponding to the table of taxis that
# are suspected to be related to the associated felony
taxi_tables = []
for row_num in range(felonies.num_rows):
    # This should all look familiar
    complaint_loc = felonies.column('Location')[row_num]
    complaint_desc = felonies.column('OFNS_DESC')[row_num]
    complaint_dt = felonies.column('Complaint_dt')[row_num]
    
    near_complaint_loc_func = lambda x: dist_near(x, complaint_loc, distance_err)
    near_complaint_time_func = lambda x: time_near(x, complaint_dt, time_err)
    
    temp_table = taxis.where('Dropoff_location', near_complaint_loc_func)
    temp_table = temp_table.where('Dropoff_dt', near_complaint_time_func)
    
    taxi_tables.append(temp_table)
    

# For visualization. Note that this can take awhile to run
NY_map = folium.Map(width=700,height=500,location=[40.7128,-74.0059], zoom_start=10)

for i in range(len(taxi_tables)):
    table = taxi_tables[i]
    if (table.num_rows>0):
        complaint_loc = felonies.column('Location')[i]
        complaint_desc = felonies.column('OFNS_DESC')[i]
        complaint_dt = felonies.column('Complaint_dt')[i]
        
        felony_marker = folium.Marker(complaint_loc, icon=folium.Icon(color='red'), popup=complaint_desc)
        felony_marker.add_to(NY_map)
        addMarkers(NY_map, table, 'Dropoff_location',color='green',popup_cols=[3,4])
        addMarkers(NY_map, table, 'Pickup_location',color='blue',popup_cols=[3,4])
        
NY_map

Complaint_dt,OFNS_DESC,PD_DESC,LAW_CAT_CD,Location
2016-01-20 08:00:00,FRAUDS,"FRAUD,UNCLASSIFIED-MISDEMEANOR",MISDEMEANOR,[ 40.65363867 -73.74077279]
2016-01-19 06:00:00,PETIT LARCENY,"LARCENY,PETIT FROM BUILDING,UN",MISDEMEANOR,[ 40.88806094 -73.8586372 ]
2016-01-07 16:27:00,FORGERY,"FORGERY,ETC.,UNCLASSIFIED-FELO",FELONY,[ 40.76011556 -73.97898212]
2016-01-07 16:27:00,FRAUDS,"FRAUD,UNCLASSIFIED-MISDEMEANOR",MISDEMEANOR,[ 40.76011556 -73.97898212]
2016-01-04 09:00:00,GRAND LARCENY,"LARCENY,GRAND BY DISHONEST EMP",FELONY,[ 40.76226873 -73.98798435]


Remember that in the visualization, red indicates the crime, green indicates the dropoff location, and blue indicates the pickup location. Do you see anywhere that seems to be the origin of many of the blue clusters? In particular you should find that there is a small cluster of blue near the coordinates (40.76, -74.00).

However, how can we be sure that this did not occur by random coincidence? There could be a variety of factors that impact the data (i.e. population, time, tourist-destination, etc). We recommend trying to explore the data yourself to find out more! You might even find more than just the location of Daredevil.

---

## Credits
This module was created as part of the DSEP Modules team for the Spring 2018 offering of INFO 290.

---