# Daredevil Demo
---

In this lab, we will explore the potential privacy concerns regarding location data that is supposedly anonymous. We will use a modified version of NYC Taxi data (which is made public and can be found [here](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml)) and modified NYC complaints data (found [here](https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Map-Year-to-Date-/2fra-mtpn)).

Based on the fictional Marvel superhero Daredevil, we will use these two datasets to find the identity/location of Daredevil (if you do not know the background of the superhero, do not worry).

While this is a seemingly trivial example, it turns out that knowing just a little bit of information can be combined with a dataset to discover much more than [intended](https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/).

**We will look at past crime data, and knowing that Daredevil is blind and thus cannot drive himself (assume Uber does not yet exist), must use a taxi to reach crimes far from his home**

*Estimated Time: 60 minutes*

---

**Topics Covered:**
- Loading/Processing Data
- Data Visualization
- Combining, Exploring, and Using Data

**Dependencies:**
*if you are running this through JupyterHub, you do not need to worry about installing these*
- numpy
- datascience
- folium
- helpers.py
- datetime


In [1]:
# Just run this cell. It imports all of the packages we will use
import numpy as np
from datascience import *
import folium
import helpers
import datetime as dt

## Loading and processing data

To start, we will load in the raw csv data (remember we are using 2 datasets) and view each one individually. Observe the column names and try to make note of what each name means. For some datasets, these names can be obscure and you will need to look directly at the source of the data to have more information about each column. However, in our case, most of the columns have column names we can easily interpret. There are a few columns that are not very clear about what they mean, but none of these columns will affect our search for the Daredevil in any significant way so we will ignore them (at least, in our demo)

In [7]:
# The lines below will load the data
taxis = Table.read_table("taxi_data_draft.csv")
complaints = Table.read_table("january_complaints.csv")

# Use .show(x) function to show the first x lines of a table
print("Taxi Data:")
taxis.show(5)
print("Complaints Data:")
complaints.show(5)

Taxi Data:


VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,Trip_distance,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,improvement_surcharge,Total_amount,Payment_type,Trip_type
2,2016-01-06 12:04:53,2016-01-06 12:14:21,N,1,-73.8396,40.7222,-73.8637,40.7326,1,1.58,8.5,0.0,0.5,0.46,0,,0.3,9.76,1,1
2,2016-01-30 22:45:09,2016-01-30 23:04:45,N,1,-73.9437,40.7117,-73.9634,40.6759,1,3.36,15.0,0.5,0.5,3.5,0,,0.3,19.8,1,1
1,2016-01-05 19:36:41,2016-01-05 19:44:39,N,1,-73.94,40.6928,-73.9809,40.6899,1,2.0,8.0,1.0,0.5,0.0,0,,0.3,9.8,2,1
2,2016-01-31 22:56:29,2016-01-31 23:06:23,N,1,-73.9349,40.8476,-73.9425,40.8277,1,1.72,9.0,0.5,0.5,0.0,0,,0.3,10.3,2,1
2,2016-01-09 13:35:55,2016-01-09 13:52:11,N,1,-73.9924,40.6894,-73.9501,40.6939,1,2.87,12.5,0.0,0.5,2.0,0,,0.3,15.3,1,1


Complaints Data:


OFNS_DESC,PD_DESC,LAW_CAT_CD,BORO_NM,Longitude,Latitude,TIME
FRAUDS,"FRAUD,UNCLASSIFIED-MISDEMEANOR",MISDEMEANOR,QUEENS,-73.7408,40.6536,2016-01-20 08:00:00
PETIT LARCENY,"LARCENY,PETIT FROM BUILDING,UN",MISDEMEANOR,BRONX,-73.8586,40.8881,2016-01-19 06:00:00
FORGERY,"FORGERY,ETC.,UNCLASSIFIED-FELO",FELONY,MANHATTAN,-73.979,40.7601,2016-01-07 16:27:00
FRAUDS,"FRAUD,UNCLASSIFIED-MISDEMEANOR",MISDEMEANOR,MANHATTAN,-73.979,40.7601,2016-01-07 16:27:00
GRAND LARCENY,"LARCENY,GRAND BY DISHONEST EMP",FELONY,MANHATTAN,-73.988,40.7623,2016-01-04 09:00:00


We see that the data we have gives a lot of information! In particular, there seems to be a wealth of information in the form of times and locations. This sets up our general approach to finding our target. We will assume that for some number of the crimes committed (there would be no way for Daredevil to get to all crimes), Daredevil must have taken a taxi and was dropped off near the location of the crime. Thus, we can try to determine the taxis/ubers Daredevil took and then look at the original pickup location. However, this is complicated by the fact that we have much more data than we want and that we cannot expect Daredevil to have gotten a ride exactly to the same location and at the same exact time.

---
Before we move on to actually analyzing the data, we must process the data to be in a more usable form. Raw data is often very messy and can be a pain to work with. Things such as missing values or NaNs are often scattered throughout the dataset, and values can often be in a difficult form to use. Thus, by processing the data now, we will make our lives much easier later.

To start, lets make the tables we are working with smaller so they only include columns of interest. While this helps to focus our analysis, note that this also discards potentially useful information. If you finish the demo early and want to try some of your own analysis, feel free to use more columns than we do here.

For our taxi dataset, we will only select the columns for pickup/dropoff times, pickup/dropoff locations and the passenger count. For our complaints data, we will only select the level of offense (LAW_CAT_CD).

In [4]:
def to_datetime(string_date):
    '''will strip a date in a string format and return a datetime format'''
    return dt.datetime.strptime(string_date, '%m/%d/%Y %H:%M:%S')

## Visualization

Before we begin trying to find our DareDevil, we will explore some of the visualization tools that we can use to easily see the data. We will be using folium for this purpose as opposed to the built in mapping function in the datascience package for technical reasons. You can look through the folium [quickstart guide](https://folium.readthedocs.io/en/latest/) or use some of the built in helper functions we provide

In [4]:
# This is the syntax to create an empty map centered at coordinates 40.7127,-74.0059
# This is also the coordinates of NYC so you can simply use these coordinates in any other maps for this lab
map_example = folium.Map(location=[40.7128,-74.0059])

# to display the map simply type the name
map_example

In order to start plotting points for the lab, folium uses a class called Markers. You can read more documentation [here](https://folium.readthedocs.io/en/latest/quickstart.html#markers). The basics of folium are displayed below.

In [5]:
# Creating a new marker at coordinates (40.8436, -73.5633)
marker_example = folium.Marker([40.8436, -73.5633])
# adds the marker to the map
marker_example.add_to(map_example)
# Note that there is no easy way to remove a marker once you add it to the map
# If you want reset a map, simply run map_example = folium.Map(location=[40.7128,-74.0059])
# in order to create a new one instead

# display the map
map_example

We have provided a function addMarkers in the helpers.py file (already imported) that you may find useful. This function will automatically add markers to a map from a given table assuming the table has 2 columns called 'Latitude' and 'Longitude'.

In [6]:
# helpers.addMarkers will automatically add up to 100 points from a table
# we add the first 100 complaints data to our map_example
helpers.addMarkers(map_example, complaints)

map_example



In [7]:
# You can also change the color and icon of the markers using the syntax below
helpers.addMarkers(map_example, taxis, color='red', icon='cloud')
# type help(folium.Icon) to get some details of what you can put in color and icon

map_example



## Analyzing Data

We start to actually look into how we are going to analyze the data. We will be looking at the latitude and longitude data from complaints and taxi as well as the times of each table (so if you dropped these columns earlier, go back and change your selection so these columns are included).

Our rationale of the data is that DareDevil uses complaints sent to the NYPD to then go to the location of a crime. Thus, if we look at a crime that Daredevil was present, we expect to find a corresponding taxi that goes to the general area. Then, if we look at where this taxi originated from, we should be (in theory) able to find where Daredevil originates from and thus get closer to identify him.

In the real world, you can imagine that we would use a variety of ideas to begin looking for specific people or narrow our search (e.g. photos of taxis celebrities emerged out of, knowledge of where someone lives, etc.)

In [11]:
# Run this cell to display the tables
taxis.show(10)
complaints.show(1)

Pickup_latitude,Pickup_longitude,Dropoff_latitude,Dropoff_longitude,Passenger_count,pickup_dt,dropoff_dt
40.83428192138672,-73.852294921875,40.83419799804688,-73.87448120117188,1,2016-01-15 17:58:18,2016-01-15 18:04:28
40.67536163330078,-73.98857879638672,40.67994689941406,-73.99510192871094,1,2016-01-15 20:26:20,2016-01-15 20:33:29
40.79811477661133,-73.95232391357422,40.78320693969727,-73.95670318603516,1,2016-01-09 08:43:10,2016-01-09 08:46:54
40.82011032104492,-73.93660736083984,40.80520629882813,-73.93932342529298,1,2016-01-17 17:48:42,2016-01-17 17:52:24
40.787330627441406,-73.9540786743164,40.79885864257813,-73.96993255615234,1,2016-01-29 15:14:41,2016-01-29 15:24:53
40.713600158691406,-73.95164489746094,40.70684814453125,-73.95042419433594,1,2016-01-04 23:03:56,2016-01-04 23:06:05
40.693626403808594,-73.98712921142578,40.69648742675781,-73.97052001953125,1,2016-01-29 17:50:39,2016-01-29 17:57:09
40.66464614868164,-73.99004364013672,40.66852188110352,-73.95066833496094,1,2016-01-17 00:12:45,2016-01-17 00:22:56
40.72539520263672,-73.95175170898438,40.71218872070313,-73.96368408203125,1,2016-01-25 19:46:44,2016-01-25 19:54:19
40.79849243164063,-73.9417953491211,40.79045867919922,-73.94763946533203,1,2016-01-28 11:51:35,2016-01-28 11:54:29


OFNS_DESC,PD_DESC,LAW_CAT_CD,BORO_NM,Longitude,Latitude,TIME
FRAUDS,"FRAUD,UNCLASSIFIED-MISDEMEANOR",MISDEMEANOR,QUEENS,-73.7408,40.6536,2016-01-20 08:00:00


Lets look at the taxi data first. Perhaps we know that DareDevil took some taxi sometime near 11:00 pm (23:00) on January 4th. We can try to find the destination by looking at all taxi rides around that time using the taxi dataset. This is done below for you. Make sure you look at the code and understand what it is doing.

In [14]:
def time_near(time1, time2, error):
    '''
    Returns a boolean (true or false) whether 2 times are within error time of each other
    error time is an integer representing the minutes in between the two times
    '''
    return time1-time2 <= dt.timedelta(minute=error)

---

## Bibliography

---

#### Notes for Notebook Style:

- Follow [PEP 8](https://www.python.org/dev/peps/pep-0008/) style guide for Python
- No two cells of successive code or markdown
- Run all cells with no errors
- Clear all cell output before pushing
- Create a binder for the repo on [mybinder.org](http://mybinder.org) and paste the badge to the top of the README markdown file