## Lab 2

Knowledge discovery and data mining allow us to gather information about customers. While the information might help us to make management decisions, we should avoid violating the privacy of customers. In this lab, we will explore the need for anonymity in datasets with personally identifying information.   

In [1]:
# importing some packages

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import zipfile

import os

from datetime import datetime as dt

import folium
import folium.plugins

# changing some settings

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 8)

%matplotlib inline
plt.rcParams['figure.figsize'] = (9,7)

#### License plates

We're going to look at some data collected by the Oakland Police Department. They have automated license plate readers on their police cars, and they've built up a database of license plates that they've seen -- and where and when they saw each one. It turns out the data is publicly available on the Oakland public records site. 

In [3]:
path_to_file =  os.environ["HOME"] + "/shared/lab-2/all-lprs.zip"

lprs_zip = zipfile.ZipFile(path_to_file, 'r')

We can download the compressed data in a zip file. Since the contents of the uncompressed data might require a large storage capacity, we should avoid unzipping the file until we inspect its content. 

In [7]:
list_names = [f.filename for f in lprs_zip.filelist]
list_names

['all-lprs.csv']

Here we use the `zipfile` pakackage. We learn that the zipped file contains a single comma separated value (csv) file. 

In [8]:
file_name = lprs_zip.filelist[0]

print("The compressed size of the file is {}MB".format(file_name.compress_size/1e6))
print("The uncompressed size of the file is {}MB".format(file_name.file_size/1e6))

The compressed size of the file is 27.428453MB
The uncompressed size of the file is 158.19924MB


If we used `lprs_zip.extractall()` to unzip the file, then we would need 185MB of storage. Instead, we can try to read the data directly from `all-lprs.csv` into memory.

In [None]:
!top

The command `top` allows us to monitor the system from the command line interface. We should not try to read a file of size `N`MB into a table unless we have at least `2N`MB of memory available on the system.

In [11]:
file_handle = lprs_zip.open(list_names[0])

lprs = pd.read_csv(file_handle)

file_handle.close()

Note that we need to

- open a handle to the file
- read the data with the `pandas` package
- close the handle to the file

Now we have a table containing the data. 

In [14]:
lprs

Unnamed: 0,red_VRM,red_Timestamp,Location
0,1275226,01/19/2011 02:06:00 AM,"(37.798304999999999, -122.27574799999999)"
1,27529C,01/19/2011 02:06:00 AM,"(37.798304999999999, -122.27574799999999)"
2,1158423,01/19/2011 02:06:00 AM,"(37.798304999999999, -122.27574799999999)"
3,1273718,01/19/2011 02:06:00 AM,"(37.798304999999999, -122.27574799999999)"
4,1077682,01/19/2011 02:06:00 AM,"(37.798304999999999, -122.27574799999999)"
...,...,...,...
2742096,4HMN225,12/19/2013 08:28:00 PM,"(37.804198, -122.285053)"
2742097,5CQR629,12/19/2013 08:28:00 PM,"(37.804171, -122.284955)"
2742098,5X10319,12/19/2013 08:28:00 PM,"(37.804148, -122.284861)"
2742099,7D56240,12/19/2013 08:28:00 PM,"(37.804096, -122.284635)"


Let's start by renaming some columns, and then take a look at it.

In [17]:
lprs.rename(columns={'red_VRM' : 'Plate', 'red_Timestamp' : 'Timestamp'}, inplace = True)
lprs

Unnamed: 0,Plate,Timestamp,Location
0,1275226,01/19/2011 02:06:00 AM,"(37.798304999999999, -122.27574799999999)"
1,27529C,01/19/2011 02:06:00 AM,"(37.798304999999999, -122.27574799999999)"
2,1158423,01/19/2011 02:06:00 AM,"(37.798304999999999, -122.27574799999999)"
3,1273718,01/19/2011 02:06:00 AM,"(37.798304999999999, -122.27574799999999)"
4,1077682,01/19/2011 02:06:00 AM,"(37.798304999999999, -122.27574799999999)"
...,...,...,...
2742096,4HMN225,12/19/2013 08:28:00 PM,"(37.804198, -122.285053)"
2742097,5CQR629,12/19/2013 08:28:00 PM,"(37.804171, -122.284955)"
2742098,5X10319,12/19/2013 08:28:00 PM,"(37.804148, -122.284861)"
2742099,7D56240,12/19/2013 08:28:00 PM,"(37.804096, -122.284635)"


We have a lot of records with about 2.7 million license plate reads here.

Let's start by seeing what can be learned about someone, using this data -- assuming you know their license plate.

#### Personally Identifying Information

A former mayor of Oakland is Jean Quan. Her license plate number is 6FCH845.  (How did I learn that?  Turns out she was in the news for getting $1000 of parking tickets, and [the news article](http://www.sfgate.com/bayarea/matier-ross/article/Jean-Quan-Oakland-s-new-mayor-gets-car-booted-3164530.php) included a picture of her car, with the license plate visible.  You'd be amazed by what's out there on the Internet...)

In [25]:
lprs[lprs['Plate'] == '6FCH845']

Unnamed: 0,Plate,Timestamp,Latitude,Longitude
1301320,6FCH845,11/01/2012 09:04:00 AM,37.79871,-122.276221
1369630,6FCH845,10/24/2012 11:15:00 AM,37.799695,-122.274868
1369854,6FCH845,10/24/2012 11:01:00 AM,37.799693,-122.274806
1369967,6FCH845,10/24/2012 10:20:00 AM,37.799735,-122.274893
2242582,6FCH845,05/08/2014 07:30:00 PM,37.797558,-122.26935
2648779,6FCH845,12/31/2013 10:09:00 AM,37.807556,-122.278485


Her car shows up 6 times in this data set.  However, it's hard to make sense of those coordinates. So, let's work out a way to show where her car has been seen on a map.  

In [19]:
def getlatitude(s):
    before, after = s.split(',') # Break it into two parts
    latstring = before[1:] # Get rid of the annoying '('
    return float(latstring) # Convert the string to a number

def getlongitude(s):
    before, after = s.split(',') # Break it into two parts
    longstring = after[1:-1] # Get rid of the ' ' and the ')'
    return float(longstring) # Convert the string to a number

We'll need to extract the latitude and longitude. We can split the string into two pieces: the stuff before the comma (the latitude) and the stuff after (the longitude).

Let's test it to make sure it works correctly.

In [20]:
getlatitude('(37.797558, -122.26935)')

37.797558

In [21]:
getlongitude('(37.797558, -122.26935)')

-122.26935

Now we're ready to add extra columns to the table.

In [24]:
lprs['Latitude'] = lprs["Location"].apply(getlatitude)
lprs['Longitude'] = lprs["Location"].apply(getlongitude)
lprs.drop(columns=['Location'], inplace = True)
lprs

Unnamed: 0,Plate,Timestamp,Latitude,Longitude
0,1275226,01/19/2011 02:06:00 AM,37.798305,-122.275748
1,27529C,01/19/2011 02:06:00 AM,37.798305,-122.275748
2,1158423,01/19/2011 02:06:00 AM,37.798305,-122.275748
3,1273718,01/19/2011 02:06:00 AM,37.798305,-122.275748
4,1077682,01/19/2011 02:06:00 AM,37.798305,-122.275748
...,...,...,...,...
2742096,4HMN225,12/19/2013 08:28:00 PM,37.804198,-122.285053
2742097,5CQR629,12/19/2013 08:28:00 PM,37.804171,-122.284955
2742098,5X10319,12/19/2013 08:28:00 PM,37.804148,-122.284861
2742099,7D56240,12/19/2013 08:28:00 PM,37.804096,-122.284635


Now we can draw a map with a marker everywhere that her car has been seen.

In [26]:
jeanquan = lprs[lprs['Plate'] == '6FCH845']
jeanquan

Unnamed: 0,Plate,Timestamp,Latitude,Longitude
1301320,6FCH845,11/01/2012 09:04:00 AM,37.79871,-122.276221
1369630,6FCH845,10/24/2012 11:15:00 AM,37.799695,-122.274868
1369854,6FCH845,10/24/2012 11:01:00 AM,37.799693,-122.274806
1369967,6FCH845,10/24/2012 10:20:00 AM,37.799735,-122.274893
2242582,6FCH845,05/08/2014 07:30:00 PM,37.797558,-122.26935
2648779,6FCH845,12/31/2013 10:09:00 AM,37.807556,-122.278485


In [93]:
OAKLAND_COORDINATES = (37.798710, -122.276221)
oakland_map = folium.Map(location = OAKLAND_COORDINATES, zoom_start=14)
oakland_map.add_child(folium.LatLngPopup())

locations = jeanquan[['Latitude', 'Longitude']].astype('float').values.tolist()

for coordinates in locations:    
    oakland_map.add_child(folium.Marker(
        coordinates
    ));

display(oakland_map)

Her car has been sited near the Oakland police department.  This should make you suspect we might be getting a bit of a biased sample.  Why might the Oakland PD be the most common place where her car is seen?  Can you come up with a plausible explanation for this?

#### Date and Time

We can try to incorporate the other column of the table into the map. We want to distinguish between 

- morning or afternoon on a weekday 
- evening on a weekday 
- weekend 

We will encode the different times with colors.

In [67]:
def get_color(ts):
    t = dt.strptime(ts, '%m/%d/%Y %I:%M:%S %p')

    output = "green" # Weekend
    
    if t.weekday() < 6:
        if t.hour >= 6 and t.hour <= 17:
            output = 'blue' # Weekday daytime
        else: 
            output = 'red' # Weekday evening
    
    return output

lprs['Color'] = lprs['Timestamp'].apply(get_color)
lprs

Unnamed: 0,Plate,Timestamp,Latitude,Longitude,Color
0,1275226,01/19/2011 02:06:00 AM,37.798305,-122.275748,red
1,27529C,01/19/2011 02:06:00 AM,37.798305,-122.275748,red
2,1158423,01/19/2011 02:06:00 AM,37.798305,-122.275748,red
3,1273718,01/19/2011 02:06:00 AM,37.798305,-122.275748,red
4,1077682,01/19/2011 02:06:00 AM,37.798305,-122.275748,red
...,...,...,...,...,...
2742096,4HMN225,12/19/2013 08:28:00 PM,37.804198,-122.285053,red
2742097,5CQR629,12/19/2013 08:28:00 PM,37.804171,-122.284955,red
2742098,5X10319,12/19/2013 08:28:00 PM,37.804148,-122.284861,red
2742099,7D56240,12/19/2013 08:28:00 PM,37.804096,-122.284635,red


Here we use the `datetime` package for checking day of the week and time of day. 

In [68]:
firechief = lprs[lprs['Plate'] == '1328354']
firechief

Unnamed: 0,Plate,Timestamp,Latitude,Longitude,Color
873561,1328354,09/13/2012 11:17:00 PM,37.837506,-122.268958,red
1105846,1328354,11/19/2012 04:36:00 PM,37.806201,-122.270495,blue
1214085,1328354,11/09/2012 03:27:00 PM,37.805775,-122.270203,blue
1225361,1328354,11/07/2012 06:00:00 PM,37.827588,-122.271633,red
1323440,1328354,10/29/2012 01:21:00 PM,37.806306,-122.271138,blue
1323465,1328354,10/29/2012 01:19:00 PM,37.80531,-122.270435,blue
2215439,1328354,05/13/2014 05:50:00 PM,37.80898,-122.274521,blue
2240326,1328354,05/09/2014 04:59:00 PM,37.805856,-122.274986,blue
2478496,1328354,01/19/2014 11:52:00 PM,37.773185,-122.138043,green
2519124,1328354,01/15/2014 03:29:00 PM,37.806376,-122.270555,blue


Let's look at the data for the car of the Oakland Fire Chief. We can add the date and time along with the corresponding color to the map.

In [91]:
OAKLAND_COORDINATES = (37.798710, -122.276221)
oakland_map = folium.Map(location = OAKLAND_COORDINATES, zoom_start=14)
oakland_map.add_child(folium.LatLngPopup())

locations = firechief[['Latitude', 'Longitude']].astype('float').values.tolist()
timestamps = firechief["Timestamp"].astype('str').values.tolist()
colors = firechief["Color"].astype('str').values.tolist()

for coordinates, time, color in zip(locations, timestamps, colors): 
        oakland_map.add_child(
            folium.Marker(
            coordinates,
            popup = time,
            icon=folium.Icon(color=color)))
    
display(oakland_map)

We can see a blue cluster in downtown Oakland, where the Fire Chief's car was seen on weekdays during business hours.  I bet we've found her office.  In fact, if you happen to know downtown Oakland, then we realize those are mostly clustered right near City Hall.  

Also, her car was seen twice in northern Oakland on weekday evenings.  One can only speculate what that indicates.  Maybe dinner with a friend?  Or running errands?  Off to the scene of a fire?  Who knows.  And then the car has been seen once more, late at night on a weekend, in a residential area in the hills.  Her home address, maybe?

In [94]:
unknown = lprs[lprs['Plate'] == '5AJG153']
unknown

Unnamed: 0,Plate,Timestamp,Latitude,Longitude,Color
118606,5AJG153,07/31/2011 02:45:00 AM,37.792756,-122.250411,green
120255,5AJG153,07/30/2011 01:38:00 AM,37.792836,-122.250491,red
121127,5AJG153,07/29/2011 10:01:00 PM,37.792888,-122.250558,red
122113,5AJG153,07/29/2011 12:46:00 AM,37.793130,-122.250885,red
123648,5AJG153,07/28/2011 04:46:00 AM,37.792766,-122.250466,red
...,...,...,...,...,...
2608433,5AJG153,01/04/2014 10:26:00 PM,37.757575,-122.164038,red
2619837,5AJG153,01/03/2014 02:15:00 PM,37.801806,-122.243988,blue
2671407,5AJG153,12/27/2013 08:31:00 AM,37.793115,-122.250221,blue
2691532,5AJG153,12/25/2013 12:52:00 AM,37.793110,-122.250880,red


Here we have an license plate corresponding to an unknown person. What can we infer from the locations?

In [95]:
OAKLAND_COORDINATES = (37.798710, -122.276221)
oakland_map = folium.Map(location = OAKLAND_COORDINATES, zoom_start=14)
oakland_map.add_child(folium.LatLngPopup())

locations = unknown[['Latitude', 'Longitude']].astype('float').values.tolist()
timestamps = unknown["Timestamp"].astype('str').values.tolist()
colors = unknown["Color"].astype('str').values.tolist()

for coordinates, time, color in zip(locations, timestamps, colors): 
        oakland_map.add_child(
            folium.Marker(
            coordinates,
            popup = time,
            icon=folium.Icon(color=color)))
    
display(oakland_map)

What can we tell from this?  Looks to me like this person lives on International Blvd and 9th, roughly.  On weekdays they've seen in a variety of locations in west Oakland.  It's fun to imagine what this might indicate -- delivery person? taxi driver? someone running errands all over the place in west Oakland?

In [97]:
another_unknown = lprs[lprs['Plate'] == '6UZA652']
another_unknown

Unnamed: 0,Plate,Timestamp,Latitude,Longitude,Color
248107,6UZA652,06/21/2012 03:51:00 AM,37.793061,-122.257675,red
248437,6UZA652,06/21/2012 01:44:00 AM,37.793005,-122.257695,red
278040,6UZA652,06/16/2012 12:21:00 AM,37.793185,-122.257441,red
329076,6UZA652,06/03/2012 12:08:00 PM,37.793866,-122.256443,green
383934,6UZA652,05/24/2012 03:27:00 AM,37.788281,-122.244280,red
...,...,...,...,...,...
2095188,6UZA652,03/21/2014 06:40:00 PM,37.821738,-122.281783,red
2241614,6UZA652,05/09/2014 09:44:00 AM,37.788465,-122.235518,blue
2241652,6UZA652,05/09/2014 08:58:00 AM,37.788260,-122.235413,blue
2339790,6UZA652,04/19/2014 11:00:00 AM,37.787948,-122.243615,blue


Here we have an license plate corresponding to an unknown person. What can we infer from the locations?

In [98]:
OAKLAND_COORDINATES = (37.798710, -122.276221)
oakland_map = folium.Map(location = OAKLAND_COORDINATES, zoom_start=14)
oakland_map.add_child(folium.LatLngPopup())

locations = another_unknown[['Latitude', 'Longitude']].astype('float').values.tolist()
timestamps = another_unknown["Timestamp"].astype('str').values.tolist()
colors = another_unknown["Color"].astype('str').values.tolist()

for coordinates, time, color in zip(locations, timestamps, colors): 
        oakland_map.add_child(
            folium.Marker(
            coordinates,
            popup = time,
            icon=folium.Icon(color=color)))
    
display(oakland_map)

What can we learn from this map?  First, it's pretty easy to guess where this person lives: 16th and International, or pretty near there.  And then we can see them spending some nights and a weekend near Laney College.  Did they have an apartment there briefly?  A relationship with someone who lived there?

#### Privacy Concerns

As we can see, this kind of data can potentially reveal a fair bit about people.  Someone with access to the data can draw inferences.  Take a moment to think about what someone might be able to infer from this kind of data.
 
As we've seen here, it's not too hard to make a pretty good guess at roughly where some lives, from this kind of information: their car is probably parked near their home most nights.  Also, it will often be possible to guess where someone works: if they commute into work by car, then on weekdays during business hours, their car is probably parked near their office, so we'll see a clear cluster that indicates where they work.

But it doesn't stop there.  If we have enough data, it might also be possible to get a sense of what they like to do during their downtime (do they spend time at the park?).  And in some cases the data might reveal that someone is in a relationship and spending nights at someone else's house.  That's arguably pretty sensitive stuff.

This gets at one of the challenges with privacy.  Data that's collected for one purpose (fighting crime, or something like that) can potentially reveal a lot more.  It can allow the owner of the data to draw inferences -- sometimes about things that people would prefer to keep private.  And that means that, in a world of "big data", if we're not careful, privacy can be collateral damage.

#### Anonymizing Data

If we want to protect people's privacy, what can be done about this?  That's a lengthy subject.  But at risk of over-simplifying, there are a few simple strategies that data owners can take:

1. Minimize the data they have.  Collect only what they need, and delete it after it's not needed.

2. Control who has access to the sensitive data.  Perhaps only a handful of trusted insiders need access; if so, then one can lock down the data so only they have access to it.  One can also log all access, to deter misuse.

3. Anonymize the data, so it can't be linked back to the individual who it is about.  Unfortunately, this is often harder than it sounds.

4. Engage with stakeholders.  Provide transparency, to try to avoid people being taken by surprise.  Give individuals a way to see what data has been collected about them.  Give people a way to opt out and have their data be deleted, if they wish.  Engage in a discussion about values, and tell people what steps you are taking to protect them from unwanted consequences.

This only scratches the surface of the subject.  My main goal was to make you aware of privacy concerns, so that if you are ever a steward of a large data set, you can think about how to protect people's data and use it responsibly.