## Surveillance Society - class 1

- Our data is logged through Google Location History: https://www.google.com/maps/timeline
- And exported using Google Checkout: https://takeout.google.com/settings/takeout

In [None]:
# Let's load it into a python dictionary

import json

floc = 'data/'
fname = 'gilad.json'

j = json.loads(open(floc+fname, 'r').read())

In [None]:
j

In [None]:
print 'number of entries:',len(j['locations'])

In [None]:
# different fields are filled at different points in time

for i in range(10):

    for k,v in j['locations'][i].items():
        print k,v
        
    print ''


## Pandas Dataframe

In [None]:
import pandas as pd

df = pd.DataFrame.from_dict(j['locations'])
print('There are %s rows' % len(df))

In [None]:
df.head()

In [None]:
df.fillna(0).head()

In [None]:
# convert lat, lon to decimalized degrees

df['lat'] = df['latitudeE7'] / 10.**7
df['lon'] = df['longitudeE7'] / 10.**7
df.head()

## Exploring our data

In [None]:
%pylab inline

df.altitude.plot()
title('altitude')

In [None]:
# we can use the seaborn library for prettier plotting (many more examples here - http://seaborn.pydata.org/)
import seaborn as sns
sns.set(color_codes=True)

df.accuracy.plot(figsize=(10,4))
title('accuracy')

In [None]:
df.accuracy[:15]

In [None]:
# Let's try to understand the underlying data here
print 'N:', len(df.accuracy)
print 'mean:',df.accuracy.mean()
print 'median:',df.accuracy.median()
print 'mode:',df.accuracy.mode()
print 'std:',df.accuracy.std()
print 'max:',df.accuracy.max()
print 'min:',df.accuracy.min()

In [None]:
# here's a historgram of this data -> what kind of distribution is this?

sns.distplot(df.accuracy, bins=100, kde=False) # <- try changing the number of bins in the histogram
#df.accuracy.hist() # --> similar way to plot a histogram

A histogram is a graphical representation of the distribution of data. It is an estimate of the probability distribution of a continuous variable (quantitative variable) and was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but are not required to be) of equal size.

In [None]:
# now let's only look at values under 1000
df_accuracy = df[[x<1000 for x in df.accuracy]].accuracy

In [None]:
sns.distplot(df_accuracy, bins=500, kde=False)

- Multi-modal distribution (bimodal): continuous distribution with two different modes
- Why with mobile data? Perhaps the nature of triangulation..?

In [None]:
# Box Plot (a.k.a. - cat and whiskers plot)

df.accuracy.plot(kind='box', vert=False, sym='k.', figsize=(8,4))

In [None]:
df_accuracy.plot(kind='box', vert=False, sym='k.', figsize=(8,4))

In [None]:
df[[x<120 for x in df.accuracy]].accuracy.plot(kind='box', vert=False, sym='k.', figsize=(8,4))

In [None]:
sns.boxplot(df.altitude)
title('altitude')

In [None]:
# remove the outlier
sns.boxplot(df.altitude[[x<10000 for x in df.altitude]])

In [None]:
df.corr()

In [None]:
sns.corrplot(df, annot=False, diag_names=False)

### Pickle the data

In [None]:
import pickle

pickle.dump(df, open('ssoc_df_1.p','wb'))

### Google APIs

Instructions to set up a google API key:
- https://console.developers.google.com/apis/dashboard

- need to enable the Google Maps JavaScript API under APIs in the Google API Console
    - https://console.developers.google.com/apis/api/maps_backend/overview?project=genuine-cirrus-115405


### Visualize on a map

In [None]:
# pip install gmaps
import gmaps
import gmaps.datasets

api_key = ''

# insert Google API key
gmaps.configure(api_key=api_key)

In [None]:
# array of (latitude, longitude) pairs
data = [(v.lat,v.lon) for k,v in df.iterrows()]

In [None]:
# instantiate a gmaps object
m = gmaps.Map()

# add a layer (heatmap) to it using our data
heatmap_layer = gmaps.Heatmap(data=data)
heatmap_layer.gradient = ['white', 'red']
heatmap_layer.point_radius = 3
heatmap_layer.max_intensity = 2
m.add_layer(heatmap_layer)

m

## Questions

1. Describe the data fields - collected by the phone
    - Why are some filled at times and others not?
    - Why are there times when there are more data entries?
    - What do the 'heading' or 'VerticalAccuracy' fields represent?
2. Where do you observe outliers?
    - Describe the outlier. What was your target doing?
    - What is noise that you need to filter out, and how do you go about making that choice?
3. Describe the top locations of your target. 
    - Can you identify home vs. work? Use a plot/map to show.
    - Share a screenshot of this map in our shared slack channel.
4. Use external data sources to describe the top areas visited by your target in terms of demographics, average income, race, and any other salient information you think is important.