# Workflow example 1:  To what extent are you a global citizen? 

### Steps ###
0. **Formulate Research Question**
1. **Determine which variable should be computed to answer RQ**
2. **Select raw data source**
3. **List available information in raw data source**
4. **Select required information to compute variable**
5. **Compute variable**
6. **Send result**
____________________________________________________________________________________________________________

In [1]:
import json
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime as dt
from pathlib import Path
import reverse_geocoder as rg

In [2]:
#proj_lib = Path.home().joinpath('anaconda3','share','proj')
proj_lib = '/home/mgdevos/anaconda3/share/proj'
os.environ['PROJ_LIB'] = str(proj_lib)
from mpl_toolkits.basemap import Basemap

### 0. Research Question

--> To what extent are you a global citizen?

### 1. Variable

* Number of cities visited over time
* Distance traveled over time

### 2. Raw data source

_TODO: Interface to support the selection of suitable raw data source_

--> Google Location History

### 3. List available information

--> Google Location History

**Load data set**

In [3]:
one_drive = Path('/media/sf_vos00076/OneDrive - Universiteit Utrecht')
data_dir = one_drive / Path('Projects', 'data_donation','data')

In [4]:
json_file = data_dir / Path('Location History', 'Location History.json')

In [5]:
with open(json_file) as f:
    data =json.loads(f.read())

In [6]:
df = pd.DataFrame(data['locations'])

### 4. Select required information

_TODO: Abovementioned interface should also support selection information_

In [7]:
df_gps = pd.read_json(json_file)

In [8]:
df_gps.head()

Unnamed: 0,locations
0,"{'timestampMs': '1397133815401', 'latitudeE7':..."
1,"{'timestampMs': '1397133865013', 'latitudeE7':..."
2,"{'timestampMs': '1397133925137', 'latitudeE7':..."
3,"{'timestampMs': '1397133986640', 'latitudeE7':..."
4,"{'timestampMs': '1397134046643', 'latitudeE7':..."


In [9]:
# parse lat, lon, and timestamp from the dict inside the locations column
df_gps['lat'] = df_gps['locations'].map(lambda x: x['latitudeE7'])
df_gps['lon'] = df_gps['locations'].map(lambda x: x['longitudeE7'])
df_gps['timestamp_ms'] = df_gps['locations'].map(lambda x: x['timestampMs'])

In [10]:
# convert lat/lon to decimalized degrees and the timestamp to date-time
df_gps['lat'] = df_gps['lat'] / 10.**7
df_gps['lon'] = df_gps['lon'] / 10.**7
df_gps['timestamp_ms'] = df_gps['timestamp_ms'].astype(float) / 1000
df_gps['datetime'] = df_gps['timestamp_ms'].map(lambda x: dt.fromtimestamp(x).strftime('%Y-%m-%d %H:%M:%S'))
date_range = '{}-{}'.format(df_gps['datetime'].min()[:4], df_gps['datetime'].max()[:4])

In [11]:
# drop columns we don't need
df_gps = df_gps.drop(labels=['locations', 'timestamp_ms'], axis=1, inplace=False)

In [12]:
# Convert date-time to appr format + set as index to facilitate time-series manipulations
df_gps['dt_datetime'] = pd.to_datetime(df_gps['datetime'])
df_gps = df_gps.set_index(['dt_datetime'])
df_gps.drop(['datetime'],axis=1,inplace=True)

In [13]:
df_gps['address'] = rg.search(list(zip(df_gps['lat'].values,df_gps['lon'].values)))

Loading formatted geocoded file...


In [14]:
df_gps['country'] = df_gps['address'].map(lambda x: x['cc'])
df_gps['city'] = df_gps['address'].map(lambda x: x['name'])
df_gps.drop(['address'],axis=1,inplace=True)

In [15]:
df_gps.head()

Unnamed: 0_level_0,lat,lon,country,city
dt_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-04-10 14:43:35,52.616619,4.623173,NL,Egmond aan Zee
2014-04-10 14:44:25,52.616636,4.623175,NL,Egmond aan Zee
2014-04-10 14:45:25,52.616623,4.623125,NL,Egmond aan Zee
2014-04-10 14:46:26,52.616603,4.623285,NL,Egmond aan Zee
2014-04-10 14:47:26,52.616626,4.623207,NL,Egmond aan Zee


### 5. Compute variable


**Number visited locations over time**

In [16]:
first = df_gps.index[0]
last = df_gps.index[-1]
timespan = last-first

In [18]:
cities = df_gps.groupby(['city']).city.agg('count')

In [19]:
countries = df_gps.groupby(['country']).country.agg('count')

In [20]:
"You have visited {} cities in {} countries over a period of {} days"\
.format(cities.size,countries.size,timespan.days)

'You have visited 368 cities in 8 countries over a period of 1757 days'