## Class 12

Throughout our classes we mainly worked with Open Datasets. Being publicly available we learned that these data are already de-anonymized when loaded into these open data platforms. But the that fact that the data does not include PII like phone number and email address does not mean we can’t learn a lot about individuals using Citi-bike. These can create loopholes in the privacy of people’s data. 

In the NYC DOIT (Department of Information Technology and Telecommunications) open data technical manual a "dataset" and a "public data set" are defined: 

Dataset: *A named collection of related records on a storage device, with the
collection containing individual data units organized or formatted in a
specific and prescribed way, often in tabular form, and accessed by a
specific access method that is based on the data set organization.*

Public data set: *A comprehensive collection of interrelated data that is available for
inspection by the public in accordance with any provision of law and is
maintained on a computer system by, or on behalf of, an Agency, excluding
any data to which an Agency may deny access pursuant to the Public
Officers Law or any other provision of law or any federal or state rule or
regulation.*

source: https://www1.nyc.gov/assets/doitt/downloads/pdf/nyc_open_data_tsm.pdf

We will see the lines between realted and not related data can be blurry. 

In [None]:
import numpy as np
import pandas as pd 

### citi bike trip data  
citi bike data is probably one of the most fun open datasets. Released monthly, the trip data reveals each and every trip done by individuals. As we will see soon each row in the data represents one trip. For each trip we have a lot information like the data, the duration of the trip (in sec), the gender of the user and the start and end Citi bike stations the bike were taken from and docked to. 


Today we are working with the most up-to-date data from March 2020:

https://www.citibikenyc.com/system-data

Columns in the data: 
1. Trip Duration (seconds)
2. Start Time and Date
3. Stop Time and Date
4. Start Station Name
5. End Station Name
6. Station ID
7. Station Lat/Long
8. Bike ID
10. User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)
11. Gender (Zero=unknown; 1=male; 2=female)
12. Year of Birth

### Load data:

In [None]:
citi_bike_march = pd.read_csv('202003-citibike-tripdata.csv')

In [None]:
citi_bike_march.head() 

In [None]:
len(citi_bike_march)

In [None]:
citi_bike_march['start station name'].unique()

In [None]:
citi_bike_march['end station name'].unique()

Remember that we are working with the March 2020 data. At the end of the 2nd week of March many people already started to work from home and stopped commuting. 
Therefore, the first two weeks of the months are likely to reveal the more “normal” patterns. To emphasize this point let’s look at the number of rides for the 1st two weeks of March and the last two weeks (which are actually 2.5 weeks). We see that rides before March 15th account for about 70% of the total rides in March:

In [None]:
print ('Total rides',len(citi_bike_march))
print ('March 1st - March 15th rides',len(citi_bike_march[(citi_bike_march['starttime']>'2020-03-01 00:00:03.6400') 
                & (citi_bike_march['starttime']<'2020-03-15 00:00:03.6400')]))
print ('March 16th and later',len(citi_bike_march[citi_bike_march['starttime']>'2020-03-15 00:00:03.6400']))

## Task: 

In the first two weeks of March 2020 there was a 48 year old commuter which traveled from the port authority to their work at the UN headquarters. This commuter picked-up their bikes on the W 41 St & 8 Ave station and dropped them off at the 1 Ave & E 44 St station. 

Use pandas queries and find out these details about the commuter: 

1. Their gender 
2. The time of the day in which they got to work in march 
3. Which days in March did this person commuted to work
4. Bonus: what was the average citi bike trip duration (in minutes) for this individual? 

In [None]:
#your code

### Hierarchical data
More broadly companies use data about individuals they look into patterns. Working with an hierarchical data structure can help reveal these patterns. group.by can transform datasets, by applying it on multiple columns  our data will be re-organized as an hierarchical data. In this example I am creating a new DF that is grouping by:age, gender, start and end stations. The function uses the “count” operation to count the number of people in each age and gender that took the same route (start and end stations):

In [None]:
grouped = citi_bike_march.groupby(["birth year", "gender", "start station name", "end station name"]).count()

In [None]:
grouped.head()

Note that in this data we not have multiple indexes (multi index/hierarchal index). So we can filter the data locating specific values in the data. 


In [None]:
grouped.loc[ [1972], :,'W 41 St & 8 Ave','1 Ave & E 44 St']


This query looks for a person that was born in 1972, from all genders, with W 41 St & 8 Ave as a start station and 1 Ave & E 44 St as the end station. 

In [None]:
# all trips done by females born in 1992
grouped.loc[ [1992], 2,:]

We can create a new DF of this subset -->


In [None]:
female92 = grouped.loc[ [1992], 2,:]

In [None]:
#and check out the common pairs of stations for 18 year old females:
female92[female92['tripduration']>10]