# Today's goals

- **Sprint 1: Scrubbing**
    - Calculate deltas
    - Handle inconsistent data
- **Sprint 2: Exploratory analysis**
    - Most traffic on average, by day: choose top 20 areas
    - Traffic flows each time period
    - Do the days change over time
    - Get it on a map: try GeoPandas
    - Tech hubs
- **Sprint 3: External data**
    - NYC Census, median income
    - Safety? 
    - Number of nearby technology companies
- **Sprint 4: Presentation** 
    - Make initial recommendations
    - Throw some graphs up on the slides
    - Start putting presentation together
    

# To do

- Use Pandas-Profiling

In [2]:
import os

import pandas as pd
import numpy as np
import matplotlib as plt

# Load data from txt files

In [89]:
# def get_data(folder):
#     """
#     Reads in turnstile data from a specified folder in Data
    
#     Input: turnstile data file i.e. 2016-2017_turnstile_data
#     Output: a DataFrame with all rows from all files in folder
#     """
    
#     col_names = ['C/A',
#                  'UNIT',
#                  'SCP',
#                  'STATION',
#                  'LINENAME',
#                  'DIVISION',
#                  'DATE',
#                  'TIME',
#                  'DESC',
#                  'ENTRIES',
#                  'EXITS                                                               ']

#     ## absolute path to Data folder
#     data_dir = os.getcwd()+"/Data/" 
    
#     return_df = pd.DataFrame(columns=col_names)
#     for file in os.listdir(data_dir+folder):
#         if not file.startswith('.'):
#             file_path = "Data/"+folder+'/'+file
#             return_df = pd.concat([return_df, pd.read_csv(file_path)],axis=0)
            
#     return_df.rename(columns={return_df.columns[10]:'EXITS'},inplace=True)
    
#     return(return_df)

# df = get_data("2016-2019_turnstile_data")
# df['ENTRIES'] = df['ENTRIES'].astype(np.int)
# df['EXITS'] = df['EXITS'].astype(np.int)

# df.shape

(10303675, 11)

In [90]:
# df.to_csv("2016-2019_turnstile_data.csv", index=False)

# Load data from csv

In [91]:
df = pd.read_csv("2016-2019_turnstile_data.csv")
df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/23/2019,00:00:00,REGULAR,6989774,2370411
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/23/2019,04:00:00,REGULAR,6989795,2370413
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/23/2019,08:00:00,REGULAR,6989813,2370436
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/23/2019,12:00:00,REGULAR,6989924,2370512
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/23/2019,16:00:00,REGULAR,6990200,2370573


In [92]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10303675 entries, 0 to 10303674
Data columns (total 11 columns):
C/A         object
UNIT        object
SCP         object
STATION     object
LINENAME    object
DIVISION    object
DATE        object
TIME        object
DESC        object
ENTRIES     int64
EXITS       int64
dtypes: int64(2), object(9)
memory usage: 864.7+ MB


# Extract date and time features

In [98]:
date = pd.to_datetime(df['DATE'], format='%m/%d/%Y')
time = pd.to_datetime(df['TIME'], format='%H:%M:%S')

df['year'] = date.dt.year
df['month'] = date.dt.month
df['day_of_week'] = date.dt.weekday
df['hour'] = time.dt.hour
df['minute'] = time.dt.minute

df.head()

# Calculate number of entries in each time period (`entries_delta`)

- Group DataFrame by turnstile and sort by date
- Define an appropriate time period
- Check for inconsistent data

In order to identify unique turnstiles, we use two key definitions from the [MTA transit toolkit](http://transitdatatoolkit.com/lessons/subway-turnstile-data/):

- `UNIT`: The remote unit is a collection of turnstiles... there can be multiple remote units one station
- `SCP`: Subunit channel position represents a turnstile... the same number can be used at different stations

Together, `UNIT` and `SCP` make a unique identifier for NYC turnstiles. 

In [100]:
df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,year,month,day_of_week,hour,minute
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/23/2019,00:00:00,REGULAR,6989774,2370411,2019,3,5,0,0
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/23/2019,04:00:00,REGULAR,6989795,2370413,2019,3,5,4,0
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/23/2019,08:00:00,REGULAR,6989813,2370436,2019,3,5,8,0
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/23/2019,12:00:00,REGULAR,6989924,2370512,2019,3,5,12,0
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/23/2019,16:00:00,REGULAR,6990200,2370573,2019,3,5,16,0


Wnat to sort by `UNIT` and `SCP`, then make everything chronological.

In [109]:
df_clean = df.sort_values(['UNIT','SCP','DATE'], ascending=[True,True,False]).reset_index()

In [63]:
turnstiles = df.groupby(['UNIT', 'SCP'], sort=False)
len(turnstiles.groups)

5073

In [None]:
df['entries_delta'] = df['ENTRIES']    \
    .rolling(2)    \
    .apply(lambda x: x[1]-x[0] if abs(x[1]-x[0]) < 5000 else np.nan ,raw=True)

In [69]:
turnstiles.sort_values()

AttributeError: Cannot access callable attribute 'sort_values' of 'DataFrameGroupBy' objects, try using the 'apply' method

# Develop method to count turnstile

In [57]:
sample_df = df[df['STATION'] == '59 ST']

In [63]:
sample_df.columns

Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES', 'EXITS'],
      dtype='object')

In [65]:
sample_df[['DATE','TIME','ENTRIES']]

Unnamed: 0,DATE,TIME,ENTRIES
0,03/23/2019,00:00:00,6989774
1,03/23/2019,04:00:00,6989795
2,03/23/2019,08:00:00,6989813
3,03/23/2019,12:00:00,6989924
4,03/23/2019,16:00:00,6990200
...,...,...,...
157647,03/09/2018,11:00:00,232801
157648,03/09/2018,12:12:19,232807
157649,03/09/2018,15:00:00,232818
157650,03/09/2018,19:00:00,232889


In [75]:
sample_df['ENTRIES']

0         6989774
1         6989795
2         6989813
3         6989924
4         6990200
           ...   
157647     232801
157648     232807
157649     232818
157650     232889
157651     232905
Name: ENTRIES, Length: 115306, dtype: int64

In [42]:
df.groupby(['STATION'])['ENTRIES'].mean()

STATION
59 ST    3.803133e+07
Name: ENTRIES, dtype: float64

In [None]:
df.shape

In [8]:
# No missing values
df.isnull().sum()

C/A                                                                     0
UNIT                                                                    0
SCP                                                                     0
STATION                                                                 0
LINENAME                                                                0
DIVISION                                                                0
DATE                                                                    0
TIME                                                                    0
DESC                                                                    0
ENTRIES                                                                 0
EXITS                                                                   0
dtype: int64

In [76]:
df['C/A'].nunique()

751

## Reduce scope by plotting