## Data Wrangling

### Citi Bike Data

In [1]:
%load_ext sql

In [2]:
from pyspark.sql import SparkSession
import os
import configparser
import pandas as pd

In [3]:
#config = configparser.ConfigParser()

#config.read_file(open('dwh.cfg'))

#os.environ["AWS_ACCESS_KEY_ID"]= config['AWS']['AWS_ACCESS_KEY_ID']
#os.environ["AWS_SECRET_ACCESS_KEY"]= config['AWS']['AWS_SECRET_ACCESS_KEY']

In [4]:
spark = SparkSession.builder\
                     .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:2.7.0")\
                     .getOrCreate()

In [5]:
import requests, zipfile, io

r_citi = requests.get('https://s3.amazonaws.com/tripdata/201701-citibike-tripdata.csv.zip')
z_citi = zipfile.ZipFile(io.BytesIO(r_citi.content))
z_citi.extractall('C:/Users/David/Desktop/DE/citi_data')

In [6]:
df_citi = spark.read.csv("C:/Users/David/Desktop/DE/citi_data/201701-citibike-tripdata.csv")
df_citi.printSchema()
df_citi.show(5)

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)

+-------------+-------------------+-------------------+----------------+--------------------+--------------------+--------------------+--------------+--------------------+--------------------+--------------------+-------+----------+----------+------+
|          _c0|                _c1|                _c2|             _c3|                 _c4|                 _c5|                 _c6|           _c7|                 _c8|                 _c9|                _c10|

In [7]:
df_citi = spark.read.csv("C:/Users/David/Desktop/DE/citi_data/201701-citibike-tripdata.csv", inferSchema=True, header=True)
df_citi.printSchema()
df_citi.show(5)

root
 |-- Trip Duration: integer (nullable = true)
 |-- Start Time: timestamp (nullable = true)
 |-- Stop Time: timestamp (nullable = true)
 |-- Start Station ID: integer (nullable = true)
 |-- Start Station Name: string (nullable = true)
 |-- Start Station Latitude: double (nullable = true)
 |-- Start Station Longitude: double (nullable = true)
 |-- End Station ID: integer (nullable = true)
 |-- End Station Name: string (nullable = true)
 |-- End Station Latitude: double (nullable = true)
 |-- End Station Longitude: double (nullable = true)
 |-- Bike ID: integer (nullable = true)
 |-- User Type: string (nullable = true)
 |-- Birth Year: integer (nullable = true)
 |-- Gender: integer (nullable = true)

+-------------+-------------------+-------------------+----------------+--------------------+----------------------+-----------------------+--------------+--------------------+--------------------+---------------------+-------+----------+----------+------+
|Trip Duration|         Start T

In [8]:
from pyspark.sql.types import StructType as R, StructField as Fld, DoubleType as Dbl, StringType as Str, IntegerType as Int, TimestampType as Time
df_citiSchema = R([
    Fld("duration_sec",Int()),
    Fld("start_time",Time()),
    Fld("end_time",Time()),
    Fld("start_station_id",Int()),
    Fld("start_station_name",Str()),
    Fld("start_station_latitude",Dbl()),
    Fld("start_station_longitude",Dbl()),
    Fld("end_station_id",Int()),
    Fld("end_station_name",Str()),
    Fld("end_station_latitude",Dbl()),
    Fld("end_station_longitude",Dbl()),
    Fld("bike_id",Int()),
    Fld("user_type",Str()),
    Fld("birth_year",Int()),
    Fld("gender",Str())
])

In [9]:
df_citiwithSchema = spark.read.csv("C:/Users/David/Desktop/DE/citi_data/201701-citibike-tripdata.csv", schema=df_citiSchema, header=True)
df_citiwithSchema.printSchema()
df_citiwithSchema.show(5)

root
 |-- duration_sec: integer (nullable = true)
 |-- start_time: timestamp (nullable = true)
 |-- end_time: timestamp (nullable = true)
 |-- start_station_id: integer (nullable = true)
 |-- start_station_name: string (nullable = true)
 |-- start_station_latitude: double (nullable = true)
 |-- start_station_longitude: double (nullable = true)
 |-- end_station_id: integer (nullable = true)
 |-- end_station_name: string (nullable = true)
 |-- end_station_latitude: double (nullable = true)
 |-- end_station_longitude: double (nullable = true)
 |-- bike_id: integer (nullable = true)
 |-- user_type: string (nullable = true)
 |-- birth_year: integer (nullable = true)
 |-- gender: string (nullable = true)

+------------+-------------------+-------------------+----------------+--------------------+----------------------+-----------------------+--------------+--------------------+--------------------+---------------------+-------+----------+----------+------+
|duration_sec|         start_time| 

In [10]:
from pyspark.sql.functions import isnan, when, count, col

df_citiwithSchema.select([count(when(isnan(c), c)).alias(c) for c in df_citiwithSchema.columns[3:-1]]).show()

+----------------+------------------+----------------------+-----------------------+--------------+----------------+--------------------+---------------------+-------+---------+----------+
|start_station_id|start_station_name|start_station_latitude|start_station_longitude|end_station_id|end_station_name|end_station_latitude|end_station_longitude|bike_id|user_type|birth_year|
+----------------+------------------+----------------------+-----------------------+--------------+----------------+--------------------+---------------------+-------+---------+----------+
|               0|                 0|                     0|                      0|             0|               0|                   0|                    0|      0|        0|         0|
+----------------+------------------+----------------------+-----------------------+--------------+----------------+--------------------+---------------------+-------+---------+----------+



In [11]:
df_citiwithSchema.na.drop()

DataFrame[duration_sec: int, start_time: timestamp, end_time: timestamp, start_station_id: int, start_station_name: string, start_station_latitude: double, start_station_longitude: double, end_station_id: int, end_station_name: string, end_station_latitude: double, end_station_longitude: double, bike_id: int, user_type: string, birth_year: int, gender: string]

### NYC Bicycle Routes Data

In [12]:
import json

r_nycbike = requests.get('https://data.cityofnewyork.us/resource/cc5c-sm6z.json')
df_nycbike = pd.read_json(r_nycbike.text)
df_nycbike = pd.DataFrame(df_nycbike)
df_nycbike

Unnamed: 0,the_geom,street,boro,segmentid,facilitycl,fromstreet,tostreet,onoffst,allclasses,instdate,moddate,bikedir,lanecount,tf_facilit,ft_facilit,comments
0,"{'type': 'MultiLineString', 'coordinates': [[[...",63 AVE,4,150483,III,WOODHAVEN BLVD,82 PLACE,ON,III,2016-11-25T00:00:00,2016-11-25T00:00:00,L,1,Sharrows,,
1,"{'type': 'MultiLineString', 'coordinates': [[[...",NEPTUNE AV,3,9009151,II,W 37 ST,BRIGHTON 8 ST,ON,II,2005-08-01T00:00:00,2005-08-01T00:00:00,L,1,Standard,,
2,"{'type': 'MultiLineString', 'coordinates': [[[...",84 ST,4,252570,II,SHORE PKWY SR,157 AV,ON,II,2008-04-01T00:00:00,2008-04-01T00:00:00,R,1,,Standard,
3,"{'type': 'MultiLineString', 'coordinates': [[[...",P P BARTEL PRITCHARD SQ APPR,3,253073,I,PROSPECT PARK W,WEST DR,OFF,I,1980-07-01T00:00:00,1980-07-01T00:00:00,2,2,Greenway,Greenway,Prospect Park Auto-Free Hours: Closed to Cars
4,"{'type': 'MultiLineString', 'coordinates': [[[...",GREENWICH ST,1,313087,II,CANAL ST,GANSEVOORT ST,ON,II,2008-04-01T00:00:00,2008-04-01T00:00:00,R,1,,Standard,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,"{'type': 'MultiLineString', 'coordinates': [[[...",CARLTON AV,3,163863,II,ATLANTIC AV,FLUSHING AV,ON,II,2007-05-01T00:00:00,2007-05-01T00:00:00,L,1,Standard,,
996,"{'type': 'MultiLineString', 'coordinates': [[[...",PACIFIC ST,3,43013,II,BROOKLYN AV,HOPKINSON AV,ON,II,2003-06-01T00:00:00,2003-06-01T00:00:00,L,1,Standard,,
997,"{'type': 'MultiLineString', 'coordinates': [[[...",FT WASHINGTON PARK BICYCLE TRAIL,1,238472,I,W 145 ST,W 181 ST,OFF,I,1999-07-01T00:00:00,1999-07-01T00:00:00,2,2,Greenway,Greenway,
998,"{'type': 'MultiLineString', 'coordinates': [[[...",PARSONS BLVD,4,9005876,II,65 AVE,71 AVE,ON,II,2017-12-15T00:00:00,2017-12-15T00:00:00,R,1,,Standard,


In [13]:
df_nycbike = df_nycbike.join(pd.json_normalize(df_nycbike.pop('the_geom')))
df_nycbike

Unnamed: 0,street,boro,segmentid,facilitycl,fromstreet,tostreet,onoffst,allclasses,instdate,moddate,bikedir,lanecount,tf_facilit,ft_facilit,comments,type,coordinates
0,63 AVE,4,150483,III,WOODHAVEN BLVD,82 PLACE,ON,III,2016-11-25T00:00:00,2016-11-25T00:00:00,L,1,Sharrows,,,MultiLineString,"[[[-73.87218201068114, 40.72315861141582], [-7..."
1,NEPTUNE AV,3,9009151,II,W 37 ST,BRIGHTON 8 ST,ON,II,2005-08-01T00:00:00,2005-08-01T00:00:00,L,1,Standard,,,MultiLineString,"[[[-74.00066694563638, 40.57717211991796], [-7..."
2,84 ST,4,252570,II,SHORE PKWY SR,157 AV,ON,II,2008-04-01T00:00:00,2008-04-01T00:00:00,R,1,,Standard,,MultiLineString,"[[[-73.84937839467118, 40.662347802483495], [-..."
3,P P BARTEL PRITCHARD SQ APPR,3,253073,I,PROSPECT PARK W,WEST DR,OFF,I,1980-07-01T00:00:00,1980-07-01T00:00:00,2,2,Greenway,Greenway,Prospect Park Auto-Free Hours: Closed to Cars,MultiLineString,"[[[-73.97950974891293, 40.661046003203786], [-..."
4,GREENWICH ST,1,313087,II,CANAL ST,GANSEVOORT ST,ON,II,2008-04-01T00:00:00,2008-04-01T00:00:00,R,1,,Standard,,MultiLineString,"[[[-74.00921397183593, 40.72529744997435], [-7..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,CARLTON AV,3,163863,II,ATLANTIC AV,FLUSHING AV,ON,II,2007-05-01T00:00:00,2007-05-01T00:00:00,L,1,Standard,,,MultiLineString,"[[[-73.97346341544264, 40.69623993366978], [-7..."
996,PACIFIC ST,3,43013,II,BROOKLYN AV,HOPKINSON AV,ON,II,2003-06-01T00:00:00,2003-06-01T00:00:00,L,1,Standard,,,MultiLineString,"[[[-73.94146280685176, 40.6771992792158], [-73..."
997,FT WASHINGTON PARK BICYCLE TRAIL,1,238472,I,W 145 ST,W 181 ST,OFF,I,1999-07-01T00:00:00,1999-07-01T00:00:00,2,2,Greenway,Greenway,,MultiLineString,"[[[-73.94593853784582, 40.850268289612984], [-..."
998,PARSONS BLVD,4,9005876,II,65 AVE,71 AVE,ON,II,2017-12-15T00:00:00,2017-12-15T00:00:00,R,1,,Standard,,MultiLineString,"[[[-73.81092794798596, 40.73088608520875], [-7..."


In [14]:
df_cleaning = df_nycbike.coordinates.apply(pd.Series)
df_cleaning.columns = ['iter1']
df_cleaning

Unnamed: 0,iter1
0,"[[-73.87218201068114, 40.72315861141582], [-73..."
1,"[[-74.00066694563638, 40.57717211991796], [-74..."
2,"[[-73.84937839467118, 40.662347802483495], [-7..."
3,"[[-73.97950974891293, 40.661046003203786], [-7..."
4,"[[-74.00921397183593, 40.72529744997435], [-74..."
...,...
995,"[[-73.97346341544264, 40.69623993366978], [-73..."
996,"[[-73.94146280685176, 40.6771992792158], [-73...."
997,"[[-73.94593853784582, 40.850268289612984], [-7..."
998,"[[-73.81092794798596, 40.73088608520875], [-73..."


In [15]:
print(df_cleaning.iter1[120])

[[-74.19139610614188, 40.51869542717229], [-74.19130020638164, 40.51881119446912], [-74.19121192733395, 40.51893044034696], [-74.19113148413363, 40.51905287269325], [-74.1910590753775, 40.51917819131708], [-74.1909948772289, 40.51930608976009], [-74.19093904695487, 40.51943625438982], [-74.19089172057579, 40.51956836800548], [-74.19085301639471, 40.51970210622992], [-74.19082302674674, 40.519837141125194], [-74.19080182626415, 40.51997314297997], [-74.19078946597244, 40.52010977941856], [-74.19078597800784, 40.520246714492565], [-74.19079136854883, 40.52038361319528], [-74.19080562725614, 40.520520141445814], [-74.19082871664624, 40.52065596430598], [-74.19085920981601, 40.52072526796795], [-74.1908826166739, 40.52079615843378], [-74.19089880029054, 40.52086821448722], [-74.19090766265305, 40.520941006741126], [-74.1909091517639, 40.521014103929616], [-74.1909171871006, 40.521070425053225], [-74.19093176364804, 40.52112597614869], [-74.19095276451425, 40.5211803152475], [-74.1909800244

In [16]:
num_nodes = []
start_street_latitude = []
start_street_longitude = []
end_street_latitude = []
end_street_longitude = []
for i in df_cleaning.iter1:
    num_nodes.append(len(i))
    start_street_latitude.append(float(i[0][1]))
    start_street_longitude.append(float(i[0][0]))
    end_street_latitude.append(float(i[len(i)-1][1]))
    end_street_longitude.append(float(i[len(i)-1][0]))

d = {'num_nodes' : num_nodes, 'start_street_latitude' : start_street_latitude,
    'start_street_longitude' : start_street_longitude, 'end_street_latitude' : end_street_latitude,
    'end_street_longitude' : end_street_longitude}
df_clean = pd.DataFrame(d)
df_clean

Unnamed: 0,num_nodes,start_street_latitude,start_street_longitude,end_street_latitude,end_street_longitude
0,2,40.723159,-73.872182,40.723523,-73.871378
1,2,40.577172,-74.000667,40.577121,-74.001105
2,2,40.662348,-73.849378,40.662175,-73.849319
3,4,40.661046,-73.979510,40.661156,-73.978835
4,2,40.725297,-74.009214,40.725410,-74.009194
...,...,...,...,...,...
995,2,40.696240,-73.973463,40.696142,-73.973453
996,2,40.677199,-73.941463,40.677051,-73.938695
997,2,40.850268,-73.945939,40.850328,-73.945897
998,2,40.730886,-73.810928,40.730766,-73.810935


In [17]:
df_nycbike = df_nycbike.join(df_clean)
df_nycbike

Unnamed: 0,street,boro,segmentid,facilitycl,fromstreet,tostreet,onoffst,allclasses,instdate,moddate,...,tf_facilit,ft_facilit,comments,type,coordinates,num_nodes,start_street_latitude,start_street_longitude,end_street_latitude,end_street_longitude
0,63 AVE,4,150483,III,WOODHAVEN BLVD,82 PLACE,ON,III,2016-11-25T00:00:00,2016-11-25T00:00:00,...,Sharrows,,,MultiLineString,"[[[-73.87218201068114, 40.72315861141582], [-7...",2,40.723159,-73.872182,40.723523,-73.871378
1,NEPTUNE AV,3,9009151,II,W 37 ST,BRIGHTON 8 ST,ON,II,2005-08-01T00:00:00,2005-08-01T00:00:00,...,Standard,,,MultiLineString,"[[[-74.00066694563638, 40.57717211991796], [-7...",2,40.577172,-74.000667,40.577121,-74.001105
2,84 ST,4,252570,II,SHORE PKWY SR,157 AV,ON,II,2008-04-01T00:00:00,2008-04-01T00:00:00,...,,Standard,,MultiLineString,"[[[-73.84937839467118, 40.662347802483495], [-...",2,40.662348,-73.849378,40.662175,-73.849319
3,P P BARTEL PRITCHARD SQ APPR,3,253073,I,PROSPECT PARK W,WEST DR,OFF,I,1980-07-01T00:00:00,1980-07-01T00:00:00,...,Greenway,Greenway,Prospect Park Auto-Free Hours: Closed to Cars,MultiLineString,"[[[-73.97950974891293, 40.661046003203786], [-...",4,40.661046,-73.979510,40.661156,-73.978835
4,GREENWICH ST,1,313087,II,CANAL ST,GANSEVOORT ST,ON,II,2008-04-01T00:00:00,2008-04-01T00:00:00,...,,Standard,,MultiLineString,"[[[-74.00921397183593, 40.72529744997435], [-7...",2,40.725297,-74.009214,40.725410,-74.009194
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,CARLTON AV,3,163863,II,ATLANTIC AV,FLUSHING AV,ON,II,2007-05-01T00:00:00,2007-05-01T00:00:00,...,Standard,,,MultiLineString,"[[[-73.97346341544264, 40.69623993366978], [-7...",2,40.696240,-73.973463,40.696142,-73.973453
996,PACIFIC ST,3,43013,II,BROOKLYN AV,HOPKINSON AV,ON,II,2003-06-01T00:00:00,2003-06-01T00:00:00,...,Standard,,,MultiLineString,"[[[-73.94146280685176, 40.6771992792158], [-73...",2,40.677199,-73.941463,40.677051,-73.938695
997,FT WASHINGTON PARK BICYCLE TRAIL,1,238472,I,W 145 ST,W 181 ST,OFF,I,1999-07-01T00:00:00,1999-07-01T00:00:00,...,Greenway,Greenway,,MultiLineString,"[[[-73.94593853784582, 40.850268289612984], [-...",2,40.850268,-73.945939,40.850328,-73.945897
998,PARSONS BLVD,4,9005876,II,65 AVE,71 AVE,ON,II,2017-12-15T00:00:00,2017-12-15T00:00:00,...,,Standard,,MultiLineString,"[[[-73.81092794798596, 40.73088608520875], [-7...",2,40.730886,-73.810928,40.730766,-73.810935


In [18]:
df_nycbike.columns

Index(['street', 'boro', 'segmentid', 'facilitycl', 'fromstreet', 'tostreet',
       'onoffst', 'allclasses', 'instdate', 'moddate', 'bikedir', 'lanecount',
       'tf_facilit', 'ft_facilit', 'comments', 'type', 'coordinates',
       'num_nodes', 'start_street_latitude', 'start_street_longitude',
       'end_street_latitude', 'end_street_longitude'],
      dtype='object')

In [19]:
df_nycbike = df_nycbike.drop(['tf_facilit', 'ft_facilit', 'comments', 'type', 'coordinates'], axis=1)
df_nycbike

Unnamed: 0,street,boro,segmentid,facilitycl,fromstreet,tostreet,onoffst,allclasses,instdate,moddate,bikedir,lanecount,num_nodes,start_street_latitude,start_street_longitude,end_street_latitude,end_street_longitude
0,63 AVE,4,150483,III,WOODHAVEN BLVD,82 PLACE,ON,III,2016-11-25T00:00:00,2016-11-25T00:00:00,L,1,2,40.723159,-73.872182,40.723523,-73.871378
1,NEPTUNE AV,3,9009151,II,W 37 ST,BRIGHTON 8 ST,ON,II,2005-08-01T00:00:00,2005-08-01T00:00:00,L,1,2,40.577172,-74.000667,40.577121,-74.001105
2,84 ST,4,252570,II,SHORE PKWY SR,157 AV,ON,II,2008-04-01T00:00:00,2008-04-01T00:00:00,R,1,2,40.662348,-73.849378,40.662175,-73.849319
3,P P BARTEL PRITCHARD SQ APPR,3,253073,I,PROSPECT PARK W,WEST DR,OFF,I,1980-07-01T00:00:00,1980-07-01T00:00:00,2,2,4,40.661046,-73.979510,40.661156,-73.978835
4,GREENWICH ST,1,313087,II,CANAL ST,GANSEVOORT ST,ON,II,2008-04-01T00:00:00,2008-04-01T00:00:00,R,1,2,40.725297,-74.009214,40.725410,-74.009194
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,CARLTON AV,3,163863,II,ATLANTIC AV,FLUSHING AV,ON,II,2007-05-01T00:00:00,2007-05-01T00:00:00,L,1,2,40.696240,-73.973463,40.696142,-73.973453
996,PACIFIC ST,3,43013,II,BROOKLYN AV,HOPKINSON AV,ON,II,2003-06-01T00:00:00,2003-06-01T00:00:00,L,1,2,40.677199,-73.941463,40.677051,-73.938695
997,FT WASHINGTON PARK BICYCLE TRAIL,1,238472,I,W 145 ST,W 181 ST,OFF,I,1999-07-01T00:00:00,1999-07-01T00:00:00,2,2,2,40.850268,-73.945939,40.850328,-73.945897
998,PARSONS BLVD,4,9005876,II,65 AVE,71 AVE,ON,II,2017-12-15T00:00:00,2017-12-15T00:00:00,R,1,2,40.730886,-73.810928,40.730766,-73.810935


In [20]:
df_nycbike.rename(columns = {'street' : 'route_name', 'boro' : 'borough', 'segmentid' : 'route_id', 
                             'facilitycl' : 'facility_cl', 'fromstreet' : 'start_street_name', 'tostreet' : 'end_street_name', 
                             'onoffst' : 'on_off_set', 'allclasses' : 'all_classes',
                             'instdate' : 'inst_date', 'moddate' : 'mod_date', 'bikedir' : 'bike_direction',
                             'lanecount' : 'lane_count', 'num_nodes' : 'number_nodes'}, inplace = True) 
df_nycbike

Unnamed: 0,route_name,borough,route_id,facility_cl,start_street_name,end_street_name,on_off_set,all_classes,inst_date,mod_date,bike_direction,lane_count,number_nodes,start_street_latitude,start_street_longitude,end_street_latitude,end_street_longitude
0,63 AVE,4,150483,III,WOODHAVEN BLVD,82 PLACE,ON,III,2016-11-25T00:00:00,2016-11-25T00:00:00,L,1,2,40.723159,-73.872182,40.723523,-73.871378
1,NEPTUNE AV,3,9009151,II,W 37 ST,BRIGHTON 8 ST,ON,II,2005-08-01T00:00:00,2005-08-01T00:00:00,L,1,2,40.577172,-74.000667,40.577121,-74.001105
2,84 ST,4,252570,II,SHORE PKWY SR,157 AV,ON,II,2008-04-01T00:00:00,2008-04-01T00:00:00,R,1,2,40.662348,-73.849378,40.662175,-73.849319
3,P P BARTEL PRITCHARD SQ APPR,3,253073,I,PROSPECT PARK W,WEST DR,OFF,I,1980-07-01T00:00:00,1980-07-01T00:00:00,2,2,4,40.661046,-73.979510,40.661156,-73.978835
4,GREENWICH ST,1,313087,II,CANAL ST,GANSEVOORT ST,ON,II,2008-04-01T00:00:00,2008-04-01T00:00:00,R,1,2,40.725297,-74.009214,40.725410,-74.009194
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,CARLTON AV,3,163863,II,ATLANTIC AV,FLUSHING AV,ON,II,2007-05-01T00:00:00,2007-05-01T00:00:00,L,1,2,40.696240,-73.973463,40.696142,-73.973453
996,PACIFIC ST,3,43013,II,BROOKLYN AV,HOPKINSON AV,ON,II,2003-06-01T00:00:00,2003-06-01T00:00:00,L,1,2,40.677199,-73.941463,40.677051,-73.938695
997,FT WASHINGTON PARK BICYCLE TRAIL,1,238472,I,W 145 ST,W 181 ST,OFF,I,1999-07-01T00:00:00,1999-07-01T00:00:00,2,2,2,40.850268,-73.945939,40.850328,-73.945897
998,PARSONS BLVD,4,9005876,II,65 AVE,71 AVE,ON,II,2017-12-15T00:00:00,2017-12-15T00:00:00,R,1,2,40.730886,-73.810928,40.730766,-73.810935


In [21]:
df_nycbike.describe()

Unnamed: 0,borough,route_id,lane_count,number_nodes,start_street_latitude,start_street_longitude,end_street_latitude,end_street_longitude
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,2.585,893469.1,1.521,3.27,40.731159,-73.934968,40.731124,-73.934699
std,1.218302,2475363.0,0.509724,6.882621,0.083749,0.077097,0.083947,0.077152
min,1.0,0.0,0.0,2.0,40.498045,-74.248933,40.498717,-74.249156
25%,1.0,49869.0,1.0,2.0,40.676778,-73.984466,40.676239,-73.984374
50%,3.0,149938.0,2.0,2.0,40.728068,-73.939992,40.728434,-73.939004
75%,4.0,243886.0,2.0,2.0,40.793322,-73.889777,40.793225,-73.889487
max,5.0,9024256.0,3.0,154.0,40.910202,-73.706073,40.911417,-73.705169


In [22]:
df_nycbike.isnull().values.any()

False

In [23]:
df_nycbike.nunique()

route_name                488
borough                     5
route_id                  999
facility_cl                 4
start_street_name         518
end_street_name           527
on_off_set                  2
all_classes                 9
inst_date                 277
mod_date                  307
bike_direction              3
lane_count                  4
number_nodes               30
start_street_latitude     997
start_street_longitude    997
end_street_latitude       994
end_street_longitude      994
dtype: int64

In [24]:
df_nycbike.duplicated().values.any()

False

## Data Modeling

![](img/DataModel.png)

Our data is comprised of two main independent facts that are trips made by the Citi bikes' customers and the bike routes established by the NYC local government. With this in mind, the data model is established as a galaxy schema with two fact tables namely, Trips table and Bike Routes table. The dimension tables for the Trips table are Trip's time, Stations, and a created dimension of whether a trip goes through the bike routes called Trips Through Routes. Moreover, the dimension tables for the Bike Routes table are Route Details table, Streets table and the same Trips Through Routes of the Trips fact table. These two star schemas connected through one dimension table make up our galaxy schema.

The reason why the schema wasn't setup as a snowflake schema with the Bike Route table as a dimension table with extension from the Trips fact table joined by the station_id and the street_id is because of lack of connection between the two data sources meaning Citi Bike's stations are not located accordingly to the Bike Routes setup by the city. This could be one function of this data model to coordinate a better stations and routes for cyclist in the city. In addition, this model keeps our data model flexible for a potential adding of a new data source from a new provider. For example, if uber start their own program here in NYC for bikes and scooter, or if a provider like Revel would like to join their data to this model, it would be as easy as creating a new fact table of trips of that provider coonected through the Trips Through Routes dimension table. Even better idea is to follow the structure of the Citi Bikes database and then it would just adding the trips of the new provider to the Trips fact table with adding a new metric of provider - in order to keep track of each trip's provider.