<img src="../images/bikes_banner.jpg" width="1000" />

## <span style="color:#37535e">Bicycle Share Usage</span>

##  <span style='color:#3b748a'>Cleaning Philadelphia Indego data</span>

<span style='color:#4095b5'>This notebook loads and cleans 12 months (July 2017 - June 2018) of data from the Philadelphia Indego bicycle share. There is data going back to 2015.</span>

<span style='color:#4095b5'>Each row (observation) of data describes one bike ride on which a bike is taken. Each rental includes a starting place and time, a ending place and time, as well as duration, user, and bike information. </span>

## <span style='color:#3b748a'>Table of contents</span>
* <span style='color:#4095b5'>I.  <a href="#checking"><span style='color:#4095b5'>Data checking functions.</span></a></span>
* <span style='color:#4095b5'>II. <a href="#cleaning"><span style='color:#4095b5'>Data cleaning functions.</span></a></span>
* <span style='color:#4095b5'>III. <a href="#import"><span style='color:#4095b5'>Import all data.</span></a></span>
* <span style='color:#4095b5'>IV. <a href="#clean"><span style='color:#4095b5'>Clean all data.</span></a></span>
* <span style='color:#4095b5'>V. <a href="#merge"><span style='color:#4095b5'>Merge the dataframes into 1 big one.</span></a></span>
* <span style='color:#4095b5'>VI. <a href="#explore"><span style='color:#4095b5'>Explore the data.</span></a></span>
* <span style='color:#4095b5'>VII. <a href="#write"><span style='color:#4095b5'>Write the full dataframe to a csv file.</span></a></span>

## <span style='color:#3b748a'>External data required</span>
<ul>
    <li><span style='color:#4095b5'>../data/phl/indego-trips-20xx-qx.csv for each quarter in (2017-q3 to 2018-q2); available in GitHub</span></li>
</ul>

## <span style='color:#3b748a'>Links</span>
<ul>
    <li><a href="https://www.rideindego.com/about/data/"><span style='color:#4095b5'>Philadelphia Indego data</span></a></li>
    <li><a href="plot_phl.ipynb"><span style='color:#4095b5'>Plotting Philadelphia data.</span></a></li>
</ul>
<hr>

In [1]:
# Let's get the administrative stuff done first
# import all the libraries and set up the plotting

import pandas as pd
import numpy as np
from datetime import datetime,timedelta
from geopy.distance import vincenty

# Gloabal variables to track 
trivial_duration = 0
trivial_distance = 0
outliers_latlon = 0
outliers_duration = 0
outliers_distance = 0

# GnBu_d
colors = ['#37535e', '#3b748a', '#4095b5', '#52aec9', '#72bfc4', '#93d0bf']

<hr>
<a name="checking"> </a>
## <span style='color:#3b748a'>I. Data checking functions</span>

In [2]:
# Check which non-numeric columns are missing values and what the possible values are for each object column

def check_cols(df):
    cols = df.select_dtypes([np.object]).columns
    for col in cols:
        print("{} is {} and values are {}.".format(col,df[col].dtype,df[col].unique()))
        n_nan = df[col].isnull().sum()
        if n_nan > 0:
            print("{} has {} NaNs".format(col,n_nan))
            
    cols = df.select_dtypes([np.int64,np.float64,np.uint64]).columns
    for col in cols:
        print("{} is {} and values are {} to {}.".format(col,df[col].dtype,df[col].min(),df[col].max()))
        n_nan = df[col].isnull().sum()
        if n_nan > 0:
            print("{} has {} NaNs".format(col,n_nan))
    return

In [3]:
# Check which numeric columns are missing values

def check_data(df):
    s = df.shape

    # Check for null values
    null_data = df.isnull().sum()
    null_data_count = sum(df.isnull().sum())
    print("Rows: {}\t Cols: {}\t NaNs: {}".format(s[0],s[1],null_data_count))
    if  null_data_count > 0:
        print("Columns with NaN: {}".format(list(null_data[null_data > 0].index)))

    return

<hr>
<a name="cleaning"></a>
## <span style='color:#3b748a'> II. Data cleaning functions</span>

<span style='color:#4095b5'>These functions clean the trip data.</span>

### <span style='color:#4095b5'>Drop columns *NOT* in Atlanta data.</span>
<span style='color:#52aec9'>I might want to add some back at some point.</span>

In [4]:
def drop_columns(df):
    cols_drop = ['trip_id', 'plan_duration', 'trip_route_category', 'passholder_type']

    # Can't drop a column that isn't there
    cols_drop = list(set(df.columns) & set(cols_drop))
    df.drop(cols_drop, axis=1, inplace=True)

    return df

### <span style='color:#4095b5'>Rename columns to match Atlanta data names.</span>

In [5]:
def rename_columns(df):
    df.rename(columns={'start_station' : 'Start Hub', 
                       'start_lat' : 'Start Latitude',
                       'start_lon' : 'Start Longitude',
                       'start_time' : 'Start Time', 
                       'end_station' : 'End Hub', 
                       'end_lat' :'End Latitude', 
                       'end_lon' : 'End Longitude', 
                       'end_time' : 'End Time', 
                       'bike_id' :'Bike Name',
                       'duration' : 'Duration'
                      }, inplace=True)
    return df

### <span style='color:#4095b5'>Merge with hub data.</span>
<span style='color:#52aec9'>We may have to use the start/end hubs to get start/end lat/long.</span>

In [6]:
def calc_latlong(df, df_hubs):
    return df

### <span style='color:#4095b5'>Drop rows with nulls.</span>
<span style='color:#52aec9'>Hub 3000 does not have lat/long information as it is a virtual hub.</span>

In [7]:
def drop_nans(df):
    latitude_3000 = 39.9526
    longitude_3000 = -75.1652
    
    df['Start Latitude'] = df.apply(
        lambda row: latitude_3000 if row['Start Hub'] == 3000  else row['Start Latitude'],
        axis=1
    )
    df['End Latitude'] = df.apply(
        lambda row: latitude_3000 if row['End Hub'] == 3000  else row['End Latitude'],
        axis=1
    )
    df['Start Longitude'] = df.apply(
        lambda row: longitude_3000 if row['Start Hub'] == 3000  else row['Start Longitude'],
        axis=1
    )
    df['End Longitude'] = df.apply(
        lambda row: longitude_3000 if row['End Hub'] == 3000  else row['End Longitude'],
        axis=1
    )

    rows = df.shape[0]
    df.dropna(subset=['Start Latitude', 'Start Longitude', 
                      'End Latitude', 'End Longitude'], 
              inplace=True)
    print("Drop lat/lon rows: {}".format(rows-df.shape[0]))
    return df

### <span style='color:#4095b5'>Use appropriate datatypes.</span>
<span style='color:#52aec9'>For example, fix Date/Time objects and cast Latitude and Longitude to floats.</span>

In [8]:
def clean_datatypes(df):
    df['Start Latitude'] = df['Start Latitude'].astype(float)
    df['Start Longitude'] = df['Start Longitude'].astype(float)
    df['End Latitude'] = df['End Latitude'].astype(float)
    df['End Longitude'] = df['End Longitude'].astype(float)

    # Turn times in datetime
    df['Start Time'] = pd.to_datetime(df['Start Time'])
    df['End Time'] = pd.to_datetime(df['End Time'])

    # CREATE dates in datetime
    df['Start Date'] = df['Start Time'].dt.date
    df['End Date'] = df['End Time'].dt.date

    # Fix the durations
    df['Duration'] = pd.to_timedelta(df['Duration'], unit='m')
    
    return df

### <span style='color:#4095b5'>Calculate distances.</span>
<span style='color:#52aec9'>Poor approximation. If bike was checked-out and returned to same station, will be trivial distance.</span>

In [9]:
def distance_calc (row):
    start = (row['Start Latitude'], row['Start Longitude'])
    stop = (row['End Latitude'], row['End Longitude'])

    return vincenty(start, stop).miles

In [10]:
def calc_distances(df):
    df['Distance [Miles]'] = df.apply (lambda row: distance_calc (row),axis=1)
    return df

### <span style='color:#4095b5'>Reorder columns.</span>
<span style='color:#52aec9'>Make order same as Atlanta data.</span>

In [11]:
def reorder_cols(df):
    columns = ['Start Hub', 'Start Latitude', 'Start Longitude', 'Start Date',
       'Start Time', 'End Hub', 'End Latitude', 'End Longitude', 'End Date',
       'End Time', 'Bike Name', 'Distance [Miles]', 'Duration']

    df = df.reindex(columns=columns)
    return df

### <span style='color:#4095b5'>Drop trivial trips.</span>
<span style='color:#52aec9'>Trivial trips have time less than 3 mins. We cannot drop for trivial distance, since we compute distance.</span>

In [12]:
def drop_trivial_trips_distance(df):
    df = df[df["Distance [Miles]"] > 0.02].copy()
    return df

In [13]:
def drop_trivial_trips_duration(df):
    df = df[df["Duration"] >= pd.to_timedelta('00:03:00')].copy()
    return df

In [14]:
def drop_trivial_trips(df):
    global trivial_duration
    global trivial_distance

    rows = df.shape[0]
    df = drop_trivial_trips_duration(df)
    rows_duration = df.shape[0]
    trivial_duration += rows-rows_duration

    # Calculated distance, don't drop
    # df = drop_trivial_trips_distance(df)
    rows_distance = df.shape[0]
    trivial_distance += rows_duration-rows_distance

    return df

### <span style='color:#4095b5'>Drop outliers.</span>
<ul>
    <li><span style='color:#52aec9'>Only use trips near Philadelphia.</span></li> 
    <li><span style='color:#52aec9'>Don't keep trips longer than 24 hours.</span></li> 
     <li><span style='color:#52aec9'>Don't keep trips further than 100 miles.</span></li> 
</ul>

In [15]:
def drop_outliers_latlon(df):
#     df = df[df["Start Latitude"] < 33.9].copy()
#     df = df[df["End Latitude"] < 33.9].copy()
#     df = df[df["Start Latitude"] > 33.5].copy()
#     df = df[df["End Latitude"] > 33.5].copy()

#     df = df[df["Start Longitude"] < -83.0].copy()
#     df = df[df["End Longitude"] < -83.0].copy()

    return df

In [16]:
def drop_outliers_duration(df):
    df = df[df["Duration"] < pd.to_timedelta('24:00:00')].copy()
    return df

In [17]:
def drop_outliers_distance(df):
    df_temp = df[df["Distance [Miles]"] >= 100.0]
    if df_temp.shape[0]:
        print("Long trip: ", df_temp[['Start Latitude','Start Longitude', 'Start Time', 
                                     'End Latitude', 'End Longitude', 'End Time', 
                                     'Distance [Miles]', 'Duration']])
    df = df[df["Distance [Miles]"] < 100.0].copy()
    return df

In [18]:
def drop_outliers(df):
    global outliers_latlon
    global outliers_duration
    global outliers_distance
    
    rows = df.shape[0]
    df = drop_outliers_latlon(df)
    rows_latlon = df.shape[0]
    outliers_latlon += rows - rows_latlon
    
    df = drop_outliers_duration(df)
    rows_duration = df.shape[0]
    outliers_duration += rows_latlon - rows_duration
    
    df = drop_outliers_distance(df)
    rows_distance = df.shape[0]
    outliers_distance += rows_duration - rows_distance
    
    return df

### <span style='color:#4095b5'>Pull all of the cleaning together.</span>

In [19]:
def clean_df(df, df_hubs=None):
    global trivial_duration
    global trivial_distance
    global outliers_latlon
    global outliers_duration
    global outliers_distance

    df = drop_columns(df)
    df = rename_columns(df)
    df = calc_latlong(df, df_hubs)
    df = drop_nans(df)
    df = clean_datatypes(df)
    df = calc_distances(df)
    df = reorder_cols(df)
    df = drop_trivial_trips(df)
    df = drop_outliers(df)

    # Information about rows dropped
    print("Trivial dur: {} dist: {}".format(trivial_duration, 
                                                                              trivial_distance))
    print("Outlier loc: {} dur: {} dist: {}".format(outliers_latlon,
                                                     outliers_duration,
                                                     outliers_distance))
    return df

<hr>
<a name="import"></a>
## <span style='color:#3b748a'> III. Import all data from Philadelphia.</span>


In [20]:
# Philadelphia data is quarterly
# For now, just load one year
# indego-trips-2018-q2.csv
trip_data = ['2017-q3', '2017-q4',
             '2018-q1', '2018-q2']

In [21]:
# Dictionary of DataFrames, one for each month
df_data = dict()
for d in trip_data:
    df_data[d] = pd.read_csv("../data/phl/indego-trips-"+str(d)+".csv")

<hr>
<a name="clean"></a>

## <span style='color:#3b748a'>IV. Clean all data from Philadelphia.</span>
<ul>
    <li><span style='color:#4095b5'>Reform the data to match Atlanta data.</span></li>
    <li><span style='color:#4095b5'>Drop the trivial trips.</span></li>
    <li><span style='color:#4095b5'>Drop the outliers.</span></li>
    <li><span style='color:#4095b5'>Use appropriate coumn types.</span></li>
</ul>

In [22]:
# For each month, clean the DataFrame
print("Cleaning the data:")
for d in trip_data:
    print("Month: {} \nRows: {}\t Cols: {}\t NaNs: {}".format(d, 
                                                    df_data[d].shape[0], 
                                                    df_data[d].shape[1], 
                                                    sum(df_data[d].isnull().sum())))
    df_data[d] = clean_df(df_data[d])
    check_data(df_data[d])
    

Cleaning the data:
Month: 2017-q3 
Rows: 276785	 Cols: 14	 NaNs: 3980
Drop lat/lon rows: 0
Trivial dur: 6398 dist: 0
Outlier loc: 0 dur: 143 dist: 0
Rows: 270244	 Cols: 13	 NaNs: 0
Month: 2017-q4 
Rows: 183909	 Cols: 14	 NaNs: 3022
Drop lat/lon rows: 0
Trivial dur: 10711 dist: 0
Outlier loc: 0 dur: 251 dist: 0
Rows: 179488	 Cols: 13	 NaNs: 0
Month: 2018-q1 
Rows: 98993	 Cols: 14	 NaNs: 2342
Drop lat/lon rows: 925
Trivial dur: 13329 dist: 0
Outlier loc: 0 dur: 285 dist: 0
Rows: 95416	 Cols: 13	 NaNs: 0
Month: 2018-q2 
Rows: 201624	 Cols: 14	 NaNs: 3576
Drop lat/lon rows: 1708
Long trip:          Start Latitude  Start Longitude          Start Time  End Latitude  \
171810        39.89307       -75.171677 2018-06-19 16:47:00    -39.951759   
171815        39.89307       -75.171677 2018-06-19 16:47:00    -39.951759   
171818        39.89307       -75.171677 2018-06-19 16:47:00    -39.951759   

        End Longitude            End Time  Distance [Miles] Duration  
171810     -75.158318 2018

<hr>
<a name="merge"></a>

## <span style='color:#3b748a'> V. Merge the DataFrames into 1 big DataFrame</span>
<ul>
    <li><span style='color:#4095b5'>Drop rows earlier than Sept 2017.</span></li>
</ul>

In [23]:
n_rows = 0
df = pd.DataFrame()
for d in trip_data:
    n_rows += df_data[d].shape[0]
    df = df.append(df_data[d])

if n_rows != df.shape[0]:
    print("There is a problem with the DataFrame merge!")

<hr>
<a name="explore"></a>

## <span style='color:#3b748a'> VI. Explore the data.</span>

In [24]:
df.head()    

Unnamed: 0,Start Hub,Start Latitude,Start Longitude,Start Date,Start Time,End Hub,End Latitude,End Longitude,End Date,End Time,Bike Name,Distance [Miles],Duration
0,3160,39.956619,-75.198624,2017-07-01,2017-07-01 00:04:00,3163,39.949741,-75.180969,2017-07-01,2017-07-01 00:16:00,11883,1.0507,00:12:00
1,3046,39.950119,-75.144722,2017-07-01,2017-07-01 00:06:00,3101,39.942951,-75.159554,2017-07-01,2017-07-01 00:37:00,5394,0.930008,00:31:00
2,3006,39.952202,-75.20311,2017-07-01,2017-07-01 00:06:00,3101,39.942951,-75.159554,2017-07-01,2017-07-01 00:21:00,3331,2.399353,00:15:00
3,3006,39.952202,-75.20311,2017-07-01,2017-07-01 00:06:00,3101,39.942951,-75.159554,2017-07-01,2017-07-01 00:21:00,3515,2.399353,00:15:00
4,3046,39.950119,-75.144722,2017-07-01,2017-07-01 00:07:00,3101,39.942951,-75.159554,2017-07-01,2017-07-01 00:37:00,11913,0.930008,00:30:00


In [25]:
df.shape

(739755, 13)

In [26]:
df.columns

Index(['Start Hub', 'Start Latitude', 'Start Longitude', 'Start Date',
       'Start Time', 'End Hub', 'End Latitude', 'End Longitude', 'End Date',
       'End Time', 'Bike Name', 'Distance [Miles]', 'Duration'],
      dtype='object')

In [27]:
df.describe()

Unnamed: 0,Start Hub,Start Latitude,Start Longitude,End Hub,End Latitude,End Longitude,Bike Name,Distance [Miles],Duration
count,739755.0,739755.0,739755.0,739755.0,739755.0,739755.0,739755.0,739755.0,739755
mean,3062.888332,39.95218,-75.16956,3062.093444,39.951789,-75.16879,6139.16474,1.043121,0 days 00:20:01.069408
std,45.8133,0.011574,0.016412,45.849313,0.011443,0.016325,3645.971521,0.628013,0 days 00:50:44.939802
min,3000.0,39.889938,-75.223991,3000.0,39.889938,-75.223991,14.0,0.0,0 days 00:03:00
25%,3026.0,39.945271,-75.17971,3024.0,39.945271,-75.17971,3392.0,0.619469,0 days 00:08:00
50%,3053.0,39.951118,-75.169022,3052.0,39.95071,-75.169022,5156.0,0.946538,0 days 00:11:00
75%,3098.0,39.959229,-75.159554,3098.0,39.95694,-75.158127,11039.0,1.372862,0 days 00:18:00
max,3188.0,39.991791,-75.129936,3188.0,39.991791,-75.129936,11965.0,5.256906,0 days 23:59:00


In [28]:
if False:
    check_data(df)
    check_cols(df)

<hr>
<a name="write"></a>

## <span style='color:#3b748a'>VII. Write the full DataFrame to a csv file.</span>

In [29]:
df.to_csv('../data/phl/trips_all.csv', index=False)

In [31]:
df[df['Start Hub'] == 3000]

Unnamed: 0,Start Hub,Start Latitude,Start Longitude,Start Date,Start Time,End Hub,End Latitude,End Longitude,End Date,End Time,Bike Name,Distance [Miles],Duration
7214,3000,39.9526,-75.1652,2017-07-03,2017-07-03 21:16:00,3097,39.978882,-75.133392,2017-07-03,2017-07-03 22:08:00,3532,2.477785,00:52:00
13749,3000,39.9526,-75.1652,2017-07-05,2017-07-05 21:26:00,3097,39.978882,-75.133392,2017-07-05,2017-07-05 22:06:00,2704,2.477785,00:40:00
16944,3000,39.9526,-75.1652,2017-07-07,2017-07-07 14:52:00,3106,39.991791,-75.186371,2017-07-07,2017-07-07 15:12:00,2706,2.928164,00:20:00
22881,3000,39.9526,-75.1652,2017-07-09,2017-07-09 14:54:00,3108,39.953159,-75.165512,2017-07-09,2017-07-09 15:20:00,2559,0.041975,00:26:00
33026,3000,39.9526,-75.1652,2017-07-12,2017-07-12 14:48:00,3000,39.952600,-75.165200,2017-07-12,2017-07-12 14:59:00,5277,0.000000,00:11:00
33053,3000,39.9526,-75.1652,2017-07-12,2017-07-12 15:00:00,3097,39.978882,-75.133392,2017-07-12,2017-07-12 15:10:00,2587,2.477785,00:10:00
39017,3000,39.9526,-75.1652,2017-07-14,2017-07-14 16:35:00,3013,39.963169,-75.147919,2017-07-14,2017-07-14 17:07:00,5359,1.171986,00:32:00
44076,3000,39.9526,-75.1652,2017-07-16,2017-07-16 14:29:00,3097,39.978882,-75.133392,2017-07-16,2017-07-16 14:57:00,2521,2.477785,00:28:00
71771,3000,39.9526,-75.1652,2017-07-25,2017-07-25 21:20:00,3103,39.977139,-75.179398,2017-07-25,2017-07-25 22:14:00,5168,1.853235,00:54:00
75423,3000,39.9526,-75.1652,2017-07-26,2017-07-26 21:31:00,3097,39.978882,-75.133392,2017-07-26,2017-07-26 22:06:00,3510,2.477785,00:35:00
