# <span style='color:#3b748a'>The rental data for DC is TOO large to upload to GitHub.</span>

<img src="../images/bikes_banner.jpg" width="1000" />

# <span style="color:#37535e">Bicycle Share Usage</span>

##  <span style='color:#3b748a'>Cleaning Capital Bikeshare data</span>

<span style='color:#4095b5'>In this notebook, we load and clean 12 months (Sep 2017 - July 2018) data from the Washington DC bicycle share. The data is quarterly for 2017 and monthly for 2018. There is data going back to 2010 that we can clean and use if it seems useful. For now we will use 12 months of data.</span>

<span style='color:#4095b5'>Each row (observation) of data describes one bike ride on which a bike is taken. Each trip includes a starting place and time, an ending place and time, as well as duration, user, and bike information. </span>

<span style='color:#4095b5'>We start with TODO rentals and TODO columns and clean to TODO rows and TODO columns. Almost 8K of the observations dropped were rentals of trivial duration (less than 3 minutes).</span>


## <span style='color:#3b748a'>Table of contents</span>
* <span style='color:#4095b5'>I.  <a href="#checking"><span style='color:#4095b5'>Data checking functions.</span></a></span>
* <span style='color:#4095b5'>II. <a href="#cleaning"><span style='color:#4095b5'>Data cleaning functions.</span></a></span>
* <span style='color:#4095b5'>III. <a href="#import"><span style='color:#4095b5'>Create hub data.</span></a></span>
* <span style='color:#4095b5'>IV. <a href="#import"><span style='color:#4095b5'>Extract July, Aug, Sep 2017 from 2017 Q3 data.</span></a></span>
* <span style='color:#4095b5'>V. <a href="#import"><span style='color:#4095b5'>Import all data.</span></a></span>
* <span style='color:#4095b5'>VI. <a href="#clean"><span style='color:#4095b5'>Clean all data.</span></a></span>
* <span style='color:#4095b5'>VII. <a href="#merge"><span style='color:#4095b5'>Merge the dataframes into 1 big one.</span></a></span>
* <span style='color:#4095b5'>VIII. <a href="#explore"><span style='color:#4095b5'>Explore the data.</span></a></span>
* <span style='color:#4095b5'>IX. <a href="#write"><span style='color:#4095b5'>Write the full DataFrame to a csv file.</span></a></span>

## <span style='color:#3b748a'>Links</span>
* <a href="https://www.capitalbikeshare.com/system-data"><span style='color:#4095b5'>DC Capital Bikeshare data</span></a>
* <a href="main.ipynb">Main notebook</a>
<hr>

In [27]:
# Let's get the administrative stuff done first
# import all the libraries and set up the plotting

import pandas as pd
import numpy as np
from datetime import datetime,timedelta
from geopy.distance import vincenty

# Gloabal variables to track 
trivial_duration = 0
trivial_distance = 0
outliers_latlon = 0
outliers_duration = 0
outliers_distance = 0

# GnBu_d
colors = ['#37535e', '#3b748a', '#4095b5', '#52aec9', '#72bfc4', '#93d0bf']

<hr>
<a name="checking"> </a>
## <span style='color:#3b748a'>I. Data checking functions</span>

In [28]:
# Check which non-numeric columns are missing values and what the possible values are for each object column

def check_cols(df):
    cols = df.select_dtypes([np.object]).columns
    for col in cols:
        print("{} is {} and values are {}.".format(col,df[col].dtype,df[col].unique()))
        n_nan = df[col].isnull().sum()
        if n_nan > 0:
            print("{} has {} NaNs".format(col,n_nan))
            
    cols = df.select_dtypes([np.int64,np.float64,np.uint64]).columns
    for col in cols:
        print("{} is {} and values are {} to {}.".format(col,df[col].dtype,df[col].min(),df[col].max()))
        n_nan = df[col].isnull().sum()
        if n_nan > 0:
            print("{} has {} NaNs".format(col,n_nan))
    return

In [29]:
# Check which numeric columns are missing values

def check_data(df):
    s = df.shape

    # Check for null values
    null_data = df.isnull().sum()
    null_data_count = sum(df.isnull().sum())
    print("Rows: {}\t Cols: {}\t NaNs: {}".format(s[0],s[1],null_data_count))
    if  null_data_count > 0:
        print("Columns with NaN: {}".format(list(null_data[null_data > 0].index)))

    return

<hr>
<a name="cleaning"></a>
## <span style='color:#3b748a'> II. Data cleaning functions</span>

<span style='color:#4095b5'>These functions clean the trip data.</span>

### <span style='color:#4095b5'>Drop columns *NOT* in Atlanta data.</span>
<span style='color:#52aec9'>I might want to add some back at some point.</span>

In [30]:
def drop_columns(df):
    cols_drop = ['Start station number', 'End station number', 'Member type']

    # Can't drop a column that isn't there
    cols_drop = list(set(df.columns) & set(cols_drop))
    df.drop(cols_drop, axis=1, inplace=True)

    return df

### <span style='color:#4095b5'>Rename columns to match Atlanta data names.</span>

In [31]:
def rename_columns(df):
    df.rename(columns={'Start station' : 'Start Hub', 
                   'Start date' : 'Start Time', 
                   'End station' : 'End Hub', 
                   'End date' : 'End Time', 
                   'Bike number' :'Bike Name',
                   'Duration' : 'Duration'
                  }, inplace=True)
    return df

### <span style='color:#4095b5'>Merge with hub data.</span>
<span style='color:#52aec9'>We may have to use the start/end hubs to get start/end lat/long.</span>

In [32]:
def calc_latlong(df, df_hubs):
    df = df.merge(df_hubs, left_on='Start Hub', right_on='Hub', how='left')
    df.drop('Hub', axis = 1, inplace=True)
    df.rename(columns={'Latitude' : 'Start Latitude', 
                       'Longitude' : 'Start Longitude'
                      }, inplace=True)

    df = df.merge(df_hubs, left_on='End Hub', right_on='Hub', how='left')
    df.drop('Hub', axis = 1, inplace=True)
    df.rename(columns={'Latitude' : 'End Latitude', 
                       'Longitude' : 'End Longitude'
                      }, inplace=True)

    return df

### <span style='color:#4095b5'>Drop rows with nulls.</span>
<span style='color:#52aec9'>Don't have any to drop right now.</span>

In [33]:
def drop_nans(df):

    if (sum(df.isnull().sum()) > 0):
        print(df[df['End Latitude'].isnull() | df['Start Latitude'].isnull()])
    return df

### <span style='color:#4095b5'>Use appropriate datatypes.</span>
<span style='color:#52aec9'>For example, fix Date/Time objects and cast Latitude and Longitude to floats.</span>

In [34]:
def clean_datatypes(df):
    df['Start Latitude'] = df['Start Latitude'].astype(float)
    df['Start Longitude'] = df['Start Longitude'].astype(float)
    df['End Latitude'] = df['End Latitude'].astype(float)
    df['End Longitude'] = df['End Longitude'].astype(float)

    # Turn times in datetime
    df['Start Time'] = pd.to_datetime(df['Start Time'])
    df['End Time'] = pd.to_datetime(df['End Time'])

    # CREATE dates in datetime
    df['Start Date'] = df['Start Time'].dt.date
    df['End Date'] = df['End Time'].dt.date

    # Fix the durations
    if df['Duration'].dtype == np.object:
        df['Duration'] = df['Duration'].map(lambda cell: cell.replace(',',''))
    df['Duration'] = df['Duration'].astype(float)
    df['Duration'] = pd.to_timedelta(df['Duration'], unit='s')
    
    return df

### <span style='color:#4095b5'>Calculate distances.</span>
<span style='color:#52aec9'>Poor approximation. If bike was checked-out and returned to same station, will be trivial distance.</span>

In [35]:
def distance_calc (row):
    start = (row['Start Latitude'], row['Start Longitude'])
    stop = (row['End Latitude'], row['End Longitude'])
    return vincenty(start, stop).miles

In [36]:
def calc_distances(df):
    df['Distance [Miles]'] = df.apply (lambda row: distance_calc (row),axis=1)
    return df

### <span style='color:#4095b5'>Reorder columns.</span>
<span style='color:#52aec9'>Make order same as Atlanta data.</span>

In [37]:
def reorder_cols(df):
    columns = ['Start Hub', 'Start Latitude', 'Start Longitude', 'Start Date',
       'Start Time', 'End Hub', 'End Latitude', 'End Longitude', 'End Date',
       'End Time', 'Bike Name', 'Distance [Miles]', 'Duration']

    df = df.reindex(columns=columns)
    return df

### <span style='color:#4095b5'>Drop trivial trips.</span>
<span style='color:#52aec9'>Trivial trips have time less than 3 mins. We cannot drop for trivial distance, since we compute distance.</span>

In [38]:
def drop_trivial_trips_distance(df):
    df = df[df["Distance [Miles]"] > 0.02].copy()
    return df

In [55]:
def drop_trivial_trips_duration(df):
    df = df[(df["Duration"] >= pd.to_timedelta('00:03:00')) | (df['Start Hub'] != df['End Hub'])].copy()
    return df

In [56]:
def drop_trivial_trips(df):
    global trivial_duration
    global trivial_distance

    rows = df.shape[0]
#    df = drop_trivial_trips_duration(df)
    rows_duration = df.shape[0]
    trivial_duration += rows-rows_duration

    # Calculated distance, don't drop
    df = drop_trivial_trips_distance(df)
    rows_distance = df.shape[0]
    trivial_distance += rows_duration-rows_distance

    return df

### <span style='color:#4095b5'>Drop outliers.</span>
<ul>
    <li><span style='color:#52aec9'>Only use trips near DC.</span></li> 
    <li><span style='color:#52aec9'>Only use trips no longer than 24 hours.</span></li> 
</ul>

In [57]:
def drop_outliers_latlon(df):
#     df = df[df["Start Latitude"] < 33.9].copy()
#     df = df[df["End Latitude"] < 33.9].copy()
#     df = df[df["Start Latitude"] > 33.5].copy()
#     df = df[df["End Latitude"] > 33.5].copy()

#     df = df[df["Start Longitude"] < -83.0].copy()
#     df = df[df["End Longitude"] < -83.0].copy()

    return df

In [58]:
def drop_outliers_duration(df):
    df = df[df["Duration"] <= pd.to_timedelta('24:00:00')].copy()
    return df

In [59]:
def drop_outliers_distance(df):
    df_temp = df[df["Distance [Miles]"] >= 100.0]
    if df_temp.shape[0]:
        print("Long trip: ", df_temp[['Start Latitude','Start Longitude', 'Start Time', 
                                     'End Latitude', 'End Longitude', 'End Time', 
                                     'Distance [Miles]', 'Duration']])
    df = df[df["Distance [Miles]"] < 100.0].copy()
    return df

In [60]:
def drop_outliers(df):
    global outliers_latlon
    global outliers_duration
    global outliers_distance
    
    rows = df.shape[0]
    df = drop_outliers_latlon(df)
    rows_latlon = df.shape[0]
    outliers_latlon += rows - rows_latlon
    
    df = drop_outliers_duration(df)
    rows_duration = df.shape[0]
    outliers_duration += rows_latlon - rows_duration
    
    df = drop_outliers_distance(df)
    rows_distance = df.shape[0]
    outliers_distance += rows_duration - rows_distance
    
    return df

### <span style='color:#4095b5'>Pull all of the cleaning together.</span>


In [61]:
def clean_df(df, df_hubs=None):
    global trivial_duration
    global trivial_distance
    global outliers_latlon
    global outliers_duration
    global outliers_distance

    df = drop_columns(df)
    df = rename_columns(df)
    df = calc_latlong(df, df_hubs)
    df = drop_nans(df)
    df = clean_datatypes(df)
    df = calc_distances(df)
    df = reorder_cols(df)
    df = drop_trivial_trips(df)
    df = drop_outliers(df)

    # Information about rows dropped
    print("Trivial dur: {} dist: {}".format(trivial_duration, 
                                                                              trivial_distance))
    print("Outlier loc: {} dur: {} dist: {}".format(outliers_latlon,
                                                     outliers_duration,
                                                     outliers_distance))
    return df

<hr>
<a name="hubs"></a>
## <span style='color:#3b748a'>III. Create hubs</span>
<ul>
    <li><span style='color:#4095b5'>Load all rental data.</span></li>
    <li><span style='color:#4095b5'>Create set of all hubs.</span></li>
    <li><span style='color:#4095b5'>Write to csv.</span></li>
    <li><span style='color:#4095b5'>Use <a href="http://www.mapdevelopers.com/batch_geocode_tool.php">geocode tool</a> to determine lat/lon for each hub.</span></li>
</ul>

In [62]:
def clean_hubs(df, hubs):
    hubs = hubs.union(set(df['Start station']), set(df['End station']))
    
    return hubs

In [63]:
if False:
    trip_data = ['2017Q3', '2017Q4',
               '201801', '201802', '201803',
               '201804', '201805', '201806',
               '201807', '201808']

    hubs = set()
    for d in trip_data:
        df_temp = pd.read_csv("../data/dc/"+str(d)+"-capitalbikeshare-tripdata.csv")
        hubs = clean_hubs(df_temp,hubs)
        print(hubs)

    df_hubs = pd.DataFrame(np.array(list(hubs)))
    df_hubs.to_csv('../data/dc/hub-names.csv', index=False)

<hr>
<a name="convert"></a>
## <span style='color:#3b748a'>IV. Extract July, Aug, Sep 2017 from 2017 Q3 data.</span>
<ul>
    <li><span style='color:#4095b5'>Trip data is quarterly.</span></li>
    <li><span style='color:#4095b5'>The file is too huge to easily use.</span></li>
</ul>

In [64]:
if False:
    d = "2017Q3"
    df = pd.read_csv("../data/dc/"+str(d)+"-capitalbikeshare-tripdata.csv")

    # Turn times in datetime
    df['Start Time'] = pd.to_datetime(df['Start date'])
    df['End Time'] = pd.to_datetime(df['End date'])

    # 2017Q3 = 2017-07-01, 2017-08-01, 2017-09-01
    df7 = df[df['Start Time'] < datetime.strptime('2017-08-01 00:00:00', '%Y-%m-%d %H:%M:%S')].copy()
    df7.drop(['Start Time', 'End Time'], axis=1, inplace=True)
    print(df7.shape)
    df7.to_csv('../data/dc/201707-capitalbikeshare-tripdata.csv', index=False)
    
    # 2017Q3 = 2017-07-01, 2017-08-01, 2017-09-01
    df8 = df[df['Start Time'] < datetime.strptime('2017-09-01 00:00:00', '%Y-%m-%d %H:%M:%S')].copy()
    df8 = df8[df8['Start Time'] >= datetime.strptime('2017-08-01 00:00:00', '%Y-%m-%d %H:%M:%S')].copy()
    df8.drop(['Start Time', 'End Time'], axis=1, inplace=True)
    print(df8.shape)
    df8.to_csv('../data/dc/201708-capitalbikeshare-tripdata.csv', index=False)
    
    # 2017Q3 = 2017-07-01, 2017-08-01, 2017-09-01
    df9 = df[df['Start Time'] >= datetime.strptime('2017-09-01 00:00:00', '%Y-%m-%d %H:%M:%S')].copy()
    df9.drop(['Start Time', 'End Time'], axis=1, inplace=True)
    print(df9.shape)
    df9.to_csv('../data/dc/201709-capitalbikeshare-tripdata.csv', index=False)

<hr>
<a name="import"></a>
## <span style='color:#3b748a'>V. Import all data from Washington DC</span>
<ul>
    <li><span style='color:#4095b5'>Need hub data to calculate latitude/longitude.</span></li>
</ul>

In [65]:
# 201808-capitalbikeshare-tripdata
# DC data is quarterly for 2017 and monthly for 2018
trip_data = ['201709', '2017Q4',
               '201801', '201802', '201803',
               '201804', '201805', '201806',
               '201807', '201808']

In [66]:
# Dictionary of DataFrames, one for each time period
df_data = dict()
for d in trip_data:
    df_data[d] = pd.read_csv("../data/dc/"+str(d)+"-capitalbikeshare-tripdata.csv")

In [67]:
# We need the hubs in order to lookup latitude/longitude
df_hubs = pd.read_csv("../data/dc/hubs.csv")

<hr>
<a name="clean"></a>

## <span style='color:#3b748a'>VI. Clean all data from Washington DC.</span>
<ul>
    <li><span style='color:#4095b5'>For now, drop most of the columns.</span></li>
    <li><span style='color:#4095b5'>Drop the trivial trips.</span></li>
    <li><span style='color:#4095b5'>Drop the outliers.</span></li>
    <li><span style='color:#4095b5'>Use appropriate coumn types.</span></li>
</ul>

In [68]:
# For each month, clean the DataFrame
print("Cleaning the data:")
for d in trip_data:
    print("Month: {} \nRows: {}\t Cols: {}\t NaNs: {}".format(d, 
                                                    df_data[d].shape[0], 
                                                    df_data[d].shape[1], 
                                                    sum(df_data[d].isnull().sum())))
    df_data[d] = clean_df(df_data[d], df_hubs)
    check_data(df_data[d])

Cleaning the data:
Month: 201709 
Rows: 391371	 Cols: 9	 NaNs: 0
Trivial dur: 0 dist: 15648
Outlier loc: 0 dur: 0 dist: 0
Rows: 375723	 Cols: 13	 NaNs: 0
Month: 2017Q4 
Rows: 815264	 Cols: 9	 NaNs: 0
Trivial dur: 0 dist: 40559
Outlier loc: 0 dur: 0 dist: 0
Rows: 790353	 Cols: 13	 NaNs: 0
Month: 201801 
Rows: 168590	 Cols: 9	 NaNs: 0
Trivial dur: 0 dist: 44346
Outlier loc: 0 dur: 0 dist: 0
Rows: 164803	 Cols: 13	 NaNs: 0
Month: 201802 
Rows: 182378	 Cols: 9	 NaNs: 0
Trivial dur: 0 dist: 48997
Outlier loc: 0 dur: 0 dist: 0
Rows: 177727	 Cols: 13	 NaNs: 0
Month: 201803 
Rows: 238998	 Cols: 9	 NaNs: 0
Trivial dur: 0 dist: 57879
Outlier loc: 0 dur: 0 dist: 0
Rows: 230116	 Cols: 13	 NaNs: 0
Month: 201804 
Rows: 328907	 Cols: 9	 NaNs: 0
Trivial dur: 0 dist: 73433
Outlier loc: 0 dur: 0 dist: 0
Rows: 313353	 Cols: 13	 NaNs: 0
Month: 201805 
Rows: 374115	 Cols: 9	 NaNs: 0
Trivial dur: 0 dist: 91406
Outlier loc: 0 dur: 0 dist: 0
Rows: 356142	 Cols: 13	 NaNs: 0
Month: 201806 
Rows: 392338	 Cols: 9

In [69]:
df_data['201808'].sort_values(by='Duration').head()

Unnamed: 0,Start Hub,Start Latitude,Start Longitude,Start Date,Start Time,End Hub,End Latitude,End Longitude,End Date,End Time,Bike Name,Distance [Miles],Duration
129350,17th & K St NW,38.902755,-77.038638,2018-08-10,2018-08-10 18:30:41,17th & K St NW / Farragut Square,38.901957,-77.038995,2018-08-10,2018-08-10 18:31:42,W21702,0.058342,00:01:00
61460,3rd & Tingey St SE,38.87489,-77.00003,2018-08-05,2018-08-05 17:18:37,M St & New Jersey Ave SE,38.876539,-77.004185,2018-08-05,2018-08-05 17:19:37,W23344,0.251264,00:01:00
204607,Columbia & Ontario Rd NW,38.922184,-77.040177,2018-08-16,2018-08-16 16:47:59,Adams Mill & Columbia Rd NW,38.923044,-77.042555,2018-08-16,2018-08-16 16:48:59,W23291,0.141233,00:01:00
172363,11th & F St NW,38.897857,-77.054845,2018-08-14,2018-08-14 10:44:14,10th & G St NW,38.8987,-77.026488,2018-08-14,2018-08-14 10:45:15,W20761,1.529667,00:01:00
99991,20th St & Virginia Ave NW,38.894264,-77.047115,2018-08-08,2018-08-08 17:51:35,20th & E St NW,38.895321,-77.044278,2018-08-08,2018-08-08 17:52:35,W20604,0.169445,00:01:00


<hr>
<a name="merge"></a>

## <span style='color:#3b748a'>VII. Merge the DataFrames into 1 big DataFrame</span>

In [None]:
n_rows = 0
df = pd.DataFrame()
for d in trip_data:
    n_rows += df_data[d].shape[0]
    df = df.append(df_data[d])

if n_rows != df.shape[0]:
    print("There is a problem with the DataFrame merge!")

<hr>
<a name="explore"></a>

## <span style='color:#3b748a'> VIII. Explore the data.</span>

In [None]:
df.head()    

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.describe()

In [None]:
if False:
    check_data(df)
    check_cols(df)

<hr>
<a name="write"></a>

## <span style='color:#3b748a'>IX. Write the full DataFrame to a csv file.</span>

In [None]:
df.to_csv('../data/dc/trips_all.csv', index=False)