# <span style='color:#3b748a'>The rental data for Boston is TOO large to upload to GitHub.</span>

<img src="../images/bikes_banner.jpg" width="1000" />

## <span style="color:#37535e">Bicycle Share Usage</span>

##  <span style='color:#3b748a'>Cleaning Boston Bluebikes data</span>

<span style='color:#4095b5'>This notebook loads and cleans 12 months (September 2017 - August 2018) of data from the Boston Bluebikes bicycle share. There is data going back to 2015 that could be cleaned and used.</span>

<span style='color:#4095b5'>Each row (observation) of data describes one bike ride on which a bike is taken. Each rental includes a starting place and time, a ending place and time, as well as duration, user, and bike information. </span>

## <span style='color:#3b748a'>Table of contents</span>
* <span style='color:#4095b5'>I.  <a href="#checking"><span style='color:#4095b5'>Data checking functions.</span></a></span>
* <span style='color:#4095b5'>II. <a href="#cleaning"><span style='color:#4095b5'>Data cleaning functions.</span></a></span>
* <span style='color:#4095b5'>III. <a href="#import"><span style='color:#4095b5'>Import all data.</span></a></span>
* <span style='color:#4095b5'>IV. <a href="#clean"><span style='color:#4095b5'>Clean all data.</span></a></span>
* <span style='color:#4095b5'>V. <a href="#merge"><span style='color:#4095b5'>Merge the dataframes into 1 big one.</span></a></span>
* <span style='color:#4095b5'>VI. <a href="#explore"><span style='color:#4095b5'>Explore the data.</span></a></span>
* <span style='color:#4095b5'>VII. <a href="#write"><span style='color:#4095b5'>Write the full dataframe to a csv file.</span></a></span>

## <span style='color:#3b748a'>External data required</span>
<ul>
    <li><span style='color:#4095b5'>../data/bos/&lt;month&gt;-bluebikes-tripdata.csv for each month in (201709 to 201807); NOT available in GitHub</span></li>
</ul>


## <span style='color:#3b748a'>Links</span>
<ul>
   <li><a href="https://www.bluebikes.com/system-data"><span style='color:#4095b5'>Boston Bluebikes data</span></a></li>
    <li><a href="plot_bos.ipynb"><span style='color:#4095b5'>Plotting Boston data.</span></a></li>
<hr>

In [1]:
# Let's get the administrative stuff done first
# import all the libraries and set up the plotting

import pandas as pd
import numpy as np
from datetime import datetime,timedelta
from geopy.distance import vincenty

# Gloabal variables to track 
trivial_duration = 0
trivial_distance = 0
outliers_latlon = 0
outliers_duration = 0
outliers_distance = 0

# GnBu_d
colors = ['#37535e', '#3b748a', '#4095b5', '#52aec9', '#72bfc4', '#93d0bf']

<hr>
<a name="checking"> </a>
## <span style='color:#3b748a'>I. Data checking functions</span>

In [2]:
# Check which non-numeric columns are missing values 
# and what the possible values are for each object column

def check_cols(df):
    cols = df.select_dtypes([np.object]).columns
    for col in cols:
        print("{} is {} and values are {}.".format(col,df[col].dtype,df[col].unique()))
        n_nan = df[col].isnull().sum()
        if n_nan > 0:
            print("{} has {} NaNs".format(col,n_nan))
            
    cols = df.select_dtypes([np.int64,np.float64,np.uint64]).columns
    for col in cols:
        print("{} is {} and values are {} to {}.".format(col,df[col].dtype,df[col].min(),df[col].max()))
        n_nan = df[col].isnull().sum()
        if n_nan > 0:
            print("{} has {} NaNs".format(col,n_nan))
    return

In [3]:
# Check which numeric columns are missing values

def check_data(df):
    s = df.shape

    # Check for null values
    null_data = df.isnull().sum()
    null_data_count = sum(null_data)
    print("Rows: {}\t Cols: {}\t NaNs: {}".format(s[0],s[1],null_data_count))
    if  null_data_count > 0:
        print("Columns with NaN: {}".format(list(null_data[null_data > 0].index)))

    return

<hr>
<a name="cleaning"></a>
## <span style='color:#3b748a'> II. Data cleaning functions</span>

<span style='color:#4095b5'>These functions clean the rental data.</span>

### <span style='color:#4095b5'>Drop columns *NOT* in Atlanta data.</span>
<span style='color:#52aec9'>I might want to add some back at some point.</span>

In [4]:
def drop_columns(df):
    cols_drop = ['start station id', 'end station id', 'usertype', 'birth year', 'gender']

    # Can't drop a column that isn't there
    cols_drop = list(set(df.columns) & set(cols_drop))
    df.drop(cols_drop, axis=1, inplace=True)

    return df

### <span style='color:#4095b5'>Rename columns to match Atlanta data names.</span>

In [5]:
def rename_columns(df):
    df.rename(columns={'start station name' : 'Start Hub', 
                       'start station latitude' : 'Start Latitude',
                       'start station longitude' : 'Start Longitude',
                       'starttime' : 'Start Time', 
                       'end station name' : 'End Hub', 
                       'end station latitude' :'End Latitude', 
                       'end station longitude' : 'End Longitude', 
                       'stoptime' : 'End Time', 
                       'bikeid' :'Bike Name',
                       'tripduration' : 'Duration'
                      }, inplace=True)
    return df

### <span style='color:#4095b5'>Merge with hub data.</span>
<span style='color:#52aec9'>We may have to use the start/end hubs to get start/end lat/long.</span>

In [6]:
def calc_latlong(df, df_hubs):
    # BOS: all rentals, even those from hubs, have lat/lon
    return df

### <span style='color:#4095b5'>Drop rows with nulls.</span>
<span style='color:#52aec9'>Don't have any to drop right now.</span>

In [7]:
def drop_nans(df):
    return df

### <span style='color:#4095b5'>Use appropriate datatypes.</span>
<span style='color:#52aec9'>For example, fix Date/Time objects and cast Latitude and Longitude to floats.</span>

In [8]:
def clean_datatypes(df):
    df['Start Latitude'] = df['Start Latitude'].astype(float)
    df['Start Longitude'] = df['Start Longitude'].astype(float)
    df['End Latitude'] = df['End Latitude'].astype(float)
    df['End Longitude'] = df['End Longitude'].astype(float)

    # Turn times in datetime
    df['Start Time'] = pd.to_datetime(df['Start Time'])
    df['End Time'] = pd.to_datetime(df['End Time'])

    # CREATE dates in datetime
    df['Start Date'] = df['Start Time'].dt.date
    df['End Date'] = df['End Time'].dt.date

    # Fix the durations
    df['Duration'] = pd.to_timedelta(df['Duration'], unit='s')
    
    return df

### <span style='color:#4095b5'>Calculate distances.</span>
<span style='color:#52aec9'><b>Poor</b> approximation. If bike was taken from and returned to same station, there will be a trivial distance.</span>

In [9]:
def distance_calc (row):
    start = (row['Start Latitude'], row['Start Longitude'])
    stop = (row['End Latitude'], row['End Longitude'])

    return vincenty(start, stop).miles

In [10]:
def calc_distances(df):
    df['Distance [Miles]'] = df.apply (lambda row: distance_calc (row),axis=1)
    return df

### <span style='color:#4095b5'>Reorder columns.</span>
<span style='color:#52aec9'>Make order same as Atlanta data.</span>

In [11]:
def reorder_cols(df):
    columns = ['Start Hub', 'Start Latitude', 'Start Longitude', 'Start Date',
       'Start Time', 'End Hub', 'End Latitude', 'End Longitude', 'End Date',
       'End Time', 'Bike Name', 'Distance [Miles]', 'Duration']

    df = df.reindex(columns=columns)
    return df

### <span style='color:#4095b5'>Drop trivial trips.</span>
<span style='color:#52aec9'>Trivial trips have time less than 3 mins. We cannot drop for trivial distance, since we compute distance.</span>

In [12]:
def drop_trivial_trips_distance(df):
    df = df[df["Distance [Miles]"] > 0.02].copy()
    return df

In [13]:
def drop_trivial_trips_duration(df):
    df = df[df["Duration"] >= pd.to_timedelta('00:03:00')].copy()
    return df

In [14]:
def drop_trivial_trips(df):
    global trivial_duration
    global trivial_distance

    rows = df.shape[0]
    df = drop_trivial_trips_duration(df)
    rows_duration = df.shape[0]
    trivial_duration += rows-rows_duration

    # Calculated distance, don't drop
    # df = drop_trivial_trips_distance(df)
    rows_distance = df.shape[0]
    trivial_distance += rows_duration-rows_distance

    return df

### <span style='color:#4095b5'>Drop outliers.</span>
<ul>
    <li><span style='color:#52aec9'>Only use rentals near Boston.</span></li> 
    <li><span style='color:#52aec9'>Don't keep rentals longer than 24 hours.</span></li> 
     <li><span style='color:#52aec9'>Don't keep rentals further than 100 miles.</span></li> 
</ul>

In [15]:
def drop_outliers_latlon(df):
    df = df[df["Start Latitude"] < 42.45].copy()
    df = df[df["End Latitude"] < 42.45].copy()
    df = df[df["Start Latitude"] > 42.25].copy()
    df = df[df["End Latitude"] > 42.25].copy()

    return df

In [16]:
def drop_outliers_duration(df):
    df = df[df["Duration"] <= pd.to_timedelta('24:00:00')].copy()
    return df

In [17]:
def drop_outliers_distance(df):
    df_temp = df[df["Distance [Miles]"] >= 100.0]
    if df_temp.shape[0]:
        print("Long trip: ", df_temp[['Start Latitude','Start Longitude', 'Start Time', 
                                     'End Latitude', 'End Longitude', 'End Time', 
                                     'Distance [Miles]', 'Duration']])
    df = df[df["Distance [Miles]"] < 100.0].copy()
    return df

In [18]:
def drop_outliers(df):
    global outliers_latlon
    global outliers_duration
    global outliers_distance
    
    rows = df.shape[0]
    df = drop_outliers_latlon(df)
    rows_latlon = df.shape[0]
    outliers_latlon += rows - rows_latlon
    
    df = drop_outliers_duration(df)
    rows_duration = df.shape[0]
    outliers_duration += rows_latlon - rows_duration
    
    df = drop_outliers_distance(df)
    rows_distance = df.shape[0]
    outliers_distance += rows_duration - rows_distance
    
    return df

### <span style='color:#4095b5'>Pull all of the cleaning together.</span>


In [19]:
def clean_df(df, df_hubs=None):
    global trivial_duration
    global trivial_distance
    global outliers_latlon
    global outliers_duration
    global outliers_distance

    df = drop_columns(df)
    df = rename_columns(df)
    df = calc_latlong(df, df_hubs)
    df = drop_nans(df)
    df = clean_datatypes(df)
    df = calc_distances(df)
    df = reorder_cols(df)
    df = drop_trivial_trips(df)
    df = drop_outliers(df)

    # Information about rows dropped
    print("Trivial dur: {} dist: {}".format(trivial_duration, 
                                                                              trivial_distance))
    print("Outlier loc: {} dur: {} dist: {}".format(outliers_latlon,
                                                     outliers_duration,
                                                     outliers_distance))
    return df

<hr>
<a name="import"></a>
## <span style='color:#3b748a'> III. Import all data from Boston.</span>
<span style='color:#4095b5'>Boston monthly data is too large to upload to GitHub.</span>

In [20]:
# Boston data is by month from Jan 2015 on
# For now, just load one year

trip_data = ['201709',
             '201710', '201711', '201712',
             '201801', '201802', '201803',
             '201804', '201805', '201806',
             '201807', '201808']

In [21]:
# Dictionary of DataFrames, one for each month
df_data = dict()
for d in trip_data:
    df_data[d] = pd.read_csv("../data/bos/"+str(d)+"-bluebikes-tripdata.csv")

<hr>
<a name="clean"></a>

## <span style='color:#3b748a'>IV. Clean all data from Boston.</span>
<ul>
    <li><span style='color:#4095b5'>Reform the data to match Atlanta data.</span></li>
    <li><span style='color:#4095b5'>Calculate distances.</span></li>
    <li><span style='color:#4095b5'>Drop the trivial trips.</span></li>
    <li><span style='color:#4095b5'>Drop the outliers.</span></li>
    <li><span style='color:#4095b5'>Use appropriate coumn types.</span></li>
</ul>

In [22]:
# For each month, clean the DataFrame
print("Cleaning the data:")
for d in trip_data:
    print("Month: {} \nRows: {}\t Cols: {}\t NaNs: {}".format(d, 
                                                    df_data[d].shape[0], 
                                                    df_data[d].shape[1], 
                                                    sum(df_data[d].isnull().sum())))
    df_data[d] = clean_df(df_data[d])
    check_data(df_data[d])

Cleaning the data:
Month: 201709 
Rows: 165386	 Cols: 15	 NaNs: 0
Trivial dur: 4901 dist: 0
Outlier loc: 2 dur: 54 dist: 0
Rows: 160429	 Cols: 13	 NaNs: 0
Month: 201710 
Rows: 163662	 Cols: 15	 NaNs: 0
Trivial dur: 10401 dist: 0
Outlier loc: 46 dur: 123 dist: 0
Rows: 158049	 Cols: 13	 NaNs: 0
Month: 201711 
Rows: 105463	 Cols: 15	 NaNs: 0
Trivial dur: 15047 dist: 0
Outlier loc: 46 dur: 151 dist: 0
Rows: 100789	 Cols: 13	 NaNs: 0
Month: 201712 
Rows: 55072	 Cols: 15	 NaNs: 0
Trivial dur: 17262 dist: 0
Outlier loc: 46 dur: 169 dist: 0
Rows: 52839	 Cols: 13	 NaNs: 0
Month: 201801 
Rows: 40932	 Cols: 15	 NaNs: 1539
Trivial dur: 18806 dist: 0
Outlier loc: 46 dur: 184 dist: 0
Rows: 39373	 Cols: 13	 NaNs: 0
Month: 201802 
Rows: 62817	 Cols: 15	 NaNs: 3497
Trivial dur: 21258 dist: 0
Outlier loc: 47 dur: 212 dist: 0
Rows: 60336	 Cols: 13	 NaNs: 0
Month: 201803 
Rows: 62986	 Cols: 15	 NaNs: 4556
Trivial dur: 23708 dist: 0
Outlier loc: 47 dur: 233 dist: 0
Rows: 60515	 Cols: 13	 NaNs: 0
Month: 201

<hr>
<a name="merge"></a>

## <span style='color:#3b748a'> V. Merge the DataFrames into 1 big DataFrame</span>

In [23]:
n_rows = 0
df = pd.DataFrame()
for d in trip_data:
    n_rows += df_data[d].shape[0]
    df = df.append(df_data[d])

if n_rows != df.shape[0]:
    print("There is a problem with the DataFrame merge!")

<hr>
<a name="explore"></a>

## <span style='color:#3b748a'> VI. Explore the data.</span>

In [24]:
df.head()    

Unnamed: 0,Start Hub,Start Latitude,Start Longitude,Start Date,Start Time,End Hub,End Latitude,End Longitude,End Date,End Time,Bike Name,Distance [Miles],Duration
0,University Park,42.362648,-71.100061,2017-09-01,2017-09-01 00:00:56,MIT Vassar St,42.355601,-71.103945,2017-09-01,2017-09-01 00:13:36,1572,0.525436,00:12:39
1,Watermark Seaport - Boston Wharf Rd at Seaport...,42.351586,-71.045693,2017-09-01,2017-09-01 00:01:08,Cambridge St at Joy St,42.361304,-71.06521,2017-09-01,2017-09-01 00:07:20,1,1.203407,00:06:12
2,MIT Stata Center at Vassar St / Main St,42.362131,-71.091156,2017-09-01,2017-09-01 00:03:07,MIT Vassar St,42.355601,-71.103945,2017-09-01,2017-09-01 00:07:40,995,0.794817,00:04:32
3,University Park,42.362648,-71.100061,2017-09-01,2017-09-01 00:04:07,MIT Vassar St,42.355601,-71.103945,2017-09-01,2017-09-01 00:12:42,635,0.525436,00:08:35
4,Longwood Ave at Binney St,42.338629,-71.1065,2017-09-01,2017-09-01 00:06:00,Coolidge Corner - Beacon St @ Centre St,42.341598,-71.123338,2017-09-01,2017-09-01 00:15:16,1862,0.886225,00:09:16


In [25]:
df.shape

(1568358, 13)

In [26]:
df.columns

Index(['Start Hub', 'Start Latitude', 'Start Longitude', 'Start Date',
       'Start Time', 'End Hub', 'End Latitude', 'End Longitude', 'End Date',
       'End Time', 'Bike Name', 'Distance [Miles]', 'Duration'],
      dtype='object')

In [27]:
df.describe()

Unnamed: 0,Start Latitude,Start Longitude,End Latitude,End Longitude,Bike Name,Distance [Miles],Duration
count,1568358.0,1568358.0,1568358.0,1568358.0,1568358.0,1568358.0,1568358
mean,42.35767,-71.08617,42.35767,-71.08587,1668.7,1.222374,0 days 00:17:52.370378
std,0.01445596,0.02493741,0.01440004,0.02493921,1021.028,0.8056844,0 days 00:35:44.760398
min,42.2679,-71.16649,42.2679,-71.16649,1.0,0.0,0 days 00:03:00
25%,42.34876,-71.10394,42.34876,-71.10394,797.0,0.6441472,0 days 00:07:08
50%,42.3581,-71.08809,42.3581,-71.08799,1650.0,1.026823,0 days 00:11:45
75%,42.36567,-71.06526,42.36567,-71.06461,2517.0,1.633256,0 days 00:19:25
max,42.4063,-71.0061,42.4063,-71.0061,4219.0,7.039046,0 days 23:58:30


In [28]:
if False:
    check_data(df)
    check_cols(df)

In [29]:
# Check dates (Sep 2017 - July 2018 has 334 days)
print("Min start date: {}".format(df['Start Date'].min()))
print("Min end date: {}".format(df['End Date'].min()))
print("Max start date: {}".format(df['Start Date'].max()))
print("Max end date: {}".format(df['End Date'].max()))
print("Number of days: {}".format(len(set(df['Start Date']))))

Min start date: 2017-09-01
Min end date: 2017-09-01
Max start date: 2018-08-31
Max end date: 2018-09-01
Number of days: 365


#### <span style='color:#4095b5'>Fewest rentals</span>
<li><span style='color:#4095b5'>13 mar 2018 - NO rentals; BLIZZARD</span></li>
    
#### <span style='color:#4095b5'>Most rentals</span>
<li><span style='color:#4095b5'>31 jul 2018 - Not clear</span></li>

#### <span style='color:#4095b5'>Outliers on Upper side of Total or Avg Duration per day of week</span>
<li><span style='color:#4095b5'>28 may 2018 - Memorial Day</span></li>
<li><span style='color:#4095b5'>4 jul 2018</span></li>

#### <span style='color:#4095b5'>Outliers on Lower side of Total or Avg Duration per day of week</span>
<li><span style='color:#4095b5'>13 mar 2018 - NO rentals; BLIZZARD</span></li>


<hr>
<a name="write"></a>

## <span style='color:#3b748a'>VII. Write the full DataFrame to a csv file.</span>

In [30]:
df.to_csv('../data/bos/trips_all.csv', index=False)