<img src="../images/bikes_banner.jpg" width="1000" />

## <span style="color:#37535e">Bicycle Share Usage</span>

##  <span style='color:#3b748a'>Cleaning Los Angeles Metro Bike Share data</span>

<span style='color:#4095b5'>This notebook loads and cleans 12 months (July 2017 - June 2018) of data from the Los Angeles Metro Bike Share bicycle share. There is data going back to 2016.</span>

<span style='color:#4095b5'>Each row (observation) of data describes one bike ride on which a bike is taken. Each rental includes a starting place and time, a ending place and time, as well as duration, user, and bike information. </span>

## <span style='color:#3b748a'>Table of contents</span>
* <span style='color:#4095b5'>I.  <a href="#checking"><span style='color:#4095b5'>Data checking functions.</span></a></span>
* <span style='color:#4095b5'>II. <a href="#cleaning"><span style='color:#4095b5'>Data cleaning functions.</span></a></span>
* <span style='color:#4095b5'>III. <a href="#import"><span style='color:#4095b5'>Import all data.</span></a></span>
* <span style='color:#4095b5'>IV. <a href="#clean"><span style='color:#4095b5'>Clean all data.</span></a></span>
* <span style='color:#4095b5'>V. <a href="#merge"><span style='color:#4095b5'>Merge the dataframes into 1 big one.</span></a></span>
* <span style='color:#4095b5'>VI. <a href="#explore"><span style='color:#4095b5'>Explore the data.</span></a></span>
* <span style='color:#4095b5'>VII. <a href="#write"><span style='color:#4095b5'>Write the full dataframe to a csv file.</span></a></span>

## <span style='color:#3b748a'>External data required</span>
<ul>
    <li><span style='color:#4095b5'>../data/la/metro-bike-share-trips-20XX-qX.csv for each quarter in (2017-q3 to 2018-q2); available in GitHub</span></li>
</ul>

## <span style='color:#3b748a'>Links</span>
* <a href="https://bikeshare.metro.net/about/data/"><span style='color:#4095b5'>Los Angeles Metro Bike Share data</span></a>
    <li><a href="plot_la.ipynb"><span style='color:#4095b5'>Plotting Los Angelese data.</span></a></li>
<hr>

In [1]:
# Let's get the administrative stuff done first
# import all the libraries and set up the plotting

import pandas as pd
import numpy as np
from datetime import datetime,timedelta
from geopy.distance import vincenty

# Gloabal variables to track 
trivial_duration = 0
trivial_distance = 0
outliers_latlon = 0
outliers_duration = 0
outliers_distance = 0

# GnBu_d
colors = ['#37535e', '#3b748a', '#4095b5', '#52aec9', '#72bfc4', '#93d0bf']

<hr>
<a name="checking"> </a>
## <span style='color:#3b748a'>I. Data checking functions</span>

In [2]:
# Check which non-numeric columns are missing values 
# and what the possible values are for each object column

def check_cols(df):
    cols = df.select_dtypes([np.object]).columns
    for col in cols:
        print("{} is {} and values are {}.".format(col,df[col].dtype,df[col].unique()))
        n_nan = df[col].isnull().sum()
        if n_nan > 0:
            print("{} has {} NaNs".format(col,n_nan))
            
    cols = df.select_dtypes([np.int64,np.float64,np.uint64]).columns
    for col in cols:
        print("{} is {} and values are {} to {}.".format(col,df[col].dtype,df[col].min(),df[col].max()))
        n_nan = df[col].isnull().sum()
        if n_nan > 0:
            print("{} has {} NaNs".format(col,n_nan))
    return

In [3]:
# Check which numeric columns are missing values

def check_data(df):
    s = df.shape

    # Check for null values
    null_data = df.isnull().sum()
    null_data_count = sum(null_data)
    print("Rows: {}\t Cols: {}\t NaNs: {}".format(s[0],s[1],null_data_count))
    if  null_data_count > 0:
        print("Columns with NaN: {}".format(list(null_data[null_data > 0].index)))

    return

<hr>
<a name="cleaning"></a>
## <span style='color:#3b748a'> II. Data cleaning functions</span>

<span style='color:#4095b5'>These functions clean the trip data.</span>

### <span style='color:#4095b5'>Drop columns *NOT* in Atlanta data.</span>
<span style='color:#52aec9'>I might want to add some back at some point.</span>

In [4]:
def drop_columns(df):
    cols_drop = ['trip_id', 'plan_duration', 'trip_route_category', 'passholder_type']

    # Can't drop a column that isn't there
    cols_drop = list(set(df.columns) & set(cols_drop))
    df.drop(cols_drop, axis=1, inplace=True)

    return df

### <span style='color:#4095b5'>Rename columns to match Atlanta data names.</span>

In [5]:
def rename_columns(df):
    df.rename(columns={'start_station' : 'Start Hub', 
                       'start_lat' : 'Start Latitude',
                       'start_lon' : 'Start Longitude',
                       'start_time' : 'Start Time', 
                       'end_station' : 'End Hub', 
                       'end_lat' :'End Latitude', 
                       'end_lon' : 'End Longitude', 
                       'end_time' : 'End Time', 
                       'bike_id' :'Bike Name',
                       'duration' : 'Duration'
                      }, inplace=True)
    return df

### <span style='color:#4095b5'>Merge with hub data.</span>
<span style='color:#52aec9'>We may have to use the start/end hubs to get start/end lat/long.</span>

In [6]:
def calc_latlong(df, df_hubs):
    return df

### <span style='color:#4095b5'>Drop rows with nulls.</span>
<span style='color:#52aec9'>Hub 3000 does not have lat/long information as it is a virtual hub.</span>

In [7]:
def drop_nans(df):
    latitude_3000 = 34.0522
    longitude_3000 = -118.2437
    df['Start Latitude'] = df.apply(
        lambda row: latitude_3000 if row['Start Hub'] == 3000  else row['Start Latitude'],
        axis=1
    )
    df['End Latitude'] = df.apply(
        lambda row: latitude_3000 if row['End Hub'] == 3000  else row['End Latitude'],
        axis=1
    )
    df['Start Longitude'] = df.apply(
        lambda row: longitude_3000 if row['Start Hub'] == 3000  else row['Start Longitude'],
        axis=1
    )
    df['End Longitude'] = df.apply(
        lambda row: longitude_3000 if row['End Hub'] == 3000  else row['End Longitude'],
        axis=1
    )

    df.dropna(subset=['Start Latitude', 'Start Longitude', 
                      'End Latitude', 'End Longitude'], 
              inplace=True)
    return df

### <span style='color:#4095b5'>Use appropriate datatypes.</span>
<span style='color:#52aec9'>For example, fix Date/Time objects and cast Latitude and Longitude to floats.</span>

In [8]:
def clean_datatypes(df):
    df['Start Latitude'] = df['Start Latitude'].astype(float)
    df['Start Longitude'] = df['Start Longitude'].astype(float)
    df['End Latitude'] = df['End Latitude'].astype(float)
    df['End Longitude'] = df['End Longitude'].astype(float)

    # Turn times in datetime
    df['Start Time'] = pd.to_datetime(df['Start Time'])
    df['End Time'] = pd.to_datetime(df['End Time'])

    # CREATE dates in datetime
    df['Start Date'] = df['Start Time'].dt.date
    df['End Date'] = df['End Time'].dt.date

    # Fix the durations
    df['Duration'] = pd.to_timedelta(df['Duration'], unit='m')
    
    return df

### <span style='color:#4095b5'>Calculate distances.</span>
<span style='color:#52aec9'>Poor approximation. If bike was checked-out and returned to same station, will be trivial distance. Also, becuase 3000 doesn't have a true location, distnces are not accurate.</span>

In [9]:
def distance_calc (row):
    start = (row['Start Latitude'], row['Start Longitude'])
    stop = (row['End Latitude'], row['End Longitude'])

    return vincenty(start, stop).miles

In [10]:
def calc_distances(df):
    df['Distance [Miles]'] = df.apply (lambda row: distance_calc (row),axis=1)
    return df

### <span style='color:#4095b5'>Reorder columns.</span>
<span style='color:#52aec9'>Make order same as Atlanta data.</span>

In [11]:
def reorder_cols(df):
    columns = ['Start Hub', 'Start Latitude', 'Start Longitude', 'Start Date',
       'Start Time', 'End Hub', 'End Latitude', 'End Longitude', 'End Date',
       'End Time', 'Bike Name', 'Distance [Miles]', 'Duration']

    df = df.reindex(columns=columns)
    return df

### <span style='color:#4095b5'>Drop trivial trips.</span>
<span style='color:#52aec9'>Trivial trips have time less than 3 mins. We cannot drop for trivial distance, since we compute distance.</span>

In [12]:
def drop_trivial_trips_distance(df):
    df = df[df["Distance [Miles]"] > 0.02].copy()
    return df

In [13]:
def drop_trivial_trips_duration(df):
    df = df[df["Duration"] >= pd.to_timedelta('00:03:00')].copy()
    return df

In [14]:
def drop_trivial_trips(df):
    global trivial_duration
    global trivial_distance

    rows = df.shape[0]
    df = drop_trivial_trips_duration(df)
    rows_duration = df.shape[0]
    trivial_duration += rows-rows_duration

    # Calculated distance, don't drop
    # df = drop_trivial_trips_distance(df)
    rows_distance = df.shape[0]
    trivial_distance += rows_duration-rows_distance
    return df

### <span style='color:#4095b5'>Drop outliers.</span>
<ul>
    <li><span style='color:#52aec9'>Only use trips near Los Angeles.</span></li> 
    <li><span style='color:#52aec9'>Don't keep trips 24 hours or longer.</span></li> 
     <li><span style='color:#52aec9'>Don't keep trips further than 100 miles.</span></li> 
</ul>

In [15]:
def drop_outliers_latlon(df):
#     df = df[df["Start Latitude"] < 33.9].copy()
#     df = df[df["End Latitude"] < 33.9].copy()
#     df = df[df["Start Latitude"] > 33.5].copy()
#     df = df[df["End Latitude"] > 33.5].copy()

#     df = df[df["Start Longitude"] < -83.0].copy()
#     df = df[df["End Longitude"] < -83.0].copy()

    return df

In [16]:
def drop_outliers_duration(df):
    df = df[df["Duration"] < pd.to_timedelta('24:00:00')].copy()
    return df

In [17]:
def drop_outliers_distance(df):
    df_temp = df[df["Distance [Miles]"] >= 100.0]
    if df_temp.shape[0]:
        print("Long trip: ", df_temp[['Start Latitude','Start Longitude', 'Start Time', 
                                     'End Latitude', 'End Longitude', 'End Time', 
                                     'Distance [Miles]', 'Duration']])
    df = df[df["Distance [Miles]"] < 100.0].copy()
    return df

In [18]:
def drop_outliers(df):
    global outliers_latlon
    global outliers_duration
    global outliers_distance
    
    rows = df.shape[0]
    df = drop_outliers_latlon(df)
    rows_latlon = df.shape[0]
    outliers_latlon += rows - rows_latlon
    
    df = drop_outliers_duration(df)
    rows_duration = df.shape[0]
    outliers_duration += rows_latlon - rows_duration
    
    df = drop_outliers_distance(df)
    rows_distance = df.shape[0]
    outliers_distance += rows_duration - rows_distance
    
    return df

### <span style='color:#4095b5'>Pull all of the cleaning together.</span>


In [19]:
def clean_df(df, df_hubs=None):
    global trivial_duration
    global trivial_distance
    global outliers_latlon
    global outliers_duration
    global outliers_distance

    df = drop_columns(df)
    df = rename_columns(df)
    df = calc_latlong(df, df_hubs)
    df = drop_nans(df)
    df = clean_datatypes(df)
    df = calc_distances(df)
    df = reorder_cols(df)
    df = drop_trivial_trips(df)
    df = drop_outliers(df)

    # Information about rows dropped
    print("Trivial dur: {} dist: {}".format(trivial_duration, 
                                                                              trivial_distance))
    print("Outlier loc: {} dur: {} dist: {}".format(outliers_latlon,
                                                     outliers_duration,
                                                     outliers_distance))
    return df

<hr>
<a name="import"></a>
## <span style='color:#3b748a'> III. Import all data from Los Angeles.</span>


In [20]:
# Los Angeles data is quarterly
# For now, just load one year
# metro-bike-share-trips-2018-q2.csv
trip_data = ['2017-q3', '2017-q4',
             '2018-q1', '2018-q2']

In [21]:
# Dictionary of DataFrames, one for each month
df_data = dict()
for d in trip_data:
    df_data[d] = pd.read_csv("../data/la/metro-bike-share-trips-"+str(d)+".csv")

<hr>
<a name="clean"></a>

## <span style='color:#3b748a'>IV. Clean all data from Los Angeles.</span>
<ul>
    <li><span style='color:#4095b5'>Reformat the data to match Atlanta data.</span></li>
    <li><span style='color:#4095b5'>Drop the trivial trips.</span></li>
    <li><span style='color:#4095b5'>Drop the outliers.</span></li>
    <li><span style='color:#4095b5'>Use appropriate coumn types.</span></li>
</ul>

In [22]:
# For each month, clean the DataFrame
print("Cleaning the data:")
for d in trip_data:
    print("Month: {} \nRows: {}\t Cols: {}\t NaNs: {}".format(d, 
                                                    df_data[d].shape[0], 
                                                    df_data[d].shape[1], 
                                                    sum(df_data[d].isnull().sum())))
    df_data[d] = clean_df(df_data[d])
    check_data(df_data[d])


Cleaning the data:
Month: 2017-q3 
Rows: 72337	 Cols: 14	 NaNs: 1598
Trivial dur: 2283 dist: 0
Outlier loc: 0 dur: 149 dist: 0
Rows: 69905	 Cols: 13	 NaNs: 0
Month: 2017-q4 
Rows: 71214	 Cols: 14	 NaNs: 3750
Trivial dur: 4294 dist: 0
Outlier loc: 0 dur: 361 dist: 0
Rows: 68991	 Cols: 13	 NaNs: 0
Month: 2018-q1 
Rows: 65387	 Cols: 14	 NaNs: 2226
Trivial dur: 6064 dist: 0
Outlier loc: 0 dur: 568 dist: 0
Rows: 63410	 Cols: 13	 NaNs: 0
Month: 2018-q2 
Rows: 77357	 Cols: 14	 NaNs: 2486
Trivial dur: 8151 dist: 0
Outlier loc: 0 dur: 847 dist: 0
Rows: 74991	 Cols: 13	 NaNs: 0


<hr>
<a name="merge"></a>

## <span style='color:#3b748a'> V. Merge the DataFrames into 1 big DataFrame</span>

In [23]:
n_rows = 0
df = pd.DataFrame()
for d in trip_data:
    n_rows += df_data[d].shape[0]
    df = df.append(df_data[d])

if n_rows != df.shape[0]:
    print("There is a problem with the DataFrame merge!")

<hr>
<a name="explore"></a>

## <span style='color:#3b748a'> VI. Explore the data.</span>

In [24]:
df.head()    

Unnamed: 0,Start Hub,Start Latitude,Start Longitude,Start Date,Start Time,End Hub,End Latitude,End Longitude,End Date,End Time,Bike Name,Distance [Miles],Duration
0,3018,34.043732,-118.260139,2017-07-01,2017-07-01 00:09:00,3018,34.043732,-118.260139,2017-07-01,2017-07-01 00:45:00,5996,0.0,00:36:00
1,3055,34.044159,-118.251579,2017-07-01,2017-07-01 00:10:00,3082,34.04652,-118.237411,2017-07-01,2017-07-01 00:23:00,5777,0.829014,00:13:00
2,3018,34.043732,-118.260139,2017-07-01,2017-07-01 00:11:00,3018,34.043732,-118.260139,2017-07-01,2017-07-01 00:45:00,6342,0.0,00:34:00
3,3018,34.043732,-118.260139,2017-07-01,2017-07-01 00:11:00,3018,34.043732,-118.260139,2017-07-01,2017-07-01 00:45:00,6478,0.0,00:34:00
4,3055,34.044159,-118.251579,2017-07-01,2017-07-01 00:11:00,3082,34.04652,-118.237411,2017-07-01,2017-07-01 00:23:00,6411,0.829014,00:12:00


In [25]:
df.shape

(277297, 13)

In [26]:
df.columns

Index(['Start Hub', 'Start Latitude', 'Start Longitude', 'Start Date',
       'Start Time', 'End Hub', 'End Latitude', 'End Longitude', 'End Date',
       'End Time', 'Bike Name', 'Distance [Miles]', 'Duration'],
      dtype='object')

In [27]:
df.describe()

Unnamed: 0,Start Hub,Start Latitude,Start Longitude,End Hub,End Latitude,End Longitude,Bike Name,Distance [Miles],Duration
count,277297.0,277297.0,277297.0,277297.0,277297.0,277297.0,277297.0,277297.0,277297
mean,3421.994645,34.045874,-118.271888,3416.667789,34.045822,-118.271845,7905.866602,0.776316,0 days 00:31:43.726906
std,539.001449,0.064987,0.097883,537.971768,0.064107,0.097998,2720.007958,1.082006,0 days 01:28:03.058314
min,3000.0,33.710979,-118.491341,3000.0,33.710979,-118.491341,4727.0,0.0,0 days 00:03:00
25%,3031.0,34.039982,-118.262733,3031.0,34.039982,-118.262733,6053.0,0.3388,0 days 00:07:00
50%,3064.0,34.046822,-118.252441,3063.0,34.04681,-118.252441,6403.0,0.62146,0 days 00:12:00
75%,4157.0,34.0532,-118.237411,4157.0,34.051941,-118.237411,12021.0,0.989826,0 days 00:26:00
max,4254.0,34.165291,-118.11653,4254.0,34.165291,-118.11653,12456.0,24.545519,0 days 23:59:00


In [28]:
if False:
    check_data(df)
    check_cols(df)

In [29]:
# Check dates (Jul 2017 - June 2018)
print("Min start date: {}".format(df['Start Date'].min()))
print("Min end date: {}".format(df['End Date'].min()))
print("Max start date: {}".format(df['Start Date'].max()))
print("Max end date: {}".format(df['End Date'].max()))
print("Number of days: {}".format(len(set(df['Start Date']))))

Min start date: 2017-07-01
Min end date: 2017-07-01
Max start date: 2018-06-30
Max end date: 2018-07-01
Number of days: 365


#### <span style='color:#4095b5'>Fewest rentals</span>
<li><span style='color:#4095b5'>22 mar 2018 - Rainy</span></li>

#### <span style='color:#4095b5'>Most rentals</span>
<li><span style='color:#4095b5'>8 oct 2017 - Ciclavia: Heart of LA "Open streets"</span></li>

#### <span style='color:#4095b5'>Outliers on Upper side of Total rentals per day of week</span>
<li><span style='color:#4095b5'>8 oct 2017 - Ciclavia: Heart of LA "Open streets"</span></li>
<li><span style='color:#4095b5'>24 dec 2017</span></li>


#### <span style='color:#4095b5'>Outliers on LOWER side of Total rentals per day of week</span>
<li><span style='color:#4095b5'>8 Jan 2018: Rain</span></li>
<li><span style='color:#4095b5'>9 jan 2018: Rain</span></li>
<li><span style='color:#4095b5'>22 mar 2018: Rain</span></li>
<li><span style='color:#4095b5'>2 mar 2018: Rain</span></li>
<li><span style='color:#4095b5'>10 mar 2018: Rain</span></li>

#### <span style='color:#4095b5'>Outliers on Upper side of Total or Avg Duration per day of week</span>
<li><span style='color:#4095b5'>8 oct 2017 - Ciclavia: Heart of LA "Open streets"</span></li>
<li><span style='color:#4095b5'>25 dec 2017</span></li>
<li><span style='color:#4095b5'>1 jan 2018</span></li>

#### <span style='color:#4095b5'>Outliers on Lower side of Total or Avg Duration per day of week</span>
<li><span style='color:#4095b5'>8 jul 2017 - No idea</span></li>

<hr>
<a name="write"></a>

## <span style='color:#3b748a'>VII. Write the full DataFrame to a csv file.</span>

In [30]:
df.to_csv('../data/la/trips_all.csv', index=False)