*JSC270, Winter 2020 - Assignment 1, Zhiyuan Lu*

# <center>  Assignment 1 - An Analysis of Toronto Bike Share Trip Lengths </center>

### <center> Zhiyuan Lu, Feburary 9 </center>
<center>  Explore the trip lengths of Toronto Bike Share, and how they are affected by variables like user type, season, and weather  </center>


***

## Introduction

One advantage of living in downtown Toronto is the variety of ways of public transportation available, including bike sharing systems. Although I have never used Toronto Bike Share yet, there is a station right next to my residence building on campus, and another at the cross-section of Willcocks Street and St. George Street where I pass by every day. I have seen people using bikes at these stations pretty often, and it would be interesting to investigate the distribution of trip lengths, as well as how they vary with different user types and changes in season and weather.

For my analysis, I use the Toronto Bikeshare Ridership Data from Open Data Catalog, and weather data from the Historical Climate Data website.

## About the Data
### What we can and can't do with the data from the Historical Climate Data website

We are allowed to use and reproduce the data for non-commercial purposes, provided that we ensure the accuracy of materials reproduced, cite the resources with the complete title and authors (where available), and provide the URL where the original document is available. We are allowed to redistribute the data if we do not charge any fee, acknowledge the source, and make sure that any other party agree to the same redistribution restrictions.

We are not allowed to reproduce materials on this site, in whole or in part, for the purposes of commercial redistribution without prior written permission from the copyright administrator. We cannot rent, lease, lend, sub-licence or transfer the data product or any of our rights under this agreement to anyone else, except under the stated terms and conditions. In addition, we cannot reproduce the official symbols of the Government of Canada, including the Canada wordmark, the Arms of Canada, and the flag symbol, whether for commercial or non-commercial purposes, without prior written authorization.


## Data Cleaning

### Toronto Bikeshare Ridership Data

The data I use is the 2016 Quarter 3, 2016 Quarter 4, and 2017 Toronto Bikeshare Ridership Data from Open Data Catalog. While combining the datasets, I noticed the following problems and limitations:
- Inconsistent datetime format between datasets.
- Only the name of stations are found in 2016 Q3, 2016 Q4, 2017 Q3, 2017 Q4 data.
- Inconsistent station names (e.g. abbreviation, space between words).
- No geospatial data recorded about the stations.

I will deal with these problems through data cleaning and wrangling.

In [1]:
import datetime
from datetime import timedelta
from fuzzywuzzy import fuzz

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
import random
import requests
import seaborn as sns

import geopy.distance

import json
import os

import pandas as pd

In [None]:
# Read 2016 data
bike_2016_q3 = pd.read_excel('bikeshare-ridership-2016-q3.xlsx')
bike_2016_q4 = pd.read_excel('bikeshare-ridership-2016-q4.xlsx')

In [None]:
# Convert date format to 2017 data, so that it can be matched to weather data
date_formats = {
    'Bikeshare Ridership (2017 Q1).csv': ['%d/%m/%Y %H:%M', -4],
    'Bikeshare Ridership (2017 Q2).csv': ['%d/%m/%Y %H:%M', -4],
    'Bikeshare Ridership (2017 Q3).csv': ['%m/%d/%Y %H:%M', 0],
    'Bikeshare Ridership (2017 Q4).csv': ['%m/%d/%y %H:%M:%S', 0],
}
df_2017 = pd.DataFrame() # Initiate an empty DataFrame

for fn, fmt in date_formats.items():
    tmp = pd.read_csv(fn)
    
    # Read the datetime in the specified format
    tmp['trip_start_time'] = pd.to_datetime(tmp['trip_start_time'], format=fmt[0], errors='coerce')
    # Convert the input time to the Easter timezone
    tmp['trip_start_time'] = tmp['trip_start_time'] + timedelta(hours=fmt[1])

    df_2017 = pd.concat([df_2017, tmp], sort=False).reset_index(drop=True)

In [None]:
data_list = [bike_2016_q3, bike_2016_q4, df_2017]
df = pd.concat(data_list, sort=False, ignore_index=True)
df.sample(5)

I will use the Bike share endpoint API to retrieve location information for the stations, and match the station IDs and names to resolve inconsistences between names. 

In [None]:
r = requests.get('https://tor.publicbikesystem.net/ube/gbfs/v1/en/station_information')
bikeshare_stations = json.loads(r.content)['data']['stations']
bikeshare_stations = pd.DataFrame(bikeshare_stations)[['station_id', 'name', 'lat', 'lon']].astype({
    'station_id': 'float64',
})

In [None]:
bikeshare_stations.sample(5)

In [None]:
stations_start = df[['from_station_id', 'from_station_name']]
stations_end = df[['to_station_id', 'to_station_name']]
stations_start.columns = stations_end.columns = ['station_id', 'name']

# Extracts the unique station ID and name combination from the from_station and to_station columns
stations = pd.concat([stations_start, stations_end]).dropna(how='all').drop_duplicates().reset_index(drop=True)

In [None]:
stations.sample(5)

For stations without IDs, we need to find the corresponding ID with the station name from the API data. I will use the fuzzywuzzy library to match silimar strings of station names.

In [None]:
# Separate the stations without station IDs
no_ids = stations[stations['station_id'].isnull()]
for idx, miss in no_ids.iterrows():
    max_score = 0
    
    # Compare the similarity of the station without ID to each station in the API data
    for i, exist in bikeshare_stations[['station_id', 'name']].iterrows():
        score = fuzz.ratio(miss['name'], exist['name'])
        
        if score > 80 and score > max_score:
            max_score = score
            no_ids.at[idx, 'station_id'] = exist['station_id']
    
    # Warn if the station was not able to be matched
    if max_score <= 80:
        print('WARN: {0} station could not be matched to an existing station'.format(miss['name']))

# Remove all stations that were not matched
no_ids = no_ids.dropna()

In [None]:
stations = pd.concat([stations[~stations['station_id'].isnull()], no_ids])\
             .merge(bikeshare_stations[['station_id', 'lat', 'lon']], how='inner', on='station_id')\
             .drop_duplicates()

In [None]:
df = df.merge(stations, how='inner', left_on='from_station_name', right_on='name') \
       .merge(stations, how='inner', left_on='to_station_name', right_on='name', suffixes=['_from', '_to']) \
       .drop_duplicates()

df = df[[x for x in df.columns if not x.endswith('_station_id') and not x.endswith('_station_name') and x != 'trip_stop_time']]

Then I can merge the station dataset with bikeshare trip data to get the starting and ending location of each trip. I used the library geopy to calculate the distance between two places with their latitudes and longitudes. Since the dataset is huge, I would only keep the variables that are useful for my analysis (`Date`, `user_type`, `trip_duration_seconds`, `distance`).

In [None]:
def calc_distance(lat_from, lon_from, lat_to, lon_to):
    coords_1 = (lat_from, lon_from)
    coords_2 = (lat_to, lon_to)
    distance = geopy.distance.distance(coords_1, coords_2).m
    
    return distance

In [None]:
df['distance'] = df.apply(lambda x: calc_distance(x['lat_from'],x['lon_from'],x['lat_to'],x['lon_to']),axis=1)

In [None]:
df['trip_start_time'] = pd.to_datetime(df['trip_start_time'])
df['Date'] = df['trip_start_time'].apply(lambda x: x.strftime('%Y-%m-%d'))

In [None]:
df = df[['Date', 'user_type', 'trip_duration_seconds', 'distance']] # only keep useful columns
df = df.rename(columns={'user_type': 'user', 'trip_duration_seconds': 'duration'})

In [None]:
df.sample(5)

### Toronto Weather Data

The daily weather information in Toronto in 2017 are scraped from a Government of Canada website. Since the data are displayed by month on the website, I ran a for-loop to scrape weather data for each month. Through data cleaning, I need to solve the following problems:
- Convert the type of numeric values.
- Remove legends such as 'Estimated' or 'Missing'.

After cleaning the bikeshare data and the weather data, I joined them to a single dataframe by matching on trip date.

In [None]:
# scrape weather data
weather = pd.DataFrame() # Initiate an empty DataFrame

for i in range(1,13): # loop to scrape data of all 12 months in 2017
    # form the link to data of a month
    url = 'https://climate.weather.gc.ca/climate_data/daily_data_e.html?StationID=51459&timeframe=2&StartYear=1840&EndYear=2019&Day=1&Year=2017&Month=' + str(i)
    tmp = pd.read_html(url, header=0)
    month = tmp[0]
    month['month'] = i # create month column
    month = month[:-4] # remove rows of summary
    month = month.replace('LegendTT', 'T') # clean text display
    
    weather = pd.concat([weather, month], sort=False).reset_index(drop=True)

In [None]:
# Create date column
weather['year'] = 2017
weather['Date'] = pd.to_datetime(weather[['year', 'month', 'DAY']])
weather['Date'] = weather['Date'].apply(lambda x: x.strftime('%Y-%m-%d'))

# Rename column to use more consise name
weather = weather.rename(columns={'Max Temp Definition°C': 'Max Temp', 
                                  'Min Temp Definition°C': 'Min Temp',
                                  'Mean Temp Definition°C': 'Mean Temp',
                                  'Total Rain Definitionmm': 'Rain',
                                  'Total Snow Definitioncm': 'Snow',
                                  'Snow on Grnd Definitioncm': 'Snow depth',
                                  'Spd of Max Gust Definitionkm/h': 'Gust'})

# Only keep useful variables
weather = weather[['Date', 'Max Temp', 'Min Temp', 'Mean Temp', 'Rain', 'Snow', 'Snow depth', 'Gust']]
weather

In [None]:
weather = weather.replace('Legend..', '', regex=True) # remove LegendEE/LegendMM/etc.
weather = weather.replace('T', 0.05) # use 0.05 to represent small value
weather = weather.replace('<31', 10) # use 10 to represent gust <31
weather = weather.replace('', np.nan)

# Convert type to numeric
num_columns = ['Max Temp', 'Min Temp', 'Mean Temp', 'Rain', 'Snow', 'Snow depth', 'Gust']
for col in num_columns:
    weather[col] = pd.to_numeric(weather[col], errors = 'coerce')

Now we can merge the two datasets.

In [None]:
df_w = df.merge(weather, on='Date', how='inner') # merge trip data and weather data to one dataframe

In [None]:
df_w.sample(10)

With the cleaned datasets available, we can start our analysis.

## Two Ways to Define Trip Length

When asked about how long a bike trip is, we usually think about the time taken to finish the trip, or the distance that we actually travelled. Hence, I will use the **trip duration time** and **distance** as two ways to define trip length. `Trip duration` (measured in seconds) is recorded in the original dataset, and I have created a new variable for `distance` in data cleaning. 

First, we can use a box plot to visualize the distribution of trip duration time and distance.

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(14, 6))

ax = sns.boxplot(df['duration'], ax=axs[0])
ax.set(xlabel='duration time(seconds)',title='Distribution of trip duration')

ax = sns.boxplot(df['distance'], ax=axs[1], color="#ff7c43")
ax.set(xlabel='distance(meters)',title='Distribution of trip distance')

plt.show()

There are a lot of outliers in duration, and the maximum value is over 6 million seconds (more than 73 days). The distribution of trip distance is also right-skewed. To have a better understanding of the distribution, I choose to remove outliers in each variable.

In [None]:
# remove outliers in duration
q1 = df['duration'].quantile(0.25)
q3 = df['duration'].quantile(0.75)
interquartile_range = q3 - q1
lower = q1 - 1.5 * interquartile_range
upper = q3 + 1.5 * interquartile_range

df = df[~((df['duration'] < lower)|(df['duration'] > upper))].reset_index(drop=True)

In [None]:
# remove outliers in distance
q1 = df['distance'].quantile(0.25)
q3 = df['distance'].quantile(0.75)
interquartile_range = q3 - q1
lower = q1 - 1.5 * interquartile_range
upper = q3 + 1.5 * interquartile_range

df = df[~((df['distance'] < lower)|(df['distance'] > upper))].reset_index(drop=True)

After removing the outliers, I will plot the histogram of distributions of trip duration and distance.

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(14, 6))

ax = sns.distplot(df['duration'], ax=axs[0])
ax.set(xlabel='duration time(seconds)',title='Distribution of trip duration')

ax = sns.distplot(df['distance'], ax=axs[1], color="#ff7c43")
ax.set(xlabel='distance(meters)',title='Distribution of trip distance')

plt.show()

In the distribution of duration time, the peak is at 500-750 seconds; and in the distribution of trip distance, the peak is at 1000-2000m. This suggests that there are many short-to-medium length trips. In both distributions, there is a lower peak at the left end, where duration is 1 second and distance is 0. It is interesting that the two distribution have very similar shapes, both slightly right-skewed, and we can use a scatter plot to further investigate the relationship between duration and distance. 

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(14, 6))

ax = sns.regplot(x='duration',y='distance',data=df, ax=axs[0])
ax.set(xlabel='duration time(seconds)', ylabel='distance(meters)',
       title='Distribution of trip duration on whole dataset')

random.seed(10)
# randomly select 5000 trips to visualize
small_df = df.sample(n=5000)
ax = sns.regplot(x='duration',y='distance',data=small_df, ax=axs[1])
ax.set(xlabel='duration time(seconds)', ylabel='distance(meters)',
       title='Distribution of trip duration on 5000 trips')

plt.show()

The graph on the left plots all of the trips. There are too many points so it is hard to get any insights on the relationship between distance and duration. However, since the points cover almost the whole grid, it shows that there are large variations between the speed of individual trips. Another interesting observation is that there is a line at duration = 1 second and distance = 0m. These points may appear due to different types of errors, such as error in docking the bike, or registration problems in the database.  

To better visualize the trend, I randomly selected 5,000 points from the dataset to plot the scatter plot on the right. It shows a positive linear relationship between distance and duration.

In [None]:
x = df['duration']
y = df['distance']

pearson_coef, p_value = stats.pearsonr(x, y)
print("Pearson correlation coefficient: ", pearson_coef, "and p-value: ", p_value)

The correlation coefficient is 0.66, which suggests a relatively strong positive relationship between distance and duration. And the p-value is very low, so the relationship is significant. This is what we would expect, as it would take longer time to finish a long-distance trip. However, the correlation coefficient is not very close to 1. It is probably because that the speed of a bike trip depends on a lot of variables, such as weather, the rider's physical ability, and how familiar the rider is to this route.

#### Limitations 
- We exclude outliers from this analysis, but we don't know if some of them are valid trips.
- Due to limit in time and data, we do not compare which of the two ways to define trip length is better. It could be a question for future investigations.

## Trip Length Differences between Casual Users and Members

We are interested to know if there is any difference in trip length between casual users and members, as the results may be useful for Bike Share Toronto to adjust their pricing or promotional events to increase usage. We will compare trip lengths using both definitions regarding `duration` and `distance`, and we will also use the `date` variable to select trips in July 2017. On Wednesdays, during the month of July the riders could rent a bike for up to 30 minutes without charge, and I would like to explore how this activity affects trip lengths for casual users and members.

First, I will visualize the distribution of duration and distance for both user types.

In [None]:
plt.figure(figsize=(8,6))

member = df[df['user'] == 'Member']
casual = df[df['user'] == 'Casual']

# Draw the density plot
graph = sns.distplot(member['duration'], hist = False, kde = True, kde_kws = {'shade': True,'linewidth': 3}, 
             label = 'Member')
graph = sns.distplot(casual['duration'], hist = False, kde = True, kde_kws = {'shade': True,'linewidth': 3}, 
             label = 'Casual')
graph.axvline(1800, ls = '--', c='grey', lw=0.8)
    
# Plot formatting
plt.legend(prop={'size': 16}, title = 'User Type')
plt.title('Distribution of trip durations for members and casual users')
plt.xlabel('Duration (seconds)')
plt.ylabel('Density')
plt.show()

In [None]:
len(member)

In [None]:
len(member[member['duration'] > 1800])

In [None]:
2816/1566760 * 100

In [None]:
len(casual)

In [None]:
df['duration'] = pd.to_numeric(df['duration'], errors = 'coerce')
len(casual[casual['duration'] > 1800])

In [None]:
over = casual[casual['duration'] > 1800]
over

In [None]:
7677/354591 * 100

In [None]:
# remove outliers in duration
q1 = df_w['duration'].quantile(0.25)
q3 = df_w['duration'].quantile(0.75)
interquartile_range = q3 - q1
lower = q1 - 1.5 * interquartile_range
upper = q3 + 1.5 * interquartile_range

df_w = df_w[~((df_w['duration'] < lower)|(df_w['duration'] > upper))].reset_index(drop=True)

In [None]:
df_w.head()

In [None]:
member_w = df_w[df_w['user'] == 'Member']
casual_w = df_w[df_w['user'] == 'Casual']

print(len(member_w))
print(len(casual_w))

In [None]:
member_w.describe()

In [None]:
casual_w.describe()

In [None]:
plt.figure(figsize=(8,6))


# Draw the density plot
graph = sns.distplot(member_w['duration'], hist = False, kde = True, kde_kws = {'shade': True,'linewidth': 3}, 
             label = 'Member')
graph = sns.distplot(casual_w['duration'], hist = False, kde = True, kde_kws = {'shade': True,'linewidth': 3}, 
             label = 'Casual')
graph.axvline(1800, ls = '--', c='grey', lw=0.8)
    
# Plot formatting
plt.legend(prop={'size': 16}, title = 'User Type')
plt.title('Distribution of trip durations for members and casual users in 2017')
plt.xlabel('Duration (seconds)')
plt.ylabel('Density')

plt.subplots_adjust(hspace = 0, wspace = 0)
#plt.margins(0,0)
#plt.gca().xaxis.set_major_locator(plt.NullLocator())
#plt.gca().yaxis.set_major_locator(plt.NullLocator())
plt.savefig("duration.png", transparent=True)

In [None]:
print(len(member_w[member_w['duration'] > 1800]))
print(len(casual_w[casual_w['duration'] > 1800]))

In [None]:
overtime = casual_w[casual_w['duration'] > 1800]
overtime

In [None]:
print("percentage of 2017 member trips with duration > 30min: ", 3230/1156634 * 100, "%")
print("percentage of 2017 casual user trips with duration > 30min: ", 5762/245182 * 100, "%")
print("percentage of 2017 overtime casual trips happened in July/August/September: ", 3585/5672 * 100, "%")

In [None]:
summer = df_w[df_w['Month'] > 6]
summer = summer[summer['Month'] < 10]
summer

In [None]:
member_s = summer[summer['user'] == 'Member']
casual_s = summer[summer['user'] == 'Casual']

print(len(member_s))
print(len(casual_s))
print(442292/1156634)
print(153658/245182)

In [None]:
print("casual user trips with >30 min duration: ",len(casual_s[casual_s['duration'] > 1800]))
print()

In [None]:
3585/5672 * 100

We can see that the trip durations of members have a very high peak at 500 seconds. In contrast, trip durations of casual users have a flatter distribution, with a peak at about 900 seconds. The distributions of both user types also show a peak at just above 0 second, but the peak for members is tiny, while the peak for casual users is much higher.

In [None]:
plt.figure(figsize=(8,6))

# Draw the density plot
sns.distplot(member['distance'], hist = False, kde = True, kde_kws = {'shade': True,'linewidth': 3}, 
             label = 'Member')
sns.distplot(casual['distance'], hist = False, kde = True, kde_kws = {'shade': True,'linewidth': 3}, 
             label = 'Casual')
    
# Plot formatting
plt.legend(prop={'size': 16}, title = 'User Type')
plt.title('Distribution of trip distances for members and casual users')
plt.xlabel('Distance (meters)')
plt.ylabel('Density')
plt.show()

As for the distribution of trip distances, it is remarkable that the distribution of casual users have a very high peak at distance = 0, i.e. the starting station and stop station are the same. The possible reasons behind that could be:  
- A larger proportion of casual users are first time users, who are not familiar with the renting and docking facilities. They may rent and dock the bike at the same station for several times.
- Casual users only use the bike for special occasions, so there would be more changes in travel plan.

We also notice that although there are more long trips for casual users than for members, the difference is less significant than the difference in duration time. This suggests that on average, casual users ride slower than members. On the one hand, casual users may use the bike more often for travel, while members may usually ride a bike to commute, so casual members would have more stops along the way; on the other hand, casual users may be less familiar with the route, so they would take longer time even if the trip distances are similar. 

Now we will have a closer look at the data in 2017 July. I choose to compare the distribution of trip distance in 2017 July with the overall distribution.

In [None]:
# select trips in July by string comparison
july = df.loc[(df['Date'] > '2017-06-30') & (df['Date'] < '2017-08-01')] 
july

In [None]:
plt.figure(figsize=(10,6))

# Draw the density plot
sns.distplot(july[july['user'] == 'Member']['distance'], hist = False, kde = True, 
             kde_kws = {'shade': False,'linewidth': 3}, 
             label = 'Member-2017 July', color='#003f5c')
sns.distplot(member['distance'], hist = False, kde = True, kde_kws = {'shade': False,'linewidth': 3}, 
             label = 'Member-all quarters', color='#7a5195')
sns.distplot(july[july['user'] == 'Casual']['distance'], hist = False, kde = True, 
             kde_kws = {'shade': False,'linewidth': 3}, 
             label = 'Casual-2017 July', color='#ef5675')
sns.distplot(casual['distance'], hist = False, kde = True, kde_kws = {'shade': False,'linewidth': 3}, 
             label = 'Casual-all quarters', color='#ffa600')
    
# Plot formatting
plt.legend(prop={'size': 12}, title = 'User Type')
plt.title('Distribution of trip distances for members and casual users in 2017 July and in all quarters')
plt.xlabel('Distance (meters)')
plt.ylabel('Density')
plt.show()

We can see that there is not much difference in terms of trip distances distribution between 2017 July with the 'Free Ride Wednesday' compared with the overall distribution across all quarters. This is probably because most bikeshare stations are in Downtown Toronto, and the popular routes have similar lengths. Free trips may increase the number of total trips, but are unlikely to have a large impact on trip length.

#### Limitations
- The analysis only compares trip lengths of members and casual users on a broad timespan. To gain more insights on trip length differences between user types, we could compare the trip lengths by each month and by weekdays/weekends in future analysis.
- The dataset is from more than 2 years ago, and we don't know if Bikeshare Toronto user behaviours have changed in the past 2 years. It would be better if we perform the same analysis on more up-to-date data.

## Effects of Season and Weather on Trip Length

For this part of analysis, since weather data is avaiable only for 2017 data, we would consider ridership data from 2017.

### Seasonal Change

In [None]:
# Extract month for 2017 data
df_w['Month'] = pd.to_datetime(df_w['Date']).apply(lambda x: x.strftime('%m')).astype(int)

I use a bar chart to visualize the mean trip distances and durations of every month in 2017.

In [None]:
plt.figure(figsize=(10,6))
dis = df_w.groupby(['Month', 'user'])['distance'].mean().reset_index().sort_values('user', ascending=False)
dis = pd.DataFrame(dis)

sns.barplot(x='Month', y='distance', hue='user', data=dis)

plt.title('Distribution of mean trip distances for members and casual users in 2017 by month')
plt.ylabel('Distance (meters)')
plt.xlabel('Month')
plt.show()

We can see that the trip distance does not vary much throughout the year. The mean trip distance of casual users in July is relatively lower than other months, which is probably due to that the weather is hot, so the trips are shorter in distance.

In [None]:
plt.figure(figsize=(10,6))
dur = df_w.groupby(['Month', 'user'])['duration'].mean().reset_index().sort_values('user', ascending=False)
dur = pd.DataFrame(dur)

graph = sns.lineplot(x='Month', y='duration', hue='user', data=dur)
graph.axhline(1800, ls = '--', c='red', lw=0.8)

plt.title('Distribution of mean trip durations for members and casual users in 2017 by month')
plt.ylim(0,2700)
plt.ylabel('Durations (seconds)')
plt.xlabel('Month')
plt.show()

For trip durations, the mean travel time of members does not vary much by month either, but the trip durations of casual users show a greater fluctuation by season. The trip duration for casual users is the shortest in winter (Jan - March), and the highest in summer (July - Sept). A possible explanation is that in summer, more people would take a trip by bike across the city as a casual user.

### Weather Change

We will use a scatterplot to visualize the distribution of mean trip duration and distance with weather conditions of rain, snow, and gust.

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(14, 6), sharey=True)
 
def addScatter(col, color, h):
    """
    create a scatterplot with 3 subplots.
    'col' is the weather to be displayed, 'color' is the color of plot,
    'h' is the horizontal index of subplot.
    """
    data = df_w.groupby(['Date'])['duration', col].mean()
    data = pd.DataFrame(data)
    
    axs[h].scatter(data[col], data['duration'], c=color, alpha=0.6)
    axs[h].set_xlabel(col)
    axs[h].set_ylabel("Duration(seconds)")

addScatter('Rain',"#ff7c43", 0)
addScatter('Snow', 'lightgray',1)
addScatter('Gust', 'darkgreen',2)

plt.show()

Since a lot of days are without rain, snow, or gust, we can see a lot of points scattered at the left in every subplot. This means that on days without bad weather, the distribution of trip duration has a wide range that depends on a lot of other variables. We can see a general decreasing trend of trip duration when the precipitation of rain increases. When there is snow, how much it is snowing does not matter much, and the trip duration on snowy days are pretty low. In contrast, there seems to be not much relationship between gust and trip duration.

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(14, 6), sharey=True)
 
def addScatter(col, color, h):
    """
    create a scatterplot with 3 subplots.
    'col' is the weather to be displayed, 'color' is the color of plot,
    'h' is the horizontal index of subplot.
    """
    data = df_w.groupby(['Date'])['distance', col].mean()
    data = pd.DataFrame(data)
    
    axs[h].scatter(data[col], data['distance'], c=color, alpha=0.6)
    axs[h].set_xlabel(col)
    axs[h].set_ylabel("Distance(meters)")

addScatter('Rain',"#ff7c43", 0)
addScatter('Snow', 'lightgray',1)
addScatter('Gust', 'darkgreen',2)

plt.show()

When measuring trip length using distance, we can get similar conclusions as measuring with duration. However, the negative correlation between distance and total precipitation of snow is more obvious.

#### Limitations
- The effect of weather in difference seasons may be different. In future analysis, we can run the algorithm on different quarters.
- The weather conditions on a day may not be independent. We may need to analyze using multiple variables such as temperature and precipitation at the same time to determine what might affect the mean trip length on a specific day.
- The mean distance/duration may not be representative of the total ridership activity on one day.

## Conclusions

### Results
We can measure the trip length of Toronto Bikeshare trips using two definitions, one is trip duration time and the other is trip distance between the start station and the stop station. The two definitions are positively correlated, but is not equivalent because the speed of a trip depends on a lot of variables.

The trips by casual users tend to have longer duration time than members, which may indicate different purposes to use the bike. 

On days of bad weather such as rain or snow, the trip duration and distance would decrease.

### Ideas for Bike Share Toronto
- The high number of trips starting and ending at the same station, especially for casual users, is an interesting phenomenon worth investigating. If it is caused by difficulty to use the facility for the same time, Bike Share Toronto may want to have more clear instructions or improve the docking facility.
- It would be interesting to further investigate what causes the difference in trip duration between casual users and members. It is possible to infer the purpose of trips by looking at the most popular routes on the map, and explore that if casual users are using the bike share system more often for travel.