# Bike Share Analysis

## Introduction

Over the past decade, bicycle-sharing systems have been growing in number and popularity in cities across the world. Bicycle-sharing systems allow users to rent bicycles for short trips, typically 30 minutes or less. Thanks to the rise in information technologies, it is easy for a user of the system to access a dock within the system to unlock or return bicycles. These technologies also provide a wealth of data that can be used to explore how these bike-sharing systems are used.

In this project, we will perform an exploratory analysis on data provided by [Motivate](https://www.motivateco.com/), a bike-share system provider for many major cities in the United States. We will compare the system usage between three large cities: New York City, Chicago, and Washington, DC. We will also see if there are any differences within each system for those users that are registered, regular users and those users that are short-term, casual users.

## Objective behind analysis

**To answer certain questions and notice trends in the data collected by Motivate. 
To visualize those trends using matplotlib, pandas**

**A glimpse of few of the many questions to be answered**: 
-  What is number of working days and working hours as well as number of holidays?
-  What is the present price per kilometer rate of local transports?
-  Which month of year or day of week has more demand of bike?

<a id='wrangling'></a>
## Data Collection and Wrangling

Now it's time to collect and explore our data. In this project, we will focus on the record of individual trips taken in 2016 from our selected cities: New York City, Chicago, and Washington, DC. Each of these cities has a page where we can freely download the trip data.:

- New York City (Citi Bike): [Link](https://www.citibikenyc.com/system-data)
- Chicago (Divvy): [Link](https://www.divvybikes.com/system-data)
- Washington, DC (Capital Bikeshare): [Link](https://www.capitalbikeshare.com/system-data)

After visiting these pages, I noticed that each city has a different way of delivering its data. Chicago updates with new data twice a year, Washington DC is quarterly, and New York City is monthly. The data has already been collected in the `/data/` folder of the project files. While the original data for 2016 is spread among multiple files for each city, the files in the `/data/` folder collect all of the trip data for the year into one file per city. Some data wrangling of inconsistencies in timestamp format within each city has already been performed. In addition, a random 2% sample of the original data is taken to make the exploration more manageable. 
 

In [1]:
## import all necessary packages and functions.
import csv # read and write csv files
from datetime import datetime # operations to parse dates
from pprint import pprint # use to print data structures like dictionaries in
                          # a nicer way than the base print function.
import pandas             

In [2]:
def print_first_point(filename):
    """
    This function prints and returns the first data point (second row) from
    a csv file that includes a header row.
    """
    # print city name for reference
    city = filename.split('-')[0].split('/')[-1]
    print('\nCity: {}'.format(city))
    
    with open(filename, 'r') as f_in:
        # make DictReader object t read from the file
        trip_reader =csv.DictReader(f_in)
        # the __next__() function allows to read from the second line (ignoring the header of file)
        first_trip =trip_reader.__next__()
        pprint(first_trip)
    # output city name and first trip for later testing
    return (city, first_trip)

# list of files for each city
data_files = ['./data/NYC-CitiBike-2016.csv',
              './data/Chicago-Divvy-2016.csv',
              './data/Washington-CapitalBikeshare-2016.csv',]

# print the first trip from each file, store in dictionary
example_trips = {}
for data_file in data_files:
    city, first_trip = print_first_point(data_file)
    example_trips[city] = first_trip
# print first_trip just to check
print(first_trip)


City: NYC
OrderedDict([('tripduration', '839'),
             ('starttime', '1/1/2016 00:09:55'),
             ('stoptime', '1/1/2016 00:23:54'),
             ('start station id', '532'),
             ('start station name', 'S 5 Pl & S 4 St'),
             ('start station latitude', '40.710451'),
             ('start station longitude', '-73.960876'),
             ('end station id', '401'),
             ('end station name', 'Allen St & Rivington St'),
             ('end station latitude', '40.72019576'),
             ('end station longitude', '-73.98997825'),
             ('bikeid', '17109'),
             ('usertype', 'Customer'),
             ('birth year', ''),
             ('gender', '0')])

City: Chicago
OrderedDict([('trip_id', '9080545'),
             ('starttime', '3/31/2016 23:30'),
             ('stoptime', '3/31/2016 23:46'),
             ('bikeid', '2295'),
             ('tripduration', '926'),
             ('from_station_id', '156'),
             ('from_station_name', 'Clar

The above piece of code will be useful since we can refer to quantities by an easily-understandable label instead of just a numeric index. For example, if we have a trip stored in the variable `row`, then we would rather get the trip duration from `row['duration']` instead of `row[0]`.

### Condensing the Trip Data

It should also be observable from the above printout that each city provides different information. Even where the information is the same, the column names and formats are sometimes different. To make things as simple as possible when we get to the actual exploration, we should trim and clean the data. Cleaning the data makes sure that the data formats across the cities are consistent, while trimming focuses only on the parts of the data we are most interested in to make the exploration easier to work with.

We will generate new data files with five values of interest for each trip: trip duration, starting month, starting hour, day of the week, and user type. Each of these may require additional wrangling depending on the city:

- **Duration**: This has been given to us in seconds (New York, Chicago) or milliseconds (Washington). A more natural unit of analysis will be if all the trip durations are given in terms of minutes.
- **Month**, **Hour**, **Day of Week**: Ridership volume is likely to change based on the season, time of day, and whether it is a weekday or weekend. Use the start time of the trip to obtain these values. The New York City data includes the seconds in their timestamps, while Washington and Chicago do not. The [`datetime`](https://docs.python.org/3/library/datetime.html) package will be very useful here to make the needed conversions.
- **User Type**: It is possible that users who are subscribed to a bike-share system will have different patterns of use compared to users who only have temporary passes. Washington divides its users into two types: 'Registered' for users with annual, monthly, and other longer-term subscriptions, and 'Casual', for users with 24-hour, 3-day, and other short-term passes. The New York and Chicago data uses 'Subscriber' and 'Customer' for these groups, respectively. For consistency, we will convert the Washington labels to match the other two.


In [3]:
def duration_in_mins(datum, city):
    """
    Takes as input a dictionary containing info about a single trip (datum) and
    its origin city (city) and returns the trip duration in units of minutes.
    
    Remember that Washington is in terms of milliseconds while Chicago and NYC
    are in terms of seconds. 
    """
    duration={}
    if city=='Washington':
        duration[city]=float(datum['Duration (ms)'])/(60*1000)    #because Washington measures in ms
    else:
        duration[city]=float(datum['tripduration'])/60     #because other two cities measures time in seconds
    
    return duration[city]
        
# Some tests to check that your code works. There should be no output if all of
# the assertions pass. The `example_trips` dictionary was obtained from when
# you printed the first trip from each of the original data files.
tests = {'NYC': 13.9833,
         'Chicago': 15.4333,
         'Washington': 7.1231}

for city in tests:
    assert abs(duration_in_mins(example_trips[city], city) - tests[city]) < .001


In [4]:
def time_of_trip(datum, city):
    """
    Takes as input a dictionary containing info about a single trip (datum) and
    its origin city (city) and returns the month, hour, and day of the week in
    which the trip was made.
    
    Remember that NYC includes seconds, while Washington and Chicago do not.
    """

    if city=='Washington':
        datee=datetime.strptime(datum['Start date'],"%m/%d/%Y %H:%M")
    elif city=='NYC':
        datee=datetime.strptime(datum['starttime'],"%m/%d/%Y %H:%M:%S")
    else:
        datee=datetime.strptime(datum['starttime'],"%m/%d/%Y %H:%M")
        
    month= datee.month
    hour = datee.hour
    day_of_week =datee.strftime('%A')
    return (month, hour, day_of_week)


# Some tests to check that your code works. There should be no output if all of
# the assertions pass. The `example_trips` dictionary was obtained from when
# you printed the first trip from each of the original data files.
tests = {'NYC': (1, 0, 'Friday'),
         'Chicago': (3, 23, 'Thursday'),
         'Washington': (3, 22, 'Thursday')}

for city in tests:
    assert time_of_trip(example_trips[city], city) == tests[city]

In [5]:
def type_of_user(datum, city):
    """
    Takes as input a dictionary containing info about a single trip (datum) and
    its origin city (city) and returns the type of system user that made the
    trip.
    
    Remember that Washington has different category names compared to Chicago
    and NYC. 
    """
    # declaring a dictionary to store informations about given city
    user_type={}
    
    if city=='Washington':
        if datum['Member Type']=='Registered':
            user_type[city]='Subscriber'
        else:
            user_type[city]='Customer'
    
    elif city=='NYC' or city=='Chicago':
        user_type[city]=datum['usertype']
    
    return user_type[city]

# Some tests to check that your code works. There should be no output if all of
# the assertions pass. The `example_trips` dictionary was obtained from when
# you printed the first trip from each of the original data files.
tests = {'NYC': 'Customer',
         'Chicago': 'Subscriber',
         'Washington': 'Subscriber'}

for city in tests:
    assert type_of_user(example_trips[city], city) == tests[city]

**Now we are going to export the cleaned data into another csv files so that things won't get messy**

In [6]:
def condense_data(in_file, out_file, city):
    """
    This function takes full data from the specified input file
    and writes the condensed data to a specified output file. The city
    argument determines how the input file will be parsed.
    """
    
    with open(out_file, 'w') as f_out, open(in_file, 'r') as f_in:
        # set up csv DictWriter object - writer requires column names for the
        # first row as the "fieldnames" argument
        out_colnames = ['duration', 'month', 'hour', 'day_of_week', 'user_type']        
        trip_writer = csv.DictWriter(f_out, fieldnames = out_colnames)
        trip_writer.writeheader()
        
        ## set up csv DictReader object ##
        trip_reader = csv.DictReader(f_in)

        # collect data from and process each row
        for row in trip_reader:
            # set up a dictionary to hold the values for the cleaned and trimmed
            # data point
            new_point = {}
            new_point['duration']=duration_in_mins(row,city)
            new_point['month'],new_point['hour'],new_point['day_of_week']=time_of_trip(row,city)
            new_point['user_type']=type_of_user(row,city)
            # write the cleaned information in the file
            trip_writer.writerow(new_point)
            

In [7]:
# Run this cell to check your work
city_info = {'Washington': {'in_file': './data/Washington-CapitalBikeshare-2016.csv',
                            'out_file': './data/Washington-2016-Summary.csv'},
             'Chicago': {'in_file': './data/Chicago-Divvy-2016.csv',
                         'out_file': './data/Chicago-2016-Summary.csv'},
             'NYC': {'in_file': './data/NYC-CitiBike-2016.csv',
                     'out_file': './data/NYC-2016-Summary.csv'}}
#iterate over dictionary and printing the output
for city, filenames in city_info.items():
    condense_data(filenames['in_file'], filenames['out_file'], city)
    print_first_point(filenames['out_file'])


City: Washington
OrderedDict([('duration', '7.123116666666666'),
             ('month', '3'),
             ('hour', '22'),
             ('day_of_week', 'Thursday'),
             ('user_type', 'Subscriber')])

City: Chicago
OrderedDict([('duration', '15.433333333333334'),
             ('month', '3'),
             ('hour', '23'),
             ('day_of_week', 'Thursday'),
             ('user_type', 'Subscriber')])

City: NYC
OrderedDict([('duration', '13.983333333333333'),
             ('month', '1'),
             ('hour', '0'),
             ('day_of_week', 'Friday'),
             ('user_type', 'Customer')])


## Exploratory Data Analysis

Now that we have the data collected and wrangled, we're ready to start exploring the data. In this section we will write some code to compute descriptive statistics from the data. We will also be introduced to the `matplotlib` library to create some basic histograms of the data.


### Statistics

First, let's compute some basic counts. The first cell below contains a function that uses the csv module to iterate through a provided data file, returning the number of trips made by subscribers and customers. The second cell runs this function on the example Bay Area data in the `/examples/` folder.

**Which city has the highest number of trips? Which city has the highest proportion of trips made by subscribers? Which city has the highest proportion of trips made by short-term customers?**

In [8]:
def number_of_trips(filename):
    """
    This function reads in a file with trip data and reports the number of
    trips made by subscribers, customers, and total overall along with subscriber and customer proportions.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        data=[]
        # initialize count variables
        n_subscribers = 0
        n_customers = 0
        
        # tally up ride types
        for row in reader:
            if row['user_type'] == 'Subscriber':
                n_subscribers += 1
            else:
                n_customers += 1
        
        # compute total number of rides
        n_total = n_subscribers + n_customers
        data.append(n_total)
        data.append(n_subscribers)
        data.append(n_customers)
        city_subscriber_prop=float(n_subscribers)/n_total
        city_customer_prop=float(n_customers)/n_total
        data.append(city_subscriber_prop)
        data.append(city_customer_prop)
        # return tallies as a tuple
        return(data)

In [9]:
## Modify this and the previous cell to answer Question 4a. Remember to run ##
## the function on the cleaned data files you created from Question 3.      ##
output_list=[]
data_file = ['./data/Chicago-2016-Summary.csv',
             './data/NYC-2016-Summary.csv',
             './data/Washington-2016-Summary.csv']
for filename in data_file:
    output_list.append(number_of_trips(filename))
#using pandas library to form table type output from the obtained data
df1=pandas.DataFrame(output_list,columns=['Total Riders','Total Subscribers','Total Customer','Subs proportion','Cust proportion'],index=['Chicago','NYC','Washington'])
df1

Unnamed: 0,Total Riders,Total Subscribers,Total Customer,Subs proportion,Cust proportion
Chicago,72131,54982,17149,0.762252,0.237748
NYC,276798,245896,30902,0.888359,0.111641
Washington,66326,51753,14573,0.780282,0.219718


- City with highest number of trip: **New York City (NYC)**  (Total: 276798).
- City with highest proportion of trips made by subscribers: **New York City (NYC)** (prop.: 88.8%).
- City with highest proportion of trips made by customers: **Chicago** (prop.: 23.7%).

Now, we will write code to continue investigating properties of the data.

Bike-share systems are designed for riders to take short trips. Most of the time, users are allowed to take trips of 30 minutes or less with no additional charges, with overage charges made for trips of longer than that duration. What is the average trip length for each city? What proportion of rides made in each city are longer than 30 minutes?

In [10]:
def duration_of_trips(filename):
    """
    This function reads in a file with trip data and reports the average
    duration and proportion of rides that are greater then 30 mins.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        data=[]
        # initialize variables
        trip_greater30 = 0
        total=0
        #to exclude the header of our .csv file, initialized no_of_trips with -1
        no_of_trips=-1
        
        # count variables
        for row in reader:
            if float(row['duration']) > 30.0:
                trip_greater30 += 1
            no_of_trips += 1
            total+=float(row['duration'])
        
        # compute total number of rides
        data.append(float(total) / no_of_trips)
        data.append((float(trip_greater30) / no_of_trips)*100)
        # return the list of duration
        return(data)

In [11]:
data1=[]
for filename in data_file:
    data1.append(duration_of_trips(filename))
df2=pandas.DataFrame(data1,columns=['Avg_duration','% (>30mins)'],index=['Chicago','NYC','Washington'])
df2

Unnamed: 0,Avg_duration,% (>30mins)
Chicago,16.563859,8.332178
NYC,15.81265,7.302464
Washington,18.933159,10.83905


Dig deeper into the question of trip duration based on ridership. Choose one city. Within that city, which type of user takes longer rides on average: Subscribers or Customers?

In [12]:
def long_duration_check(filename):
    """
    This function reads in a file with trip data and returns the avg time
    taken by different usertypes for given city
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        
        # initialize variables
        duration_subscribers=0
        duration_customers=0
        # count duration
        for row in reader:
            if row['user_type']=='Subscriber':
                duration_subscribers += float(row['duration'])
            else:
                duration_customers += float(row['duration'])
        #calculate avg duration for both type
        data=number_of_trips(filename)
        avg_subs=float(duration_subscribers)/data[1]
        avg_cust=float(duration_customers)/data[2]
        print("for the given city\nsubscribers on avg: {} and customers on avg: {}".format(avg_subs,avg_cust))
        
        #return type of customer which have more average
        if avg_cust > avg_subs:
            return('Customers')
        else:
            return("Subscribers")


In [13]:
print(long_duration_check('./data/Washington-2016-Summary.csv'))

for the given city
subscribers on avg: 12.528120499294745 and customers on avg: 41.67803139252976
Customers


In Washington city, Customer type of users (avg = 41.67) are more prevalent compared to subscribers (avg = 12.52).


### Visualizations

The last set of values that we computed should have pulled up an interesting result. While the mean trip time for Subscribers is well under 30 minutes, the mean trip time for Customers is actually _above_ 30 minutes! It will be interesting for us to look at how the trip times are distributed. In order to do this, we will use a new library named, `matplotlib`.

In [14]:
# load library
import matplotlib.pyplot as plt
import plotly.plotly as py

# this is a 'magic word' that allows for plots to be displayed
# inline with the notebook.
%matplotlib inline 

# example histogram, data taken from bay area sample
data = [ 7.65,  8.92,  7.42,  5.50, 16.17,  4.20,  8.98,  9.62, 11.48, 14.33,
        19.02, 21.53,  3.90,  7.97,  2.62,  2.67,  3.08, 14.40, 12.90,  7.83,
        25.12,  8.30,  4.93, 12.43, 10.60,  6.17, 10.88,  4.78, 15.15,  3.53,
         9.43, 13.32, 11.72,  9.85,  5.22, 15.10,  3.95,  3.17,  8.78,  1.88,
         4.55, 12.68, 12.38,  9.78,  7.63,  6.45, 17.38, 11.90, 11.52,  8.63,]
plt.hist(data)
plt.title('Distribution of Trip Durations')
plt.xlabel('Duration (m)')
plt.show()

ModuleNotFoundError: No module named 'plotly'

In the above cell, we collected fifty trip times in a list, and passed this list as the first argument to the `.hist()` function. This function performs the computations and creates plotting objects for generating a histogram, but the plot is actually not rendered until the `.show()` function is executed. The `.title()` and `.xlabel()` functions provide some labeling for plot context.

We will now use these functions to create a histogram of the trip times for the city we selected above. We won't separate the Subscribers and Customers for now: just collect all of the trip times and plot them.

In [None]:
with open('./data/Washington-2016-Summary.csv', 'r') as f_in:
    # set up csv reader object
    reader = csv.DictReader(f_in)
    data=[]
    for row in reader:
        data.append(float(row['duration']))

plt.hist(data)
plt.title('Distribution of Trip Durations')
plt.xlabel('Duration (m)')
plt.show()

We will use the parameters of the `.hist()` function to plot the distribution of trip times for the Subscribers in our selected city. We will do the same thing for only the Customers. Add limits to the plots so that only trips of duration less than 75 minutes are plotted. As a bonus, we will set the plots up so that bars are in five-minute wide intervals. For each group, where is the peak of each distribution? How would the shape of each distribution will be described? 

In [None]:
with open('./data/Washington-2016-Summary.csv', 'r') as f_in:
    # set up csv reader object
    reader = csv.DictReader(f_in)
    data_subs=[]
    data_cust=[]
    # iterating over .csv file row-wise
    for row in reader:
        if float(row['duration']) < 75:
            if row['user_type']=='Subscriber':
                data_subs.append(float(row['duration']))
            else:
                data_cust.append(float(row['duration']))
# plotting histogram for subscribers
plt.hist(data_subs,rwidth=0.5)
plt.title('Distribution of Trip Durations for Subscribers')
plt.xlabel('Duration (min)')
plt.ylabel('No of trips')
plt.show()
# plotting histogram for customers
plt.hist(data_cust,rwidth=0.5)
plt.title('Distribution of Trip Durations for Customers')
plt.xlabel('Duration (min)')
plt.ylabel('No of trips')
plt.show()

**For Graph 1 (Subscribers):**
- Unimodal around 3-8 minute duration.
- Asymmetric obviously.
- Right Skewed around the mean and mode.
- Peak at 3-8 minute duration then decreasing.

**For Graph 2 (Customers):**
- Unimodal around 18-21 minute duration.
- Asymmetric obviously.
- Right Skewed around the mean and mode.
- Increasing till 18-21 minute duration then decreasing.


## Performing Your Own Analysis

So far, we've performed an initial exploration into the data available. We have compared the relative volume of trips made between three U.S. cities and the ratio of trips made by Subscribers and Customers. For one of these cities, we have investigated differences between Subscribers and Customers in terms of how long a typical trip lasts. Now we will continue the exploration in a direction not specified in the given dataset. Here we make a few suggestions for questions to explore:

- How does ridership differ by month or season? Which month / season has the highest ridership? Does the ratio of Subscriber trips to Customer trips change depending on the month or season?
- Is the pattern of ridership different on the weekends versus weekdays? On what days are Subscribers most likely to use the system? What about Customers? Does the average duration of rides change depending on the day of the week?
- During what time of day is the system used the most? Is there a difference in usage patterns for Subscribers and Customers?

We will continue the investigation by exploring another question that could be answered by the data available. We will document a question below. Our investigation should involve at least two variables and should compare at least two groups. We should also use at least one visualization as part of our explorations.

Which month of year has more demand for bikes in each city?

In [None]:
def month_wise_duration(filename):
    with open(filename, 'r') as f_in:
    # set up csv reader object
        reader = csv.DictReader(f_in)
        month_duration={1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0}
    # iterate over file row-wise    
        for row in reader:
            month_duration[int(row['month'])]+=1
    return month_duration

In [None]:
#declaring dictionary to store the information
data1={}
city=['Chicago','New York City','Washington']
l=0
for filename in data_file:
    data1=month_wise_duration(filename)
    dictionary = plt.figure()
    plt.bar(range(len(data1)), data1.values(), align='center')
    plt.title('Month wise total trip duration for {}'.format(city[l]))
    plt.xlabel('Month')
    plt.ylabel('Total Duration (min)')
    plt.xticks(range(len(data1)), data1.keys())
    l+=1

- In Chicago city, the demand starts increasing is in the month of June and in the month of July in reaches it's peak and by the start of August, it's start decreasing again.
- In NYC, we can clearly see that demand starts increasing from August and in September it reaches peak and then gradually starts decreasing.
- In Wahinsgton city, the demand mainly increases after year-mid and reaches it's peak in June. The demand remains almost same but after October, the demand starts decreasing very rapidly. 


## Conclusions

This is only a sampling of the data analysis process: from generating questions, wrangling the data, and to exploring the data. Normally, at this point in the data analysis process, one might want to draw conclusions about the data by performing a statistical test or fitting the data to a model for making predictions. There are also a lot of potential analyses that could be performed on the data which are not possible with only the data provided. For example, detailed location data has not been investigated. Where are the most commonly used docks? What are the most common routes? As another example, weather has potential to have a large impact on daily ridership. How much is ridership impacted when there is rain or snow? Are subscribers or customers affected more by changes in weather?