##  Dependencies and Read CSV

* Most recent CSV taken from https://data.austintexas.gov/Transportation-and-Mobility/Austin-B-Cycle-Trips/tyfh-5r8s 

* Data Available 
    * Trip ID 
    * Membership Type 
    * Bicycle ID 
    * Checkout Time 
    * Checkout Kiosk ID
    * Checkout Kiosk
    * Return Kiosk ID
    * Trip Duration Minutes
    * Month
    * Year

In [1]:
# Import Dependencies
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

# Ignore Warnings as we are rewrititng values 
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Read the CSV file using pandas and creating a dataframe 
df_bike = pd.read_csv("Data/austin_B-Cycle_Trips.csv")


In [3]:
# Display the top rows of the dataframe 
df_bike.head()


Unnamed: 0,Trip ID,Membership Type,Bicycle ID,Checkout Date,Checkout Time,Checkout Kiosk ID,Checkout Kiosk,Return Kiosk ID,Return Kiosk,Trip Duration Minutes,Month,Year
0,9900285854,Annual (San Antonio B-cycle),207.0,10/26/2014,13:12:00,2537.0,West & 6th St.,2707.0,Rainey St @ Cummings,76,10.0,2014.0
1,9900285855,24-Hour Kiosk (Austin B-cycle),969.0,10/26/2014,13:12:00,2498.0,Convention Center / 4th St. @ MetroRail,2566.0,Pfluger Bridge @ W 2nd Street,58,10.0,2014.0
2,9900285856,Annual Membership (Austin B-cycle),214.0,10/26/2014,13:12:00,2537.0,West & 6th St.,2496.0,8th & Congress,8,10.0,2014.0
3,9900285857,24-Hour Kiosk (Austin B-cycle),745.0,10/26/2014,13:12:00,,Zilker Park at Barton Springs & William Barton...,,Zilker Park at Barton Springs & William Barton...,28,10.0,2014.0
4,9900285858,24-Hour Kiosk (Austin B-cycle),164.0,10/26/2014,13:12:00,2538.0,Bullock Museum @ Congress & MLK,,Convention Center/ 3rd & Trinity,15,10.0,2014.0


## Initial Data Exploration 

On an initial exploration of the data we see that the columns month, year, membership type, bicycle id, return kiosk id, and check out kiosk id are missing values. It does not necessarily mean that the data is not present. It just needs to be extracted and reformatted. 

In [4]:
# Count rows and columns 
df_bike.shape

(991271, 12)

In [5]:
# Check for missing values 
df_bike.count()

Trip ID                  991271
Membership Type          984960
Bicycle ID               990548
Checkout Date            991271
Checkout Time            991271
Checkout Kiosk ID        968117
Checkout Kiosk           991271
Return Kiosk ID          966858
Return Kiosk             991271
Trip Duration Minutes    991271
Month                    618479
Year                     618479
dtype: int64

In [6]:
# We see that the last values of our dataframe are missing the month and year
df_bike.tail()

Unnamed: 0,Trip ID,Membership Type,Bicycle ID,Checkout Date,Checkout Time,Checkout Kiosk ID,Checkout Kiosk,Return Kiosk ID,Return Kiosk,Trip Duration Minutes,Month,Year
991266,18013687,Local30,2828.0,07/20/2018,20:24:42,2566.0,Pfluger Bridge @ W 2nd Street,3685.0,Henderson & 9th,6,,
991267,18048048,Local30,447.0,07/24/2018,15:12:54,2707.0,Rainey St @ Cummings,2552.0,3rd & West,7,,
991268,17988798,U.T. Student Membership,472.0,07/18/2018,9:59:11,3838.0,Nueces & 26th,3798.0,21st & Speedway @PCL,6,,
991269,17902063,U.T. Student Membership,220.0,07/09/2018,11:50:26,2547.0,Guadalupe & 21st,3792.0,22nd & Pearl,2,,
991270,18007218,Local30,666.0,07/20/2018,8:38:40,3621.0,Nueces & 3rd,2495.0,4th & Congress,5,,


In [7]:
# Check misisng membership type count
df_bike["Membership Type"].isnull().sum()

6311

In [8]:
# Check missing bicycle id count
df_bike["Bicycle ID"].isnull().sum()

723

In [9]:
# Check the data types 
df_bike.dtypes

Trip ID                    int64
Membership Type           object
Bicycle ID               float64
Checkout Date             object
Checkout Time             object
Checkout Kiosk ID        float64
Checkout Kiosk            object
Return Kiosk ID          float64
Return Kiosk              object
Trip Duration Minutes      int64
Month                    float64
Year                     float64
dtype: object

## Initial Data Clean Up

In this section we create a new data frame with the needed values.
1.	Since we are missing kiosk ID's but we have the kiosk names available, we can make a new data frame without having to drop the rows of missing kiosk id's.

2. We can exclude bike id from our new data frame since the value will not contribute to the analysis

3.	We need to extract the year, month and date from check out date, to fill in the missing values in the Month and Year column.

4.	We can extract the hour from Check Out time for future analysis.



In [72]:
df_bike_clean = df_bike[["Trip ID", "Membership Type", "Checkout Date","Checkout Time", "Checkout Kiosk", "Return Kiosk", "Trip Duration Minutes", "Month", "Year"]]

In [73]:
# So now we see that we have a check out date that matches the number of trips taken 
df_bike_clean.count()

Trip ID                  991271
Membership Type          984960
Checkout Date            991271
Checkout Time            991271
Checkout Kiosk           991271
Return Kiosk             991271
Trip Duration Minutes    991271
Month                    618479
Year                     618479
dtype: int64

In [74]:
# We see take the Checkout Date and extract the Month, Year, and Day of the Week 

df_bike_clean['Checkout Date'] = pd.to_datetime(df_bike_clean['Checkout Date']) 
df_bike_clean['Year'] = df_bike_clean['Checkout Date'].dt.year
df_bike_clean['Month'] = df_bike_clean['Checkout Date'].dt.month
df_bike_clean['Trip Date'] = df_bike_clean['Checkout Date'].dt.day
df_bike_clean['Trip Day of Week'] = df_bike_clean['Checkout Date'].dt.weekday_name



In [75]:
# Inspect the filled in values 
df_bike_clean.tail()

Unnamed: 0,Trip ID,Membership Type,Checkout Date,Checkout Time,Checkout Kiosk,Return Kiosk,Trip Duration Minutes,Month,Year,Trip Date,Trip Day of Week
991266,18013687,Local30,2018-07-20,20:24:42,Pfluger Bridge @ W 2nd Street,Henderson & 9th,6,7,2018,20,Friday
991267,18048048,Local30,2018-07-24,15:12:54,Rainey St @ Cummings,3rd & West,7,7,2018,24,Tuesday
991268,17988798,U.T. Student Membership,2018-07-18,9:59:11,Nueces & 26th,21st & Speedway @PCL,6,7,2018,18,Wednesday
991269,17902063,U.T. Student Membership,2018-07-09,11:50:26,Guadalupe & 21st,22nd & Pearl,2,7,2018,9,Monday
991270,18007218,Local30,2018-07-20,8:38:40,Nueces & 3rd,4th & Congress,5,7,2018,20,Friday


In [76]:
# Split the hour from the checkout time 
Checkout_Time = pd.to_datetime(df_bike_clean['Checkout Time'])
df_bike_clean['Checkout Time'] = Checkout_Time.dt.time
df_bike_clean['Hour'] = Checkout_Time.dt.hour


In [77]:
# Inspect the filled in values 
df_bike_clean.head()

Unnamed: 0,Trip ID,Membership Type,Checkout Date,Checkout Time,Checkout Kiosk,Return Kiosk,Trip Duration Minutes,Month,Year,Trip Date,Trip Day of Week,Hour
0,9900285854,Annual (San Antonio B-cycle),2014-10-26,13:12:00,West & 6th St.,Rainey St @ Cummings,76,10,2014,26,Sunday,13
1,9900285855,24-Hour Kiosk (Austin B-cycle),2014-10-26,13:12:00,Convention Center / 4th St. @ MetroRail,Pfluger Bridge @ W 2nd Street,58,10,2014,26,Sunday,13
2,9900285856,Annual Membership (Austin B-cycle),2014-10-26,13:12:00,West & 6th St.,8th & Congress,8,10,2014,26,Sunday,13
3,9900285857,24-Hour Kiosk (Austin B-cycle),2014-10-26,13:12:00,Zilker Park at Barton Springs & William Barton...,Zilker Park at Barton Springs & William Barton...,28,10,2014,26,Sunday,13
4,9900285858,24-Hour Kiosk (Austin B-cycle),2014-10-26,13:12:00,Bullock Museum @ Congress & MLK,Convention Center/ 3rd & Trinity,15,10,2014,26,Sunday,13


In [78]:
# Check for missing values 
df_bike_clean.count()

Trip ID                  991271
Membership Type          984960
Checkout Date            991271
Checkout Time            991271
Checkout Kiosk           991271
Return Kiosk             991271
Trip Duration Minutes    991271
Month                    991271
Year                     991271
Trip Date                991271
Trip Day of Week         991271
Hour                     991271
dtype: int64

In [79]:
# Verify Data types
df_bike_clean.dtypes

Trip ID                           int64
Membership Type                  object
Checkout Date            datetime64[ns]
Checkout Time                    object
Checkout Kiosk                   object
Return Kiosk                     object
Trip Duration Minutes             int64
Month                             int64
Year                              int64
Trip Date                         int64
Trip Day of Week                 object
Hour                              int64
dtype: object

## Cleaning Check out ID and Trip Duration 
Based on Tinku's notebook
* To be added.

## Cleaning membership data 

While there are missing values in the membership data, we want to take a closer look to understand the types of memberships. Upon closer inspection there are memberships with similar names that should be categorized together. For example: U.T. Student Membership and UT Student Membership. Furthermore, it will be more helpful to categorize the data by day, weekend, week, month, year, 3 year, and student memberships.


In [80]:
df_bike_clean["Membership Type"].value_counts()

Walk Up                                          368322
Local365                                         167363
U.T. Student Membership                          158480
24-Hour Kiosk (Austin B-cycle)                   108672
Local30                                           54774
Weekender                                         43880
Annual Membership (Austin B-cycle)                30306
Explorer                                          14860
Local365+Guest Pass                               10331
Local365 ($80 plus tax)                            4005
Founding Member                                    3550
7-Day                                              3137
Founding Member (Austin B-cycle)                   2764
7-Day Membership (Austin B-cycle)                  2760
Semester Membership (Austin B-cycle)               2426
Annual                                             1087
Semester Membership                                 900
Local30 ($11 plus tax)                          

In [81]:
# Examine Prohibited and Restricted
test = df_bike_clean.loc[df_bike_clean["Membership Type"] == "RESTRICTED", :]
#test = df_bike_clean.loc[df_bike_clean["Membership Type"] == "PROHIBITED", :]
test.head()

Unnamed: 0,Trip ID,Membership Type,Checkout Date,Checkout Time,Checkout Kiosk,Return Kiosk,Trip Duration Minutes,Month,Year,Trip Date,Trip Day of Week,Hour
326221,8482474,RESTRICTED,2016-01-20,15:35:09,Capitol Station / Congress & 11th,Capitol Station / Congress & 11th,0,1,2016,20,Wednesday,15
329490,8368765,RESTRICTED,2016-01-11,11:30:08,Capitol Station / Congress & 11th,Capitol Station / Congress & 11th,8,1,2016,11,Monday,11
329491,8370726,RESTRICTED,2016-01-11,14:12:09,Capitol Station / Congress & 11th,Capitol Station / Congress & 11th,1,1,2016,11,Monday,14
329492,8370745,RESTRICTED,2016-01-11,14:13:32,Capitol Station / Congress & 11th,Capitol Station / Congress & 11th,0,1,2016,11,Monday,14
329493,8370747,RESTRICTED,2016-01-11,14:13:50,Capitol Station / Congress & 11th,Capitol Station / Congress & 11th,1,1,2016,11,Monday,14


In [82]:
# Replace all 24-hour with same name == day 
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {"24-Hour Kiosk (Austin B-cycle)": "day",
     "24-Hour-Online (Austin B-cycle)": "day",
     "24-Hour Membership (Austin B-cycle)": "day",
    "Explorer": "day", 
    "Explorer ($8 plus tax)":"day",
    "Walk Up": "day",
    "Try Before You Buy Special": "day",
    "RideScout Single Ride": "day", 
    "Aluminum Access":"day"})

# Replace all weekend membership == weekend 
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {
        "Weekender": "weekend", 
        "Weekender ($15 plus tax)": "weekend", 
        "ACL Weekend Pass Special (Austin B-cycle)": "weekend", 
        "FunFunFun Fest 3 Day Pass": "weekend"
    })

# Replace all weekend membership == week
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {
        "7-Day": "week", 
        "7-Day Membership (Austin B-cycle)": "week", 
    })


# Replace all weekend membership == month
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {
        "Local30": "month", 
        "Local30 ($11 plus tax)": "month",
        "Madtown Monthly":"month", 
    })


# Combine all student memberships
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {
        "U.T. Student Membership": "student",
        "UT Student Membership": "student", 
        "Semester Membership (Austin B-cycle)":"student", 
        "Semester Membership": "student"
    })

# Replace all annual membership == year
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {
        "Annual Membership (Austin B-cycle)": "year",
         "Annual Member": "year",
         "Annual Membership":"year",
         "Annual (San Antonio B-cycle)": "year",
         "Annual Member (Houston B-cycle)":"year",
         "Annual Membership (Fort Worth Bike Sharing)":"year",
         "Annual (Denver B-cycle)":"year",
         "Republic Rider (Annual)":"year",
         "Republic Rider": "year",
         "Annual Plus":"year",
         "Annual (Madison B-cycle)":"year",
         "Annual (Broward B-cycle)":"year",
         "Annual (Denver Bike Sharing)":"year",
         "Annual (Boulder B-cycle)":"year",
         "Annual Membership (GREENbike)":"year",
         "Annual Pass":"year",
         "Annual (Kansas City B-cycle)":"year",
         "Annual (Cincy Red Bike)":"year",
         "Annual (Nashville B-cycle)":"year",
         "Annual Plus Membership":"year",
         "Annual Membership (Charlotte B-cycle)":"year",
         "Annual Membership (Indy - Pacers Bikeshare )":"year",
         "Annual (Omaha B-cycle)":"year",
         "Annual":"year",
         "Annual ": "year",
         "Local365": "year", 
         "Local365+Guest Pass":"year",
         "Local365 ($80 plus tax)": "year",
         "Local365 Youth with helmet (age 13-17 riders)": "year", 
         "Local365 Youth (age 13-17 riders)":"year",
         "Membership: pay once  one-year commitment":"year"
        
    })

# Replace all founding membership == 3 year
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {
        "Founding Member": "3 year",
        "Founding Member (Austin B-cycle)": "3 year",
        "Denver B-cycle Founder": "3 year"
    })


In [83]:
# Create a new data frame that does not include restricted and prohibited
bike_trips= df_bike_clean.loc[(df_bike_clean["Membership Type"] != "RESTRICTED") & (df_bike_clean["Membership Type"] != "PROHIBITED"), :]


In [84]:
# Verify clean up 
bike_trips["Membership Type"].value_counts()

day        493728
year       216726
student    161815
month       55650
weekend     44802
3 year       6324
week         5897
Name: Membership Type, dtype: int64

In [85]:
# Filling the na values  
bike_trips = df_bike_clean.fillna(0)
bike_trips.count()

Trip ID                  991271
Membership Type          991271
Checkout Date            991271
Checkout Time            991271
Checkout Kiosk           991271
Return Kiosk             991271
Trip Duration Minutes    991271
Month                    991271
Year                     991271
Trip Date                991271
Trip Day of Week         991271
Hour                     991271
dtype: int64

In [56]:
# Export to csv
bike_trips.to_csv("Clean_Data\out.csv")