# Austin B-Cycle Data Cleanup and Exploration

### Introduction 

This notebook contains the clean up and exploration for the publically available data of Austin B-Cycle, a bikesharing program. The notebook looks at the inconsistencies in the columns caused by missing data, duplicates, typos and other anomalies. The result is an re-organized CSV file that has data ready for analysis. 

### Data Extraction 

* The Austin B-Cycle data comes from the city of Austin Open Data Portal(AODP). The portal has API functionality through which we obtained the most current dataset. The API uses The Socrata Open Data API(SODA API) which hosts the AODP datasets. 

* As of early September 2018, there are a total of 991271 rows of rides. The dataset starts since Austin B-Cycle inception in December 2013 through July 2018. 

* Data Provided 
    * Trip ID 
    * Membership Type 
    * Bicycle ID 
    * Checkout Time 
    * Checkout Kiosk ID
    * Checkout Kiosk
    * Return Kiosk ID
    * Trip Duration Minutes
    * Month
    * Year


##  Dependencies and API


* AODP Dataset Access: https://data.austintexas.gov/Transportation-and-Mobility/Austin-B-Cycle-Trips/tyfh-5r8s
* API Endpoint:  https://data.austintexas.gov/resource/cwi3-ckqi.json
* API Documentation: https://dev.socrata.com/foundry/data.austintexas.gov/cwi3-ckqi

To access the dataset host, install SODA API first: 

> pip install sodapy

The script works without a token and password, as Austin B-Cycle data is a public dataset. An individual API token and password may be created to avoid throttling limits. To create a token and password visit https://data.austintexas.gov/profile/app_tokens and specify the client information below on the corresponding commented section. 

In [1]:
# Uncomment the command below if sodapy is not currently installed in your libraries 
#!pip install sodapy

In [2]:
# Import Dependencies
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import csv
import requests
from pprint import pprint
from sodapy import Socrata

# Ignore Warnings as we are rewrititng values 
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password
client = Socrata("data.austintexas.gov", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.austintexas.gov,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")


# First 991271 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy
# The limit parameter represent how many rows of data will return in the response, 
# When the API pushes updated data, the limit will need to be increased
results = client.get("cwi3-ckqi", limit= 991271)

# Convert to pandas DataFrame
df_bike= pd.DataFrame.from_records(results)

# A limited number of requests can be made without an app token, but they are subject to much lower throttling limits than request that do include one. 



In [4]:
# Display the top rows of the dataframe 
df_bike.head()


Unnamed: 0,bicycle_id,checkout_date,checkout_kiosk,checkout_kiosk_id,checkout_time,membership_type,month,return_kiosk,return_kiosk_id,trip_duration_minutes,trip_id,year
0,207,2014-10-26T00:00:00.000,West & 6th St.,2537.0,13:12:00,Annual (San Antonio B-cycle),10,Rainey St @ Cummings,2707.0,76,9900285854,2014
1,969,2014-10-26T00:00:00.000,Convention Center / 4th St. @ MetroRail,2498.0,13:12:00,24-Hour Kiosk (Austin B-cycle),10,Pfluger Bridge @ W 2nd Street,2566.0,58,9900285855,2014
2,214,2014-10-26T00:00:00.000,West & 6th St.,2537.0,13:12:00,Annual Membership (Austin B-cycle),10,8th & Congress,2496.0,8,9900285856,2014
3,745,2014-10-26T00:00:00.000,Zilker Park at Barton Springs & William Barton...,,13:12:00,24-Hour Kiosk (Austin B-cycle),10,Zilker Park at Barton Springs & William Barton...,,28,9900285857,2014
4,164,2014-10-26T00:00:00.000,Bullock Museum @ Congress & MLK,2538.0,13:12:00,24-Hour Kiosk (Austin B-cycle),10,Convention Center/ 3rd & Trinity,,15,9900285858,2014


## Initial Data Exploration 

On an initial exploration of the data we see that the columns month, year, membership type, bicycle id, return kiosk id, and check out kiosk id are missing values. It does not necessarily mean that all the data is not present. It just needs to be extracted and reformatted. The API returns all data types as strings or Pandas objects. Currently, there are 991271 rows of information and 12 columns. During the initial exploration we discovered we were missing values in the following columns: 

* month 
* year 
* membership_type 
* checkout_kiosk_id 
* return_kiosk_id  

In [5]:
# Count rows and columns 
df_bike.shape

(991271, 12)

In [6]:
# Check for missing values 
df_bike.count()

bicycle_id               990548
checkout_date            991271
checkout_kiosk           991271
checkout_kiosk_id        972240
checkout_time            991271
membership_type          984960
month                    618479
return_kiosk             991271
return_kiosk_id          971487
trip_duration_minutes    991271
trip_id                  991271
year                     618479
dtype: int64

In [7]:
# We see that the last values of our dataframe are missing the month and year
df_bike.tail()

Unnamed: 0,bicycle_id,checkout_date,checkout_kiosk,checkout_kiosk_id,checkout_time,membership_type,month,return_kiosk,return_kiosk_id,trip_duration_minutes,trip_id,year
991266,1524,2018-07-31T00:00:00.000,Capitol Station / Congress & 11th,2497,23:24:32,U.T. Student Membership,,City Hall / Lavaca & 2nd,2499,29,18122309,
991267,95,2018-07-31T00:00:00.000,Long Center @ South 1st & Riverside,2549,23:29:21,Local365+Guest Pass,,East 6th & Pedernales St.,2544,18,18122312,
991268,252,2018-07-31T00:00:00.000,Rio Grande & 28th,3793,23:39:16,U.T. Student Membership,,22nd & Pearl,3792,4,18122330,
991269,2228,2018-07-31T00:00:00.000,Dean Keeton & Whitis,3795,23:42:50,U.T. Student Membership,,Nueces & 26th,3838,2,18122340,
991270,576,2018-07-31T00:00:00.000,Rio Grande & 28th,3793,23:44:32,U.T. Student Membership,,Rio Grande & 28th,3793,4,18122349,


In [8]:
# Count how many rows are missing year, month, membership_type, bicycle_id 
missing_year = df_bike["year"].isnull().sum()
missing_month = df_bike["month"].isnull().sum()
missing_membership_type = df_bike["membership_type"].isnull().sum()
missing_bike_id = df_bike["bicycle_id"].isnull().sum()

# Create summary of the missing values 
print(f"There are {missing_year} missing year values.")
print(f"There are {missing_month} missing month values.")
print(f"There are {missing_membership_type} missing membership type values.")
print(f"There are {missing_bike_id} missing bike id values.")

There are 372792 missing year values.
There are 372792 missing month values.
There are 6311 missing membership type values.
There are 723 missing bike id values.


In [9]:
# Verify the data types
df_bike.dtypes

bicycle_id               object
checkout_date            object
checkout_kiosk           object
checkout_kiosk_id        object
checkout_time            object
membership_type          object
month                    object
return_kiosk             object
return_kiosk_id          object
trip_duration_minutes    object
trip_id                  object
year                     object
dtype: object

## Initial Data Clean Up

In this section we address the missing values found in the initial exploration. 

* First, we use the check_out date to extract the year, month and trip date and trip day of the week. This is accomplished through the use of the datetime function. 

* Next, we split the hour from the checkout time. This is accomplished with an anonymous function. 

* Also, change the data types of trip duration from object to integer type. After this change all the data types are ready for iteration in the analysis notebook. 

* Finally, we rename the columns to further explore and understand the remaining columns of missing and conflicting values. 

    * membership_type
    * checkout_kiosk_id
    * return_kiosk_id

In [10]:
# Rename the data frame 
df_bike_clean = df_bike

In [11]:
# We see take the Checkout Date and extract the Month, Year, and Day of the Week 

df_bike_clean['checkout_date'] = pd.to_datetime(df_bike_clean['checkout_date']) 
df_bike_clean['year'] = df_bike_clean['checkout_date'].dt.year
df_bike_clean['month'] = df_bike_clean['checkout_date'].dt.month
df_bike_clean['Trip Date'] = df_bike_clean['checkout_date'].dt.day
df_bike_clean['Trip Day of Week'] = df_bike_clean['checkout_date'].dt.weekday_name

In [12]:
# Inspect the filled in values 
df_bike_clean.tail()

Unnamed: 0,bicycle_id,checkout_date,checkout_kiosk,checkout_kiosk_id,checkout_time,membership_type,month,return_kiosk,return_kiosk_id,trip_duration_minutes,trip_id,year,Trip Date,Trip Day of Week
991266,1524,2018-07-31,Capitol Station / Congress & 11th,2497,23:24:32,U.T. Student Membership,7,City Hall / Lavaca & 2nd,2499,29,18122309,2018,31,Tuesday
991267,95,2018-07-31,Long Center @ South 1st & Riverside,2549,23:29:21,Local365+Guest Pass,7,East 6th & Pedernales St.,2544,18,18122312,2018,31,Tuesday
991268,252,2018-07-31,Rio Grande & 28th,3793,23:39:16,U.T. Student Membership,7,22nd & Pearl,3792,4,18122330,2018,31,Tuesday
991269,2228,2018-07-31,Dean Keeton & Whitis,3795,23:42:50,U.T. Student Membership,7,Nueces & 26th,3838,2,18122340,2018,31,Tuesday
991270,576,2018-07-31,Rio Grande & 28th,3793,23:44:32,U.T. Student Membership,7,Rio Grande & 28th,3793,4,18122349,2018,31,Tuesday


In [13]:
# Split the hour from the checkout time
df_bike_clean['Trip Hour'] = df_bike_clean['checkout_time'].apply(lambda x: x.split(":")[0])

In [14]:
# Inspect the filled in values 
df_bike_clean.head()

Unnamed: 0,bicycle_id,checkout_date,checkout_kiosk,checkout_kiosk_id,checkout_time,membership_type,month,return_kiosk,return_kiosk_id,trip_duration_minutes,trip_id,year,Trip Date,Trip Day of Week,Trip Hour
0,207,2014-10-26,West & 6th St.,2537.0,13:12:00,Annual (San Antonio B-cycle),10,Rainey St @ Cummings,2707.0,76,9900285854,2014,26,Sunday,13
1,969,2014-10-26,Convention Center / 4th St. @ MetroRail,2498.0,13:12:00,24-Hour Kiosk (Austin B-cycle),10,Pfluger Bridge @ W 2nd Street,2566.0,58,9900285855,2014,26,Sunday,13
2,214,2014-10-26,West & 6th St.,2537.0,13:12:00,Annual Membership (Austin B-cycle),10,8th & Congress,2496.0,8,9900285856,2014,26,Sunday,13
3,745,2014-10-26,Zilker Park at Barton Springs & William Barton...,,13:12:00,24-Hour Kiosk (Austin B-cycle),10,Zilker Park at Barton Springs & William Barton...,,28,9900285857,2014,26,Sunday,13
4,164,2014-10-26,Bullock Museum @ Congress & MLK,2538.0,13:12:00,24-Hour Kiosk (Austin B-cycle),10,Convention Center/ 3rd & Trinity,,15,9900285858,2014,26,Sunday,13


In [15]:
# Convert trip duration minutes from object to integer
df_bike_change1 = df_bike_clean["trip_duration_minutes"].astype(int)
df_bike_clean["trip_duration_minutes"] = df_bike_change1

In [16]:
# Verify Data types
df_bike_clean.dtypes

bicycle_id                       object
checkout_date            datetime64[ns]
checkout_kiosk                   object
checkout_kiosk_id                object
checkout_time                    object
membership_type                  object
month                             int64
return_kiosk                     object
return_kiosk_id                  object
trip_duration_minutes             int32
trip_id                          object
year                              int64
Trip Date                         int64
Trip Day of Week                 object
Trip Hour                        object
dtype: object

In [17]:
# Rename the columns 

df_bike_clean = df_bike_clean.rename(columns = {
    "bicycle_id": "Bicycle ID",
    "checkout_date": "Checkout Date",
    "checkout_kiosk": "Checkout Station",
    "checkout_kiosk_id": "Checkout Station ID",
    "checkout_time": "Checkout Time",
    "membership_type": "Membership Type",
    "month": "Trip Month",
    "return_kiosk":"Return Station",
    "return_kiosk_id": "Return Station ID",
    "trip_duration_minutes": "Trip Duration Minutes",
    "trip_id": "Trip ID",
    "year":"Trip Year",
    "Checkout Kiosk ID":"Checkout Station ID"
                                         })

df_bike_clean.head(1) 

Unnamed: 0,Bicycle ID,Checkout Date,Checkout Station,Checkout Station ID,Checkout Time,Membership Type,Trip Month,Return Station,Return Station ID,Trip Duration Minutes,Trip ID,Trip Year,Trip Date,Trip Day of Week,Trip Hour
0,207,2014-10-26,West & 6th St.,2537,13:12:00,Annual (San Antonio B-cycle),10,Rainey St @ Cummings,2707,76,9900285854,2014,26,Sunday,13


## Trip Duration Column Exploration

While the trip duration column is not missing values, it is important to understand why there are trips with zero minutes. Below is a summary of the bikes that were reported as stolen, missing and with zero minutes traveled. This data is valid but quantification is needed to understand the potential margin for error in the dataset. 

In [18]:
# To check how many bikes were stolen
# These bikes have unusally large trip duration
df_bike_stolen = df_bike_clean.loc[df_bike_clean["Return Station"] == "Stolen"]
number_bike_stolen = df_bike_stolen["Return Station"].count()


# To check how many bikes were missing
# These bikes have unusally large trip duration
df_bike_missing = df_bike_clean.loc[df_bike_clean["Return Station"] == "Missing"]
number_bike_missing = df_bike_missing["Return Station"].count()


# To check how many bikes have trip duration has zero minutes
df_bike_trip_minutes_zero = df_bike_clean.loc[df_bike_clean["Trip Duration Minutes"] == 0]
number_bike_trip_minutes_zero = df_bike_trip_minutes_zero["Trip ID"].count()

# Summary of bike trips with distinict data 
print(f"There are {number_bike_stolen} bikes reported as stolen.")
print(f"There are {number_bike_missing} bikes reported as missing.")
print(f"There are {number_bike_trip_minutes_zero} bikes that had a trip duration of zero.")

There are 23 bikes reported as stolen.
There are 25 bikes reported as missing.
There are 19033 bikes that had a trip duration of zero.


##  Station ID and Station Name Columns Exploration and Cleanup 

There are a significant number of values missing in the Checkout and Return Station ID columns. However, all the station names are available. In this section, we seek to understand and quantify the remaining missing values. 

* The missing values are summarized below. The only column with missing data that will be used in the analysis notebook is the membership_type. Thus, we need to fill-in the remaining missing values with zeros and explain the missing data. 
    * We iterate through the known missing values to derive more information. To accomplish this we fill missing values with zero.
* Besides missing values, we find that there are cases where there are more than one value assigned to a key in the Station ID columns.
* Ultimately, we make a new data frame that reorganizes the column order and focuses on the columns we will iterate through in the analysis section. 

In [19]:
# Check for missing values 
df_bike_clean.count()

Bicycle ID               990548
Checkout Date            991271
Checkout Station         991271
Checkout Station ID      972240
Checkout Time            991271
Membership Type          984960
Trip Month               991271
Return Station           991271
Return Station ID        971487
Trip Duration Minutes    991271
Trip ID                  991271
Trip Year                991271
Trip Date                991271
Trip Day of Week         991271
Trip Hour                991271
dtype: int64

In [20]:
# To check how many Checkout Station ID are blank
number_df_bike_checkout_id_blank  = df_bike_clean["Checkout Station ID"].isnull().sum()
print(f"There are {number_df_bike_checkout_id_blank} Checkout Station IDs that are blank.")

There are 19031 Checkout Station IDs that are blank.


In [21]:
# Filling the Na values with zero for exploration
df_bike_na = df_bike_clean.fillna(0)

In [22]:
# Find which check out stations have blank checkout IDs that is zero or #N/A
df_bike_checkout_id_blank = df_bike_na.loc[(df_bike_na["Checkout Station ID"] == 0) | (df_bike_na["Checkout Station ID"] == "#N/A")]
df_bike_checkout_id_blank["Checkout Station"].value_counts()

Zilker Park at Barton Springs & William Barton Drive    11534
Dean Keeton & Speedway                                   3825
ACC - West & 12th                                        2462
Convention Center/ 3rd & Trinity                         1292
Mobile Station                                           1183
East 11th Street at Victory Grill                        1030
Red River @ LBJ Library                                   584
Mobile Station @ Bike Fest                                516
Main Office                                               300
Bullock Museum @ Congress & MLK                           172
State Capitol @ 14th & Colorado                           111
MapJam at Pan Am Park                                      32
MapJam at French Legation                                  27
MapJam at Hops & Grain Brewery                             19
Repair Shop                                                15
MapJam at Scoot Inn                                        11
Shop    

In [23]:
# List checkout station IDs
checkout_station_id_list =  df_bike_na["Checkout Station ID"].value_counts().index
checkout_station_id_list

# We have the following unique checkout station IDs excluding zero and #N/A

Index(['3798', '2575', '2499', '2494', '2501', '2707', '2495', '2498', '2563',
       '2497', '2566', '2552', '2548', '2549', '2567', '2574', '2711', '2502',
            0, '2503', '2547', '2570', '2539', '2572', '2496', '2504', '3841',
       '2537', '3792', '2542', '3377', '2565', '3390', '2571', '2538', '3793',
       '3838', '2550', '2569', '3794', '2562', '3795', '3513', '2540', '3797',
       '2822', '2564', '3619', '3621', '2561', '3799', '2536', '3455', '2544',
       '3292', '2568', '2541', '3687', '1007', '1008', '3291', '3684', '3293',
       '#N/A', '3686', '3660', '2712', '2823', '2576', '3294', '3685', '2546',
       '2545', '3635', '1006', '1002', '3464', '3790', '1003', '3381', '2500',
       '3791', '3456', '1005', '1001'],
      dtype='object')

In [24]:
# Number of stations which have no checkout ID
blank_stations = len(df_bike_checkout_id_blank["Checkout Station"].value_counts())

# Number of unique Checkout Stations
unique_checkout_stations = df_bike_clean["Checkout Station"].unique().size

# Number of stations with unique checkout station ids other than zero
unique_stations_nonnull = unique_checkout_stations - blank_stations 

# Number of unique checkout Station IDs
unique_checkout_id = df_bike_clean["Checkout Station ID"].unique().size
unique_checkout_id_not_zero = unique_checkout_id - 2
## -2 for zero and #N/A checkout id


# Summary Checkout Station ID findings 
print(f"There are {blank_stations} stations without a checkout ID.")
print(f"There are {unique_checkout_stations} unique checkout stations.")
print(f"There are {unique_stations_nonnull} stations with unique checkout station ID's other than zero.")
print(f"There are {unique_checkout_id_not_zero} unique checkout station ID.")

There are 24 stations without a checkout ID.
There are 104 unique checkout stations.
There are 80 stations with unique checkout station ID's other than zero.
There are 83 unique checkout station ID.


In [25]:
# We have a larger number of unique checkout station ids than number of unique checkout stations
# This implies we have a few checkout stations with more than one checkout station ID

# Create a dictionary using keyword arguments checkout station and checkout station ids 
Checkout_station_id = dict()
for index, row in df_bike_na.iterrows():
    if row['Checkout Station'] not in Checkout_station_id:
        Checkout_station_id[row['Checkout Station']] = set()
    else:
         Checkout_station_id[row['Checkout Station']].add(row['Checkout Station ID'])

In [26]:
# Check which station has more than one checkout id
Checkout_id_check = Checkout_station_id
for key in Checkout_id_check:
    if len(Checkout_id_check[key]) > 1:
        print("{} has the following ids: {}".format(key, Checkout_id_check[key]))

Bullock Museum @ Congress & MLK has the following ids: {'#N/A', '2538'}
State Capitol @ 14th & Colorado has the following ids: {'#N/A', '2541'}
Main Office has the following ids: {0, '#N/A', '1001'}
Lavaca & 6th has the following ids: {'3294', '1007'}
Re-branding has the following ids: {0, '#N/A'}
Repair Shop has the following ids: {0, '#N/A'}
Republic Square @ 5th & Guadalupe has the following ids: {'3455', '3456'}
Dean Keeton & Speedway  has the following ids: {'#N/A', '3794'}


In [27]:
# Check if there is an overlap of checkout station ID :

# Remove stations with zero checkout station ID 
for key in Checkout_id_check:
    if 0 in Checkout_id_check[key]:
        Checkout_id_check[key].remove(0)

# Remove stations with "#N/A" checkout station ID 
for key in Checkout_id_check:
    if "#N/A" in Checkout_id_check[key]:
        Checkout_id_check[key].remove("#N/A")
        
# Check for overlapping of station ID
for key1 in Checkout_id_check:
    for key2 in Checkout_id_check:
        if key1 == key2:
            continue
        intersect = Checkout_id_check[key1].intersection(Checkout_id_check[key2])
        if len(intersect) > 0:
            print("{} and {} share id: {}".format(key1, key2, intersect))

Lavaca & 6th and Guadalupe & 6th share id: {'3294'}
Republic Square @ Federal Courthouse Plaza and Republic Square @ 5th & Guadalupe share id: {'3455'}
Guadalupe & 6th and Lavaca & 6th share id: {'3294'}
Republic Square @ 5th & Guadalupe and Republic Square @ Federal Courthouse Plaza share id: {'3455'}


## Reorganized and Updated Data Frame

After the initial exploration of the data and initial clean up, we have a better grasp of the data. The columns that were missing significant data and had anomalies are accounted for. 

In the end, we won't need the columns bike id, checkout station id, return station ids for our analysis thus, we take only a subset of the most complete data, the data that will be used for analysis.

In [28]:
# Organize columns and take columns with relevant data which we will use for analysis
df_bike_clean[["Trip ID", "Membership Type", "Checkout Date","Checkout Time", "Checkout Station", "Return Station", "Trip Duration Minutes", "Trip Month", "Trip Year","Trip Date", "Trip Day of Week", "Trip Hour"]] 

Unnamed: 0,Trip ID,Membership Type,Checkout Date,Checkout Time,Checkout Station,Return Station,Trip Duration Minutes,Trip Month,Trip Year,Trip Date,Trip Day of Week,Trip Hour
0,9900285854,Annual (San Antonio B-cycle),2014-10-26,13:12:00,West & 6th St.,Rainey St @ Cummings,76,10,2014,26,Sunday,13
1,9900285855,24-Hour Kiosk (Austin B-cycle),2014-10-26,13:12:00,Convention Center / 4th St. @ MetroRail,Pfluger Bridge @ W 2nd Street,58,10,2014,26,Sunday,13
2,9900285856,Annual Membership (Austin B-cycle),2014-10-26,13:12:00,West & 6th St.,8th & Congress,8,10,2014,26,Sunday,13
3,9900285857,24-Hour Kiosk (Austin B-cycle),2014-10-26,13:12:00,Zilker Park at Barton Springs & William Barton...,Zilker Park at Barton Springs & William Barton...,28,10,2014,26,Sunday,13
4,9900285858,24-Hour Kiosk (Austin B-cycle),2014-10-26,13:12:00,Bullock Museum @ Congress & MLK,Convention Center/ 3rd & Trinity,15,10,2014,26,Sunday,13
5,9900285859,24-Hour Kiosk (Austin B-cycle),2014-10-26,13:12:00,Zilker Park at Barton Springs & William Barton...,ACC - Rio Grande & 12th,26,10,2014,26,Sunday,13
6,9900285860,Annual Membership (Austin B-cycle),2014-10-26,13:12:00,8th & Congress,State Capitol Visitors Garage @ San Jacinto & ...,35,10,2014,26,Sunday,13
7,9900285861,Annual Membership (Austin B-cycle),2014-10-26,13:12:00,East 11th St. & San Marcos,City Hall / Lavaca & 2nd,11,10,2014,26,Sunday,13
8,9900285862,Annual Membership (Austin B-cycle),2014-10-26,13:12:00,8th & Congress,8th & Congress,0,10,2014,26,Sunday,13
9,9900285863,24-Hour Kiosk (Austin B-cycle),2014-10-26,13:12:00,Zilker Park at Barton Springs & William Barton...,ACC - Rio Grande & 12th,25,10,2014,26,Sunday,13


## Cleaning and Groupping Membership Data 

While now the data count reflects a complete data set, there is further cleaning that needs to happend before we can successfully analyze the data. 

In this last part, we want to take a closer look to understand the types of memberships. Upon closer inspection there are memberships with similar names that should be categorized together. For example: U.T. Student Membership and UT Student Membership. Furthermore, it will be more helpful to categorize the data by day, weekend, week, month, year, 3 year, and U.T. student memberships.



In [29]:
df_bike_clean["Membership Type"].value_counts()

Walk Up                                          368322
Local365                                         167363
U.T. Student Membership                          158480
24-Hour Kiosk (Austin B-cycle)                   108672
Local30                                           54774
Weekender                                         43880
Annual Membership (Austin B-cycle)                30306
Explorer                                          14860
Local365+Guest Pass                               10331
Local365 ($80 plus tax)                            4005
Founding Member                                    3550
7-Day                                              3137
Founding Member (Austin B-cycle)                   2764
7-Day Membership (Austin B-cycle)                  2760
Semester Membership (Austin B-cycle)               2426
Annual                                             1087
Semester Membership                                 900
Local30 ($11 plus tax)                          

In [30]:
# Examine Prohibited and Restricted
test = df_bike_clean.loc[df_bike_clean["Membership Type"] == "RESTRICTED", :]
#test = df_bike_clean.loc[df_bike_clean["Membership Type"] == "PROHIBITED", :]
test.head()

Unnamed: 0,Bicycle ID,Checkout Date,Checkout Station,Checkout Station ID,Checkout Time,Membership Type,Trip Month,Return Station,Return Station ID,Trip Duration Minutes,Trip ID,Trip Year,Trip Date,Trip Day of Week,Trip Hour
326221,158,2016-01-20,Capitol Station / Congress & 11th,2497,15:35:09,RESTRICTED,1,Capitol Station / Congress & 11th,2497,0,8482474,2016,20,Wednesday,15
329490,421,2016-01-11,Capitol Station / Congress & 11th,2497,11:30:08,RESTRICTED,1,Capitol Station / Congress & 11th,2497,8,8368765,2016,11,Monday,11
329491,407,2016-01-11,Capitol Station / Congress & 11th,2497,14:12:09,RESTRICTED,1,Capitol Station / Congress & 11th,2497,1,8370726,2016,11,Monday,14
329492,407,2016-01-11,Capitol Station / Congress & 11th,2497,14:13:32,RESTRICTED,1,Capitol Station / Congress & 11th,2497,0,8370745,2016,11,Monday,14
329493,407,2016-01-11,Capitol Station / Congress & 11th,2497,14:13:50,RESTRICTED,1,Capitol Station / Congress & 11th,2497,1,8370747,2016,11,Monday,14


In [31]:
# Replace all 24-hour with same name == day 
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {"24-Hour Kiosk (Austin B-cycle)": "day",
     "24-Hour-Online (Austin B-cycle)": "day",
     "24-Hour Membership (Austin B-cycle)": "day",
    "Explorer": "day", 
    "Explorer ($8 plus tax)":"day",
    "Walk Up": "day",
    "Try Before You Buy Special": "day",
    "RideScout Single Ride": "day", 
    "Aluminum Access":"day"})

# Replace all weekend membership == weekend 
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {
        "Weekender": "weekend", 
        "Weekender ($15 plus tax)": "weekend", 
        "ACL Weekend Pass Special (Austin B-cycle)": "weekend", 
        "FunFunFun Fest 3 Day Pass": "weekend"
    })

# Replace all weekend membership == week
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {
        "7-Day": "week", 
        "7-Day Membership (Austin B-cycle)": "week", 
    })


# Replace all weekend membership == month
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {
        "Local30": "month", 
        "Local30 ($11 plus tax)": "month",
        "Madtown Monthly":"month", 
    })


# Combine all student memberships
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {
        "U.T. Student Membership": "student",
        "UT Student Membership": "student", 
        "Semester Membership (Austin B-cycle)":"student", 
        "Semester Membership": "student"
    })

# Replace all annual membership == year
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {
        "Annual Membership (Austin B-cycle)": "year",
         "Annual Member": "year",
         "Annual Membership":"year",
         "Annual (San Antonio B-cycle)": "year",
         "Annual Member (Houston B-cycle)":"year",
         "Annual Membership (Fort Worth Bike Sharing)":"year",
         "Annual (Denver B-cycle)":"year",
         "Republic Rider (Annual)":"year",
         "Republic Rider": "year",
         "Annual Plus":"year",
         "Annual (Madison B-cycle)":"year",
         "Annual (Broward B-cycle)":"year",
         "Annual (Denver Bike Sharing)":"year",
         "Annual (Boulder B-cycle)":"year",
         "Annual Membership (GREENbike)":"year",
         "Annual Pass":"year",
         "Annual (Kansas City B-cycle)":"year",
         "Annual (Cincy Red Bike)":"year",
         "Annual (Nashville B-cycle)":"year",
         "Annual Plus Membership":"year",
         "Annual Membership (Charlotte B-cycle)":"year",
         "Annual Membership (Indy - Pacers Bikeshare )":"year",
         "Annual (Omaha B-cycle)":"year",
         "Annual":"year",
         "Annual ": "year",
         "Local365": "year", 
         "Local365+Guest Pass":"year",
         "Local365 ($80 plus tax)": "year",
         "Local365 Youth with helmet (age 13-17 riders)": "year", 
         "Local365 Youth (age 13-17 riders)":"year",
         "Membership: pay once  one-year commitment":"year"
        
    })

# Replace all founding membership == 3 year
df_bike_clean["Membership Type"] = df_bike_clean["Membership Type"].replace(
    {
        "Founding Member": "3 year",
        "Founding Member (Austin B-cycle)": "3 year",
        "Denver B-cycle Founder": "3 year"
    })


In [32]:
# Create a new data frame that does not include restricted and prohibited as per observation 
bike_trips= df_bike_clean.loc[(df_bike_clean["Membership Type"] != "RESTRICTED") & (df_bike_clean["Membership Type"] != "PROHIBITED"), :]

In [33]:
# Verify clean up 
bike_trips["Membership Type"].value_counts()

day        493728
year       216726
student    161815
month       55650
weekend     44802
3 year       6324
week         5897
Name: Membership Type, dtype: int64

In [34]:
# Export to csv
df_bike_clean.to_csv("Clean_Data\out.csv", index = None)