# Analysis for San Francisco Bay Bike Sharing-System 2017
## by Muhammad Adipurna Kusumawardana

## Preliminary Wrangling

This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.
https://www.lyft.com/bikes/bay-wheels/system-data

The Data
Each trip is anonymized and includes:

* Trip Duration (seconds)
* Start Time and Date
* End Time and Date
* Start Station ID
* Start Station Name
* Start Station Latitude
* Start Station Longitude
* End Station ID
* End Station Name
* End Station Latitude
* End Station Longitude
* Bike ID
* User Type (Subscriber or Customer – “Subscriber” = Member or “Customer” = Casual)

In [1]:
# import all packages and set plots to be embedded inline
import re 
import glob
import zipfile
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

%matplotlib inline

In [2]:
# Make a list containing all the zip-file names
zip_list = glob.glob('./data/*.zip')

In [3]:
# Extract all zipfile using zipfile library
for zip_file in zip_list:
    with zipfile.ZipFile(zip_file, 'r') as zip_ref:
        zip_ref.extractall("./data/")

In [4]:
csv_list = glob.glob('./data/*.csv')
csv_list      

['./data\\2017-fordgobike-tripdata.csv',
 './data\\201801-fordgobike-tripdata.csv',
 './data\\201802-fordgobike-tripdata.csv',
 './data\\201803-fordgobike-tripdata.csv',
 './data\\201804-fordgobike-tripdata.csv',
 './data\\201805-fordgobike-tripdata.csv',
 './data\\201806-fordgobike-tripdata.csv',
 './data\\201807-fordgobike-tripdata.csv',
 './data\\201808-fordgobike-tripdata.csv',
 './data\\201809-fordgobike-tripdata.csv',
 './data\\201810-fordgobike-tripdata.csv',
 './data\\201811-fordgobike-tripdata.csv',
 './data\\201812-fordgobike-tripdata.csv',
 './data\\201901-fordgobike-tripdata.csv',
 './data\\201902-fordgobike-tripdata.csv',
 './data\\201903-fordgobike-tripdata.csv',
 './data\\201904-fordgobike-tripdata.csv',
 './data\\201905-baywheels-tripdata.csv',
 './data\\201906-baywheels-tripdata.csv',
 './data\\201907-baywheels-tripdata.csv',
 './data\\201908-baywheels-tripdata.csv',
 './data\\201909-baywheels-tripdata.csv',
 './data\\201910-baywheels-tripdata.csv',
 './data\\201911-ba

In [5]:
df_2017 = pd.read_csv('./data/2017-fordgobike-tripdata.csv')
df_2017

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type
0,80110,2017-12-31 16:57:39.6540,2018-01-01 15:12:50.2450,74,Laguna St at Hayes St,37.776435,-122.426244,43,San Francisco Public Library (Grove St at Hyde...,37.778768,-122.415929,96,Customer
1,78800,2017-12-31 15:56:34.8420,2018-01-01 13:49:55.6170,284,Yerba Buena Center for the Arts (Howard St at ...,37.784872,-122.400876,96,Dolores St at 15th St,37.766210,-122.426614,88,Customer
2,45768,2017-12-31 22:45:48.4110,2018-01-01 11:28:36.8830,245,Downtown Berkeley BART,37.870348,-122.267764,245,Downtown Berkeley BART,37.870348,-122.267764,1094,Customer
3,62172,2017-12-31 17:31:10.6360,2018-01-01 10:47:23.5310,60,8th St at Ringold St,37.774520,-122.409449,5,Powell St BART Station (Market St at 5th St),37.783899,-122.408445,2831,Customer
4,43603,2017-12-31 14:23:14.0010,2018-01-01 02:29:57.5710,239,Bancroft Way at Telegraph Ave,37.868813,-122.258764,247,Fulton St at Bancroft Way,37.867789,-122.265896,3167,Subscriber
...,...,...,...,...,...,...,...,...,...,...,...,...,...
519695,435,2017-06-28 10:00:54.5280,2017-06-28 10:08:10.4380,81,Berry St at 4th St,37.775880,-122.393170,45,5th St at Howard St,37.781752,-122.405127,400,Subscriber
519696,431,2017-06-28 09:56:39.6310,2017-06-28 10:03:51.0900,66,3rd St at Townsend St,37.778742,-122.392741,321,5th at Folsom,37.780146,-122.403071,316,Subscriber
519697,424,2017-06-28 09:47:36.3470,2017-06-28 09:54:41.1870,21,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,48,2nd St at S Park St,37.782411,-122.392706,240,Subscriber
519698,366,2017-06-28 09:47:41.6640,2017-06-28 09:53:47.7150,58,Market St at 10th St,37.776619,-122.417385,59,S Van Ness Ave at Market St,37.774814,-122.418954,669,Subscriber


In [6]:
# Check column dtype for each column
df_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 519700 entries, 0 to 519699
Data columns (total 13 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             519700 non-null  int64  
 1   start_time               519700 non-null  object 
 2   end_time                 519700 non-null  object 
 3   start_station_id         519700 non-null  int64  
 4   start_station_name       519700 non-null  object 
 5   start_station_latitude   519700 non-null  float64
 6   start_station_longitude  519700 non-null  float64
 7   end_station_id           519700 non-null  int64  
 8   end_station_name         519700 non-null  object 
 9   end_station_latitude     519700 non-null  float64
 10  end_station_longitude    519700 non-null  float64
 11  bike_id                  519700 non-null  int64  
 12  user_type                519700 non-null  object 
dtypes: float64(4), int64(4), object(5)
memory usage: 51.5+ MB


In [7]:
# Check null value for each column
df_2017.isna().sum()

duration_sec               0
start_time                 0
end_time                   0
start_station_id           0
start_station_name         0
start_station_latitude     0
start_station_longitude    0
end_station_id             0
end_station_name           0
end_station_latitude       0
end_station_longitude      0
bike_id                    0
user_type                  0
dtype: int64

In [8]:
# Check duplicated row
df_2017.duplicated().sum()

0

### Data Assesment Result

#### Quality Issues

* `start_time` and `end_time` columns in object dtype
* `start_station_id` and `end_station_id` columns in int64 dtype
* `start_station_latitude`, `start_station_longitude`, `end_station_latitude`, and `end_station_longitude` columns in float64 dtype
* `bike_id` column in int64 dtype
* `user_type` in object dtype

## Data Cleaning

The programmatic data cleaning process:

* Define
* Code
* Test

What we will do for this dataframe is change columns dtype properly based on it's value.  
As always, we need to copy our dataframe before do any cleaning process, so we can refer back to the old ones.

In [9]:
# Define: Make a new copy before doing any operation, so we can refer back to the old ones.
df_2017_clean = df_2017.copy()

In [10]:
# Define: Change columns dtype based on it's value

# Code
dtype= {'start_time': 'datetime64', 
        'end_time': 'datetime64',
        
        'start_station_id': 'object',
        'end_station_id': 'object',
        
        'start_station_latitude': 'object',
        'start_station_longitude':  'object',
        'end_station_latitude':  'object',
        'end_station_longitude':  'object',
        
        'bike_id': 'object',
        'user_type': 'category'}

df_2017_clean = df_2017_clean.astype(dtype)

# Test
df_2017_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 519700 entries, 0 to 519699
Data columns (total 13 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             519700 non-null  int64         
 1   start_time               519700 non-null  datetime64[ns]
 2   end_time                 519700 non-null  datetime64[ns]
 3   start_station_id         519700 non-null  object        
 4   start_station_name       519700 non-null  object        
 5   start_station_latitude   519700 non-null  object        
 6   start_station_longitude  519700 non-null  object        
 7   end_station_id           519700 non-null  object        
 8   end_station_name         519700 non-null  object        
 9   end_station_latitude     519700 non-null  object        
 10  end_station_longitude    519700 non-null  object        
 11  bike_id                  519700 non-null  object        
 12  user_type       

> Load in your dataset and describe its properties through the questions below.
Try and motivate your exploration goals through this section.

In [11]:
df_2017_clean

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type
0,80110,2017-12-31 16:57:39.654,2018-01-01 15:12:50.245,74,Laguna St at Hayes St,37.7764,-122.426,43,San Francisco Public Library (Grove St at Hyde...,37.7788,-122.416,96,Customer
1,78800,2017-12-31 15:56:34.842,2018-01-01 13:49:55.617,284,Yerba Buena Center for the Arts (Howard St at ...,37.7849,-122.401,96,Dolores St at 15th St,37.7662,-122.427,88,Customer
2,45768,2017-12-31 22:45:48.411,2018-01-01 11:28:36.883,245,Downtown Berkeley BART,37.8703,-122.268,245,Downtown Berkeley BART,37.8703,-122.268,1094,Customer
3,62172,2017-12-31 17:31:10.636,2018-01-01 10:47:23.531,60,8th St at Ringold St,37.7745,-122.409,5,Powell St BART Station (Market St at 5th St),37.7839,-122.408,2831,Customer
4,43603,2017-12-31 14:23:14.001,2018-01-01 02:29:57.571,239,Bancroft Way at Telegraph Ave,37.8688,-122.259,247,Fulton St at Bancroft Way,37.8678,-122.266,3167,Subscriber
...,...,...,...,...,...,...,...,...,...,...,...,...,...
519695,435,2017-06-28 10:00:54.528,2017-06-28 10:08:10.438,81,Berry St at 4th St,37.7759,-122.393,45,5th St at Howard St,37.7818,-122.405,400,Subscriber
519696,431,2017-06-28 09:56:39.631,2017-06-28 10:03:51.090,66,3rd St at Townsend St,37.7787,-122.393,321,5th at Folsom,37.7801,-122.403,316,Subscriber
519697,424,2017-06-28 09:47:36.347,2017-06-28 09:54:41.187,21,Montgomery St BART Station (Market St at 2nd St),37.7896,-122.401,48,2nd St at S Park St,37.7824,-122.393,240,Subscriber
519698,366,2017-06-28 09:47:41.664,2017-06-28 09:53:47.715,58,Market St at 10th St,37.7766,-122.417,59,S Van Ness Ave at Market St,37.7748,-122.419,669,Subscriber


### What is the structure of your dataset?

> The 2017 dataset consists of 519700 rows × 13 columns.

### What is/are the main feature(s) of interest in your dataset?

> The main  features of interest in this dataset is `duration_sec` and `start_time` column.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> From the main feature above, I thinks the feature that will help our investigation are station name / location, bike id and user type. Maybe we can extract month, day, and hour feature from `start_time`.

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!