# FordGoBike_DataVisualization
This is an exploraton of the the FordGoBike dataset from San Francisco in 2018.   
Analysis and Visualization: Liz Herdter  
March 2019

**Objective**: Use visualizations to gain insight about trends in rider usage. Specifically identify when and where rider usage is highest and identify possible locations for additional bikes (or removal of stations). 

1. Perform any necessary wrangling  
2. Expore the dataset visually
3. Gain insight about trends in rider usage

**About the Data**  
Ford GoBike is a bike share system in the San Francisco Bay Area. This program was piloted in 2013 and as of 2018 there were 7000 bikes in the Ford GoBike fleet spread across the Bay Area, East Bay, and San Jose. The bikes are locked into a network of docking stations around the city. They can be unlocked from one station and returned to any other station making them ideal for one way trips. The bike are accessible 24/7/365. More about this program is can be accessed [here](https://www.fordgobike.com/about). 

Sources used:  
https://stackoverflow.com/questions/41514173/change-multiple-columns-in-pandas-dataframe-to-datetime  
https://stackoverflow.com/questions/30405413/python-pandas-extract-year-from-datetime-dfyear-dfdate-year-is-not/33757291  
https://stackoverflow.com/questions/9847213/how-do-i-get-the-day-of-week-given-a-date-in-python

In [11]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import glob

%matplotlib inline

In [None]:
#define path for data within in Jupyter Notebook
path = r'Data/'
all_files = glob.glob(path + "/*.csv")

# Join all dataframes together 
li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

df = pd.concat(li, axis=0, ignore_index=True)

In [7]:
#explore shape and features within dataset
df.shape

(1863721, 16)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1863721 entries, 0 to 1863720
Data columns (total 16 columns):
duration_sec               int64
start_time                 object
end_time                   object
start_station_id           float64
start_station_name         object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             float64
end_station_name           object
end_station_latitude       float64
end_station_longitude      float64
bike_id                    int64
user_type                  object
member_birth_year          float64
member_gender              object
bike_share_for_all_trip    object
dtypes: float64(7), int64(2), object(7)
memory usage: 227.5+ MB


In [12]:
df.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,598,2018-02-28 23:59:47.0970,2018-03-01 00:09:45.1870,284.0,Yerba Buena Center for the Arts (Howard St at ...,37.784872,-122.400876,114.0,Rhode Island St at 17th St,37.764478,-122.40257,1035,Subscriber,1988.0,Male,No
1,943,2018-02-28 23:21:16.4950,2018-02-28 23:36:59.9740,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,324.0,Union Square (Powell St at Post St),37.7883,-122.408531,1673,Customer,1987.0,Male,No
2,18587,2018-02-28 18:20:55.1900,2018-02-28 23:30:42.9250,93.0,4th St at Mission Bay Blvd S,37.770407,-122.391198,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,3498,Customer,1986.0,Female,No
3,18558,2018-02-28 18:20:53.6210,2018-02-28 23:30:12.4500,93.0,4th St at Mission Bay Blvd S,37.770407,-122.391198,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,3129,Customer,1981.0,Male,No
4,885,2018-02-28 23:15:12.8580,2018-02-28 23:29:58.6080,308.0,San Pedro Square,37.336802,-121.89409,297.0,Locust St at Grant St,37.32298,-121.887931,1839,Subscriber,1976.0,Female,Yes


In [9]:
df.isna().sum()

duration_sec                    0
start_time                      0
end_time                        0
start_station_id            11771
start_station_name          11771
start_station_latitude          0
start_station_longitude         0
end_station_id              11771
end_station_name            11771
end_station_latitude            0
end_station_longitude           0
bike_id                         0
user_type                       0
member_birth_year          110718
member_gender              110367
bike_share_for_all_trip         0
dtype: int64

In [10]:
df.duplicated().sum()

0

### Structure of the dataset

This dataset has 16 features and nearly 2 million records. Each record corresponds to a single trip made. 

### Main features of interest

This dataset can be used to explore the total number of rides at an hourly, daily, weekly, and monthyl temporal resolution. It can also be used to learn about peak rides from each station. Other interesting features included the interaction between number of rides made in each hour based on the day as well as average duration across hours in each day. Additionally, this dataset will provide information about what stations are most traveled too and from and identify areas where more bikes might be used or target spatial areas for new bikeshare stations. 


### Features to support investigation

This dataset contains a wealth of information that can be used to explore rider patterns. Specific features include duration_sec, start time, end time, start station id, end station id, and user_type. There are some missing records for start and end station id but we can fill these in using stations that match the same start and end station lat for the missing ones. Age of rides (member_birth_year) may also be informative but ~5% of the records are missing information for this feature, most likely because only a portion of the users are members. 





## Perform necessary wrangling

**Quality Issues**
1. Start_time and end_time as type objects
2. start and end station id as type ints 
3. Bike id is int
4. Member birth year is int
5. Missing values for start and end station. 
**Structural Issues**  
Technically speaking, this dataset is quite tidy as it is but in order to explore rider preference on a temporal and spatial scale new features will be needed based on the start and end time. Duration will need to be converted and binned. 

>start time
1. new column for hour
2. new column for day 
3. new column for month 
4. new column for day of week

>end time
1. new column for hour
2. new column for day 
3. new column for month 
4. new column for day of week


> duration
1. convert to minutes
2. cut bins 

**Deal with quality issues first**

In [57]:
df1 = df.copy()

1. Change start and end time to type datetime

In [58]:
df1[['start_time', 'end_time']]=df1[['start_time', 'end_time']].apply(pd.to_datetime)

2-4. Change start and end station id, bike id, and member_birthyear to type object.

In [59]:
df1[['start_station_id', 'end_station_id', 'bike_id', "member_birth_year"]]=df1[['start_station_id', 'end_station_id', 'bike_id', 'member_birth_year']].astype(object)

5. Explore missing values for start and end station id 

In [60]:
#get unique lat longs to see which bikes are not recording station 
lat_long = df1.drop_duplicates(subset= ['start_station_latitude', 'start_station_longitude'])
lat_long.sample(5)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
315,278,2018-02-28 19:51:24.334,2018-02-28 19:56:03.297,336.0,Potrero Ave and Mariposa St,37.763281,-122.407377,124.0,19th St at Florida St,37.760447,-122.410807,3016,Subscriber,1984,Male,No
112298,335,2018-11-30 06:42:41.154,2018-11-30 06:48:16.342,271.0,San Pablo Park,37.855783,-122.283127,265.0,Ninth St at Parker St,37.858868,-122.291209,3597,Subscriber,1966,Male,No
478040,816,2018-06-23 13:45:35.985,2018-06-23 13:59:12.384,,,37.39,-121.96,,,37.38,-121.94,4240,Subscriber,1987,Male,No
106745,905,2018-11-30 23:29:52.456,2018-11-30 23:44:58.094,371.0,Lombard St at Columbus Ave,37.802746,-122.413579,17.0,Embarcadero BART Station (Beale St at Market St),37.792251,-122.397086,2464,Subscriber,1977,Male,No
106748,369,2018-11-30 23:36:17.314,2018-11-30 23:42:27.248,370.0,Jones St at Post St,37.787327,-122.413278,19.0,Post St at Kearny St,37.788975,-122.403452,2752,Subscriber,1992,Male,No


In [61]:
#which bikes are not recording
lat_long[lat_long.start_station_id.isna()].bike_id.value_counts()

4102    7
4147    3
4281    3
4250    3
4184    3
4289    2
4163    2
4165    2
4259    2
4127    2
4095    2
4105    2
4240    2
4099    2
4238    2
4111    2
4181    2
4193    2
4254    1
4144    1
4257    1
4243    1
4247    1
3975    1
4260    1
4263    1
4136    1
4425    1
4140    1
4270    1
4245    1
3769    1
4276    1
4120    1
4288    1
4110    1
4207    1
4132    1
4122    1
4202    1
4201    1
4196    1
4190    1
4185    1
4277    1
4179    1
4171    1
4297    1
4168    1
4295    1
4160    1
4284    1
4155    1
3758    1
4097    1
Name: bike_id, dtype: int64

In [62]:
#test
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1863721 entries, 0 to 1863720
Data columns (total 16 columns):
duration_sec               int64
start_time                 datetime64[ns]
end_time                   datetime64[ns]
start_station_id           object
start_station_name         object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             object
end_station_name           object
end_station_latitude       float64
end_station_longitude      float64
bike_id                    object
user_type                  object
member_birth_year          object
member_gender              object
bike_share_for_all_trip    object
dtypes: datetime64[ns](2), float64(4), int64(1), object(9)
memory usage: 227.5+ MB


**Now deal with structural items**

In [63]:
#start time
#1. new column for hour
#2. new column for day 
#3. new column for month 
#4. new column for day of week

df1['start_hour'] = df1.start_time.dt.hour
df1['start_day'] = df1.start_time.dt.day
df1['start_month'] = df1.start_time.dt.month
df1['start_weekday'] = df1.start_time.dt.weekday


df1['end_hour'] = df1.end_time.dt.hour
df1['end_day'] = df1.end_time.dt.day
df1['end_month'] = df1.end_time.dt.month
df1['end_weekday'] = df1.start_time.dt.weekday

In [None]:
# duration
#1. convert to minutes
#2. cut bins 

In [75]:
df1['duration_mins'] = df1.duration_sec/60
df1['duration_hours']=df1.duration_sec/3600


In [83]:
#cut bins
bins = np.arange(0, int(df1.duration_hours.max()+0.1)+1, 1)
bins

df1['duration_hours'] = pd.cut(df1['duration_hours'], bins)

In [103]:
#check to see it worked
df1.loc[:, ['duration_sec', 'duration_mins', 'duration_hours']].sample(10)

Unnamed: 0,duration_sec,duration_mins,duration_hours
752212,395,6.583333,"(0, 1]"
632717,411,6.85,"(0, 1]"
107806,740,12.333333,"(0, 1]"
1052630,591,9.85,"(0, 1]"
745612,425,7.083333,"(0, 1]"
789026,1925,32.083333,"(0, 1]"
514280,4188,69.8,"(1, 2]"
38305,1268,21.133333,"(0, 1]"
1158817,124,2.066667,"(0, 1]"
889564,2075,34.583333,"(0, 1]"


## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

**Univariate:**
1. How many rides for each hour
2. How many rides for each month
3. How many rides on day of week 
4. How many rides from each start station
5. How many rides to each end station
6. Distribution of duration



> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

**Bivariate**
1. Total number of rides vs hour in day
2. Total number of rides vs day of week 
3. Total number of rides vs month
4. Total number of rides per station

5. Average duration vs hour in day
6. Average duration vs day of week
7. Average duration vs month
8. Average duration per station

9. Ratio of subscriber to customer over month in the year  (make a ratio for each month)
10. Ratio of subscriber to customer over day of the week faceted by the month (make a ratio for each week)



### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

**Multivariate**
1. Total number of rides for each hour of the day faceted by day of the week
2. Total number of rides for each day of the week faceted by month 
3. Total number of rides by user type (subscribers vs customers ) per station 
4. Total number of rides by hour faceted by start_station and day of the week 
5. Average duration per station by day of the week 
6. Most frequent start and stop station (what is the path most traveled), most frequent station
7. By user type - total number of trips per start station by day of week (clustered bar chart) 




### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!