# Ford GoBike System Data Exploration
## by Victoria Perez Mola

## Introduction 

The dataset chosen to perform an analysis is the [Bay wheels (ex Ford GoBike) System Data](https://www.lyft.com/bikes/bay-wheels/system-data). 
This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.

## The Data

Each trip is anonymized and includes:

* Trip Duration (seconds)
* Start Time and Date
* End Time and Date
* Start Station ID
* Start Station Name
* Start Station Latitude
* Start Station Longitude
* End Station ID
* End Station Name
* End Station Latitude
* End Station Longitude
* Bike ID
* User Type (Subscriber or Customer – “Subscriber” = Member or “Customer” = Casual)

## Preliminary Wrangling

In [17]:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from os import listdir

%matplotlib inline

Read all files from the Data folder and join them into one single dataset.

In [20]:
files_list = []
data_dir = 'Data'

# read and append each file in the folder
for file in listdir(data_dir):
    files_list.append(pd.read_csv(data_dir+'/'+file))
    
# assign the files data to a data frame    
df = pd.concat(files_list)

In [23]:
# Check dataframe
df.sample(3)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,bike_share_for_all_trip,rental_access_method
87968,335,2019-06-17 09:51:55.9190,2019-06-17 09:57:31.6990,19.0,Post St at Kearny St,37.788975,-122.403452,14.0,Clay St at Battery St,37.795001,-122.39997,2410,Subscriber,No,
179011,258,2019-11-01 16:49:43.8360,2019-11-01 16:54:02.1150,182.0,19th Street BART Station,37.809369,-122.267951,196.0,Grand Ave at Perkins St,37.808894,-122.25646,11569,Subscriber,No,
121314,1082,2019-06-11 17:05:07.4360,2019-06-11 17:23:10.2770,350.0,8th St at Brannan St,37.771431,-122.405787,119.0,18th St at Noe St,37.761047,-122.432642,419,Subscriber,No,


In [21]:
# Check the amount of data
df.shape

(3036496, 15)

In [22]:
#get information about the joint dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3036496 entries, 0 to 176798
Data columns (total 15 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   duration_sec             int64  
 1   start_time               object 
 2   end_time                 object 
 3   start_station_id         float64
 4   start_station_name       object 
 5   start_station_latitude   float64
 6   start_station_longitude  float64
 7   end_station_id           float64
 8   end_station_name         object 
 9   end_station_latitude     float64
 10  end_station_longitude    float64
 11  bike_id                  int64  
 12  user_type                object 
 13  bike_share_for_all_trip  object 
 14  rental_access_method     object 
dtypes: float64(6), int64(2), object(7)
memory usage: 370.7+ MB


In [24]:
df.nunique()

duration_sec                 15717
start_time                 2974901
end_time                   2974562
start_station_id               455
start_station_name             473
start_station_latitude      319513
start_station_longitude     339226
end_station_id                 455
end_station_name               473
end_station_latitude        322392
end_station_longitude       343102
bike_id                      14382
user_type                        2
bike_share_for_all_trip          2
rental_access_method             2
dtype: int64

In [26]:
df.describe()

Unnamed: 0,duration_sec,start_station_id,start_station_latitude,start_station_longitude,end_station_id,end_station_latitude,end_station_longitude,bike_id
count,3036496.0,2469905.0,3036496.0,3036496.0,2468257.0,3036496.0,3036496.0,3036496.0
mean,815.904,153.0668,37.75906,-122.3513,148.4574,37.75821,-122.3474,125409.8
std,1913.268,128.6385,0.1730132,0.4490454,127.6466,0.2606699,0.7751218,243107.2
min,60.0,3.0,0.0,-122.5143,3.0,0.0,-122.5758,4.0
25%,367.0,50.0,37.7675,-122.4163,43.0,37.76835,-122.4143,2410.0
50%,584.0,109.0,37.77877,-122.3997,104.0,37.77922,-122.3991,7041.0
75%,910.0,246.0,37.79429,-122.3889,243.0,37.795,-122.3889,13000.0
max,912110.0,521.0,45.51,0.0,521.0,45.51,0.0,999960.0


- fields start_time and end_time should be datetime
- start_station_id and end_station_id are float

In [5]:
# correct datatype for columns start_time and end_time to datetime
df.start_time = pd.to_datetime(df.start_time)
df.end_time = pd.to_datetime(df.end_time)

### What is the structure of your dataset?
The dataset contains over 3M ride records, and they have 10 features of diferent nature. 

### What is/are the main feature(s) of interest in your dataset?

* How long is the the average trip? 
* Has the season of the year an influence on these averages
* Is there any peak on the rides any particular time of the day?
* Does this has any relation with the user type?
* Are there any stations more popular than others? 
* Does this varies weekdays vs weekends? 

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

The main features of interest are the station ids, the user type and the start and end time of the rides. 
To answer the questions I've postulated the start time and end time are with no doubt the most important ones, and the features that more analysis and transformation will need. With this information I could extract the lenght, time of the day, the day of the week, and the season the trips are taking place. 

I will also need the user type and the stations ids to add information on the previous variables and to find correlations between the data.

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!