# Ford GoBike System Data Flights Exploration
## by *Furawa*

## Data Wrangling

The Ford GoBike System Data Flights includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.  We will use the datasets of all the year 2018 which are divided in 12 separated datasets(1 dataset for each month of the year). 
We will download all the 12 datasets for the year 2018, gather them all together in one unique dataset, assess it and clean it if necessary. 
First of all let us import libraries useful for the process.  

In [2]:
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt  
import requests
import os
import zipfile
import re
import datetime

%matplotlib inline

Let us retrieve all the urls from the [Bay Wheels trip history data](https://s3.amazonaws.com/baywheels-data/index.html).

In [4]:
# Create an empty list to store the urls
baywheels_urls = []
# Create a for loop to retrieve the urls one by one, there are 12
for i in range(1,13):
    # Remove the 0 (after 2018) after the 9th file
    if i < 10:
        url = 'https://s3.amazonaws.com/baywheels-data/20180' + str(i) +'-fordgobike-tripdata.csv.zip'
    else:
        url = 'https://s3.amazonaws.com/baywheels-data/2018' + str(i) +'-fordgobike-tripdata.csv.zip'
    baywheels_urls.append(url)
baywheels_urls

['https://s3.amazonaws.com/baywheels-data/201801-fordgobike-tripdata.csv.zip',
 'https://s3.amazonaws.com/baywheels-data/201802-fordgobike-tripdata.csv.zip',
 'https://s3.amazonaws.com/baywheels-data/201803-fordgobike-tripdata.csv.zip',
 'https://s3.amazonaws.com/baywheels-data/201804-fordgobike-tripdata.csv.zip',
 'https://s3.amazonaws.com/baywheels-data/201805-fordgobike-tripdata.csv.zip',
 'https://s3.amazonaws.com/baywheels-data/201806-fordgobike-tripdata.csv.zip',
 'https://s3.amazonaws.com/baywheels-data/201807-fordgobike-tripdata.csv.zip',
 'https://s3.amazonaws.com/baywheels-data/201808-fordgobike-tripdata.csv.zip',
 'https://s3.amazonaws.com/baywheels-data/201809-fordgobike-tripdata.csv.zip',
 'https://s3.amazonaws.com/baywheels-data/201810-fordgobike-tripdata.csv.zip',
 'https://s3.amazonaws.com/baywheels-data/201811-fordgobike-tripdata.csv.zip',
 'https://s3.amazonaws.com/baywheels-data/201812-fordgobike-tripdata.csv.zip']

Now we download all the zip files from the urls and store them in the created folder.  

In [5]:
zip_folder = 'baywheels_monthly_data_zip'   # folder to store the zip files
# Create zip_folder if it does not exist
if not os.path.exists(zip_folder):
    os.makedirs(zip_folder)
    
# Download each zip file from the urls with a for loop
for url in baywheels_urls:  
    response = requests.get(url)
    # Retrieve programmatically the name of the file which is the last part after /
    with open(os.path.join(zip_folder, url.split('/')[-1]), mode = 'wb') as file:
        file.write(response.content)
sorted(os.listdir(zip_folder))

['201801-fordgobike-tripdata.csv.zip',
 '201802-fordgobike-tripdata.csv.zip',
 '201803-fordgobike-tripdata.csv.zip',
 '201804-fordgobike-tripdata.csv.zip',
 '201805-fordgobike-tripdata.csv.zip',
 '201806-fordgobike-tripdata.csv.zip',
 '201807-fordgobike-tripdata.csv.zip',
 '201808-fordgobike-tripdata.csv.zip',
 '201809-fordgobike-tripdata.csv.zip',
 '201810-fordgobike-tripdata.csv.zip',
 '201811-fordgobike-tripdata.csv.zip',
 '201812-fordgobike-tripdata.csv.zip']

All the zip files are in the folder, we can proceed and unzip them.  

In [6]:
csv_folder = 'baywheels_monthly_data_csv'   # folder to store the unzip csv files
# Create csv_folder
if not os.path.exists(csv_folder):
    os.makedirs(csv_folder)
    
import glob
files = glob.glob(zip_folder+'/*.zip')
for file in files:
    with zipfile.ZipFile(file, 'r') as my_zip:
        my_zip.extractall(csv_folder)

In [7]:
sorted(os.listdir(csv_folder))

['201801-fordgobike-tripdata.csv',
 '201802-fordgobike-tripdata.csv',
 '201803-fordgobike-tripdata.csv',
 '201804-fordgobike-tripdata.csv',
 '201805-fordgobike-tripdata.csv',
 '201806-fordgobike-tripdata.csv',
 '201807-fordgobike-tripdata.csv',
 '201808-fordgobike-tripdata.csv',
 '201809-fordgobike-tripdata.csv',
 '201810-fordgobike-tripdata.csv',
 '201811-fordgobike-tripdata.csv',
 '201812-fordgobike-tripdata.csv']

In [8]:
test = pd.read_csv('baywheels_monthly_data_csv/201801-fordgobike-tripdata.csv')
test.shape
print(test.dtypes)
test.sample(6)

duration_sec                 int64
start_time                  object
end_time                    object
start_station_id             int64
start_station_name          object
start_station_latitude     float64
start_station_longitude    float64
end_station_id               int64
end_station_name            object
end_station_latitude       float64
end_station_longitude      float64
bike_id                      int64
user_type                   object
member_birth_year          float64
member_gender               object
bike_share_for_all_trip     object
dtype: object


Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
12566,253,2018-01-29 09:09:33.9510,2018-01-29 09:13:47.1780,223,16th St Mission BART Station 2,37.764765,-122.420091,120,Mission Dolores Park,37.76142,-122.426435,2730,Subscriber,1972.0,Male,No
11919,507,2018-01-29 11:01:48.0720,2018-01-29 11:10:15.1800,163,Lake Merritt BART Station,37.79732,-122.26532,195,Bay Pl at Vernon St,37.812314,-122.260779,257,Subscriber,1979.0,Female,No
75051,442,2018-01-09 18:21:16.7050,2018-01-09 18:28:39.3590,37,2nd St at Folsom St,37.785,-122.395936,61,Howard St at 8th St,37.776513,-122.411306,3642,Subscriber,1991.0,Male,No
39276,459,2018-01-20 21:45:27.3410,2018-01-20 21:53:07.0910,45,5th St at Howard St,37.781752,-122.405127,27,Beale St at Harrison St,37.788059,-122.391865,1059,Subscriber,1989.0,Male,No
30651,758,2018-01-23 17:12:42.0910,2018-01-23 17:25:20.5200,33,Golden Gate Ave at Hyde St,37.78165,-122.415408,67,San Francisco Caltrain Station 2 (Townsend St...,37.776639,-122.395526,1773,Subscriber,1963.0,Male,No
54266,1333,2018-01-16 16:40:48.5510,2018-01-16 17:03:02.1870,37,2nd St at Folsom St,37.785,-122.395936,88,11th St at Bryant St,37.77003,-122.411726,1782,Subscriber,1974.0,Male,No


Now that we have all the files we can read them and join them together in a unique dataframe.

In [15]:
# Select all the files in the csv_folder and store them in a list
list_files = glob.glob(csv_folder+'/*.csv')
# Read all the files of the list and put them all in one unique file
baywheels_data = pd.concat(map(pd.read_csv, list_files))

In [19]:
print(baywheels_data.shape)  
print(baywheels_data.dtypes)
baywheels_data.head()

(1863721, 16)
duration_sec                 int64
start_time                  object
end_time                    object
start_station_id           float64
start_station_name          object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             float64
end_station_name            object
end_station_latitude       float64
end_station_longitude      float64
bike_id                      int64
user_type                   object
member_birth_year          float64
member_gender               object
bike_share_for_all_trip     object
dtype: object


Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,71766,2018-03-31 16:58:33.1490,2018-04-01 12:54:39.2630,4.0,Cyril Magnin St at Ellis St,37.785881,-122.408915,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,341,Customer,1964.0,Female,No
1,62569,2018-03-31 19:03:35.9160,2018-04-01 12:26:25.0350,78.0,Folsom St at 9th St,37.773717,-122.411647,47.0,4th St at Harrison St,37.780955,-122.399749,536,Subscriber,1984.0,Male,No
2,56221,2018-03-31 20:13:13.5640,2018-04-01 11:50:14.8400,258.0,University Ave at Oxford St,37.872355,-122.266447,239.0,Bancroft Way at Telegraph Ave,37.868813,-122.258764,3245,Customer,1983.0,Male,No
3,85844,2018-03-31 11:28:07.6580,2018-04-01 11:18:52.6130,186.0,Lakeside Dr at 14th St,37.801319,-122.262642,340.0,Harmon St at Adeline St,37.849735,-122.270582,3722,Customer,,,No
4,1566,2018-03-31 23:37:56.6400,2018-04-01 00:04:02.8930,193.0,Grand Ave at Santa Clara Ave,37.812744,-122.247215,196.0,Grand Ave at Perkins St,37.808894,-122.25646,2355,Subscriber,1979.0,Male,No


All the files are in one unique dataframe `baywheels_data`, but there are many issues to fix before using the 
dataframe. Many columns does not have the correct data type(start_time, end_time, start_station_id,end_station_id,
user_type,member_birth_year,member_gender).

In [28]:
# Change the data type of start_time and end_time from object to date_time
baywheels_data.start_time = pd.to_datetime(baywheels_data.start_time, format = '%Y-%m-%d %H:%M:%S.%f')
baywheels_data.end_time = pd.to_datetime(baywheels_data.end_time, format = '%Y-%m-%d %H:%M:%S.%f')

In [31]:
# Assert that the changes are correct, no output means it is correct
assert baywheels_data.start_time.dtypes == '<datetime64[ns]'
assert baywheels_data.end_time.dtypes == '<datetime64[ns]'

In [49]:
# Change member_birth_year from float to int 
baywheels_data.member_birth_year = baywheels_data.member_birth_year.fillna(0) # fill all the NaN with 0 
baywheels_data.member_birth_year = baywheels_data.member_birth_year.astype(int)

In [52]:
# Check if the data type is correct, no output means it is correct 
assert baywheels_data.member_birth_year.dtypes == 'int64'

In [63]:
# Change the data type of member_gender from string to category
baywheels_data.member_gender = baywheels_data.member_gender.fillna('Other') # Change the na values with other
baywheels_data.member_gender = baywheels_data.member_gender.astype('category')

In [64]:
# Check if the data type is correct, no output means it is correct
assert baywheels_data.member_gender.dtypes == 'category'

In [66]:
# Change the user type from string to category  
baywheels_data.user_type = baywheels_data.user_type.astype('category')

In [67]:
# Check the changes, no output means it is correct
assert baywheels_data.user_type.dtypes == 'category'

In [71]:
baywheels_data.start_station_id = baywheels_data.start_station_id.fillna(0) # Replace the NaNs values with 0 
baywheels_data.end_station_id = baywheels_data.end_station_id.fillna(0)     # Replace the NaNs values with 0 
# Change data type of start_station_id and end_station_id from float to int
baywheels_data.start_station_id = baywheels_data.start_station_id.astype(int)
baywheels_data.end_station_id = baywheels_data.end_station_id.astype(int)

In [74]:
# Check the changes, no output means it is ok
assert baywheels_data.start_station_id.dtypes == 'int64'
assert baywheels_data.end_station_id.dtypes == 'int64'