# Ford GoBike System Data Exploration
## by Luca(MingCong) Zhou

## Preliminary Wrangling

This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.

In [25]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from io import StringIO
from geopy.distance import distance

%matplotlib inline

In [24]:
import sys
!{sys.executable} -m pip install geopy

Collecting geopy
  Downloading geopy-2.1.0-py3-none-any.whl (112 kB)
[K     |████████████████████████████████| 112 kB 1.7 MB/s 
[?25hCollecting geographiclib<2,>=1.49
  Downloading geographiclib-1.50-py3-none-any.whl (38 kB)
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.1.0


In [4]:
# Since GitHub does not allow for large file, I decided loading the data from an url
file_url = "https://video.udacity-data.com/topher/2020/October/5f91cf38_201902-fordgobike-tripdata/201902-fordgobike-tripdata.csv"
res = requests.get(file_url)

In [11]:
# load the bytes data into pandas dataframe if status code is 200
if res.status_code != requests.codes.ok:
    print("Download Failed!")

s = str(res.content, 'utf-8')
data = StringIO(s)
df = pd.read_csv(data)

In [22]:
# high-level overview of data shape and composition
df.sample()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
177955,377,2019-02-01 18:23:33.4110,2019-02-01 18:29:50.7950,26.0,1st St at Folsom St,37.78729,-122.39438,321.0,5th St at Folsom,37.780146,-122.403071,5444,Subscriber,1933.0,Female,Yes


In [29]:
print(df.shape)
print(df.dtypes)
print(df.isnull().sum())

(183412, 16)
duration_sec                        int64
start_time                 datetime64[ns]
end_time                   datetime64[ns]
start_station_id                  float64
start_station_name                 object
start_station_latitude            float64
start_station_longitude           float64
end_station_id                    float64
end_station_name                   object
end_station_latitude              float64
end_station_longitude             float64
bike_id                             int64
user_type                          object
member_birth_year                 float64
member_gender                      object
bike_share_for_all_trip            object
dtype: object
duration_sec                  0
start_time                    0
end_time                      0
start_station_id            197
start_station_name          197
start_station_latitude        0
start_station_longitude       0
end_station_id              197
end_station_name            197
end_station_l

### Issues

#### Issue 1 Missing Values

1. 197 values are missing for features: start_station_id, start_station_name.
2. 8265 values are missing for features: member_birth_year and member_gender.

#### Solve Issue 1

In [30]:
# drop the rows with null values
df.dropna(inplace=True)

#### Issue 2 Misused Data Types

1. start_time, and end_time columns are in string format.
2. start_station_id, and member_birth_year columns are in float format.

#### Solve Issue 2

In [31]:
# convert the 'start_time' and 'end_time' columns to datetime format
df['start_time']= pd.to_datetime(df['start_time'])
df['end_time']= pd.to_datetime(df['end_time'])

# convert the 'start_station_id' and 'member_birth_year columns to int format
df['start_station_id'] = df['start_station_id'].astype('int')
df['member_birth_year'] = df['member_birth_year'].astype('int')

#### Inspect the cleaned dataframe

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             174952 non-null  int64         
 1   start_time               174952 non-null  datetime64[ns]
 2   end_time                 174952 non-null  datetime64[ns]
 3   start_station_id         174952 non-null  int64         
 4   start_station_name       174952 non-null  object        
 5   start_station_latitude   174952 non-null  float64       
 6   start_station_longitude  174952 non-null  float64       
 7   end_station_id           174952 non-null  float64       
 8   end_station_name         174952 non-null  object        
 9   end_station_latitude     174952 non-null  float64       
 10  end_station_longitude    174952 non-null  float64       
 11  bike_id                  174952 non-null  int64         
 12  user_type       

### What is the structure of your dataset?

There are 183412 rides in the dataset with 16 features (duration_sec, start_time, end_time, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, end_station_longitude, bike_id, user_type, member_birth_year, member_gender, and bike_share_for_all_trip). 

Most variables are numeric in nature, but the variables start_station_name, end_station_name, user_type, member_gender and bike_share_for_all_trip are nominal factor variables.

### What is/are the main feature(s) of interest in your dataset?

I'm most interested in figuring out:

1. The average distance of their rides.
2. The average speed when they are riding the bikes.
3. The most popular location for bike renting.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

To get the distance of a rides, we need to know the starting and ending locations. The features start_station_latitude, start_station_longitude, end_station_latitude, and end_station_longitude are provided in the datasets. These features are bascally two GPS coordinates:

```text
                                        |
                                        |   (x1,y1)
                                        |
                                        |
                                        |                        (x2, y2)
                                --------|-----------------------------------------
                                        |
```

Remember, from high school, to find the distance between two coordinates, we can use [Euclid's theorem](https://en.wikipedia.org/wiki/Euclid's_theorem). Nevertheless, since Earth is a sphere, we can not merely apply the theorem here. However, Python has made everything so easy for us. We will use the [geopy](https://github.com/geopy/geopy) library to solve the problem.


In [42]:
# use geopy to measure distances, store it as a new feature
# for your reference: https://geopy.readthedocs.io/en/stable/#module-geopy.distance
def getDistance(startCoord, endCoord):
    """
    This function will return the distance (in meter) between two coordinates.

    :param[float tuple] startCoord the starting coordinate
    :param[float tuple] endCoord the ending coordinate
    """
    return distance(startCoord, endCoord).m

In [39]:
distances = list()

# calculate distance for all rows
for ride in df.itertuples():
    startCoord = (ride.start_station_latitude, ride.start_station_longitude)
    endCoord = (ride.end_station_latitude, ride.end_station_longitude)
    distances.append(getDistance(startCoord, endCoord))

df['distance'] = distances

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!