# 👩🏻‍💻🏫 Bixi! 🏫👩🏻‍💻

**Note to the reader:** This analysis explores my city's bikeshare service using Kaggle's datasets. I thought it would be interesting to see what insights we can uncover about bike riders from this data. As a frequent user of the bikeshare and a fan of cycling (I have even biked across half of Europe), I wanted to bring my passion into this project. Hope you enjoy the ride!

### 📥🚲 Import packages and Bixi Data 🚲📥

In [None]:
!pip install pandas numpy matplotlib geopy seaborn
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from geopy.distance import geodesic
import seaborn as sns

# I downloaded all necessary CSVs. These CSVs are from Kaggle.
df_2017_Rides = pd.read_csv('C:/Users/AzureVirtualDesktopU/Documents/CodeSample/OD_2017.csv')
df_Stations = pd.read_csv('C:/Users/AzureVirtualDesktopU/Documents/CodeSample/Stations_2017_Data.csv')

### 🔗🧹 Cleaning, merging, and manipulations 🧹🔗

In [None]:
# Remove strings and switch data type for some columns, making sure the analysis is possible
df_2017_Rides = df_2017_Rides[df_2017_Rides['end_station_code'] != "Tabletop (RMA)"]
df_2017_Rides['end_station_code'] = df_2017_Rides['end_station_code'].astype('int64')
df_Stations['code'] = df_Stations['code'].astype('int64')
df_Stations = df_Stations[df_Stations['elevation'] != "e"]
df_Stations['elevation'] = pd.to_numeric(df_Stations['elevation'])

# Merge the two data sets
df_2017_Rides = pd.merge(df_2017_Rides, df_Stations[['latitude','longitude','code','elevation']],
                          left_on = 'start_station_code', right_on = 'code', how = 'left')
df_2017_Rides.rename(columns = {'latitude' : 'StartLatitude','longitude' : 'StartLongitude', 
                                'elevation' : 'StartElevation'}, inplace = True)
df_2017_Rides = df_2017_Rides.drop('code', axis=1)

df_2017_Rides_Merge = pd.merge(df_2017_Rides, df_Stations[['latitude','longitude','code','elevation']], 
                               left_on = 'end_station_code', right_on = 'code', how = 'left')
df_2017_Rides_Merge.rename(columns = {'latitude' : 'EndLatitude','longitude' : 'EndLongitude',
                                       'elevation' : 'EndElevation'}, inplace = True)

# Basic manipulations
df_2017_Rides_Merge['VariationElevation'] = 
df_2017_Rides_Merge.apply(lambda x: x['EndElevation'] - x['StartElevation'], axis =1)
df_2017_Rides_Merge['DurationMinutes'] = df_2017_Rides_Merge['duration_sec']/60
df_2017_Rides_Merge = df_2017_Rides_Merge.dropna(subset=['VariationElevation'])


### 🗺️🏁⏱️ Calculate Distance & Speed ⏱️🏁🗺️

In [None]:
# I set station 6046 as the closest station to downtown. (Verify on Google Maps)
StationDowntown = df_Stations[df_Stations['code'] == 6046].iloc[0]
EndCoordsDowntown = (station_Downtown['latitude'], station_Downtown['longitude'])
DowntownElevation = StationDowntown['elevation']

# Define a function to calculate the distance between the two stations of a ride, 
# accounting for the elevation difference. Apply this result to the dataset. Also,
# calculate the distance from the starting station to the downtown station.
def calculate_distance(row):
    StartCoords = (row['StartLatitude'], row['StartLongitude'])
    EndCoordsOther = (row['EndLatitude'], row['EndLongitude'])
    HorizontalDistance = geodesic(StartCoords, EndCoordsOther).kilometers
    VerticalDistance = row['VariationElevation']/1000
    PythagoreDistance = np.sqrt(HorizontalDistance**2 + VerticalDistance**2)
    HorizontalDistanceDowntown = geodesic(StartCoords, EndCoordsDowntown).kilometers
    VerticalDistanceDowntown = (row['VariationElevation'] - DowntownElevation)/1000
    PythagoreDistanceDowntown = np.sqrt(HorizontalDistanceDowntown**2 + VerticalDistanceDowntown**2)

    return pd.Series([PythagoreDistance, PythagoreDistanceDowntown])

df_2017_Rides_Merge[['DistanceKM', 'DistanceToDowntown']] = 
df_2017_Rides_Merge.apply(calculate_distance, axis=1)

# Calculate the average ride speed. This will help provide valuable insights!
def speed(row):
    distance = row['DistanceKM']
    time_in_sec = row['duration_sec']
    return (distance/(time_in_sec))*3600

df_2017_Rides_Merge['RideSpeed'] = df_2017_Rides_Merge.apply(speed, axis=1)

### 📊📈 Data illustration and insights 📈📊

In [None]:
# To gain more insights into how the distance from downtown affects riders, let's divide
# the distance into 20 equal segments and calculate the average distance traveled and the average
# speed. Show the results in a combined bar and line graph.

df_2017_Rides_Merge['DistanceRange'] = pd.cut(df_2017_Rides_Merge['DistanceToDowntown'], bins=20)
AvgDistanceSlice = df_2017_Rides_Merge.groupby('DistanceRange')['DistanceKM'].mean().reset_index()
AvgSpeedSlice = df_2017_Rides_Merge.groupby('DistanceRange')['RideSpeed'].mean().reset_index()

AvgDistanceSlice['RangeMiddle'] = AvgDistanceSlice['DistanceRange'].apply(lambda x: x.mid)
AvgSpeedSlice['RangeMiddle'] = AvgSpeedSlice['DistanceRange'].apply(lambda x: x.mid)

plt.bar(AvgDistanceSlice['RangeMiddle'], AvgDistanceSlice['DistanceKM'], 
        width=0.6, label='Average Ride Distance')
plt.plot(AvgSpeedSlice['RangeMiddle'], AvgSpeedSlice['RideSpeed'], color='red', label='Avg. Speed')
plt.title('Average Ride Distance and Average Speed vs. Distance from Downtown')
plt.ylabel('Average Ride Distance (km) / Average Speed (km/h)')
plt.xlabel('Distance from Downtown (km)')
plt.legend()
plt.show()