## Analysis
After our exploratory analysis, we can start digging deeper into the data.
 
 For this analysis, we'll take a closer look Citibike operations in Brooklyn, NY. We'll try to answer the following:
 - compared to Manhattan, what is the average distance and ride duration?
 - which stations are the most popular?
 - what are the most popular routes?
 - are there any problems with certain stations?
 - do people prefer electric bikes over classic bikes?
  
#### More detail
We'll also attempt a more granular analysis. For example, specific entries and exits for each station could be helpful in illustrating how bikes are used. Would we see similar usage patterns for a station located near Prospect Park vs. a more residential area? 

#### Data source
While our exploratory analysis was done on 2013-2020 data, we'll be using 2022 data for this analysis. This is because the 2022 data is the most recent and complete data available.

In [31]:
import polars as pl
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

px.set_mapbox_access_token(open('./tokens/.mapbox_token').read())

In [32]:
path = '2021-2023/{}.parquet'
year = '2022'

# read into polars lazyframe
df = pl.scan_parquet(path.format(year))

In [None]:
noend_df = df.filter(pl.col('end_lat').is_null() | pl.col('end_lng').is_null()).drop(['end_station_id', 'end_station_name', 'end_lat', 'end_lng', 'route']).collect()

# get counts of each start_station_name
noend_counts_df = noend_df.group_by('start_station_name') \
    .agg(
        pl.count('start_station_name').alias('count'),
        pl.col('start_lat').first().alias('start_lat'),
        pl.col('start_lng').first().alias('start_lng')) \
    .sort('count', descending=True) \
    .to_pandas()

top_noend_counts = noend_counts_df.head(50)

def show_map_noends(df):
    fig = px.scatter_mapbox(df, 
                            lat='start_lat', 
                            lon='start_lng', 
                            hover_name='start_station_name', 
                            size='count', 
                            hover_data=['count'],
                            height=800,
                            zoom=12)
    fig.update_layout(mapbox_style='carto-positron', mapbox_accesstoken=open('./tokens/.mapbox_token').read())
    fig.show()

show_map_noends(top_noend_counts)

In [34]:
# create columns for filtering
df = df.with_columns([
    (pl.col('ended_at') - pl.col('started_at')).dt.seconds().alias('duration'),
    pl.col('started_at').dt.date().alias('date'),
    pl.col('started_at').dt.month().alias('month'),
    pl.col('started_at').dt.hour().alias('hour'),
    (pl.col('start_lat').cast(pl.Utf8, strict=False) + ',' + pl.col('start_lng').cast(pl.Utf8, strict=False) + '|' + pl.col('end_lat').cast(pl.Utf8, strict=False) + pl.col('end_lng').cast(pl.Utf8, strict=False)).alias('route')
])

In [54]:
popular_routes_df = df.group_by('route') \
    .agg(
        pl.count('route').alias('count'),
        pl.col('start_lat').first().alias('start_lat'),
        pl.col('start_lng').first().alias('start_lng'),
        pl.col('start_station_name').first().alias('start_station_name'),
        pl.col('end_lat').first().alias('end_lat'),
        pl.col('end_lng').first().alias('end_lng'),
        pl.col('end_station_name').first().alias('end_station_name'),
        pl.col('duration').mean().alias('avg_duration')
        ) \
    .sort('count', descending=True) \
    .drop_nulls() \
    .collect() \
    .to_pandas()

def show_map_popular_routes(df):
    fig = px.scatter_mapbox(df, 
                            lat='start_lat', 
                            lon='start_lng', 
                            hover_name='start_station_name', 
                            size='count', 
                            hover_data=['count'],
                            height=800,
                            zoom=11,
                            color='avg_duration',
                            color_continuous_scale=px.colors.sequential.Agsunset)
    fig.update_layout(mapbox_style='dark', title='50 Most Popular Citibike Route Starts', mapbox_accesstoken=open('./tokens/.mapbox_token').read())
    fig.show()
    
top_popular_routes_df = popular_routes_df.head(50)
show_map_popular_routes(top_popular_routes_df)

After looking at the data, the most popular routes have the same start and end stations. These stations are primarily near waterfronts or parks with the others being located in high traffic areas of Brooklyn and Manhattan. This suggests that a significant portion of Citibike users are using the bikes for recreation, not just commuting. Those who use bikes recreationally (noted by proximity to parks or waterfronts) are likely to use the bike for longer periods of time.

In [None]:
fig = go.Figure()

for row in top_popular_routes_df.itertuples():
    fig.add_trace(go.Scattermapbox(
        lat=[row.start_lat, row.end_lat],
        lon=[row.start_lng, row.end_lng],
        mode='lines',
        line=dict(width=2, color='yellow'),
        hoverinfo='none'
    ))

fig.update_layout(
    mapbox_style='dark',
    mapbox_zoom=11,
    margin = {'r':0, 't':0, 'l':0, 'b':0},
    mapbox = {
        'center': { 'lon': top_popular_routes_df.start_lng.mean(), 'lat': top_popular_routes_df.start_lat.mean() },
    }
)

fig.show()

In [None]:
biketype_df = df.group_by('rideable_type') \
    .agg(
        pl.count('rideable_type').alias('count'),
        (pl.col('duration').median()).alias('avg_duration'),
        pl.col('start_lat').first().alias('start_lat'),
        pl.col('start_lng').first().alias('start_lng'),
    ) \
    .sort('count', descending=True) \
    .collect() \
    .to_pandas()


In [None]:
biketype_df

In [None]:
def show_biketype(df):
    fig = px.bar(df,
                    x='rideable_type',
                    y='count',
                    color='rideable_type',
                    title='Bike Type Popularity',)
    fig.show()

show_biketype(biketype_df)

##### Which stations are most common starting points
- get most common ending points from starting point
##### Which stations are most common ending points   
- get most common starting points
##### Top 10 most popular routes
- add start and end station to get route, then count
##### Average ride duration for electric vs. docked vs. classic bikes
- also proportion of each type
##### Starting stations with most null ending stations

Grand Army Plaza & Plaza St West is one of the most popular stations in Brooklyn. 

In [47]:
def show_endstations(stationname):
    endstation_df = df.filter(pl.col('start_station_name') == stationname) \
    .group_by('end_station_name') \
    .agg(
        pl.count('end_station_name').alias('count'),
        pl.col('start_lat').first().alias('start_lat'),
        pl.col('start_lng').first().alias('start_lng'),
        pl.col('end_lat').first().alias('end_lat'),
        pl.col('end_lng').first().alias('end_lng'),
    ) \
    .collect() \
    .to_pandas()

    fig = px.scatter_mapbox(endstation_df, 
                            lat='end_lat', 
                            lon='end_lng', 
                            hover_name='end_station_name', 
                            size='count',
                            height=800,
                            zoom=11,
                            center={'lat': 40.66366, 'lon': -73.96301},
                            title='Destinations from {}'.format(stationname))
    fig.update_traces(marker_color='yellow')
    fig.update_layout(mapbox_style='dark', mapbox_accesstoken=open('./tokens/.mapbox_token').read())
    fig.show()

def show_startstations(stationname):
    startstation_df = df.filter(pl.col('end_station_name') == stationname) \
    .group_by('start_station_name') \
    .agg(
        pl.count('start_station_name').alias('count'),
        pl.col('start_lat').first().alias('start_lat'),
        pl.col('start_lng').first().alias('start_lng'),
        pl.col('end_lat').first().alias('end_lat'),
        pl.col('end_lng').first().alias('end_lng'),
    ) \
    .collect() \
    .to_pandas()

    fig = px.scatter_mapbox(startstation_df, 
                            lat='start_lat', 
                            lon='start_lng', 
                            hover_name='start_station_name', 
                            size='count',
                            height=800,
                            zoom=11,
                            center={'lat': 40.66366, 'lon': -73.96301},
                            title='Departure Stations to {}'.format(stationname))
    fig.update_traces(marker_color='yellow')
    fig.update_layout(mapbox_style='dark', mapbox_accesstoken=open('./tokens/.mapbox_token').read())
    fig.show()

In [52]:
show_endstations('S 4 St & Wythe Ave')

In [53]:
show_startstations('S 4 St & Wythe Ave')