# Nice Ride MN Exploratory Data Analysis

description / intro...

In [64]:
import numpy as np
import pandas as pd
#import matplotlib.pyplot as plt
#import seaborn as sb
#%matplotlib inline
#sb.set()

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource, Circle, HoverTool, ColorBar, LinearColorMapper
from bokeh.palettes import Viridis256
from bokeh.tile_providers import CARTODBPOSITRON_RETINA
fig_height = 500
fig_width = 800
output_notebook()

## Data Loading and Cleaning

First, let's load the data and take a look at the data to see if it needs any cleaning.

In [28]:
# Load data
stations = pd.read_csv('../input/Nice_Ride_2017_Station_Locations.csv')
trips = pd.read_csv('../input/Nice_ride_trip_history_2017_season.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [29]:
# Show some of the stations data
stations.sample(n=5)

Unnamed: 0,Number,Name,Latitude,Longitude,Total docks
23,30023,Chicago & 27th Street,44.95355,-93.26259,15
179,30182,Sanford Hall,44.980831,-93.240282,23
93,30096,University Ave NE & 12th Ave NE,44.999874,-93.262955,15
1,30001,25th Street & 33rd Ave S,44.957341,-93.223374,15
38,30039,10th Street & Nicollet Mall,44.973839,-93.274544,19


In [30]:
# Print info about each column
for col in stations:
    print('\n',col,'\nNulls:',stations[col].isnull().sum(),'\n',stations[col].describe())


 Number 
Nulls: 0 
 count       202
unique      202
top       30077
freq          1
Name: Number, dtype: object

 Name 
Nulls: 0 
 count                    202
unique                   202
top       Washington & Cedar
freq                       1
Name: Name, dtype: object

 Latitude 
Nulls: 0 
 count    202.000000
mean      44.965178
std        0.023582
min       44.890527
25%       44.948514
50%       44.969745
75%       44.980757
max       45.042435
Name: Latitude, dtype: float64

 Longitude 
Nulls: 0 
 count    202.000000
mean     -93.229178
std        0.064316
min      -93.322066
25%      -93.274872
50%      -93.251619
75%      -93.200025
max      -93.083433
Name: Longitude, dtype: float64

 Total docks 
Nulls: 0 
 count    202.000000
mean      18.059406
std        4.682606
min        7.000000
25%       15.000000
50%       15.000000
75%       19.000000
max       41.000000
Name: Total docks, dtype: float64


The stations dataset looks pretty clean - no missing values, and the latitude, longitude, and number of docks loaded as expected.  The only anomaly was that the last station ID number was a string ('NRHQ'), while all the others were integers.  But that's fine - pandas will just treat that column as a categorical object, which is what we want anyway.  

What about the trip data?

In [31]:
# Show some of the trip data
trips.sample(n=5)

Unnamed: 0,Start date,Start station,Start station number,End date,End station,End station number,Account type,Total duration (Seconds)
450416,4/9/2017 17:46,S 5th Street & Nicollet Mall,30186,4/9/2017 17:57,University & Bank Street SE,30047,Casual,655
215910,7/22/2017 13:48,4th Street & 17th Ave SE,30032,7/22/2017 14:12,U of M St. Paul Student Center,30113,Casual,1478
139210,8/20/2017 18:14,Nicollet Island,30170,8/20/2017 19:05,Social Sciences,30019,Member,3065
424936,4/27/2017 12:00,IDS Center,30042,4/27/2017 12:07,West 15th Street & Willow,30093,Member,373
3807,10/29/2017 21:03,4th Street & 13th Ave SE,30009,10/29/2017 21:07,4th Street & 17th Ave SE,30032,Member,261


In [32]:
# Print info about each column
for col in trips:
    print('\n',col,'\nNulls:',trips[col].isnull().sum(),'\n',trips[col].describe())


 Start date 
Nulls: 0 
 count             460718
unique            171626
top       7/3/2017 13:06
freq                  20
Name: Start date, dtype: object

 Start station 
Nulls: 0 
 count                       460718
unique                         202
top       Lake Street & Knox Ave S
freq                         10747
Name: Start station, dtype: object

 Start station number 
Nulls: 0 
 count     460718
unique       401
top        30115
freq        7350
Name: Start station number, dtype: object

 End date 
Nulls: 0 
 count             460718
unique            170180
top       9/9/2017 14:58
freq                  25
Name: End date, dtype: object

 End station 
Nulls: 0 
 count                       460718
unique                         202
top       Lake Street & Knox Ave S
freq                         11658
Name: End station, dtype: object

 End station number 
Nulls: 0 
 count     460718
unique       401
top        30158
freq        7769
Name: End station number, dtype: object

 

The trips dataset looks clean too - the only problem is the trip start and end dates were loaded as generic objects, so we'll convert them to datetimes:

In [None]:
# Convert start and end times to datetime
for col in ['End date', 'Start date']:
    trips[col] = pd.to_datetime(trips[col])

## Station Locations

Let's plot the station locations on a map.  We'll use [Bokeh](https://bokeh.pydata.org/en/latest/) to display a map of the station locations.  To plot them on a map, however, we'll first have to transform the station locations from latitude+longitude coordinates to [Mercator (UTM) coordinates](https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system), and make a function which generates a plot of points on the map.

In [56]:
def lat_to_mercY(lat):
    """Convert Latitude to Mercator Y"""
    return np.log(np.tan(np.pi / 4 + np.radians(lat) / 2)) * 6378137.0

def lon_to_mercX(lon):
    """Convert Longitude to Mercator X"""
    return np.radians(lon) * 6378137.0

def MapPoints(lat, lon, size=10, color="green", alpha=0.8, padding=0.1, 
              tooltips=None, title=None, width=None, height=None):
    """Bokeh plot of points overlayed on a map"""
    
    # Convert lat,lon to UTM coordinates
    X = lon_to_mercX(lon)
    Y = lat_to_mercY(lat)
    
    # Set marker sizes
    if type(size) is int or type(size) is float:
        size = size*np.ones(len(lat))
    
    # Data source table for Bokeh
    source = ColumnDataSource(data=dict(
        X = X,
        Y = Y,
        size = size
    ))
    
    # Set marker colors
    if type(color) is not str: #map colors to a colormap
        source.add(color, 'color') #add to source table
        mapper = LinearColorMapper(palette=Viridis256, low=min(color), high=max(color))
        color = {'field': 'color', 'transform': mapper}
            
    # Plot the points
    p = figure(tools="pan,wheel_zoom,reset,hover,save", active_scroll="wheel_zoom")
    p.add_tile(CARTODBPOSITRON_RETINA) #set background map
    p.circle('X', 'Y', source=source, size='size', #plot each station
             fill_color=color, fill_alpha=alpha, line_color=None)
    p.axis.visible = False
    
    # Colorbar
    if type(color) is not str: #if plotting a range of colors,
        color_bar = ColorBar(color_mapper=mapper, location=(0, 0))
        p.add_layout(color_bar, 'right')
        
    # Tool tips
    if tooltips is not None:
        for T in tooltips: #add to Bokeh data source
            source.add(T[1].values.tolist(), name=T[0])
        hover = p.select_one(HoverTool) #set hover values
        hover.tooltips=[(T[0], "@"+T[0]) for T in tooltips]
        
    # Title
    if title is not None:
        p.title.text = title
        
    # Figure height
    if height is not None:
        p.plot_height = height
        
    # Figure width
    if width is not None:
        p.plot_width = width
    
    return p

In [57]:
# On hover, show Station name and the number of docks
tooltips = [("Station", stations['Name']), 
            ("Docks", stations['Total docks'])]

# Plot the stations
p = MapPoints(stations.Latitude, stations.Longitude, 
              tooltips=tooltips, color=stations['Total docks'],
              title="Nice Ride Locations (color indicates # docks)",
              height=fig_height, width=fig_width)

show(p)

## Station Demand

Let's also take a look at the demand at each station.  What I mean by that is the number of bikes which users take from each station and leave at each station.  First, how many trips are started from each station?

In [60]:
# Count incoming and outgoing trips for each station
demand_df = pd.DataFrame({'Outgoing trips': trips.groupby('Start station').size(),
                          'Incoming trips': trips.groupby('End station').size()
                      })
demand_df['Name'] = demand_df.index
sdf = stations.merge(demand_df, on='Name')

# On hover, show Station name, the number of docks, and the number of outgoing trips
tooltips = [("Station", sdf['Name']), 
            ("Docks", sdf['Total docks']),
            ("Num_Outgoing", sdf['Outgoing trips'])]

# Plot the stations
p = MapPoints(sdf.Latitude, sdf.Longitude, 
              tooltips=tooltips, color=sdf['Outgoing trips'],
              title="Nice Ride Locations (color indicates # Outgoing trips)",
              height=fig_height, width=fig_width)

show(p)

And the number of trips which *end* at each station?

In [61]:
# On hover, show Station name, the number of docks, and the number of INCOMING trips
tooltips = [("Station", sdf['Name']), 
            ("Docks", sdf['Total docks']),
            ("Num_Incoming", sdf['Incoming trips'])]

# Plot the stations
p = MapPoints(sdf.Latitude, sdf.Longitude, 
              tooltips=tooltips, color=sdf['Incoming trips'],
              title="Nice Ride Locations (color indicates # Incoming trips)",
              height=fig_height, width=fig_width)

show(p)

Nice Ride MN has to re-distribute bikes from stations which have extra bikes to stations which don't have enough.  Stations which have more rides ending at that station than starting there will end up with extra bikes, and Nice Ride will have to re-distribute those extra bikes to the stations which are more empty!  What does this distribution look like? That is, which stations have more rides ending at that station than starting there, or vice versa?  Again we'll use Bokeh to plot this difference in demand.

In [63]:
# Compute the DIFFERENCE between #incoming and #outgoing trips
demand_diff = sdf['Incoming trips']-sdf['Outgoing trips']

# On hover, show Station name, #docks, and DIFFERENCE between #incoming and #outgoing trips
tooltips = [("Station", sdf['Name']), 
            ("Docks", sdf['Total docks']),
            ("Incoming_minus_Outgoing", demand_diff)]

# Plot the stations
p = MapPoints(sdf.Latitude, sdf.Longitude, 
              tooltips=tooltips, color=demand_diff,
              title="Nice Ride Locations (color indicates #Incoming - #Outgoing trips)",
              height=fig_height, width=fig_width)

show(p)

There's clearly some stations at which more people are ending their trips than starting (e.g. the station at Lake St & Knox Ave, at the northeast corner of lake Bde Maka Ska), and also stations at which more people are *starting* their trips than ending (e.g. the station at Coffman Union on the Univeristy of Minnesota campus).  Is this difference in demand going to be a problem for Nice Ride?

 Ideally, Nice Ride will want to have more docks at stations where there is large difference between the number of incoming and outgoing rides.  This is because if more rides are starting at a given station than ending there, the number of bikes at that station will decrease as the day goes on.  So, there need to be enough docks at that station to hold enough bikes so the station isn't empty by the end of the day!  On the other hand, if more rides are *ending* at a station than are beginning there, all the docks at that station will fill up and people won't be able to end their rides there!

Stations which have a good balance of the number of rides coming in to the number of rides going out don't need quite as many docks - because about as many bikes are being taken from that station as are being left there.  Although, there's also the issue of time (some stations may see different demand depending on the time of day, week, or season). With a good match of the number of docks at each station to the difference between incoming and outgoing trips, Nice Ride won't have to spend as much time during prime riding hours re-distributing bikes from low-demand stations (with extra unused bikes) to high-demand stations (with not enough bikes!).  How well does this distribution of demand differences match up with the distribution of the number of docks at each station? 

In [None]:
# TODO: COMPARE #DOCKS TO difference in DEMAND DISTRIBUTION

# Notes / TODO

Data cleaning first

Use KeplerGL maybe to visualize:
- Station locations and spots per station
- Ride density to and from stations, and vs time
- Difference in to vs from, and vs time (need to know where to move bikes)
- Flow (overall)
- Flow to vs from
- Flow to and from by day/time
  - Eg to/from downtown on a weekday morning vs night
  - Or to/from the lakes/parks weekday vs weekend
  
Compute # bikes at each station? If that's not provided
- Then visualize bike availability vs station

Does the distribution of rides from a each station match the distribution of available bikes at each station? (This will change over time, see if there's a time when it's especially bad)

Does the distribution of rides to each station match the distribution of empty slots (again will change over time)

Is there a difference in the bike patterns between people with a membership vs people without one?
- Compute the transition (or trip) probability matrix for members and separately for non members
- Does it significantly differ?
- Or could bootstrap to get distributions for each trip, compute roc to get prob members take that trip more often (don't bother with significance testing unless there's one or two you are really interested in...)
- Then find the M best stations to put ads at (the M stations which maximize the probability that a person with a riding pattern similar to people with memberships will see it)
- Mention that this assumes that people who have bike patterns that are similar to those people with memberships are more likely to actually get memberships. And say that in fact what you would really want to do is do a/b testing or Bandit algorithm.

How does riding activity depend on the weather?
 -Temperature, cloudy, and rain
 
Is there a seasonal dependence in riding activity independent of the weather?
- Look at residuals over season after regressing out weather
- The idea is: do people just not wanna ride bikes as much in the fall even if it's nice out?
