# Nice Ride MN Exploratory Data Analysis

description...

In [13]:
import numpy as np
import pandas as pd
#import matplotlib.pyplot as plt
#import seaborn as sb
#%matplotlib inline
#sb.set()

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource, Circle, HoverTool, ColorBar, ColorMapper, BasicTicker
from bokeh.tile_providers import CARTODBPOSITRON_RETINA
output_notebook()

In [4]:
# Load data
stations = pd.read_csv('Nice_Ride_2017_Station_Locations.csv')
trips = pd.read_csv('Nice_ride_trip_history_2017_season.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# Show some of the stations data
stations.head()

Unnamed: 0,Number,Name,Latitude,Longitude,Total docks
0,30000,100 Main Street SE,44.984892,-93.256551,27
1,30001,25th Street & 33rd Ave S,44.957341,-93.223374,15
2,30002,Riverside Ave & 23rd Ave S,44.967115,-93.240149,15
3,30003,Plymouth Ave N & N Oliver Ave,44.991412,-93.306269,15
4,30004,11th Street & Hennepin,44.97534,-93.27869,23


Notes: 
- last entry in stations['Number'] is NRWH instead of a number!
- Lat, lon, total docks are OK

Let's plot the station locations on a map.  We'll use [Bokeh](https://bokeh.pydata.org/en/latest/) to display a map of the station locations.  To plot them on a map, however, we'll first have to transform the station locations from latitude+longitude coordinates to [Mercator (UTM) coordinates](https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system), and make a function which generates a plot of points on the map.

In [5]:
def lat_to_mercY(lat):
    """Convert Latitude to Mercator Y"""
    return np.log(np.tan(np.pi / 4 + np.radians(lat) / 2)) * 6378137.0

def lon_to_mercX(lon):
    """Convert Longitude to Mercator X"""
    return np.radians(lon) * 6378137.0

def MapPoints(lat, lon, size=10, color="green", alpha=0.8, padding=0.1, tooltips=None, cmap='plasma'):
    """Bokeh plot of points overlayed on a map"""
    
    # Convert lat,lon to UTM coordinates
    X = lon_to_mercX(lon)
    Y = lat_to_mercY(lat)
    
    # Set marker sizes
    if type(size) is int or type(size) is float:
        size = size*np.ones(len(lat))
    
    # Set marker colors
    if type(color) is str: #just use same color for all points
        color_str = color
        color_hex = X
    else: #map colors to a colormap
        import matplotlib.cm as cm
        colormap = cm.get_cmap(cmap) #matplotlib colormap
        caxis = [min(color), max(color)]
        cmapped = np.interp(color, caxis, [0,1], left=0, right=1)
        colors = colormap(cmapped, 1 ,True) #map values to the colormap
        color_hex = ["#%02x%02x%02x" % (r, g, b) for r, g, b in colors[:,0:3]]
        color_str = 'color'
        
            
    # Data source table for Bokeh
    source = ColumnDataSource(data=dict(
        X = X,
        Y = Y,
        size = size,
        color = color_hex
    ))
                
    # Plot the points
    p = figure(tools="pan,wheel_zoom,reset,hover,save", active_scroll="wheel_zoom")
    p.add_tile(CARTODBPOSITRON_RETINA) #set background map
    p.circle('X', 'Y', source=source, size='size', #plot each station
             fill_color=color_str, fill_alpha=alpha, line_color=None)
    p.axis.visible = False
    
    # Tool tips
    if tooltips is not None:
        for T in tooltips: #add to Bokeh data source
            source.add(T[1].values.tolist(), name=T[0])
        hover = p.select_one(HoverTool) #set hover values
        hover.tooltips=[(T[0], "@"+T[0]) for T in tooltips]
    
    return p

Now we can plot the stations!  We'll plot the station locations where each marker corresponds to a station, and marker color and size indicate the number of docks at that station.  You can also hover the mouse over a marker to see the name of that station and how many docks it has.

In [11]:
# On hover, show Station name and the number of docks
tooltips = [("Station", stations['Name']), 
            ("Docks", stations['Total docks'])]

# Plot the stations
p = MapPoints(stations.Latitude, stations.Longitude, 
              tooltips=tooltips, color=stations['Total docks'])
p.title.text = "Nice Ride Locations"

# TODO: add color bar, and maybe just use default bokeh color map?
# https://bokeh.pydata.org/en/latest/docs/user_guide/annotations.html#color-bars
# https://bokeh.pydata.org/en/latest/docs/reference/palettes.html#bokeh-palettes
# https://bokeh.pydata.org/en/latest/docs/reference/models/mappers.html
# https://bokeh.pydata.org/en/latest/docs/reference/models/tickers.html
color_mapper = ColorMapper(palette="Viridis256", low=1, high=50)
color_bar = ColorBar(color_mapper=color_mapper, ticker=BasicTicker(),
                     label_standoff=12, border_line_color=None, location=(0,0))
p.add_layout(color_bar, 'right')

show(p)

In [5]:
# Show some of the trip data
trips.head()

Unnamed: 0,Start date,Start station,Start station number,End date,End station,End station number,Account type,Total duration (Seconds)
0,11/5/2017 21:45,Hennepin Ave & S Washington Ave,30184,11/5/2017 22:02,Logan Park,30104,Member,1048
1,11/5/2017 21:45,Broadway Street N & 4th Street E,30122,11/5/2017 22:26,Broadway Street N & 4th Street E,30122,Member,2513
2,11/5/2017 21:43,Dale Street & Grand Ave.,30106,11/5/2017 22:13,N Milton Street & Summit Ave,30101,Member,1817
3,11/5/2017 21:41,Weisman Art Museum,30183,11/5/2017 22:05,22nd Ave S & Franklin Ave,30014,Casual,1399
4,11/5/2017 21:38,South 2nd Street & 3rd Ave S,30030,11/5/2017 21:44,6th Ave SE & University Ave,30088,Member,370


In [9]:
# TODO: join tables on station number

Unnamed: 0,Total duration (Seconds)
count,460718.0
mean,2276.507
std,43932.44
min,60.0
25%,408.0
50%,764.0
75%,1483.0
max,11354800.0


In [6]:
dir(trips)

['T',
 '_AXIS_ALIASES',
 '_AXIS_IALIASES',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_SLICEMAP',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_wrap__',
 '__bool__',
 '__bytes__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdiv__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',


In [7]:
# Get info about each column
for col in trips:
    print('\n',col,'\n',trips[col].describe())


 Start date 
 count              460718
unique             171626
top       6/16/2017 12:31
freq                   20
Name: Start date, dtype: object

 Start station 
 count                       460718
unique                         202
top       Lake Street & Knox Ave S
freq                         10747
Name: Start station, dtype: object

 Start station number 
 count     460718
unique       401
top        30115
freq        7350
Name: Start station number, dtype: object

 End date 
 count             460718
unique            170180
top       9/9/2017 14:58
freq                  25
Name: End date, dtype: object

 End station 
 count                       460718
unique                         202
top       Lake Street & Knox Ave S
freq                         11658
Name: End station, dtype: object

 End station number 
 count     460718
unique       401
top        30158
freq        7769
Name: End station number, dtype: object

 Account type 
 count     460718
unique         3
top    

In [8]:
# Convert datatypes
for col in ['End date', 'Start date']:
    trips[col] = pd.to_datetime(trips[col])

In [9]:
# Info about memberships
trips['Account type'].value_counts()

Member     290070
Casual     170646
Inconnu         2
Name: Account type, dtype: int64

# Notes / TODO

Data cleaning first

Use KeplerGL maybe to visualize:
- Station locations and spots per station
- Ride density to and from stations, and vs time
- Difference in to vs from, and vs time (need to know where to move bikes)
- Flow (overall)
- Flow to vs from
- Flow to and from by day/time
  - Eg to/from downtown on a weekday morning vs night
  - Or to/from the lakes/parks weekday vs weekend
  
Compute # bikes at each station? If that's not provided
- Then visualize bike availability vs station

Does the distribution of rides from a each station match the distribution of available bikes at each station? (This will change over time, see if there's a time when it's especially bad)

Does the distribution of rides to each station match the distribution of empty slots (again will change over time)

Is there a difference in the bike patterns between people with a membership vs people without one?
- Compute the transition (or trip) probability matrix for members and separately for non members
- Does it significantly differ?
- Or could bootstrap to get distributions for each trip, compute roc to get prob members take that trip more often (don't bother with significance testing unless there's one or two you are really interested in...)
- Then find the M best stations to put ads at (the M stations which maximize the probability that a person with a riding pattern similar to people with memberships will see it)
- Mention that this assumes that people who have bike patterns that are similar to those people with memberships are more likely to actually get memberships. And say that in fact what you would really want to do is do a/b testing or Bandit algorithm.

How does riding activity depend on the weather?
 -Temperature, cloudy, and rain
 
Is there a seasonal dependence in riding activity independent of the weather?
- Look at residuals over season after regressing out weather
- The idea is: do people just not wanna ride bikes as much in the fall even if it's nice out?
