# CitiBike data exploration
The data here is [public available](https://www.citibikenyc.com/system-data), which include the anoynomized trip data from 2013/7 until 2016/12. There are more than 10 million in total, it should be fasinating to explore!

ACTION ITEM: What's the goal of this exploration? Possible answers: 
* Build a model to predict bike use for next hour/day
* Build a bike-transfer protocol, if the current one is not optimized

In [1]:
import os, sys
import pandas as pd
# this is to load the file from local path
# I will only load few months of data
files = ['201307 - Citi Bike tripdata.csv', 
         '201308 - Citi Bike tripdata.csv',
         '201309 - Citi Bike tripdata.csv',
         '201310 - Citi Bike tripdata.csv',
         '201311 - Citi Bike tripdata.csv',
         '201312 - Citi Bike tripdata.csv']
# files = ['all_trips.csv']
trips = pd.DataFrame()
for file in files:
    df_temp = pd.read_csv(file, parse_dates=['starttime', 'stoptime'],
                    infer_datetime_format=True, low_memory=False)
    trips = trips.append(df_temp, ignore_index=True)
# find the 99.5 quantile, will filter anything above  
trips = trips[trips.tripduration < trips.tripduration.quantile(0.995)]
# the value is about 90 mins
ind = pd.DatetimeIndex(trips.starttime)  # this is very convenient!
trips['date'] = ind.date.astype('datetime64')
trips['hour'] = ind.hour
trips['trip_id'] = trips.index.values
trips['weekday'] = ind.weekday < 5

total_days = (max(ind) - min(ind)).days
print('Total number of trips: {}'.format(trips.shape[0]))
trips.columns

Total number of trips: 5011999


Index(['tripduration', 'starttime', 'stoptime', 'start station id',
       'start station name', 'start station latitude',
       'start station longitude', 'end station id', 'end station name',
       'end station latitude', 'end station longitude', 'bikeid', 'usertype',
       'birth year', 'gender', 'date', 'hour', 'trip_id', 'weekday'],
      dtype='object')

### TODO: systematic overview

In [61]:
# percentage of trips taken by subscriber / customer
per = trips.groupby('usertype')['trip_id'].count()['Subscriber'] / trips.shape[0] * 100
print('The percentage of trips taken by subsribers is {:.3f} %'.format(per))

The percentage of trips taken by subsribers is 87.001 %


## Time-serise exploration

In [2]:
import bokeh.plotting as bkp
import bokeh.models as bkm
import bokeh.charts as bkc
from bokeh.layouts import gridplot
import numpy as np

# what's the distribution of trip duration
bkp.output_notebook()
masks, count, bin_edges, source = [0, 0], [0, 0], [0, 0], [0, 0]  # initialization
colors = ['navy', 'red']
masks[0] = (trips.usertype == 'Customer')
masks[1] = ~masks[0]
ps = [bkp.figure(plot_height = 300, plot_width = 350, toolbar_location = 'above',
                 tools = 'pan,box_zoom,reset,resize,save',
                 x_axis_label='Trip duration (min)', y_axis_label = 'Total trips')
      for _ in range(2)]
for i, item in enumerate(['Customer', 'Subscriber']):
    count[i], bin_edges[i] = np.histogram(trips[masks[i]].tripduration/60, bins=200)
    source[i] = bkp.ColumnDataSource(data = dict(bin_value = bin_edges[i][:-1],
                                                 count = count[i], 
                                                 per = 100*np.cumsum(count[i])/sum(count[i])))
    ps[i].vbar(x = 'bin_value', top = 'count', width = 0.5, alpha = 0.7, line_alpha = 0.0,
           source = source[i], color = colors[i])
    ps[i].title.text = item
    hover = bkm.HoverTool(tooltips=[('Duration','@bin_value{0.0}'), 
                                    ('Count','@count{int}'), 
                                    ('Percentile','@per{0.0}')])
    ps[i].add_tools(hover)
    
grid = gridplot([ps])
bkp.show(grid)

In [3]:
# now I can do some statisitic about the count(sth)
by_date = trips.pivot_table(values = 'tripduration', index = 'date', 
                            columns = ['usertype'], aggfunc='count')
by_date['dayofweek'] = by_date.index.weekday
weekdays = {0:'Mon', 1:'Tue', 2:'Wed',3:'Thur',4:'Fri',5:'Sat',6:'Sun'}
by_date['dayofweek_str'] = by_date['dayofweek'].map(lambda x: weekdays[x])

In [4]:
source = bkp.ColumnDataSource(by_date)
p = bkp.figure(plot_height = 300, plot_width = 750, toolbar_location = 'above',
               x_axis_type="datetime", tools = 'pan,box_zoom,reset,resize,save,crosshair',
              x_axis_label='Date', y_axis_label = 'Total trips')

p.line(x = 'date', y = 'Customer', source = source, 
       legend = 'Customer ',  # the trailing space is important!
         line_width = 3, color='navy')
#p.circle(x = 'date', y = 'Customer', source = source, 
#        size = 10, alpha = 0.7, color = 'navy', line_color=None,
#        hover_line_color='white')

p.line(x = 'date', y = 'Subscriber', source = source, 
         line_width = 3, color='red', legend = 'Subscriber ') # the trailing space is important!
#p.circle(x = 'date', y = 'Subscriber', source = source, 
#        size = 10, alpha = 0.7, color = 'red', line_color=None,
#        hover_line_color='white')
p.legend.location = "top_left"
hover = bkm.HoverTool(tooltips=[('Day','@dayofweek_str')], mode='vline')
p.add_tools(hover)
bkp.show(p)

In [5]:
# distribution over weekdays and weekends
by_day = by_date.pivot_table(index = 'dayofweek', 
                             columns = by_date.index.weekofyear + 100*by_date.index.year,
                             values = ['Customer', 'Subscriber'])


In [6]:
ps = [bkp.figure(plot_height = 300, plot_width = 350, toolbar_location = 'above',
                 x_range = [elem for elem in weekdays.values()],
                 tools = 'pan,box_zoom,reset,resize,save',
                  x_axis_label='Day of the week', y_axis_label = 'Daily trips') 
      for _ in range(2)]
colors = ['navy', 'red']
   
for j, item in enumerate(['Customer', 'Subscriber']):
    for i in range(1, by_day[item].shape[1]):
        ps[j].line(x = by_day[item].index.map(lambda x: weekdays[x]), 
                   y = by_day[item].iloc[:,i], 
                   line_width = 2, alpha = 0.1, color = colors[j])
    # plot the mean
    mean = ps[j].line(x = by_day[item].index.map(lambda x: weekdays[x]),
               y = by_day[item].loc[:, 1:].mean(axis = 1), 
               line_width = 3, alpha = 1, color = colors[j])
    ps[j].title.text = item

grid = gridplot([ps])
bkp.show(grid)

In [7]:
# what is the outlier in the left plot on a Thur?
by_day['Customer'].iloc[3,:].argmax()
# this is July 4th!

201327

In [8]:
by_hour = pd.pivot_table(trips, values = 'tripduration', index = 'hour', 
                         columns = ['usertype', 
                                    ind.weekofyear + 100*ind.year,
                                    ind.weekday < 5], aggfunc = 'count')

In [9]:
ps = [bkp.figure(plot_height = 300, plot_width = 350, toolbar_location = 'above',
                 x_range = (0, 24),
                 tools = 'pan,box_zoom,reset,resize,save',
                  x_axis_label='Hour of the day', y_axis_label = 'Daily trips') 
      for _ in range(2)]  # [weekdays, weekends]
colors = ['navy', 'red']  # ['Customer', 'Subscriber']
days = [5, 2]

for j, item in enumerate(['Customer', 'Subscriber']):
    weeks = by_hour[item].columns.get_level_values(0)
    # plot every week
    for week in weeks:  
        if True in by_hour[item][week].columns:  # plot weekdays
            ps[0].line(x = by_hour[item][week].index, 
                       y = by_hour[item][week][True]/5,
                       line_width = 1, alpha = 0.05, color = colors[j])
        if False in by_hour[item][week].columns:  # plot weekends
            ps[1].line(x = by_hour[item][week].index, 
                       y = by_hour[item][week][False]/2,
                       line_width = 1, alpha = 0.05, color = colors[j])
    # plot mean value in the subplots
    for k, elem in enumerate([True, False]):        
        mean = ps[k].line(x = by_hour[item][week].index, 
                   y = by_hour[item].loc[:,(slice(None), elem)].mean(axis = 1)/days[k],
                   line_width = 3, alpha = 1, color = colors[j], legend = item,
                   name = 'mean')
        ps[k].legend.location = 'top_left'
        hover = bkm.HoverTool(tooltips=[('Hour','@x{int}'), ('Number','@y{0.0}')], 
                              mode='vline', renderers=[mean])
        ps[k].add_tools(hover)
        ps[k].title.text = ['Weekdays', 'Weekends'][k]
    
grid = gridplot([ps])
bkp.show(grid)

## Find the busiest stations

In [10]:
# populate a station dataframe
import os
if os.path.exists('stations.csv'):
    stations = pd.read_csv('stations.csv', index_col='id')
else:    
    stations = trips.groupby('start station id')['start station name', 'start station latitude',
           'start station longitude'].aggregate(lambda x: x.value_counts().index[0])
    stations.columns = ['name','lat','long']
    stations.index.name = 'id'
    stations.sort_index(inplace = True)
    stations.to_csv('stations.csv')

In [11]:
station_count = -1*trips.groupby(['start station id'])['tripduration'].count().to_frame()
station_count.rename(columns = {'tripduration' : 'out'}, inplace = True)
station_count.index.rename('station_id', inplace = True)
station_count['in'] = trips.groupby(['end station id'])['tripduration'].count()
station_count['total'] = station_count['in'] - station_count['out']
station_count['diff'] = station_count['in'] + station_count['out']
station_count.sort_values(by = 'total', ascending = False, inplace = True)
station_count['name'] = [stations.loc[x, 'name'] for x in station_count.index]
station_count.rename(index = str, inplace = True)

In [12]:
p = bkp.figure(plot_height = 400, plot_width = 750, toolbar_location = 'above',
               tools = 'pan,box_zoom,reset,resize,save',
               x_range = station_count.index.tolist(),
               y_range = [station_count['out'].min(), station_count['in'].max()],
               x_axis_label='Stations ', y_axis_label = 'Total ins and outs',
               title = 'Total station number: {}. Total days: {}'
               .format(len(station_count), total_days))
source= bkp.ColumnDataSource(station_count)
p.vbar(x = 'station_id', top = 'in', width = 0.9, alpha = 0.5, line_alpha = 0.0,
           source = source, color = 'red', legend = 'In:   +')
p.vbar(x = 'station_id', top = 'out', width = 0.9, alpha = 0.5, line_alpha = 0.0,
           source = source, color = 'navy', legend = 'Out: -')
p.vbar(x = 'station_id', top = 'diff', width = 0.9, alpha = 1.0, line_alpha = 0.0,
           source = source, color = 'black',)
hover = bkm.HoverTool(tooltips=[('Station','@name'), ('Net change:','@diff{0.0}')])
p.add_tools(hover)
# visual optimization
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
p.ygrid.band_fill_alpha = 0.05
p.ygrid.band_fill_color = "navy"
p.xaxis.major_tick_line_color = None
p.xaxis.major_tick_line_width = None
p.xaxis.minor_tick_line_color = None
p.yaxis.minor_tick_line_width = None
p.xaxis.major_label_text_font_size = '0pt'
bkp.show(p)

## Spatial exploration

In [13]:
# assgin the route as AAAABBBB, where AAAA is the start station id, and BBBB is of end station
trips['route'] = 10000 * trips['start station id'] + trips['end station id']

In [14]:
pop_route = trips.groupby('route')['usertype'].value_counts().unstack()
pop_route.fillna(0, inplace = True)

In [15]:
# to visualize the most popular routes
ps = [bkp.figure(plot_height = 300, plot_width = 350, toolbar_location = 'above',
                 tools = 'pan,box_zoom,reset,resize,save',
                 x_axis_label='Distinct routes', y_axis_label = 'Daily trips')
      for _ in range(2)]
for i, item in enumerate(['Customer', 'Subscriber']):
    pop_route.sort_values(by = item, ascending = False, inplace = True)
    ps[i].line(x = range(1, len(pop_route)+1), y = pop_route[item].values/total_days,
               #width = 0.5, alpha = 0.5, line_alpha = 0.0,
               color = colors[i])
    ps[i].title.text = item
    hover = bkm.HoverTool(tooltips=[('Route','@x{int}'), ('Count','@y{0.000}')], mode = 'vline')
    ps[i].add_tools(hover)
    
grid = gridplot([ps])
bkp.show(grid)


In [16]:
N = 20  # number of N most popular routes
pop_route_list = [0, 0]
for i, item in enumerate(['Customer', 'Subscriber']):
    pop_route.sort_values(by = item, ascending = False, inplace = True)
    pop_route_list[i] = pop_route.index[:N]

ps = [0, 0]  # initilized the list
source = [0, 0]  # initilized the list
colors = ['navy', 'red']  # ['Customer', 'Subscriber']

for i, item in enumerate(['Customer', 'Subscriber']):
    source[i] = bkp.ColumnDataSource(
    data = dict(route = list(map(str, pop_route_list[i])),
                count = pop_route[item].loc[pop_route_list[i]].values/total_days,
                start = [stations.loc[int(x/10000), 'name'] for x in pop_route_list[i]],
                end = [stations.loc[int(x%10000), 'name'] for x in pop_route_list[i]])
    )
    x = list(map(str, source[i].data['route']))
    ps[i] = bkp.figure(plot_height = 300, plot_width = 350, toolbar_location = 'above',
                 tools = 'pan,box_zoom,reset,resize,save', x_range = x,
                 x_axis_label='Route', y_axis_label = 'Daily trips', title = item)
    ps[i].vbar('route', top = 'count', width = 0.8, alpha = 0.7, line_alpha = 0.0,
               source = source[i], color = colors[i])
    ps[i].xaxis.major_label_orientation = np.pi/4   
    hover = bkm.HoverTool(tooltips=[('Start','@start'), ('End','@end'), ('Number','@count{0.0}')])
    ps[i].add_tools(hover)
grid = gridplot([ps])
bkp.show(grid) 

In [17]:
# find tripid where the bike's *last* end station is not current start station
# hence is bike is xferred in
# grouped by bikeid
# for debugging purpose, the most used bikeid is 16049
def find_xfer_trip_id_pre_trip(df):
    '''the first trip is removed, and a False value is added later'''
    mask = (df['start station id'].values[1:] != df['end station id'].values[:-1])
    # add False to the first item, and reture a single array
    mask = np.hstack((False, mask))
    return df[mask]['trip_id'].values  # return value is an array, hence capable of groupby

# find tripid where the bike's current end station is not *next* start station
# hence is bike is xferred out
def find_xfer_trip_id_post_trip(df):
    '''the last trip is removed, and a False value is added later'''
    mask = (df['start station id'].values[1:] != df['end station id'].values[:-1])
    # add False to the first item, and reture a single array
    mask = np.hstack((mask, False))
    return df[mask]['trip_id'].values  # return value is an array, hence capable of groupby


In [18]:
# look which stations receive most xfer bike in
from itertools import chain
# to get *ALL* trip_id
by_bike = trips.groupby('bikeid').apply(find_xfer_trip_id_pre_trip)
xfer_trip_ids = list(chain.from_iterable(by_bike.values))
# here I assume the trip_id equals to the origin index
trips_xfer_in = trips[trips['trip_id'].isin(xfer_trip_ids)]

In [19]:
# look which stations donate most xfer bike out
# to get *ALL* trip_id
by_bike = trips.groupby('bikeid').apply(find_xfer_trip_id_post_trip)
xfer_trip_ids = list(chain.from_iterable(by_bike.values))
trips_xfer_out = trips[trips['trip_id'].isin(xfer_trip_ids)]

In [20]:
# let's visualize it!
N = 20  # number of top choices
station_lists = [0, 0]
station_lists[0] = trips_xfer_in.groupby('start station id').apply(
                        lambda x: x['trip_id'].count()).sort_values(ascending = False)[:N]
station_lists[1] = trips_xfer_out.groupby('end station id').apply(
                        lambda x: x['trip_id'].count()).sort_values(ascending = False)[:N]

ps = [0, 0]  # initilized the list
source = [0, 0]  # initilized the list
colors = ['navy', 'red']  # ['Received', 'Donate']
labels = ['Out - In','In - Out']

for i, item in enumerate(['Received', 'Donated']):
    source[i] = bkp.ColumnDataSource(
    data = dict(station = list(map(str, station_lists[i].index)),
                count = station_lists[i].values,
                name = [stations.loc[x, 'name'] for x in station_lists[i].index],
                diff = [(-1)**(i+1)*station_count.loc[str(x), 'diff'] for x in station_lists[i].index]))
                        
    ps[i] = bkp.figure(plot_height = 300, plot_width = 350, toolbar_location = 'above',
                 tools = 'pan,box_zoom,reset,resize,save', 
                 x_range = list(map(str, station_lists[i].index)),
                 x_axis_label='Station id', y_axis_label = 'Counts', title = item)
    show = ps[i].vbar('station', top = 'count', width = 0.8, alpha = 0.5, line_alpha = 0.0,
               source = source[i], color = colors[i])
    ps[i].vbar('station', top = 'diff', width = 0.8, alpha = 1, line_alpha = 0.0,
               source = source[i], color = 'black')
    ps[i].xaxis.major_label_orientation = np.pi/4
    hover = bkm.HoverTool(tooltips=[('Station','@name'), 
                                    (item, '@count'),
                                    (labels[i],'@diff')], renderers=[show] )
    ps[i].add_tools(hover)
grid = gridplot([ps])
bkp.show(grid) 

In [21]:
# get distances, first read the matrix
dist_mat = pd.read_csv('dist_matrix.csv', index_col = 'id')
# then I will make a dictionary for all the possible route, for the later fast lookup      
dist_table = {}
# the two for-loops below are expected to be slow... but I will let it pass this time
for i in range(dist_mat.shape[0]):
    for j in range(0, dist_mat.shape[1]):
        dist_table[10000*dist_mat.index[i] + int(dist_mat.columns[j])] = dist_mat.iloc[i,j]
        if dist_mat.iloc[i,j] == 0:  # in the case of round trip, mark it as negative
            dist_table[10000*dist_mat.index[i] + int(dist_mat.columns[j])] = -1

In [22]:
# I need to use .apply to speed things up (by a lot!)
trips['distance'] = trips['route'].apply(lambda x: dist_table[x])  

# calculate the speed: assuming the google maps route, unit is mph
trips['speed'] = trips['distance'] / trips['tripduration']* 3600.0 / 1600

In [23]:
# now let's see some distribution
masks, count, bin_edges, source = [0, 0], [0, 0], [0, 0], [0, 0]  # initialization
colors = ['navy', 'red']
masks[0] = (trips.usertype == 'Customer')
masks[1] = ~masks[0]
ps = [bkp.figure(plot_height = 300, plot_width = 350, toolbar_location = 'above',
                 tools = 'pan,box_zoom,reset,resize,save,crosshair',
                 x_range = (0, 20),
                 x_axis_label='Speed (miles per hour)',
                 y_range = (0, 1),
                 y_axis_label = 'Total count (Normalized)')
      for _ in range(2)]
for i, item in enumerate(['Customer', 'Subscriber']):
    count[i], bin_edges[i] = np.histogram(trips[masks[i]].speed, bins=5000)
    source[i] = bkp.ColumnDataSource(data = dict(bin_value = bin_edges[i][:-1],
                                                 count = count[i]/max(count[i][1:]), 
                                                 per = 100*np.cumsum(count[i])/sum(count[i])))
    ps[i].vbar(x = 'bin_value', top = 'count', 
               width = 0.1, alpha = 0.5, line_alpha = 0.0,
           source = source[i], color = colors[i])
    ps[i].title.text = item
    
grid = gridplot([ps])
bkp.show(grid)

## Prediction models - given route, guess usertype

In [24]:
# the goal of this first toy model is to predict whether a route is taken by a subscriber
# or a customer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
features = ['tripduration', 'hour', 'distance', 'speed', 'weekday']
target = 'usertype'

df_raw = trips[[*features, target]].copy()
target_mapping = {'Customer':0, 'Subscriber':1}
df_raw['usertype'].replace(target_mapping, inplace = True)

In [25]:
df = df_raw.sample(500000)
df.columns

Index(['tripduration', 'hour', 'distance', 'speed', 'weekday', 'usertype'], dtype='object')

In [26]:
# stardarize the input values
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(df[features])  # for later scaling of test data
df[features] = scaler.transform(df[features])

In [27]:
# Logistic regression
train, test = train_test_split(df, test_size = 0.2)
features = df.columns.tolist()
features.remove(target)

usertype_logistic_model = LogisticRegression()
usertype_logistic_model.fit(train[features], train[target].values)
print('========== Logistic regression ==========')
print('training score is {}'.format(
            usertype_logistic_model.score(train[features], train[target].values)))
print('x-validation score is {}'.format(
            usertype_logistic_model.score(test[features], test[target].values)))

training score is 0.87057
x-validation score is 0.87332


In [28]:
# Neural network
from sklearn.neural_network import MLPClassifier
train, test = train_test_split(df, test_size = 0.2)
features = df.columns.tolist()
features.remove(target)

usertype_mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=100)
usertype_mlp.fit(train[features], train[target].values)
print('========== Neural network ==========')
print('training score is {}'.format(
            usertype_mlp.score(train[features], train[target].values)))
print('x-validation score is {}'.format(
            usertype_mlp.score(test[features], test[target].values)))

training score is 0.87595
x-validation score is 0.87687


In [29]:
# add start, end station id, 
df['start_id'] = trips['start station id']
df['end_id'] = trips['end station id']

In [30]:
# Logistic regression with station ids
train, test = train_test_split(df, test_size = 0.2)
features = df.columns.tolist()
features.remove(target)

usertype_logistic_model_2 = LogisticRegression()
usertype_logistic_model_2.fit(train[features], train[target].values)
print('========== Logistic regression with station ids ==========')
print('training score is {}'.format(
            usertype_logistic_model_2.score(train[features], train[target].values)))
print('x-validation score is {}'.format(
           usertype_logistic_model_2.score(test[features], test[target].values)))

training score is 0.8713275
x-validation score is 0.87171


In [32]:
# Neural network with station ids
train, test = train_test_split(df, test_size = 0.2)
features = df.columns.tolist()
features.remove(target)

usertype_mlp_2 = MLPClassifier(hidden_layer_sizes=(50,), max_iter=100, verbose = 10)
usertype_mlp_2.fit(train[features], train[target].values)
print('========== Neural network with station ids ==========')
print('training score is {}'.format(
            usertype_mlp_2.score(train[features], train[target].values)))
print('x-validation score is {}'.format(
            usertype_mlp_2.score(test[features], test[target].values)))

Iteration 1, loss = 0.43153075
Iteration 2, loss = 0.36885925
Iteration 3, loss = 0.37152925
Iteration 4, loss = 0.36898762
Iteration 5, loss = 0.37113127
Training loss did not improve more than tol=0.000100 for two consecutive epochs. Stopping.
training score is 0.869025
x-validation score is 0.86822


In [38]:
# one hoc encoding
df = pd.get_dummies(df, columns=['start_id', 'end_id'])

In [39]:
# Logistic regression with station ids, one hoc encoded
train, test = train_test_split(df, test_size = 0.2)
features = df.columns.tolist()
features.remove(target)

usertype_logistic_model_3 = LogisticRegression()
usertype_logistic_model_3.fit(train[features], train[target].values)
print('========== Logistic regression with station ids one hoc encoded ==========')
print('training score is {}'.format(
            usertype_logistic_model_3.score(train[features], train[target].values)))
print('x-validation score is {}'.format(
           usertype_logistic_model_3.score(test[features], test[target].values)))

training score is 0.8787925
x-validation score is 0.87971


In [40]:
# Neural network with station ids, one hoc encoded
train, test = train_test_split(df, test_size = 0.2)
features = df.columns.tolist()
features.remove(target)

usertype_mlp_3 = MLPClassifier(hidden_layer_sizes=(50,), max_iter=100)
usertype_mlp_3.fit(train[features], train[target].values)
print('========== Neural network with station ids one hoc encoded ==========')
print('training score is {}'.format(
            usertype_mlp_3.score(train[features], train[target].values)))
print('x-validation score is {}'.format(
            usertype_mlp_3.score(test[features], test[target].values)))

Iteration 1, loss = 0.31517901
Iteration 2, loss = 0.29691747
Iteration 3, loss = 0.29370178
Iteration 4, loss = 0.29152600
Iteration 5, loss = 0.28978947
Iteration 6, loss = 0.28839360
Iteration 7, loss = 0.28706646
Iteration 8, loss = 0.28602885
Iteration 9, loss = 0.28489866
Iteration 10, loss = 0.28393359
Iteration 11, loss = 0.28297554
Iteration 12, loss = 0.28211479
Iteration 13, loss = 0.28119144
Iteration 14, loss = 0.28035975
Iteration 15, loss = 0.27960338
Iteration 16, loss = 0.27861858
Iteration 17, loss = 0.27781326
Iteration 18, loss = 0.27705708
Iteration 19, loss = 0.27623210
Iteration 20, loss = 0.27544971
Iteration 21, loss = 0.27466441
Iteration 22, loss = 0.27384956
Iteration 23, loss = 0.27315562
Iteration 24, loss = 0.27240227
Iteration 25, loss = 0.27170411
Iteration 26, loss = 0.27087087
Iteration 27, loss = 0.27018514
Iteration 28, loss = 0.26935460
Iteration 29, loss = 0.26865405
Iteration 30, loss = 0.26798800
Iteration 31, loss = 0.26732712
Iteration 32, los



training score is 0.9072475
x-validation score is 0.87606


In [42]:
# Random Forest with station ids, one hoc encoded
from sklearn.ensemble import RandomForestClassifier
train, test = train_test_split(df, test_size = 0.2)
features = df.columns.tolist()
features.remove(target)

usertype_rf_3 = RandomForestClassifier(n_estimators=1000, 
                                       criterion='gini',
                                       max_depth=5,)
usertype_rf_3.fit(train[features], train[target].values)
print('========== Random Forest with station ids one hoc encoded ==========')
print('training score is {}'.format(
            usertype_rf_3.score(train[features], train[target].values)))
print('x-validation score is {}'.format(
            usertype_rf_3.score(test[features], test[target].values)))

training score is 0.8699825
x-validation score is 0.87053


### TODO: is there better way to identify the route, say from an area to another area? Also, how about round-trips,  how many are in and out of Manhattan?

### TODO: make heat map of all the trips, say, with google maps?

### TODO: find the route information, such as distances, then build some statistics such as speed, elevation changes, etc

### TODO: bike-centric exploration, how many times a day that one bike is used? what's the 'lifetime' of a bike?

### TODO: find correlation between trips and weather

### TODO: find the correlations between traveling pattern and other geographic attribute, such as median income?

## TODO: prediction model? 
* Idea: given a route, to see whether the rider is a subscriber or a customer
* Idea: given a start station (and other features), where will the rider go
* Idea: predicting the bike inflow-outflow for the next hour
* Idea: where to put new stations?
* Idea: can citibike make more profit, from the customers?
