In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import json

In [None]:
df = pd.read_pickle("citibikedata/9000timeslots.pickle")
# Please note that some unused columns were deleted.
# However, in all other respects the pickled data was "raw".
# Row cleaning had not yet taken place.

In [None]:
TARGETSTATION = 465   # Station IDs are strings

In [None]:
df.dtypes

In [None]:
# Houston we have a problem!
# We are getting files with "last_reported" of ZERO so those must be filtered out of the dataframe first.
df = df[df['last_reported'] > 1000]

In [None]:
# OK now the df is clean of bad timestamps in the "last_reported" column.
# Converting from typical second-granularity epoch timestamp requires unit='s'
df['most_recent_conn_DT'] = pd.to_datetime(df['last_reported'], unit='s')

In [None]:
# The shape of a dataframe is its row count x column count
df.dtypes

In [None]:
#A successful plotting of just one station:
df[df['station_id']==TARGETSTATION].plot(x='most_recent_conn_DT', y=['num_bikes_available'])

# 1: STATIONS WITH MOST "volatility"

Every station sends its reports to HQ only sporadically, not on a fixed schedule.

So as a quick measure of volatility of station S, we could take the time-sorted signatures for station S, and determine the velocity between each adjacent pair S[i] and S[i+1], and compute the sum of the velocities.

The velocity could simply be the sum, across all columns C, of abs(S[i][c] - S[i+1][c]).  The will as desired produce a velocity of zero if two adjacent reports actually had no net change to report.

We could mute the velocity by the duration between S[i] and S[i+1] but this isn't really necessary since max(i) itself will be higher for the highest-active stations anyway, so they will naturally have more velocities being summed.

## 1.1:  "QuickVolatility"

The quickest approach to computing volatility would be to simply produce the count of individual report rows per station.  There is already a great deal
of spread on that particular metric.

Let's compute a histogram based on QuickVolatility!


In [None]:
df = df.drop(columns=['is_installed','is_renting','is_returning','ts'], 
             errors='ignore')

In [None]:
df.dtypes

In [None]:
df.shape

In [None]:
df = df.drop_duplicates()

In [None]:
df.shape

In [None]:
df.sort_values(by=['station_id','most_recent_conn_DT'], inplace=True)

In [None]:
# This will create an obj of type DataFrameGroupBy
per_station_info = df.groupby('station_id')

In [None]:
# OK so there are 845 actual listed stations in the official station DB.
# But not all are online as you can see here:
per_station_info.ngroups

In [None]:
# TRIVIAL  "QuickVolatility" is just the count per station.
per_station_info.count().hist(column='last_reported',bins=10)

In [None]:
# What are the top 20 stations in QuickVolatility?
df_station_to_quickvol = per_station_info.count()

In [None]:
df_station_to_quickvol.sort_values(by='last_reported', ascending=False)[:20]

In [None]:
df.dtypes

Let's find out exactly how much data we have.  The granularity is per minute, but what is the range?


In [None]:
df['ts'].min()

In [None]:
df['ts'].max()

In [None]:
df['ts'].max() - df['ts'].min()