- Sunday, October 21, 2018

# Plotly like a Pro
- The purpose of this notebook is to display some of my most frequently used plotly plotting patterns.

# Lets begin!
***

### First are libraries, these include the usual suspects:
    - pandas
    - numpy
    
### Second are some specitic module imports from plotly to streamline useage
    - init_notebook_mode: necessary to view plotly plots in a jupyter notebook
        - must include `init_notebook_mode(connected=True)` or else plots will not be visible!
    - plotly.graph_objs: the modules are imported directly, so no need to call the typical `py.` prefix

In [1]:
import pandas as pd
import numpy as np


from plotly.offline import init_notebook_mode, plot, iplot
from plotly.graph_objs import *
init_notebook_mode(connected=True)

## Before we can create any plots, we need to get some data first!
    - I have game & view data scraped from Twitch.tv every ~30 minutes for the month of june 2018
    - Second is a dataset of NYC bus breakdowns and delays (courtesy of Kaggle)

In [2]:
twitch = pd.read_csv('Data/twitch_scrape.csv',encoding='utf-8') # Import csv file as DataFrame

twitch.head() # Display first 10 records

Unnamed: 0,batch,game,views
0,2018-06-03 00:23:00,Fortnite,235619
1,2018-06-03 00:23:00,Dota 2,102896
2,2018-06-03 00:23:00,League of Legends,91233
3,2018-06-03 00:23:00,IRL,51908
4,2018-06-03 00:23:00,Hearthstone,46972


## Its useful to set the timestamp ('batch') field as the DataFrame's index
- This simplifies aggregating over different time periods - hourly/daily/weekly/etc

In [3]:
twitch = twitch.set_index('batch') #change dataframe index to the timestamp field
twitch.index = pd.to_datetime(twitch.index) # set index dtype to datetime64
twitch.head()

Unnamed: 0_level_0,game,views
batch,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-06-03 00:23:00,Fortnite,235619
2018-06-03 00:23:00,Dota 2,102896
2018-06-03 00:23:00,League of Legends,91233
2018-06-03 00:23:00,IRL,51908
2018-06-03 00:23:00,Hearthstone,46972


## Some basic info
- How many batches are there?
- How many days does this include?
- How batches per day?
    - on avg?
    - max?
    - min?


In [43]:
# How many batches are there?
print "There are {0} separate batches included in the dataframe 'twitch'\n\n".format(twitch.index.nunique())

# How many days does this include?
print "There are {0} separate days included in the dataframe 'twitch'\n".format(twitch.resample('D').count().index.shape[0])

# How batches per day?
print "\nThe most batches scraped in a single day is - {0}".format(int(twitch.to_period('D').index.value_counts().apply({'Max':np.max,'Min':np.min,'Avg':np.mean})['Max']))
print "\nThe least batches scraped in a single day is - {0}".format(int(twitch.to_period('D').index.value_counts().apply({'Max':np.max,'Min':np.min,'Avg':np.mean})['Min']))
print "\nThere are {0} batches per day on average in the dataframe 'twitch'".format(int(twitch.to_period('D').index.value_counts().apply({'Max':np.max,'Min':np.min,'Avg':np.mean})['Avg']))


There are 1785 separate batches included in the dataframe 'twitch'


There are 23 separate days included in the dataframe 'twitch'


The most batches scraped in a single day is - 119991

The least batches scraped in a single day is - 9686

There are 95422 batches per day on average in the dataframe 'twitch'


## Plot 1: Number of batches scraped per day

In [47]:
batches_per_day = twitch.resample('D').count()['game'] # results are a Pandas Series
type(batches_per_day)

pandas.core.series.Series

### This Plotly 'structure' closely matches the documentation
- Note that `Bar`, `Layout`, `Figure` , and `iplot` are called directly because they have been imported without a namespace prefix like `py`
    - Also, most of these objects are plain old dictionaries, so they dont need to be called at all!

In [259]:
data = [Bar(x=batches_per_day.index,
            y=batches_per_day.values,
            opacity=.7)]

layout = Layout(title="Twitch.tv Scraped batches per day")

figure = Figure(data=data,layout=layout)

#iplot(figure)
#############################################################

## more dictionaries!

data = [{'type':'bar',
         'x':batches_per_day.index,
         'y':batches_per_day.values,
         'opacity':.7}]

layout = {'title':"Twitch.tv Scraped batches per day"}

figure = {'data':data,'layout':layout}

#iplot(figure)
############################################################

## Combine objects
#- Only need to call `iplot`

iplot(
        {
            'data':[
                    {'type':'bar',
                     'x':batches_per_day.index,
                     'y':batches_per_day.values,
                     'opacity':.7}
            ],
            'layout':{'title':"Twitch.tv Scraped batches per day"}
        }
)

## Lets change the plot from bar to scatter
    - type needed to be changed from `bar` to `scatter`
    - `mode` key:value pair allows users to select if they want markers (points), lines, or both

In [71]:
iplot(
        {
            'data':[
                    {'type':'scatter',
                     'x':batches_per_day.index,
                     'y':batches_per_day.values,
                     'opacity':.7,
                     'mode':'markers+lines'}
            ],
            'layout':{'title':"Twitch.tv Scraped batches per day"}
        }
)

## Bars and Scatter
- combining different type plots is as simple as combining their respective data fields
- Note: Including the `names` key:value pair lets you customize the name of each plot for the axis labels and the legend
    - If the `names` field is omitted, the legend will default to trace0, trace1, ... tracen

In [72]:
iplot(
        {
            'data':[
                    {'type':'scatter',
                     'x':batches_per_day.index,
                     'y':batches_per_day.values,
                     'opacity':.7,
                     'mode':'markers+lines'},
                {'type':'bar',
                     'x':batches_per_day.index,
                     'y':batches_per_day.values,
                     'opacity':.7}
            ],
            'layout':{'title':"Bars and Scatterplot<br>Twitch.tv Scraped batches per day"}
        }
)

## So far so good. But we can do better.
- Customize colors - htmlcolorcodes.com is a good place to start
- Add trace names
- adjust size of scatter line
- add axis labels
- customize text fields with HTML tags

In [101]:
colorcodes = "#48C9B0 #2037C8 #20C86C #4DA8D9 #0A0D6F #DAF7A6 #FFC300 #FF5733 #C70039".split()

In [118]:
iplot(
        {
            'data':[
                    {'type':'scatter',
                     'x':batches_per_day.index,
                     'y':batches_per_day.values,
                     'opacity':.7,
                     'mode':'markers+lines',
                     'name':'Scrapes per day - Scatter',
                     'marker':{'line':{'width':5,
                                       'color':colorcodes[0]},
                              'size':9,
                              'color':colorcodes[1]}},
                    {'type':'bar',
                     'x':batches_per_day.index,
                     'y':batches_per_day.values,
                     'opacity':.4,
                     'name':'Scrapes per day - Bar',
                     'marker':{'color':colorcodes[3]}}
            ],
            'layout':{'title':"<i>Bars</i> and <i>Scatterplot</i><br><b>Twitch.tv</b><br>Scraped batches per day",
                     'xaxis':{'title':'<b>Days'},'yaxis':{'title':'<b>Frequency'}}
        }
)

## Now let's look at the top 3 games

In [119]:
top_3_games = twitch                                \
                .groupby('game')                      \
                .mean()                               \
                .sort_values('views',ascending=False) \
                .index[:3].tolist()
top_3_games

[u'Fortnite', u'League of Legends', u'Dota 2']

### How do their views change over time?
- Unlike grammar of graphics style plotting libraries, you need to manually specify any particular values you wish to subset across
- For example, we are only interested in plotting the timeseries values for: Fortnite, League of Legends, and Dota2
    - In plotly, we must specify a data object for each of these games. We can manually assign these values to multiple separate DataFrames
    - When the number of traces increases sufficiently, it becomes inpractical to create a separate DataFrame for each value, but this is a prime use case for list comprehension

In [156]:
gamecols = dict(zip(top_3_games,np.random.choice(colorcodes,3,False))) # randomly assign a color from the list of colors

iplot(
        {
            'data':[
                    {'type':'scatter',
                     'x':twitch[twitch['game'] == game].sort_index().index,
                     'y':twitch[twitch['game'] == game].sort_index()['views'],
                     'opacity':.7,
                     'mode':'markers+lines',
                     'name':game,
                     'marker':{'line':{'width':5,
                                       'color':gamecols[game]},
                              'size':1,
                              'color':gamecols[game]}} for game in top_3_games
                   
            ],
            'layout':{'title':"Top 3 twitch games<br>Views over time",
                     'xaxis':{'title':'<b>Days'},'yaxis':{'title':'<b>Frequency'}}
        }
)

In [263]:
## How about Daily?
gamecols = dict(zip(top_3_games,np.random.choice(colorcodes,3,False))) # randomly assign a color from the list of colors


_ = [iplot(
        {
            'data':[
                    {'type':'scatter',
                     'x':twitch[twitch['game'] == game]\
                                 .resample('D')['views']\
                                 .agg([np.max,np.sum,np.mean,np.min])\
                                 .rename(columns={'amax':'Max','amin':'Min','mean':'Average','sum':'Total'})\
                                 .sort_index().index,
                     'y':twitch[twitch['game'] == game]\
                                 .resample('D')['views']\
                                 .agg([np.max,np.sum,np.mean,np.min])\
                                 .rename(columns={'amax':'Max','amin':'Min','mean':'Average','sum':'Total'})\
                                 .sort_index()[AGG],
                     'opacity':.7,
                     'mode':'markers+lines',
                     'name':"{0} - {1}".format(game,AGG),
                     'marker':{'line':{'width':5,
                                       'color':gamecols[game]},
                              'size':2,
                              'color':gamecols[game]}} 
                for game in top_3_games      
            ],
            'layout':{'title':"Top 3 twitch games<br>Views over time - <b>Daily</b>",
                     'xaxis':{'title':'<b>Day'},'yaxis':{'title':'<b>Views'}}
        })
     for AGG in ['Max','Min','Average','Total']]

In [260]:
## How about weekly?
gamecols = dict(zip(top_3_games,np.random.choice(colorcodes,3,False))) # randomly assign a color from the list of colors


_ = [iplot(
        {
            'data':[
                    {'type':'scatter',
                     'x':twitch[twitch['game'] == game]\
                                 .resample('W')['views']\
                                 .agg([np.max,np.sum,np.mean,np.min])\
                                 .rename(columns={'amax':'Max','amin':'Min','mean':'Average','sum':'Total'})\
                                 .sort_index().index,
                     'y':twitch[twitch['game'] == game]\
                                 .resample('W')['views']\
                                 .agg([np.max,np.sum,np.mean,np.min])\
                                 .rename(columns={'amax':'Max','amin':'Min','mean':'Average','sum':'Total'})\
                                 .sort_index()[AGG],
                     'opacity':.7,
                     'mode':'markers+lines',
                     'name':"{0} - {1}".format(game,AGG),
                     'marker':{'line':{'width':5,
                                       'color':gamecols[game]},
                              'size':1,
                              'color':gamecols[game]}} 
                for game in top_3_games      
            ],
            'layout':{'title':"Top 3 twitch games<br>Views over time - <b>Weekly</b>",
                     'xaxis':{'title':'<b>Weeks'},'yaxis':{'title':'<b>Views'}}
        })
     for AGG in ['Max','Min','Average','Total']]

In [243]:
## How about hourly?
gamecols = dict(zip(top_3_games,np.random.choice(colorcodes,3,False))) # randomly assign a color from the list of colors



_ = [iplot(
        {
            'data':[
                    {'type':'scatter',
                     'x':twitch[twitch['game'] == game]\
                                 .groupby(twitch[twitch['game'] == game].index.hour)\
                                 .agg([np.max,np.sum,np.mean,np.min])\
                                 .rename(columns={'amax':'Max','amin':'Min','mean':'Average','sum':'Total'})\
                                 .sort_index().index,
                     'y':twitch[twitch['game'] == game]\
                                 .groupby(twitch[twitch['game'] == game].index.hour)\
                                 .agg([np.max,np.sum,np.mean,np.min])\
                                 .rename(columns={'amax':'Max','amin':'Min','mean':'Average','sum':'Total'})\
                                 .sort_index()['views'][AGG],
                     'opacity':.7,
                     'mode':'markers+lines',
                     'name':"{0} - {1}".format(game,AGG),
                     'marker':{'line':{'width':5,
                                       'color':gamecols[game]},
                              'size':1,
                              'color':gamecols[game]}} 
                for game in top_3_games      
            ],
            'layout':{'title':"Top 3 twitch games<br>Views over time - <b>Hourly</b>",
                     'xaxis':{'title':'<b>Hours'},'yaxis':{'title':'<b>Views'}}
        }
) for AGG in ['Max','Min','Average','Total']]

## Another personal favorite - the humble Boxplot

In [242]:
# Top 10 games
top_10_games = twitch                                \
                .groupby('game')                      \
                .mean()                               \
                .sort_values('views',ascending=False) \
                .index[:10].tolist()
top_10_games
colorcodes= "#F5B7B1 #D7BDE2 #A9CCE3 #AED6F1 #A3E4D7 #A9DFBF #FAD7A0 #F5CBA7 #E5E7E9 #AEB6BF".split()
gamecols = dict(zip(top_10_games,np.random.choice(colorcodes,10,False))) # randomly assign a color from the list of colors


iplot(
        {
            'data':[
                    {'type':'box',
                     'x':twitch[twitch['game'] == game].game,
                     'y':twitch[twitch['game'] == game]['views'],
                     'opacity':.7,
                     'name':"{0}".format(game),
                     'boxpoints':'all',
                     'marker':{'color':gamecols[game]}} 
                for game in top_10_games      
            ],
            'layout':{'title':"Top 10 twitch games</b>",
                     'xaxis':{'title':'<br><br><b>Game'},'yaxis':{'title':'<b>Views'},
                     'margin':{'b':150}}
        }
)

## Change bar order by median
- In this example, the easiest way to re-order the bar plots is to re-order the `top_10_games` list by median
- The second approach is to set the `game` field as type Categorial where the values are ordered by median
- Use `margin` key:value parameter to add boarder space for labels that get cutoff
    - in this case, increase the bottom margin 150 pixels: `'layout':{'margin':{'b':150})`

In [293]:
# Re-order top 10 games (by avg views) by median views in ascending order
twitch[twitch['game'].isin(top_10_games)].groupby('game').median().sort_values('views')

Unnamed: 0_level_0,views
game,Unnamed: 1_level_1
E3 2018,598
Bangai-O!,6517
Realm Royale,25817
Hearthstone,37076
PLAYERUNKNOWN'S BATTLEGROUNDS,41489
IRL,53025
Dota 2,61332
Wacky Races,67714
League of Legends,96513
Fortnite,116955


In [256]:
top_10_games_median = twitch[twitch['game'].isin(top_10_games)].groupby('game').median().sort_values('views').index
gamecols = dict(zip(top_10_games_median,np.random.choice(colorcodes,10,False))) # randomly assign a color from the list of colors



iplot(
        {
            'data':[
                    {'type':'box',
                     'x':twitch[twitch['game'] == game].game,
                     'y':twitch[twitch['game'] == game]['views'],
                     'opacity':.7,
                     'name':"{0}".format(game),
                     'boxpoints':'all',
                     'marker':{'color':gamecols[game]}} 
                for game in top_10_games_median      
            ],
            'layout':{'title':"<b>Top 10 twitch games - by Average</b><br>Ordered by Median",
                     'xaxis':{'title':'<br><br><b>Game'},'yaxis':{'title':'<b>Views'},
                     'margin':{'b':150}}
        }
)

## Max views by day
- fix overlapping axis labels with `dtick` key:value paramter
    - set `'xaxis':{'dtick':'2'}` to display every other x axis tick (when set as a datetime)

In [291]:
iplot(
        {
            'data':[
                    {'type':'bar',
                     'x':twitch.resample('D').max().index.map(lambda day: "June {}<br>{}".format(day.day,day.day_name()[:3])),
                     'y':twitch.resample('D')['views'].max(),
                     'opacity':.7,} 
            ],
            'layout':{'title':"Daily Max Views</b>",
                     'xaxis':{'title':'<b>Day','dtick':'2'},'yaxis':{'title':'<b>Views'},
                     }        }
)

## Bus Breakdown Questions:
- Are bus breakdowns increasing or decreasing year to year? in duration and quantity?
- Are bus breakdowns seasonal? When?
- Which borough has the most bus breakdowns? relative and absolute?
- which bus number has the most breakdowns? relative and absolute?
- Are busses that service multiple schools more susceptible to breakdowns?
- Which route has the most breakdowns? is it seasonal?
- Which bus company has the most breakdowns? which has the least?
- Which bus company is the worst at notifying parents of delays?
- What is the longest delay? when did it occur?
- what is the most frequent reason for delays?

In [294]:
busbreak = pd.read_csv('Data/bus-breakdown-and-delays.csv',encoding='utf-8')


Columns (17) have mixed types. Specify dtype option on import or set low_memory=False.



In [324]:
for _,col in enumerate(busbreak.columns):
    print _,col,busbreak[col].dtype

0 School_Year object
1 Busbreakdown_ID int64
2 Run_Type object
3 Bus_No object
4 Route_Number object
5 Reason object
6 Schools_Serviced object
7 Occurred_On object
8 Created_On object
9 Boro object
10 Bus_Company_Name object
11 How_Long_Delayed object
12 Number_Of_Students_On_The_Bus int64
13 Has_Contractor_Notified_Schools object
14 Has_Contractor_Notified_Parents object
15 Have_You_Alerted_OPT object
16 Informed_On object
17 Incident_Number object
18 Last_Updated_On object
19 Breakdown_or_Running_Late object
20 School_Age_or_PreK object


In [322]:
# set NAN incident number to -1
busbreak.loc[busbreak['Incident_Number'].isna(),'Incident_Number'] = -1
# set incident number type to object
busbreak['Incident_Number'] = busbreak['Incident_Number'].astype(str) 

In [337]:
# drop Created On field - offers no additional insights
busbreak = busbreak.drop('Created_On',axis=1)
busbreak.head()

Unnamed: 0,School_Year,Busbreakdown_ID,Run_Type,Bus_No,Route_Number,Reason,Schools_Serviced,Occurred_On,Boro,Bus_Company_Name,How_Long_Delayed,Number_Of_Students_On_The_Bus,Has_Contractor_Notified_Schools,Has_Contractor_Notified_Parents,Have_You_Alerted_OPT,Informed_On,Incident_Number,Last_Updated_On,Breakdown_or_Running_Late,School_Age_or_PreK
0,2015-2016,1212699,Special Ed AM Run,48186,N758,Other,75485,2015-09-02T06:27:00,Nassau County,"BORO TRANSIT, INC.",25 minutes,0,Yes,No,No,2015-09-02T06:29:00,-1,2015-09-02T06:29:16,Running Late,School-Age
1,2015-2016,1212700,Special Ed AM Run,2518,L530,Mechanical Problem,21854,2015-09-02T06:24:00,Brooklyn,"RELIANT TRANS, INC. (B232",,0,Yes,Yes,Yes,2015-09-02T06:30:00,-1,2015-09-02T06:30:19,Breakdown,School-Age
2,2015-2016,1212701,Special Ed AM Run,235,K168,Other,18366,2015-09-02T06:45:00,Brooklyn,"NEW DAWN TRANSIT, LLC (B2",30MINS,0,Yes,Yes,No,2015-09-02T06:47:00,-1,2015-09-02T08:05:39,Running Late,School-Age
3,2015-2016,1212703,Special Ed AM Run,2102,K216,Other,21501,2015-09-02T06:55:00,Brooklyn,EMPIRE STATE BUS CORP.,20 min,1,Yes,Yes,No,2015-09-02T07:02:00,-1,2015-09-02T07:02:01,Running Late,School-Age
4,2015-2016,1212704,Special Ed AM Run,48162,N861,Mechanical Problem,75485,2015-09-02T06:55:00,Nassau County,"BORO TRANSIT, INC.",30 min,0,Yes,Yes,No,2015-09-02T07:04:00,-1,2015-09-02T07:04:25,Running Late,School-Age


In [338]:
# update timestamp fields as Pandas Datetime
for timefield in ["Occurred_On","Informed_On","Last_Updated_On"]:
    busbreak[timefield] = pd.to_datetime(busbreak[timefield])


In [349]:
busbreak['Informed_gap'] = (busbreak['Informed_On'] - busbreak['Occurred_On'])

## cleanup field `How long delayed`

In [424]:
def bucket_delay(delay):
    def dc(delay):
        # strip out 'Minutes' and any variation by removing all non-digit or '-' characters
        strip_letters = ''.join([x for x in list(delay) if (x.isdigit() or x == "-")])
        return strip_letters if len(strip_letters) > 0 else "-1"

    delay = dc(delay)
    delay_strip = str(delay) if len(delay.split('-')) == 1 else (delay.split('-')[0])
    if delay_strip == '':
        return 'No Value'
    elif int(delay_strip) < 16:
        return "0-15"
    elif int(delay_strip) < 31:
        return "16-30"
    elif int(delay_strip) < 46:
        return "31-45"
    elif int(delay_strip) < 61:
        return "46-60"
    elif int(delay_strip) < 120:
        return "61-120"
    else:
        return "other"

In [426]:
busbreak['How_Long_Delayed'] = busbreak['How_Long_Delayed'].astype(str).map(bucket_delay)

## Are bus breakdowns increasing or decreasing year to year? in duration and quantity?

In [495]:
busbreak = busbreak.loc[busbreak['School_Year'] != '2019-2020',:] # drop value from 2019-2020

# limit to full school years
full_schoolyear = ['2015-2016','2016-2017','2017-2018']

buscolors = "#A569BD #D98880 #F5B7B1 #D7BDE2 #A9CCE3 #AED6F1 #A3E4D7 #A9DFBF #FAD7A0 #F5CBA7 #E5E7E9 #AEB6BF".split()

borocols = dict(zip(busbreak.Boro.unique(),np.random.choice(buscolors,12,False))) # randomly assign a color from the list of colors

data = []

data.append({'type':'scatter',
                         'x':busbreak[(busbreak.School_Year.isin(full_schoolyear))].groupby('School_Year').count().index,
                         'y':busbreak[(busbreak.School_Year.isin(full_schoolyear))].groupby('School_Year').count()['Bus_No'],
                 'opacity':.9,
                 'name':'',
                 'mode':'markers+lines',
                 'marker':{'line':{'width':5,
                                   'color':borocols.values()[5]},
                           'size':2,
                           'color':borocols.values()[5]},
                  'showlegend':False})
data.append({'type':'bar',
                         'x':busbreak[(busbreak.School_Year.isin(full_schoolyear))].groupby('School_Year').count().index,
                         'y':busbreak[(busbreak.School_Year.isin(full_schoolyear))].groupby('School_Year').count()['Bus_No'],
                 'opacity':.7,
                 'name':'',
                 'marker':{'color':borocols.values()[4]},
                 'showlegend':False})

iplot(
        {
            'data':data,
            'layout':{'title':"Schoolbus Delays by School Year</b>",
                     'xaxis':{'title':'<b>School Year',},'yaxis':{'title':'<b>Delays'},
                     }        }
)

## There is clearly year over year increase for schoolbus delays across
- Note: the pace of increase slowed down from 2016-2017 - 2017-2018

### How does does this look on a Boro by Boro basis?

In [496]:
data = []
for b in busbreak.Boro.unique():
    data.append({'type':'scatter',
                             'x':busbreak[(busbreak.School_Year.isin(full_schoolyear)) & (busbreak.Boro == b)].groupby('School_Year').count().index,
                             'y':busbreak[(busbreak.School_Year.isin(full_schoolyear)) & (busbreak.Boro == b)].groupby('School_Year').count()['Bus_No'],
                     'opacity':.7,
                     'name':b,
                     'mode':'lines',
                     'marker':{'line':{'width':5,
                                       'color':borocols[b]},
                               'size':1,
                               'color':borocols[b]}})
    data.append({'type':'bar',
                             'x':busbreak[(busbreak.School_Year.isin(full_schoolyear)) & (busbreak.Boro == b)].groupby('School_Year').count().index,
                             'y':busbreak[(busbreak.School_Year.isin(full_schoolyear)) & (busbreak.Boro == b)].groupby('School_Year').count()['Bus_No'],
                     'opacity':.7,
                     'name':b,
                     'marker':{'color':borocols[b]},
                     'showlegend':False})


iplot(
        {
            'data':data,
            'layout':{'title':"Schoolbus Delays by School Year</b>",
                     'xaxis':{'title':'<b>School Year',},'yaxis':{'title':'<b>Delays'},
                     }        
        }
)

## The worst offender of Bus delays is Manhattan which shot past the Bronx for the 2017-2018 School Year

In [539]:
schoolyear_delays = busbreak.pivot_table(values='Bus_No',columns='How_Long_Delayed',index=['School_Year'],aggfunc=np.size)
schoolyear_delays.columns.name = 'Delay in Min'
schoolyear_delays

Delay in Min,0-15,16-30,31-45,46-60,61-120,No Value,other
School_Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2015-2016,17450.0,27937.0,5633.0,1417.0,275.0,9830.0,642.0
2016-2017,19796.0,39526.0,10260.0,835.0,123.0,12015.0,586.0
2017-2018,11291.0,30997.0,22889.0,10728.0,3187.0,10326.0,15.0
2018-2019,1596.0,3998.0,6246.0,2817.0,1158.0,1644.0,


In [545]:
iplot(
        {
            'data':[{'type':'scatter',
                    'x':schoolyear_delays[schoolyear_delays.index.isin(full_schoolyear)][col].index,
                    'y':schoolyear_delays[schoolyear_delays.index.isin(full_schoolyear)][col].values,
                    'opacity':.7,
                    'name':"{} Minutes".format(col) if col not in ['No Value','other'] else col}
                   for col in schoolyear_delays.columns],
            'layout':{'title':"Schoolbus Delays by School Year</b>",
                     'xaxis':{'title':'<b>School Year',},'yaxis':{'title':'<b>Delays'},
                     }        
        }
)

## Sadly the Shortest delays in duration are decreasing, while the larger duration delays all have a substantial increase going into 2017-2018

### How does this picture change across Boros?
    - Bronx
    - Manhattan
    - Brooklyn
    - Queens
    - Staten Island
    - Westchester (County)
    - Rockland County

In [657]:
boro7 = ['Bronx','Manhattan','Brooklyn','Queens','Staten Island','Westchester','Rockland County']
boro_delays = busbreak.pivot_table(values='Bus_No',columns='How_Long_Delayed',index=['School_Year','Boro'],aggfunc=np.size)
boro_delays.loc[(full_schoolyear,boro7),:]

Unnamed: 0_level_0,How_Long_Delayed,0-15,16-30,31-45,46-60,61-120,No Value,other
School_Year,Boro,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2015-2016,Bronx,7066.0,7104.0,655.0,168.0,13.0,2094.0,20.0
2015-2016,Brooklyn,2609.0,6797.0,2634.0,600.0,122.0,2713.0,164.0
2015-2016,Manhattan,2594.0,6150.0,825.0,300.0,88.0,1423.0,190.0
2015-2016,Queens,1432.0,4452.0,967.0,240.0,36.0,2703.0,44.0
2015-2016,Rockland County,81.0,124.0,13.0,2.0,1.0,9.0,5.0
2015-2016,Staten Island,1619.0,771.0,93.0,21.0,7.0,251.0,3.0
2015-2016,Westchester,926.0,837.0,94.0,9.0,2.0,44.0,201.0
2016-2017,Bronx,7649.0,10598.0,1131.0,136.0,43.0,3767.0,65.0
2016-2017,Brooklyn,2807.0,8926.0,4476.0,274.0,28.0,2902.0,224.0
2016-2017,Manhattan,3042.0,10096.0,1805.0,246.0,33.0,1348.0,191.0


In [581]:
sy_delay_boro = boro_delays.loc[(full_schoolyear,boro7),:]

syd_boro = sy_delay_boro.reset_index()

delaycols = dict(zip(syd_boro.columns.tolist()[2:-2],borocols.values()))

buscat = busbreak.loc[busbreak['Boro'].isin(boro7),:].groupby('Boro')['Bus_No'].count().sort_values(ascending=False).index
syd_boro['Boro'] = pd.Categorical(syd_boro['Boro'],categories = buscat, ordered=True)

In [684]:
iplot(
        {
            'data':[{'type':'scatter3d',
                    'y':syd_boro[(syd_boro.School_Year == s)].sort_values('Boro').School_Year,
                    'z':syd_boro[(syd_boro.School_Year == s)].sort_values('Boro')[col],
                    'x':syd_boro[(syd_boro.School_Year == s)].sort_values('Boro').Boro,
                    'opacity':.9,
                    'name':"{} Minute Delay".format(col),
                    'marker':{'line':{'width':4,
                                       'color':delaycols[col]},
                               'size':5,
                               'color':delaycols[col]},
                    'showlegend':True if s == '2015-2016' else False}
                   for col in syd_boro.columns.tolist()[2:-2] for s in full_schoolyear[::-1]],
            'layout':{'title':"Schoolbus Delays by School Year</b>",
                     'xaxis':{'title':'<b>School Year',},'yaxis':{'title':'<b>Delays'},
                     }        
        }
)

## Across the 5 Boros of NYC, there is a notable increase in 46-60 minute delays in 2017-2018 from prior years
### Manhattan has a massive uptick in 31-45 minute delays compared to previous years
### On a positive note, thee 16-30 minute delays have decreased back to the 2015-2016 levels

# Are bus breakdowns seasonal?

In [705]:
bbtime.resample('W').count()['Bus_No'].head()

Occurred_On
2015-09-06      89
2015-09-13    1786
2015-09-20    1462
2015-09-27    1135
2015-10-04    1814
Freq: W-SUN, Name: Bus_No, dtype: int64

In [711]:
bbtime.head()

Unnamed: 0_level_0,School_Year,Busbreakdown_ID,Run_Type,Bus_No,Route_Number,Reason,Schools_Serviced,Boro,Bus_Company_Name,How_Long_Delayed,Number_Of_Students_On_The_Bus,Has_Contractor_Notified_Schools,Has_Contractor_Notified_Parents,Have_You_Alerted_OPT,Informed_On,Incident_Number,Last_Updated_On,Breakdown_or_Running_Late,School_Age_or_PreK,Informed_gap
Occurred_On,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2015-09-02 06:27:00,2015-2016,1212699,Special Ed AM Run,48186,N758,Other,75485,Nassau County,"BORO TRANSIT, INC.",16-30,0,Yes,No,No,2015-09-02 06:29:00,-1,2015-09-02 06:29:16,Running Late,School-Age,00:02:00
2015-09-02 06:24:00,2015-2016,1212700,Special Ed AM Run,2518,L530,Mechanical Problem,21854,Brooklyn,"RELIANT TRANS, INC. (B232",No Value,0,Yes,Yes,Yes,2015-09-02 06:30:00,-1,2015-09-02 06:30:19,Breakdown,School-Age,00:06:00
2015-09-02 06:45:00,2015-2016,1212701,Special Ed AM Run,235,K168,Other,18366,Brooklyn,"NEW DAWN TRANSIT, LLC (B2",16-30,0,Yes,Yes,No,2015-09-02 06:47:00,-1,2015-09-02 08:05:39,Running Late,School-Age,00:02:00
2015-09-02 06:55:00,2015-2016,1212703,Special Ed AM Run,2102,K216,Other,21501,Brooklyn,EMPIRE STATE BUS CORP.,16-30,1,Yes,Yes,No,2015-09-02 07:02:00,-1,2015-09-02 07:02:01,Running Late,School-Age,00:07:00
2015-09-02 06:55:00,2015-2016,1212704,Special Ed AM Run,48162,N861,Mechanical Problem,75485,Nassau County,"BORO TRANSIT, INC.",16-30,0,Yes,Yes,No,2015-09-02 07:04:00,-1,2015-09-02 07:04:25,Running Late,School-Age,00:09:00


In [757]:
yearcolors = dict(zip(busbreak.School_Year.unique(),buscolors))

bbtime = busbreak.set_index('Occurred_On').resample('W')['Bus_No'].count()

iplot(
      {'data':[{'type':'bar',
           'x':busbreak[busbreak.School_Year == yr]\
                    .set_index('Occurred_On')\
                    .resample('W')['Bus_No']\
                    .count().index.map(lambda x: "week - {0} - {1}".format(x.week,x.month_name()[:3])),
           'y':busbreak[busbreak.School_Year == yr]\
                    .set_index('Occurred_On')\
                    .resample('W')['Bus_No']\
                    .count().values,
               'name':yr,
               'opacity':0.7,
               'marker':{'color':yearcolors[yr]}}
              for yr in busbreak.School_Year.unique()],
       'layout':{'title':'Bus Breakdowns by Week','barmode':'stack',
                'xaxis':{'title':'Week of the year','dtick':4}}
      }

)


In [758]:
iplot(
      {'data':[{'type':'bar',
           'x':busbreak[busbreak.School_Year == yr]\
                    .set_index('Occurred_On')\
                    .resample('M')['Bus_No']\
                    .count().index.month_name(),
           'y':busbreak[busbreak.School_Year == yr]\
                    .set_index('Occurred_On')\
                    .resample('M')['Bus_No']\
                    .count().values,
               'name':yr,
               'opacity':0.7,
               'marker':{'color':yearcolors[yr]}}
              for yr in busbreak.School_Year.unique()],
       'layout':{'title':'Bus Breakdowns by Month','barmode':'stack',
                'xaxis':{'title':'Month','dtick':2}}
      }

)


In [760]:
iplot(
      {'data':[{'type':'bar',
           'x':busbreak[busbreak.School_Year == yr]\
                    .set_index('Occurred_On')\
                    .resample('D')['Bus_No']\
                    .count().index.day_name(),
           'y':busbreak[busbreak.School_Year == yr]\
                    .set_index('Occurred_On')\
                    .resample('D')['Bus_No']\
                    .count().values,
               'name':yr,
               'opacity':0.7,
               'marker':{'color':yearcolors[yr]}}
              for yr in busbreak.School_Year.unique()],
       'layout':{'title':'Bus Breakdowns by Day',
                'xaxis':{'title':'Day of the week','dtick':2}}
      }

)


In [761]:
iplot(
      {'data':[{'type':'bar',
           'x':busbreak[busbreak.School_Year == yr]\
                    .set_index('Occurred_On')\
                    .resample('H')['Bus_No']\
                    .count().index.hour,
           'y':busbreak[busbreak.School_Year == yr]\
                    .set_index('Occurred_On')\
                    .resample('H')['Bus_No']\
                    .count().values,
               'name':yr,
               'opacity':0.7,
               'marker':{'color':yearcolors[yr]}}
              for yr in busbreak.School_Year.unique()],
       'layout':{'title':'Bus Breakdowns by hour',
                'xaxis':{'title':'Hour','dtick':2}}
      }

)


In [741]:
G.month_name()

Index([u'September', u'September', u'September', u'September', u'October',
       u'October', u'October', u'October', u'November', u'November',
       ...
       u'August', u'August', u'September', u'September', u'September',
       u'September', u'September', u'October', u'October', u'October'],
      dtype='object', name=u'Occurred_On', length=164)