### Usage Histogram Interactive Plot

similar in usage to: https://github.com/bokeh/bokeh/blob/master/examples/app/selection_histogram.py

In [1]:
import pandas as pd
import numpy as np

In [2]:
import gzip
import pickle

This is the library data processed in the percent usage per hour.

In [3]:
with gzip.open(r'../data/LibData.pkl.gz') as f:
    libraryData = pickle.load(f)

The dates here show the data from 3/24/10 - 10/19/17

In [4]:
libraryData.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 66381 entries, 2010-03-24 12:00:00 to 2017-10-19 08:00:00
Freq: H
Columns: 312 entries, BL001 to TL902
dtypes: float64(312)
memory usage: 158.5 MB


This is a grouping of the library data by average per hour using arbitrary endpoints.

In [5]:
startDate = pd.to_datetime("2017-01-01")
endDate = pd.to_datetime("2017-12-31")
dateMask = (libraryData.index > startDate) & (libraryData.index < endDate)

The computer attributes need to be loaded into a separate dataframe:

In [6]:
compAttrs = pd.read_csv(r'../data/computerAttributes.csv',header=0)

In [7]:
compAttrs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 312 entries, 0 to 311
Data columns (total 14 columns):
dbID                  312 non-null int64
computerName          312 non-null object
requiresLogon         311 non-null float64
isDesktop             311 non-null float64
inJackson             311 non-null float64
location              306 non-null object
is245                 311 non-null float64
floor                 245 non-null object
numMonitors           304 non-null float64
largeMonitor          238 non-null float64
adjacentWindow        238 non-null float64
collaborativeSpace    238 non-null float64
roomIsolated          238 non-null float64
inQuietArea           238 non-null float64
dtypes: float64(10), int64(1), object(3)
memory usage: 34.2+ KB


In [8]:
booleanCols = ["requiresLogon",
               "isDesktop",
               "inJackson",
               "is245",
               "largeMonitor",
               "adjacentWindow",
               "collaborativeSpace",
               "roomIsolated",
               "inQuietArea"]

Using the attributes from above as booleans, create a mask for the `compAttrs` dataframe, and return the names. Testing various attributes in the following array.

In [9]:
attrsNamesMask = compAttrs[(compAttrs.requiresLogon       == True)
                         & (compAttrs.isDesktop           == True)
                         & (compAttrs.inJackson           == False)
#                          & (compAttrs.is245               == True)
#                          & (compAttrs.floor               == 2)  #this one doesn't work yet.
#                          & (compAttrs.largeMonitor        == True)
#                          & (compAttrs.adjacentWindow      == True)
#                          & (compAttrs.collaborativeSpace  == True)
#                          & (compAttrs.roomIsolated        == True)
#                          & (compAttrs.inQuietArea         == True)
                           ].computerName

In [10]:
libraryMeans = libraryData[dateMask].groupby(libraryData[dateMask].index.hour).mean()*100

In [11]:
libraryMeansNameMask = libraryMeans.loc[:,attrsNamesMask.values]

In [12]:
libraryMeansNameMask.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Data columns (total 18 columns):
MLC0001    24 non-null float64
MLC0002    24 non-null float64
MLC0003    24 non-null float64
MLC0004    24 non-null float64
MLC0005    24 non-null float64
MLC0006    24 non-null float64
MLC0007    24 non-null float64
MLC0008    24 non-null float64
MLC0009    24 non-null float64
MLC0010    24 non-null float64
MLC0011    24 non-null float64
MLC0012    24 non-null float64
MLC0013    24 non-null float64
MLC0014    24 non-null float64
MLC0015    24 non-null float64
MLC0016    24 non-null float64
MLC0017    24 non-null float64
MLC0018    24 non-null float64
dtypes: float64(18)
memory usage: 3.6 KB


Since the format is a 24 (hours) x 312 (computers) matrix, and the scatter plot is looking for single-dimension arrays, the data needs to be unstacked into these arrays.

In [13]:
meansUnstacked = libraryMeansNameMask.unstack().reset_index()
meansUnstacked.columns = ["comps","hour","means"]

In [14]:
meansUnstacked

Unnamed: 0,comps,hour,means
0,MLC0001,0,0.343643
1,MLC0001,1,0.342466
2,MLC0001,2,0.342466
3,MLC0001,3,0.342466
4,MLC0001,4,0.342466
5,MLC0001,5,0.342466
6,MLC0001,6,0.342466
7,MLC0001,7,0.685845
8,MLC0001,8,6.337003
9,MLC0001,9,15.463194


In [15]:
#meansUnstackedMerged = meansUnstacked.merge(compAttrs, left_on='comps',right_on='computerName')

Since the number of machines in the later graph might change, going ahead here and setting variables based on the count of machines returned in the dataframe above:

In [16]:
machineCount = meansUnstacked.comps.unique().size
recordCount = meansUnstacked.index.size
hourCount = 24
print machineCount * hourCount
print recordCount

432
432


The Bokeh libraries necessary for this graph:

In [17]:
from bokeh.layouts import row, column
from bokeh.models import BoxSelectTool, LassoSelectTool, Spacer, FuncTickFormatter, FixedTicker, HoverTool, ColumnDataSource
from bokeh.plotting import figure, output_file, output_notebook, show, save
output_notebook()

Bokeh allows a number of tools included in the tool bar adjacent to the graph. Testing the tools available and configurations for each. Hover, in this case, uses the data from the ColumnDataSource to populate the tooltips.

In [18]:
hover = HoverTool(
    tooltips=[
        ("Computer", "@comps"),
        ("Hour", "$y{0}:00"),
        ("Pct Use","@means"),
        ("x", "$x"),
        ("y", "$y")
    ],
    formatters={"Hour":"datetime"}
)
#TOOLS=[hover,"crosshair,pan,wheel_zoom,zoom_in,zoom_out,box_zoom,undo,redo,reset,tap,save,box_select,poly_select,lasso_select"]
TOOLS=[hover,"crosshair,pan,wheel_zoom,box_zoom,reset,tap,save,box_select,poly_select,lasso_select"]

This ColumnDataSource is necessary to pass the dataframe values to the scatter plot later.

In [19]:
source = ColumnDataSource.from_df(meansUnstacked)

These are the basic commands to create the graph known as `mainGraph`. The `select()` commands are perceived to improve performance on large datasets

In [20]:
mainGraph = figure(tools=TOOLS, plot_width=900, plot_height=600,
                     min_border=10, min_border_left=50,
                     toolbar_location="above",
                     x_axis_location=None, # this is left in, as the x-axis ticks are hard to read zoomed out.
                     #y_axis_location=None, 
                     title="Library Usage: Average Percent Utilization per Hour")
mainGraph.background_fill_color = "#fafafa"
mainGraph.select(BoxSelectTool).select_every_mousemove = False
mainGraph.select(LassoSelectTool).select_every_mousemove = False

Formatting the tickers requires some finesse. This first example uses some JavaScript to format the tick values to a 24-hour clock, and then constrain it to integers.

In [21]:
mainGraph.yaxis.formatter = FuncTickFormatter(code="""return Math.floor(tick)+':00'""")
mainGraph.yaxis.ticker = FixedTicker(ticks = range(0,24))

Formatting the xaxis requires aligning the computernames to the values within the unstacked dataframe. This is omitted for now. 

In [22]:
keys=range(0,recordCount,hourCount)
values=list(meansUnstackedMerged.comps.unique())
graphCompIndex = dict(zip(keys,values))
mainGraph.xaxis.ticker = FixedTicker(ticks = range(0,recordCount,hourCount))
mainGraph.xaxis.major_label_overrides = graphCompIndex

NameError: name 'meansUnstackedMerged' is not defined

In [None]:
mainGraph.scatter("index","hour",radius=5,color="blue",alpha=.4,source=source)
#output_file("./AvgPercentUtil.html", title='Library Usage: Average Percent Utilization per Hour')
show(mainGraph)

In [None]:
from IPython.display import Image

![title](./avgPercentUtil.png)