### Usage Histogram Interactive Plot

similar in usage to: https://github.com/bokeh/bokeh/blob/master/examples/app/selection_histogram.py

In [1]:
import pandas as pd
import numpy as np

In [2]:
import gzip
import pickle

This is the library data processed in the percent usage per hour.

In [3]:
with gzip.open(r'../data/LibData.pkl.gz') as f:
    libraryData = pickle.load(f)

The dates here show the data from 3/24/10 - 10/19/17

In [4]:
libraryData.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 66381 entries, 2010-03-24 12:00:00 to 2017-10-19 08:00:00
Freq: H
Columns: 312 entries, BL001 to TL902
dtypes: float64(312)
memory usage: 158.5 MB


This is a grouping of the library data by average per hour using arbitrary endpoints.

In [5]:
startDate = pd.to_datetime("2014-01-01")
endDate = pd.to_datetime("2017-12-31")
dateMask = (libraryData.index > startDate) & (libraryData.index < endDate)

The computer attributes need to be loaded into a separate dataframe:

In [6]:
compAttrs = pd.read_csv(r'../data/computerAttributes.csv',header=0)

In [7]:
compAttrs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 312 entries, 0 to 311
Data columns (total 14 columns):
dbID                  312 non-null int64
computerName          312 non-null object
requiresLogon         311 non-null float64
isDesktop             311 non-null float64
inJackson             311 non-null float64
location              306 non-null object
is245                 311 non-null float64
floor                 245 non-null object
numMonitors           304 non-null float64
largeMonitor          238 non-null float64
adjacentWindow        238 non-null float64
collaborativeSpace    238 non-null float64
roomIsolated          238 non-null float64
inQuietArea           238 non-null float64
dtypes: float64(10), int64(1), object(3)
memory usage: 34.2+ KB


In [8]:
booleanCols = ["requiresLogon",
               "isDesktop",
               "inJackson",
               "is245",
               "largeMonitor",
               "adjacentWindow",
               "collaborativeSpace",
               "roomIsolated",
               "inQuietArea"]

Using the attributes from above as booleans, create a mask for the `compAttrs` dataframe, and return the names. Testing various attributes in the following array.

In [9]:
attrsNamesMask = compAttrs[(compAttrs.requiresLogon       == True)
                         & (compAttrs.isDesktop           == True)
                         & (compAttrs.inJackson           == True)
#                          & (compAttrs.is245               == True)
#                          & (compAttrs.floor               == 2)  #this one doesn't work yet.
#                          & (compAttrs.largeMonitor        == True)
#                          & (compAttrs.adjacentWindow      == True)
#                          & (compAttrs.collaborativeSpace  == True)
#                          & (compAttrs.roomIsolated        == True)
#                          & (compAttrs.inQuietArea         == True)
                           ].computerName

In [10]:
libraryMeans = libraryData[dateMask].groupby(libraryData[dateMask].index.hour).mean()*100

In [11]:
libraryMeansNameMask = libraryMeans.loc[:,attrsNamesMask.values]

In [12]:
libraryMeansNameMask.head()

Unnamed: 0,BL001,BL002,CITI001,CITI002,CITI003,CITI004,CITI005,CITI006,CITI007,CITI008,...,TC701,TL7001,TL702,TC8001,TL801,TL802,TC901,TL90003,TL901,TL902
0,7.919358,6.505413,5.917757,6.831236,5.105177,3.904129,3.654862,8.502025,6.317602,3.532474,...,16.227139,15.863417,11.035971,9.245541,10.583019,11.146817,1.551217,3.073335,7.792467,16.559259
1,6.605202,5.206064,4.93657,6.107377,4.726449,3.135932,2.630313,7.10774,5.628477,2.575806,...,13.969767,14.17216,8.988107,8.31683,9.040848,8.691397,1.504919,3.014281,6.582077,14.132584
2,6.038368,4.303542,4.245557,5.409049,4.49858,2.756343,2.230482,6.218136,5.068355,2.099745,...,11.922185,12.570467,7.620449,7.478736,7.902697,7.833987,1.581627,2.78969,5.466827,13.14469
3,6.212687,3.6674,3.857458,5.278202,4.444149,2.503476,2.082033,5.516838,4.785545,1.908431,...,11.182654,12.317397,7.481264,7.057692,7.4578,7.580582,1.524965,2.78218,5.492212,13.481826
4,5.909924,3.158767,3.693379,5.227368,4.469294,2.499625,2.159071,5.260092,4.699053,1.780873,...,10.742113,11.77043,7.120327,6.48367,7.140074,7.42892,1.595684,2.529056,4.931993,12.957592


In [13]:
libraryMeansNameMask.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Columns: 201 entries, BL001 to TL902
dtypes: float64(201)
memory usage: 37.9 KB


Since the format is a 24 (hours) x 312 (computers) matrix, and the scatter plot is looking for single-dimension arrays, the data needs to be unstacked into these arrays.

In [14]:
meansUnstacked = libraryMeansNameMask.unstack().reset_index()
meansUnstacked.columns = ["comps","hour","means"]

Having an issue where the incrementing of the index is causing the scatter plot to shift over one. Experimenting with getting the index values in a merge.

In [15]:
attrsNamesMask = attrsNamesMask.reset_index().drop('index',axis=1)
attrsNamesMask['x_vals'] = attrsNamesMask.index*24
attrsNamesMask.head()

Unnamed: 0,computerName,x_vals
0,BL001,0
1,BL002,24
2,CITI001,48
3,CITI002,72
4,CITI003,96


In [16]:
meansUnstackedMerged = meansUnstacked.merge(attrsNamesMask, left_on='comps',right_on='computerName').drop('computerName',axis=1)
meansUnstackedMerged.head()

Unnamed: 0,comps,hour,means,x_vals
0,BL001,0,7.919358,0
1,BL001,1,6.605202,0
2,BL001,2,6.038368,0
3,BL001,3,6.212687,0
4,BL001,4,5.909924,0


Since the number of machines in the later graph might change, going ahead here and setting variables based on the count of machines returned in the dataframe above:

In [17]:
machineCount = meansUnstackedMerged.comps.unique().size
recordCount = meansUnstackedMerged.index.size
hourCount = 24
print machineCount * hourCount
print recordCount

4824
4824


The Bokeh libraries necessary for this graph:

In [18]:
from bokeh.layouts import row, column
from bokeh.models import(
                BoxSelectTool, 
                LassoSelectTool, 
                Spacer, 
                FuncTickFormatter, 
                FixedTicker, 
                HoverTool, 
                ColumnDataSource, 
                LinearColorMapper,
                ColorBar, 
                BasicTicker, PrintfTickFormatter)
from bokeh.plotting import figure, output_file, output_notebook, show, save
output_notebook()

Bokeh allows a number of tools included in the tool bar adjacent to the graph. Testing the tools available and configurations for each. Hover, in this case, uses the data from the ColumnDataSource to populate the tooltips.

In [19]:
hover = HoverTool(
    tooltips=[
        ("Computer", "@comps"),
        ("Hour", "$y{0}:00"),
        ("Pct Use","@means")
    ],
    formatters={"Hour":"datetime"}
)
#TOOLS=[hover,"crosshair,pan,wheel_zoom,zoom_in,zoom_out,box_zoom,undo,redo,reset,tap,save,box_select,poly_select,lasso_select"]
TOOLS=[hover,"crosshair,pan,wheel_zoom,box_zoom,reset,tap,save,box_select,poly_select,lasso_select"]

This ColumnDataSource is necessary to pass the dataframe values to the scatter plot later.

In [20]:
source = ColumnDataSource.from_df(meansUnstackedMerged)

In [21]:
colors = ["#1c204e","#232863","#2a3079","#30388e","#3740a4","#7c7c7c","#8a8a8a","#989898","#a6a6a6"][::-1]
mapper = LinearColorMapper(palette=colors, low=meansUnstackedMerged.means.min(), high=meansUnstackedMerged.means.max())

These are the basic commands to create the graph known as `mainGraph`. The `select()` commands are perceived to improve performance on large datasets

In [22]:
mainGraph = figure(tools=TOOLS, plot_width=900, plot_height=600,
                     min_border=10, min_border_left=50,
                     toolbar_location="above",
                     x_axis_location=None, # this is left in, as the x-axis ticks are hard to read zoomed out.
                     #y_axis_location=None, 
                     title="Library Usage: Average Percent Utilization per Hour")
mainGraph.background_fill_color = "#fafafa"
mainGraph.select(BoxSelectTool).select_every_mousemove = False
mainGraph.select(LassoSelectTool).select_every_mousemove = False

Formatting the tickers requires some finesse. This first example uses some JavaScript to format the tick values to a 24-hour clock, and then constrain it to integers.

In [23]:
mainGraph.yaxis.formatter = FuncTickFormatter(code="""return Math.floor(tick)+':00'""")
mainGraph.yaxis.ticker = FixedTicker(ticks = range(0,24))

Formatting the xaxis requires aligning the computernames to the values within the unstacked dataframe. This is omitted for now. 

In [24]:
# keys=range(0,recordCount,hourCount)
# values=list(meansUnstacked.comps.unique())
# graphCompIndex = dict(zip(keys,values))
# mainGraph.xaxis.ticker = FixedTicker(ticks = range(0,recordCount,hourCount))
# mainGraph.xaxis.major_label_overrides = graphCompIndex

In [25]:
# mainGraph.scatter("x_vals","hour",radius=5,color="blue",alpha=.4,source=source)
mainGraph.rect(x="x_vals", y="hour", 
               width=24, height=1,
               source=source,
               fill_color={'field': 'means', 'transform': mapper},
              line_color=None)
# output_file("./AvgPercentUtil.html", title='Library Usage: Average Percent Utilization per Hour')
color_bar = ColorBar(color_mapper=mapper, major_label_text_font_size="10pt",
                     ticker=BasicTicker(desired_num_ticks=len(colors)),
                     formatter=PrintfTickFormatter(format="%d%%"),
                     label_standoff=10, border_line_color=None, location=(0, 0))
mainGraph.add_layout(color_bar, 'right')
show(mainGraph)

Below, manually adding the output from the graph above so that it will preview in GitHub correctly.

In [26]:
from IPython.display import Image

![title](./avgPercentUtil.png)