# Homework 8 - Visual Analytics
#### Team Members: Anmol Singh Suag and Sanuj Bhatia
#### VAST Challenge Selected - VAST 2015 (MC2)
This project file contains the code we wrote for Homework 8 - To generate results for the Class Presentation. We use the data provided for the second mini-challenge of VAST Challenge 2015 here.

In [58]:
#Importing commonly used ML Libraries
from bokeh.io import output_notebook, show
from bokeh.layouts import column, row, widgetbox
from bokeh.plotting import figure
from bokeh.models import HoverTool, ColumnDataSource, LabelSet, CustomJS, Slider, Range1d, GraphRenderer, Circle, MultiLine, StaticLayoutProvider
from bokeh.models.widgets import Select, Panel, Tabs, Button
import pandas as pd
import numpy as np
from scipy import stats
from scipy.optimize import curve_fit
import math
import copy
import networkx as nx
import warnings
warnings.filterwarnings('ignore')
from sklearn.cluster import KMeans, DBSCAN
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold   #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
from random import randint
import pickle
from PIL import Image
import pygraphml

In [7]:
output_notebook()

### Problem Statement
All visitors in DinoFun World use the application provided by the park on their devices or on the devices they are lent by the park. They can communicate via this app with their friends using the same app. This data is provided for all three days, detailing the location of message origin, the timestamp, and the sender and receiver's IDs.
MC2 has the following questions to be answered:
1. Identify IDs that stand out for large amount of communication. For them:
    a. Characterise communication patterns
    b. What can these IDs be?
2. Describe communication patterns in the data.
3. Can you hypothesize when the vandalism was occurred?

In [8]:
df_com_fri = pd.read_csv('data/comm-data-Fri.csv',infer_datetime_format=True, sep = ',',header=0)
df_com_sat = pd.read_csv('data/comm-data-Sat.csv',infer_datetime_format=True, sep = ',',header=0)
df_com_sun = pd.read_csv('data/comm-data-Sun.csv',infer_datetime_format=True, sep = ',',header=0)

In [9]:
df_com_fri['Timestamp'] = pd.to_datetime(df_com_fri['Timestamp'])
df_com_sat['Timestamp'] = pd.to_datetime(df_com_sat['Timestamp'])
df_com_sun['Timestamp'] = pd.to_datetime(df_com_sun['Timestamp'])

In [10]:
df = None
df = df_com_fri.append(df_com_sat).append(df_com_sun)
print ("Number of Rows:",len(df))
df.head()

Number of Rows: 4153329


Unnamed: 0,Timestamp,from,to,location
0,2014-06-06 08:03:19,439105,1053224,Kiddie Land
1,2014-06-06 08:03:19,439105,1696241,Kiddie Land
2,2014-06-06 08:03:19,439105,580064,Kiddie Land
3,2014-06-06 08:03:19,439105,1464748,Kiddie Land
4,2014-06-06 08:03:47,1836139,1593258,Entry Corridor


In [11]:
df.groupby(['location']).size()

location
Coaster Alley      434864
Entry Corridor     790792
Kiddie Land        393080
Tundra Land        869342
Wet Land          1665251
dtype: int64

### There are 5 Locations
The message origin is broadly classified as being from 5 different locations in the park, as shown above.

In [13]:
len(df.groupby(['from']))

9429

In [15]:
len(df.groupby(['to']))

9391

#### Senders and Receivers
We have around 9400 senders and receivers in the three days of data.

In [17]:
# Finding IDs that have unusual amount of communication
sender_count=df.groupby(['from']).size().sort_values(ascending=False)
print(sender_count.head(10))
receiver_count=df.groupby(['to']).size().sort_values(ascending=False)
receiver_count.head(10)

from
1278894    190360
839736      60812
1045021      3807
1116329      3746
1749109      3708
918738       3707
1250941      3683
970490       3569
128533       3568
1508923      3558
dtype: int64


to
1278894     189894
external     62077
839736       60818
171002        3270
1116329       3153
1388162       3111
48730         3105
856067        3101
992045        3087
1300247       3065
dtype: int64

## Answer MC2.1.
We try and analyse which IDs have unusual amounts of communication:
* Clearly the IDs 1278894 and 839736 stand out with a substantially high number of sent and received messages.
* ID 1278894 sent 190360 messages and received 189894 messages, the highest in both categories.
* ID 839736 sent 60812 and received 60818 messages as the second highest sender and non-external receiver.
* About 62k messages have been sent to external, that is to people outside of the park.

We will now analyse IDs 1278894 and 839736

In [18]:
timestamp_count_sent_1278894_fri = {}
timestamp_count_sent_1278894_sat = {}
timestamp_count_sent_1278894_sun = {}
timestamp_count_rec_1278894_fri = {}
timestamp_count_rec_1278894_sat = {}
timestamp_count_rec_1278894_sun = {}

timestamp_count_sent_839736_fri = {}
timestamp_count_sent_839736_sat = {}
timestamp_count_sent_839736_sun = {}
timestamp_count_rec_839736_fri = {}
timestamp_count_rec_839736_sat = {}
timestamp_count_rec_839736_sun = {}

timestamp_count_rec_ext_fri = {}
timestamp_count_rec_ext_sat = {}
timestamp_count_rec_ext_sun = {}

timestamp_count_sent_all_fri = {}
timestamp_count_sent_all_sat = {}
timestamp_count_sent_all_sun = {}

In [19]:
for i in range(len(df_com_fri)):
    time = df_com_fri.iloc[i,0].hour*60*60 + df_com_fri.iloc[i,0].minute*60
    timestamp_count_sent_all_fri[time] = timestamp_count_sent_all_fri.get(time,0)+1

    if (df_com_fri.iloc[i,1] == 1278894):
        timestamp_count_sent_1278894_fri[time] = timestamp_count_sent_1278894_fri.get(time,0)+1
        
    if (df_com_fri.iloc[i,1] == 839736):
        timestamp_count_sent_839736_fri[time] = timestamp_count_sent_839736_fri.get(time,0)+1
        
        
    if (df_com_fri.iloc[i,2] == '1278894'):
        timestamp_count_rec_1278894_fri[time] = timestamp_count_rec_1278894_fri.get(time,0)+1
        
    if (df_com_fri.iloc[i,2] == '839736'):
        timestamp_count_rec_839736_fri[time] = timestamp_count_rec_839736_fri.get(time,0)+1
        
    if (df_com_fri.iloc[i,2] == 'external'):
        timestamp_count_rec_ext_fri[time] = timestamp_count_rec_ext_fri.get(time,0)+1



In [20]:

for i in range(len(df_com_sat)):
    time =df_com_sat.iloc[i,0].hour*60*60 + df_com_sat.iloc[i,0].minute*60
    timestamp_count_sent_all_sat[time] = timestamp_count_sent_all_sat.get(time,0)+1


    if (df_com_sat.iloc[i,1] == 1278894):
        timestamp_count_sent_1278894_sat[time] = timestamp_count_sent_1278894_sat.get(time,0)+1
        
    if (df_com_sat.iloc[i,1] == 839736):
        timestamp_count_sent_839736_sat[time] = timestamp_count_sent_839736_sat.get(time,0)+1
        
        
    if (df_com_sat.iloc[i,2] == '1278894'):
        timestamp_count_rec_1278894_sat[time] = timestamp_count_rec_1278894_sat.get(time,0)+1
        
    if (df_com_sat.iloc[i,2] == '839736'):
        timestamp_count_rec_839736_sat[time] = timestamp_count_rec_839736_sat.get(time,0)+1
        
    if (df_com_sat.iloc[i,2] == 'external'):
        timestamp_count_rec_ext_sat[time] = timestamp_count_rec_ext_sat.get(time,0)+1


In [21]:
for i in range(len(df_com_sun)):
    time =df_com_sun.iloc[i,0].hour*60*60 + df_com_sun.iloc[i,0].minute*60
    timestamp_count_sent_all_sun[time] = timestamp_count_sent_all_sun.get(time,0)+1

    if (df_com_sun.iloc[i,1] == 1278894):
        timestamp_count_sent_1278894_sun[time] = timestamp_count_sent_1278894_sun.get(time,0)+1
        
    if (df_com_sun.iloc[i,1] == 839736):
        timestamp_count_sent_839736_sun[time] = timestamp_count_sent_839736_sun.get(time,0)+1
        
        
    if (df_com_sun.iloc[i,2] == '1278894'):
        timestamp_count_rec_1278894_sun[time] = timestamp_count_rec_1278894_sun.get(time,0)+1
        
    if (df_com_sun.iloc[i,2] == '839736'):
        timestamp_count_rec_839736_sun[time] = timestamp_count_rec_839736_sun.get(time,0)+1
        
    if (df_com_sun.iloc[i,2] == 'external'):
        timestamp_count_rec_ext_sun[time] = timestamp_count_rec_ext_sun.get(time,0)+1

    

In [22]:
fig1_dict = {}
fig2_dict = {}
fig3_dict = {}

fig1_dict['count_sent_1278894_fri']=list(timestamp_count_sent_1278894_fri.values())
fig1_dict['time_sent_1278894_fri']=list(timestamp_count_sent_1278894_fri.keys())
fig1_dict['timestamp_sent_1278894_fri']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_sent_1278894_fri.keys())]
fig1_dict['count_sent_1278894_sat']=list(timestamp_count_sent_1278894_sat.values())
fig1_dict['time_sent_1278894_sat']=list(timestamp_count_sent_1278894_sat.keys())
fig1_dict['timestamp_sent_1278894_sat']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_sent_1278894_sat.keys())]
fig1_dict['count_sent_1278894_sun']=list(timestamp_count_sent_1278894_sun.values())
fig1_dict['time_sent_1278894_sun']=list(timestamp_count_sent_1278894_sun.keys())
fig1_dict['timestamp_sent_1278894_sun']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_sent_1278894_sun.keys())]
fig1_dict['count_sent_1278894_curr']=list(timestamp_count_sent_1278894_fri.values())
fig1_dict['time_sent_1278894_curr']=list(timestamp_count_sent_1278894_fri.keys())
fig1_dict['timestamp_sent_1278894_curr']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_sent_1278894_fri.keys())]

fig1_dict['count_rec_1278894_fri']=list(timestamp_count_rec_1278894_fri.values())
fig1_dict['time_rec_1278894_fri']=list(timestamp_count_rec_1278894_fri.keys())
fig1_dict['timestamp_rec_1278894_fri']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_rec_1278894_fri.keys())]
fig1_dict['count_rec_1278894_sat']=list(timestamp_count_rec_1278894_sat.values())
fig1_dict['time_rec_1278894_sat']=list(timestamp_count_rec_1278894_sat.keys())
fig1_dict['timestamp_rec_1278894_sat']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_rec_1278894_sat.keys())]
fig1_dict['count_rec_1278894_sun']=list(timestamp_count_rec_1278894_sun.values())
fig1_dict['time_rec_1278894_sun']=list(timestamp_count_rec_1278894_sun.keys())
fig1_dict['timestamp_rec_1278894_sun']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_rec_1278894_sun.keys())]

fig1_dict['count_rec_1278894_curr']=list(timestamp_count_rec_1278894_fri.values())
fig1_dict['time_rec_1278894_curr']=list(timestamp_count_rec_1278894_fri.keys())
fig1_dict['timestamp_rec_1278894_curr']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_rec_1278894_fri.keys())]

fig2_dict['count_sent_839736_fri']=list(timestamp_count_sent_839736_fri.values())
fig2_dict['time_sent_839736_fri']=list(timestamp_count_sent_839736_fri.keys())
fig2_dict['timestamp_sent_839736_fri']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_sent_839736_fri.keys())]
fig2_dict['count_sent_839736_sat']=list(timestamp_count_sent_839736_sat.values())
fig2_dict['time_sent_839736_sat']=list(timestamp_count_sent_839736_sat.keys())
fig2_dict['timestamp_sent_839736_sat']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_sent_839736_sat.keys())]
fig2_dict['count_sent_839736_sun']=list(timestamp_count_sent_839736_sun.values())
fig2_dict['time_sent_839736_sun']=list(timestamp_count_sent_839736_sun.keys())
fig2_dict['timestamp_sent_839736_sun']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_sent_839736_sun.keys())]

fig2_dict['count_sent_839736_curr']=list(timestamp_count_sent_839736_fri.values())
fig2_dict['time_sent_839736_curr']=list(timestamp_count_sent_839736_fri.keys())
fig2_dict['timestamp_sent_839736_curr']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_sent_839736_fri.keys())]

fig2_dict['count_rec_839736_fri']=list(timestamp_count_rec_839736_fri.values())
fig2_dict['time_rec_839736_fri']=list(timestamp_count_rec_839736_fri.keys())
fig2_dict['timestamp_rec_839736_fri']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_rec_839736_fri.keys())]
fig2_dict['count_rec_839736_sat']=list(timestamp_count_rec_839736_sat.values())
fig2_dict['time_rec_839736_sat']=list(timestamp_count_rec_839736_sat.keys())
fig2_dict['timestamp_rec_839736_sat']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_rec_839736_sat.keys())]
fig2_dict['count_rec_839736_sun']=list(timestamp_count_rec_839736_sun.values())
fig2_dict['time_rec_839736_sun']=list(timestamp_count_rec_839736_sun.keys())
fig2_dict['timestamp_rec_839736_sun']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_rec_839736_sun.keys())]

fig2_dict['count_rec_839736_curr']=list(timestamp_count_rec_839736_fri.values())
fig2_dict['time_rec_839736_curr']=list(timestamp_count_rec_839736_fri.keys())
fig2_dict['timestamp_rec_839736_curr']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_rec_839736_fri.keys())]

fig3_dict['count_rec_ext_fri']=list(timestamp_count_rec_ext_fri.values())
fig3_dict['time_rec_ext_fri']=list(timestamp_count_rec_ext_fri.keys())
fig3_dict['timestamp_rec_ext_fri']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_rec_ext_fri.keys())]
fig3_dict['count_rec_ext_sat']=list(timestamp_count_rec_ext_sat.values())
fig3_dict['time_rec_ext_sat']=list(timestamp_count_rec_ext_sat.keys())
fig3_dict['timestamp_rec_ext_sat']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_rec_ext_sat.keys())]
fig3_dict['count_rec_ext_sun']=list(timestamp_count_rec_ext_sun.values())
fig3_dict['time_rec_ext_sun']=list(timestamp_count_rec_ext_sun.keys())
fig3_dict['timestamp_rec_ext_sun']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_rec_ext_sun.keys())]

fig3_dict['count_rec_ext_curr']=list(timestamp_count_rec_ext_fri.values())
fig3_dict['time_rec_ext_curr']=list(timestamp_count_rec_ext_fri.keys())
fig3_dict['timestamp_rec_ext_curr']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_rec_ext_fri.keys())]

fig3_dict['count_sent_all_fri']=list(timestamp_count_sent_all_fri.values())
fig3_dict['time_sent_all_fri']=list(timestamp_count_sent_all_fri.keys())
fig3_dict['timestamp_sent_all_fri']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_sent_all_fri.keys())]
fig3_dict['count_sent_all_sat']=list(timestamp_count_sent_all_sat.values())
fig3_dict['time_sent_all_sat']=list(timestamp_count_sent_all_sat.keys())
fig3_dict['timestamp_sent_all_sat']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_sent_all_sat.keys())]
fig3_dict['count_sent_all_sun']=list(timestamp_count_sent_all_sun.values())
fig3_dict['time_sent_all_sun']=list(timestamp_count_sent_all_sun.keys())
fig3_dict['timestamp_sent_all_sun']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_sent_all_sun.keys())]

fig3_dict['count_sent_all_curr']=list(timestamp_count_sent_all_fri.values())
fig3_dict['time_sent_all_curr']=list(timestamp_count_sent_all_fri.keys())
fig3_dict['timestamp_sent_all_curr']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(timestamp_count_sent_all_fri.keys())]


In [23]:
fig1_source=ColumnDataSource(data=fig1_dict)
fig2_source=ColumnDataSource(data=fig2_dict)
fig3_source=ColumnDataSource(data=fig3_dict)

In [24]:
#Figure 1: Visualising Number of Messages sent and received by ID 1278894

fig1_menu=[('fri','Friday'),('sat','Saturday'),('sun','Sunday')]
fig1_dd=Select(title="Choose Day",value="fri", options=fig1_menu,width=150)
fig1=figure(title='Messages Sent/Received By ID 1278894', plot_height = 400,plot_width=600)

fig1_c1=fig1.line(source=fig1_source,x='time_sent_1278894_curr',y='count_sent_1278894_curr',color="#E74C3C",legend="Sent")
fig1.add_tools(HoverTool(tooltips=[
    ("Time", "@timestamp_sent_1278894_curr"),
    ("Count", "@count_sent_1278894_curr")
],renderers=[fig1_c1]))


fig1_c2=fig1.vbar(source=fig1_source,x='time_rec_1278894_curr',top='count_rec_1278894_curr',width=60,color='#85C1E9',legend="Received")
fig1.add_tools(HoverTool(tooltips=[
    ("Time", "@timestamp_rec_1278894_curr"),
    ("Count", "@count_rec_1278894_curr")
],renderers=[fig1_c2]))

update_curve = CustomJS(args=dict(source=fig1_source,fig1_dd=fig1_dd), code="""

    day=fig1_dd.value
    source.data['time_sent_1278894_curr']=source.data['time_sent_1278894_'+day]
    source.data['timestamp_sent_1278894_curr']=source.data['timestamp_sent_1278894_'+day]
    source.data['count_sent_1278894_curr']=source.data['count_sent_1278894_'+day]
    source.data['time_rec_1278894_curr']=source.data['time_rec_1278894_'+day]
    source.data['timestamp_rec_1278894_curr']=source.data['timestamp_rec_1278894_'+day]
    source.data['count_rec_1278894_curr']=source.data['count_rec_1278894_'+day]
    source.trigger('change');

""")

fig1_dd.js_on_change('value', update_curve)

fig1.background_fill_alpha = 0.4
fig1.xaxis.axis_label = 'Time'
fig1.yaxis.axis_label = 'Message Count'
fig1.axis.major_label_text_color = "black"
fig1.ygrid.grid_line_alpha = 0.8
fig1.ygrid.grid_line_dash = [5, 3]
fig1.xgrid.grid_line_alpha = 0.8
fig1.xgrid.grid_line_dash = [5, 3]

fig1.legend.location = "top_right"
fig1.legend.click_policy="hide"

show(row(fig1,fig1_dd))

## Answer : MC2.1 - What is ID 1278894?
* It sends messages to all park visitors over the 3 days.
* It is receiving messages in an interval. It receives messages for an hour, then doesn't receive for an hour.
* It is probably beaconing some advertisement or information to all park visitors and the number of messages sent at a time is most probably the number of park visitors at that time.
* It receives messages 5 times in a day in one hour intervals. This is possibly a survey or some game that opens every other hour and park visitors participate in it. 

In [25]:
#Figure 2: Visualising Number of Messages sent and received by ID 839736

fig2_menu=[('fri','Friday'),('sat','Saturday'),('sun','Sunday')]
fig2_dd=Select(title="Choose Day",value="fri", options=fig2_menu,width=150)
fig2=figure(title='Messages Sent/Received By ID 839736', plot_height = 400,plot_width=600)

fig2_c1=fig2.line(source=fig2_source,x='time_sent_839736_curr',y='count_sent_839736_curr',color='#85C1E9',legend="Sent")
fig2.add_tools(HoverTool(tooltips=[
    ("Time", "@timestamp_sent_839736_curr"),
    ("Count", "@count_sent_839736_curr")
],renderers=[fig2_c1]))


fig2_c2=fig2.line(source=fig2_source,x='time_rec_839736_curr',y='count_rec_839736_curr',color="#E74C3C",legend="Received")
fig2.add_tools(HoverTool(tooltips=[
    ("Time", "@timestamp_rec_839736_curr"),
    ("Count", "@count_rec_839736_curr")
],renderers=[fig2_c2]))

update_curve2 = CustomJS(args=dict(source=fig2_source,fig2_dd=fig2_dd), code="""

    day=fig2_dd.value
    source.data['time_sent_839736_curr']=source.data['time_sent_839736_'+day]
    source.data['timestamp_sent_839736_curr']=source.data['timestamp_sent_839736_'+day]

    source.data['count_sent_839736_curr']=source.data['count_sent_839736_'+day]
    source.data['time_rec_839736_curr']=source.data['time_rec_839736_'+day]
    source.data['timestamp_rec_839736_curr']=source.data['timestamp_rec_839736_'+day]

    source.data['count_rec_839736_curr']=source.data['count_rec_839736_'+day]
    source.trigger('change');

""")

fig2_dd.js_on_change('value', update_curve2)



fig2.background_fill_alpha = 0.4
fig2.xaxis.axis_label = 'Time'
fig2.yaxis.axis_label = 'Message Count'
fig2.axis.major_label_text_color = "black"
fig2.ygrid.grid_line_alpha = 0.8
fig2.ygrid.grid_line_dash = [5, 3]
fig2.xgrid.grid_line_alpha = 0.8
fig2.xgrid.grid_line_dash = [5, 3]


fig2.legend.location = "top_right"
fig2.legend.click_policy="hide"
show(row(fig2,fig2_dd))

### Answer: MC2.1 - What is ID 839736?
* It sends and receives messages almost at the same time throught the three days.
* On Sunday about 12:00 PM and 14:40 PM, there is a spike in the graph.
* As it's sending and receiving isn't constant and the park visitors tried to communicate with it, it's most like the Park Management or Support team.
* Messages are probably sent to it by visitors and it replies back with a custom or manual response within a short time.
* The spikes on Sunday 11:45 AM to 12:00 PM is most likely the park visitors communicating the vandalism to the authorities.
* It spikes up again around 14:40 PM, we think, because of people asking the authorities why Scott's show was canceled, as they arrive at the stage to find it closed.

In [26]:
#Figure 3: Visualising Number of Messages sent to external 

fig3_menu=[('fri','Friday'),('sat','Saturday'),('sun','Sunday')]
fig3_dd=Select(title="Choose Day",value="fri", options=fig3_menu,width=150)
fig3=figure(title='Messages Sent to External and Total', plot_height = 400,plot_width=600)

fig3_c1=fig3.vbar(source=fig3_source,x='time_sent_all_curr',top='count_sent_all_curr',width=60,color='#85C1E9',legend="Total Sent")
fig3.add_tools(HoverTool(tooltips=[
    ("Time", "@timestamp_sent_all_curr"),
    ("Count", "@count_sent_all_curr")
],renderers=[fig3_c1]))


fig3_c2=fig3.vbar(source=fig3_source,x='time_rec_ext_curr',top='count_rec_ext_curr',color='#E74C3C',width=60,legend="Sent To Exteral")
fig3.add_tools(HoverTool(tooltips=[
    ("Time", "@timestamp_rec_ext_curr"),
    ("Count", "@count_rec_ext_curr")
],renderers=[fig3_c2]))

update_curve3 = CustomJS(args=dict(source=fig3_source,fig3_dd=fig3_dd), code="""

    day=fig3_dd.value
    source.data['time_sent_all_curr']=source.data['time_sent_all_'+day]
    source.data['timestamp_sent_all_curr']=source.data['timestamp_sent_all_'+day]
    source.data['count_sent_all_curr']=source.data['count_sent_all_'+day]
    source.data['time_rec_ext_curr']=source.data['time_rec_ext_'+day]
    source.data['timestamp_rec_ext_curr']=source.data['timestamp_rec_ext_'+day]
    source.data['count_rec_ext_curr']=source.data['count_rec_ext_'+day]
    source.trigger('change');

""")

fig3_dd.js_on_change('value', update_curve3)



fig3.background_fill_alpha = 0.4
fig3.xaxis.axis_label = 'Time'
fig3.yaxis.axis_label = 'Message Count'
fig3.axis.major_label_text_color = "black"
fig3.ygrid.grid_line_alpha = 0.8
fig3.ygrid.grid_line_dash = [5, 3]
fig3.xgrid.grid_line_alpha = 0.8
fig3.xgrid.grid_line_dash = [5, 3]


fig3.legend.location = "top_right"
fig3.legend.click_policy="hide"


In [27]:
show(row(fig3,fig3_dd))


## Answer: MC2.3 - When was the vandalism discovered?
* Messages sent to external recepients is quite low until on Sunday 11:45 AM to 12:00 PM.
* The spike in external messages is a hard to miss sign that visitors started informing external recepients about the vandalism. 
* Even the total amount of messages sent has a spike for that time, hence we hypothesize that the vandalism was discovered around 11:45 AM on Sunday.
* The 11:00 AM and 4:00 PM spikes in total messages sent is the time for Scott Jones show. We have these 2 spike for Friday and Saturday, but on Sunday there is no 4:00 PM spike as the show was cancelled as vandalism occurred around 11:45 AM to 12:00 PM shortly after 11:00 AM show.

In [28]:
#Figure 4 : Messages sent from different location
# 5 Location
# 3 Days
fri_ca ={}
fri_ec ={}
fri_kl ={}
fri_tl ={}
fri_wl ={}
sat_ca ={}
sat_ec ={}
sat_kl ={}
sat_tl ={}
sat_wl ={}
sun_ca ={}
sun_ec ={}
sun_kl ={}
sun_tl ={}
sun_wl ={}


In [29]:
for i in range(len(df_com_fri)):
    if(df_com_fri.iloc[i,1]==839736 or df_com_fri.iloc[i,1]==1278894 or df_com_fri.iloc[i,2]==839736 or df_com_fri.iloc[i,2]=='1278894'):
        continue
    time = df_com_fri.iloc[i,0].hour*60*60 + df_com_fri.iloc[i,0].minute*60
    if(df_com_fri.iloc[i,3] =="Coaster Alley"):
        fri_ca[time] = fri_ca.get(time,0)+1
    elif(df_com_fri.iloc[i,3] =="Entry Corridor"):
        fri_ec[time] = fri_ec.get(time,0)+1
    elif(df_com_fri.iloc[i,3] =="Kiddie Land"):
        fri_kl[time] = fri_kl.get(time,0)+1
    elif(df_com_fri.iloc[i,3] =="Tundra Land"):
        fri_tl[time] = fri_tl.get(time,0)+1
    else:
        fri_wl[time] = fri_wl.get(time,0)+1


In [30]:
for i in range(len(df_com_sat)):
    if(df_com_sat.iloc[i,1]==839736 or df_com_sat.iloc[i,1]==1278894 or df_com_sat.iloc[i,2]==839736 or df_com_sat.iloc[i,2]=='1278894' ):
        continue
    time = df_com_sat.iloc[i,0].hour*60*60 + df_com_sat.iloc[i,0].minute*60
    if(df_com_sat.iloc[i,3] =="Coaster Alley"):
        sat_ca[time] = sat_ca.get(time,0)+1
    elif(df_com_sat.iloc[i,3] =="Entry Corridor"):
        sat_ec[time] = sat_ec.get(time,0)+1
    elif(df_com_sat.iloc[i,3] =="Kiddie Land"):
        sat_kl[time] = sat_kl.get(time,0)+1
    elif(df_com_sat.iloc[i,3] =="Tundra Land"):
        sat_tl[time] = sat_tl.get(time,0)+1
    else:
        sat_wl[time] = sat_wl.get(time,0)+1

In [31]:
for i in range(len(df_com_sun)):
    if(df_com_sun.iloc[i,1]==839736 or df_com_sun.iloc[i,1]==1278894 or df_com_sun.iloc[i,2]==839736 or df_com_sun.iloc[i,2]=='1278894' ):
        continue
    time = df_com_sun.iloc[i,0].hour*60*60 + df_com_sun.iloc[i,0].minute*60
    if(df_com_sun.iloc[i,3] =="Coaster Alley"):
        sun_ca[time] = sun_ca.get(time,0)+1
    elif(df_com_sun.iloc[i,3] =="Entry Corridor"):
        sun_ec[time] = sun_ec.get(time,0)+1
    elif(df_com_sun.iloc[i,3] =="Kiddie Land"):
        sun_kl[time] = sun_kl.get(time,0)+1
    elif(df_com_sun.iloc[i,3] =="Tundra Land"):
        sun_tl[time] = sun_tl.get(time,0)+1
    else:
        sun_wl[time] = sun_wl.get(time,0)+1

In [32]:
fig4_dict = {}

fig4_dict['fri_ca_count']=list(fri_ca.values())
fig4_dict['fri_ca_time']=list(fri_ca.keys())
fig4_dict['fri_ca_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(fri_ca.keys())]
fig4_dict['sat_ca_count']=list(sat_ca.values())
fig4_dict['sat_ca_time']=list(sat_ca.keys())
fig4_dict['sat_ca_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(sat_ca.keys())]
fig4_dict['sun_ca_count']=list(sun_ca.values())
fig4_dict['sun_ca_time']=list(sun_ca.keys())
fig4_dict['sun_ca_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(sun_ca.keys())]

fig4_dict['fri_ec_count']=list(fri_ec.values())
fig4_dict['fri_ec_time']=list(fri_ec.keys())
fig4_dict['fri_ec_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(fri_ec.keys())]
fig4_dict['sat_ec_count']=list(sat_ec.values())
fig4_dict['sat_ec_time']=list(sat_ec.keys())
fig4_dict['sat_ec_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(sat_ec.keys())]
fig4_dict['sun_ec_count']=list(sun_ec.values())
fig4_dict['sun_ec_time']=list(sun_ec.keys())
fig4_dict['sun_ec_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(sun_ec.keys())]

fig4_dict['fri_kl_count']=list(fri_kl.values())
fig4_dict['fri_kl_time']=list(fri_kl.keys())
fig4_dict['fri_kl_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(fri_kl.keys())]
fig4_dict['sat_kl_count']=list(sat_kl.values())
fig4_dict['sat_kl_time']=list(sat_kl.keys())
fig4_dict['sat_kl_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(sat_kl.keys())]
fig4_dict['sun_kl_count']=list(sun_kl.values())
fig4_dict['sun_kl_time']=list(sun_kl.keys())
fig4_dict['sun_kl_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(sun_kl.keys())]

fig4_dict['fri_tl_count']=list(fri_tl.values())
fig4_dict['fri_tl_time']=list(fri_tl.keys())
fig4_dict['fri_tl_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(fri_tl.keys())]
fig4_dict['sat_tl_count']=list(sat_tl.values())
fig4_dict['sat_tl_time']=list(sat_tl.keys())
fig4_dict['sat_tl_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(sat_tl.keys())]
fig4_dict['sun_tl_count']=list(sun_tl.values())
fig4_dict['sun_tl_time']=list(sun_tl.keys())
fig4_dict['sun_tl_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(sun_tl.keys())]

fig4_dict['fri_wl_count']=list(fri_wl.values())
fig4_dict['fri_wl_time']=list(fri_wl.keys())
fig4_dict['fri_wl_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(fri_wl.keys())]
fig4_dict['sat_wl_count']=list(sat_wl.values())
fig4_dict['sat_wl_time']=list(sat_wl.keys())
fig4_dict['sat_wl_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(sat_wl.keys())]
fig4_dict['sun_wl_count']=list(sun_wl.values())
fig4_dict['sun_wl_time']=list(sun_wl.keys())
fig4_dict['sun_wl_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(sun_wl.keys())]

fig4_dict['fri_curr_count']=list(fri_ca.values())
fig4_dict['fri_curr_time']=list(fri_ca.keys())
fig4_dict['fri_curr_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(fri_ca.keys())]
fig4_dict['sat_curr_count']=list(sat_ca.values())
fig4_dict['sat_curr_time']=list(sat_ca.keys())
fig4_dict['sat_curr_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(sat_ca.keys())]
fig4_dict['sun_curr_count']=list(sun_ca.values())
fig4_dict['sun_curr_time']=list(sun_ca.keys())
fig4_dict['sun_curr_timestamp']=[str(int(x/3600))+":"+str(int((x%3600)/60)) for x in list(sun_ca.keys())]


In [33]:
fig4_source=ColumnDataSource(data=fig4_dict)

In [34]:
fig4_menu=[('ca','Coaster Alley'),('ec','Entry Corridor'),('kl','Kiddie Land'),('tl','Tundra Land'),('wl','Wet Land')]
fig4_dd=Select(title="Choose Location",value="ca", options=fig4_menu,width=150)
fig4=figure(title='Friday', plot_height = 200,plot_width=600)

fig4_c1=fig4.vbar(source=fig4_source,x='fri_curr_time',top='fri_curr_count',width=60,color='#F1948A',legend="Total Sent")
fig4.add_tools(HoverTool(tooltips=[
    ("Time", "@fri_curr_timestamp"),
    ("Count", "@fri_curr_count")
],renderers=[fig4_c1]))


fig5=figure(title='Saturday', plot_height = 200,plot_width=600,x_range=fig4.x_range, y_range=fig4.y_range)
fig5_c1=fig5.vbar(source=fig4_source,x='sat_curr_time',top='sat_curr_count',width=60,color='#F1948A')
fig5.add_tools(HoverTool(tooltips=[
    ("Time", "@sat_curr_timestamp"),
    ("Count", "@sat_curr_count")
],renderers=[fig5_c1]))


fig6=figure(title='Sunday', plot_height = 200,plot_width=600,x_range=fig4.x_range, y_range=fig4.y_range)
fig6_c1=fig6.vbar(source=fig4_source,x='sun_curr_time',top='sun_curr_count',width=60,color='#F1948A')
fig6.add_tools(HoverTool(tooltips=[
    ("Time", "@sun_curr_timestamp"),
    ("Count", "@sun_curr_count")
],renderers=[fig6_c1]))

update_curve4 = CustomJS(args=dict(source=fig4_source,fig4_dd=fig4_dd), code="""

    loc=fig4_dd.value
    source.data['fri_curr_count']=source.data['fri_'+loc+'_count']
    source.data['fri_curr_time']=source.data['fri_'+loc+'_time']
    source.data['fri_curr_timestamp']=source.data['fri_'+loc+'_timestamp']
    source.data['sat_curr_count']=source.data['sat_'+loc+'_count']
    source.data['sat_curr_time']=source.data['sat_'+loc+'_time']
    source.data['sat_curr_timestamp']=source.data['sat_'+loc+'_timestamp']
    source.data['sun_curr_count']=source.data['sun_'+loc+'_count']
    source.data['sun_curr_time']=source.data['sun_'+loc+'_time']
    source.data['sun_curr_timestamp']=source.data['sun_'+loc+'_timestamp']

    source.trigger('change');

""")

fig4_dd.js_on_change('value', update_curve4)



fig4.background_fill_alpha = 0.4
fig4.yaxis.axis_label = 'Message Count'
fig4.ygrid.grid_line_alpha = 0.8
fig4.ygrid.grid_line_dash = [5, 3]
fig4.xgrid.grid_line_alpha = 0.8
fig4.xgrid.grid_line_dash = [5, 3]

fig5.background_fill_alpha = 0.4
fig5.axis.major_label_text_color = "black"
fig5.ygrid.grid_line_alpha = 0.8
fig5.ygrid.grid_line_dash = [5, 3]
fig5.xgrid.grid_line_alpha = 0.8
fig5.xgrid.grid_line_dash = [5, 3]

fig6.background_fill_alpha = 0.4
fig6.xaxis.axis_label = 'Time'
fig6.axis.major_label_text_color = "black"
fig6.ygrid.grid_line_alpha = 0.8
fig6.ygrid.grid_line_dash = [5, 3]
fig6.xgrid.grid_line_alpha = 0.8
fig6.xgrid.grid_line_dash = [5, 3]


fig4.legend.location = "top_right"
fig4.legend.click_policy="hide"

fig5.legend.location = "top_right"
fig5.legend.click_policy="hide"

fig6.legend.location = "top_right"
fig6.legend.click_policy="hide"

fig5.toolbar.logo = None
fig5.toolbar_location = None
fig6.toolbar.logo = None
fig6.toolbar_location = None


In [35]:
show(row(column(fig4,fig5,fig6),fig4_dd))

## MC2.2 - Some communication patterns
* Coaster Alley has a spike in total sent messages at 11:00 AM and 4:00 PM for Friday and Saturday. On Sunday, the 4:00 PM spike is missing. Coaster Alley is the area with Grinosaurus stage. As observed earlier as well, the 4pm show on Sunday was cancelled. Hence, the spike of messages from that area is missing.
* Wet Land has a spike in messages sent around 11:30 and 12:00 and again on 12:01 to 12:20. This is most likely the time when the vandalism was discovered in Wet Land that visitors started informing the authorities or sending messages to each other. Another spike just after the incident could be visitors responding to the police coming in to investigate the crime.

In [36]:
def generateMap(df):
    m = {x:{l:0 for l in ['Tundra Land', 'Kiddie Land', 'Wet Land', 'Coaster Alley', 'Entry Corridor']} for x in range(8,24)}
    for i in range(len(df)):
        h = pd.to_datetime(df.iloc[i,0]).hour
        l = df.iloc[i,3]
        m[h][l]+=1
    
    return m

m_fri = generateMap(df_com_fri)
m_sat = generateMap(df_com_sat)
m_sun = generateMap(df_com_sun)

In [41]:
def generateSizes(m):
    x1 = [360,300,120,420,600]
    y1 = [100,290,550,630,540]
    x = {'label': ['Coaster Alley', 'Wet Land', 'Tundra Land', 'Entry Corridor', 'Kiddie Land']}
    for h in m:
        x[str(h)] = []
        total = sum(m[h].values())
        for l in x['label']:
            x[str(h)].append(200 * m[h][l] / total)
    x['x'] = x1
    x['y'] = y1
    x['cur_size'] = x['8']
    return x

sizes_fri = generateSizes(m_fri)
sizes_sat = generateSizes(m_sat)
sizes_sun = generateSizes(m_sun)
park_map = Image.open('park_map.jpg').convert('RGBA')
xdim, ydim = park_map.size
img = np.empty((ydim, xdim), dtype=np.uint32)
view = img.view(dtype=np.uint8).reshape((ydim, xdim, 4))
view[:,:,:] = np.flipud(np.asarray(park_map))

dim = max(xdim, ydim)
fig1 = figure(title="DinoFunWorld", x_range=(0,dim), y_range=(0,dim))
fig1.image_rgba(image=[img], x=0, y=0, dw=xdim, dh=ydim, alpha =100)
fig1.background_fill_alpha = 0.3

#CoasterAlley, WetLand, TundraLand, EntryCorridor, KiddieLand
fig1_source_fri = ColumnDataSource(data=dict(sizes_fri))
fig1_source_sat = ColumnDataSource(data=dict(sizes_sat))
fig1_source_sun = ColumnDataSource(data=dict(sizes_sun))
fig1_toPlot = ColumnDataSource(data=dict(sizes_fri))
fig1_c = fig1.circle(x = 'x',y = 'y',size='cur_size', source = fig1_toPlot, line_color='black', fill_color = '#D3D321')

fig1.add_tools(HoverTool(tooltips=[
    ("Location", "@label")
],renderers=[fig1_c]))

fig1_callback_day = CustomJS(args=dict(s_toPlot=fig1_toPlot, s_fri=fig1_source_fri, s_sat=fig1_source_sat,s_sun=fig1_source_sun), 
                            code='''
    var selection = cb_obj.value;
    switch(selection){
        case 'Friday':
            s_toPlot.data = s_fri.data;
            break;
        case 'Saturday':
            s_toPlot.data = s_sat.data;
            break;
        case 'Sunday':
            s_toPlot.data = s_sun.data;
            break;
    }
    s_toPlot.trigger('change')
''')
fig1_select_day = Select(title="Select Day", value='Friday', options=['Friday', 'Saturday', 'Sunday'], callback=fig1_callback_day)

fig1_callback_hour = CustomJS(args=dict(s_toPlot=fig1_toPlot), 
                            code='''
    var selection = cb_obj.value;
    s_toPlot.data['cur_size'] = s_toPlot.data[String(selection)]
    s_toPlot.trigger('change')
''')
fig1_select_hour = Slider(title='Time of Day (hour)', value=8, start = 8, end = 23, step = 1, callback=fig1_callback_hour)

fig1_callback_play = CustomJS(args=dict(s_toPlot = fig1_toPlot, selectW = fig1_select_hour), 
                            code='''
    console.log(selectW);
    function sleep(ms) {
      return new Promise(resolve => setTimeout(resolve, ms));
    }
    async function run(){
        for (var i = 8; i < 24; i++){
            s_toPlot.data['cur_size'] = s_toPlot.data[String(i)]
            selectW.value = i;
            s_toPlot.trigger('change')
            await sleep(700);
        }
    }
    run();
''')
fig1_playButton = Button(label='Play', button_type = 'success', callback = fig1_callback_play)

show(row(fig1, column(fig1_select_day, fig1_select_hour, fig1_playButton)))

### Visualising Communication across the park
The above visualisation can be used to see which area has the most message activity in every hour of the day. Some observations we make from this map:
* Most of the time, the communication activity is maximum in Wet Land. We think that since this area has most of the thrill rides, it is the most popular and consequently has the most message traffic.
* Kiddie Land is by far the least popular area in the park, and has very low message activity.
* The entry corridor is crowded in the morning and night, when people queue up to enter and exit from the park.
* WetLand has peaks in communication at 11 and 4 every day, when the Creighton Pavillion opens to showcase Scott Jones's memorabilia.

In [46]:
# def getGraph(df):    #Generates networkx graph to export to Gephi
#     G = nx.Graph()   #No need to run this code
#     for i in range(len(df)):
#         f = df.iloc[i,1]
#         t = df.iloc[i,2]
#         if not G.has_edge(f,t) and not G.has_edge(t,f):
#             G.add_edge(f,t, weight = 1)
#     return G
        
# g_sun = getGraph(df_com_sun)
# nx.write_gexf(g_sun, "data/g_sun.gexf")

In [48]:
parser = pygraphml.GraphMLParser()
g = parser.parse("data/sun.graphml")

In [49]:
nodes = g.nodes()
fig2_toPlot = {'color': [], 'size': []}
# nodes_coords = {}
for i in range(len(nodes)):
    fig2_toPlot['color'].append('#%02x%02x%02x' % (int(nodes[i]['r']), int(nodes[i]['g']), int(nodes[i]['b'])))
    fig2_toPlot['size'].append(float(nodes[i]['size']) + 5)
#     nodes_coords[str(nodes[i]['label'])] = [float(nodes[i]['x']), float(nodes[i]['y'])]

In [50]:
with open('data/figure2.pickle', 'rb') as handle:
    nodes_coords = pickle.load(handle)
edges = list(g_sun.edges())
s = []
e = []
x = []
y = []
for i in range(len(edges)):
    n1 = str(edges[i][0])
    n2 = str(edges[i][1])
    s.append(n1)
    e.append(n2)

In [59]:
N = len(nodes_coords)
node_indices = list(nodes_coords)

x = [nodes_coords[i][0] for i in node_indices]
y = [nodes_coords[i][1] for i in node_indices]

fig2 = figure(title="Graph Layout Demonstration", x_range=(-3000,3000), y_range=(-3000,3000), width = 700, height = 700)
fig2.background_fill_color = 'black'
fig2.background_fill_alpha = 0.9
fig2.xgrid.grid_line_color = None
fig2.ygrid.grid_line_color = None

graph1 = GraphRenderer()

graph1.node_renderer.data_source.data = dict(index=node_indices, node_size = fig2_toPlot['size'], node_color = fig2_toPlot['color'])
graph1.node_renderer.glyph = Circle(radius = 'node_size', fill_color='node_color', line_color = None)

graph1.edge_renderer.data_source.data = dict(start=s,end=e)
graph1.edge_renderer.glyph = MultiLine(line_color="white", line_alpha=0.15, line_width=.1)

graph_layout = dict(zip(node_indices, zip(x, y)))
graph1.layout_provider = StaticLayoutProvider(graph_layout=graph_layout)
fig2.renderers.append(graph1)
show(fig2)

E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: node_color, node_size [renderer: GlyphRenderer(id='d94e22bf-127a-4790-ae98-daed8481f9d6', ...)]


### Communication network
We created a NetworkX graph using the data provided, and exported it to a gexf file for Gephi to read. In Gephi we ran the Fruchterman-Reingold algorithm on graph, which is a force-directed algorithm. We exported the layout obtained to a graphml file and plotted it in bokeh using GraphRenderer.
A node represents a unique visitor (ID) and a white edge connecting two nodes means there was a communication between them (undirected, with weight 1)
We were unable to make the graph perform well with hover enabled, and we have removed it in the final version of our code.
There are interesting patterns to note in this graph:
* The centre has three IDs with the most amount of communication: 839736, 1278894, and external. These IDs, as discussed earlier, are of special importance.
* We have colored some compact groups within the graph. These are nodes which communicate with themselves only, apart from the three IDs mentioned above. There are a lot of large groups of size 30-40 communicating with each other.
* On the edges of the graph, we see smaller groups of 3-5 nodes with a group leader communicating with the group, and minimal communication between everyone else.
* The spherical cluster on the right are people coming individually or in small groups to the park, and then meeting new people while they are there and communicating with them.