# Project in TNM098 VT 2018, VAST Challenge MC2 2015

This project will explore data containing phone calls between people in an amusement park. The questions that will be answered are:

1. Identify those IDs that stand out for their large volumes of communication. For each of these IDs
  1. Characterize the communication patterns you see.
  2. Based on these patterns, what do you hypothesize about these IDs?

Please limit your response to no more than 4 images and 300 words.

2. Describe up to 10 communications patterns in the data. Characterize who is communicating, with whom, when and where. If you have more than 10 patterns to report, please prioritize those patterns that are most likely to relate to the crime.

Please limit your response to no more than 10 images and 1000 words.

#### My approach:
To start solving the first problem, I would filter out all communication from the IDs with large volumes. From that I would try to visualize the following:
* With who are these persons are communicating with?
* From what places?
* Where is the recipient located?
* What time of the day?
* Do the recipients communicate with each other?
* And more possible properties of the communication data that are relevant


First, start with imports.

In [44]:
import numpy as np
import pandas as pd
import nvd3
import ipywidgets as widgets
from IPython.display import display, HTML

#import matplotlib.pyplot as plt
#import pylab
#import plotly.plotly as py

Read in the data for Friday.
Pandas series is a vector and a dataframe is a matrix of series.

In [2]:
data_Fri = pd.read_csv("data/comm-data-Fri.csv")
#type(data_Fri): pandas.core.frame.DataFrame
#data_Fri.head(): gives 5 first elements


#	Timestamp 			from 		to 			location
#0 	2014-6-06 08:03:19 	439105 		1053224 	Kiddie Land
#1 	2014-6-06 08:03:19 	439105 		1696241 	Kiddie Land
#2 	2014-6-06 08:03:19 	439105 		580064 		Kiddie Land
#3 	2014-6-06 08:03:19 	439105 		1464748 	Kiddie Land
#4 	2014-6-06 08:03:47 	1836139 	1593258 	Entry Corridor


Find out which senders stand out for their large communication volumes. Pick the five largest.

In [45]:
#Count occurrences for each sender
from_data = data_Fri['from']
to_data = data_Fri['to']
send_counts = from_data.value_counts()
receive_counts = to_data.value_counts()

#print(send_counts)
#print(receive_counts)

#send_average = send_counts.mean()
#receive_average = receive_counts.mean()

too_low_senders = send_counts[send_counts <= 1650].index
highest_senders = send_counts.drop(too_low_senders)

too_low_receivers = receive_counts[receive_counts <= 1350].index
# Remove 'external' from receivers
too_low_receivers = too_low_receivers.append(pd.Index(['external']))
highest_receivers = receive_counts.drop(too_low_receivers)

#print(highest_senders)
#print(highest_receivers)

Different ways to access the data in the DataFrame

In [4]:
#data_Fri.loc[0:1, ['to', 'location']]
#data_Fri['from']

In [34]:
widgets.Dropdown(
    options=['1', '2', '3'],
    value='2',
    description='Number:',
    disabled=False,
)
#display(w)

Dropdown(description='Number:', index=1, options=('1', '2', '3'), value='2')

Use nvd3 to construct the plot, and export it to view as html.

In [46]:
nvd3.ipynb.initialize_javascript(use_remote=True)
np.random.seed(100)

chart_type = 'discreteBarChart'
chart = nvd3.discreteBarChart(name=chart_type, height=700, width=1500)

ydata = highest_senders.to_list()
xdata = list(highest_senders.keys())

chart.add_serie(y=ydata, x=xdata)
chart.buildhtml()
chart_html = chart.htmlcontent

# The chart in html code
#chart_html

# Show the plot
display(HTML(chart_html))

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Since 1278894 is clearly the biggest sender and receiver, a closer look will be made on its communication patterns.

In [54]:
# Get all data where 1278894 is the sender
sender_1278894 = data_Fri[data_Fri['from'] == 1278894]

# 
counts = sender_1278894.value_counts()

                 Timestamp     from       to        location
192866  2014-6-06 12:00:00  1278894  1231028  Entry Corridor
192867  2014-6-06 12:00:00  1278894   626177  Entry Corridor
192868  2014-6-06 12:00:00  1278894  1281941  Entry Corridor
192869  2014-6-06 12:00:00  1278894    96504  Entry Corridor
192870  2014-6-06 12:00:00  1278894   256620  Entry Corridor
192871  2014-6-06 12:00:00  1278894   656123  Entry Corridor
192872  2014-6-06 12:00:00  1278894  1295204  Entry Corridor
192873  2014-6-06 12:00:00  1278894  1688081  Entry Corridor
192874  2014-6-06 12:00:00  1278894   953336  Entry Corridor
192875  2014-6-06 12:00:00  1278894   142394  Entry Corridor
192876  2014-6-06 12:00:00  1278894  1242773  Entry Corridor
192877  2014-6-06 12:00:00  1278894   477978  Entry Corridor
192878  2014-6-06 12:00:00  1278894   856067  Entry Corridor
192879  2014-6-06 12:00:00  1278894   315002  Entry Corridor
192880  2014-6-06 12:00:00  1278894  1399755  Entry Corridor
192881  2014-6-06 12:00:

In [50]:



from nvd3 import stackedAreaChart
chart = stackedAreaChart(name='stackedAreaChart', height=700, width=1500)

xdata = [100, 101, 102, 103, 104, 105, 106,]
ydata = [6, 11, 12, 7, 11, 10, 11]
ydata2 = [8, 20, 16, 12, 20, 28, 28]

extra_serie = {"tooltip": {"y_start": "There is ", "y_end": " min"}}
chart.add_serie(name="Serie 1", y=ydata, x=xdata, extra=extra_serie)
chart.add_serie(name="Serie 2", y=ydata2, x=xdata, extra=extra_serie)
chart.buildhtml()

chart_html = chart.htmlcontent

# The chart in html code
#chart_html

# Show the plot
display(HTML(chart_html))