# Project in TNM098 VT 2018, VAST Challenge MC2 2015

This project will explore data containing phone calls between people in an amusement park. The questions that will be answered are:

1. Identify those IDs that stand out for their large volumes of communication. For each of these IDs
  1. Characterize the communication patterns you see.
  2. Based on these patterns, what do you hypothesize about these IDs?

Please limit your response to no more than 4 images and 300 words.

2. Describe up to 10 communications patterns in the data. Characterize who is communicating, with whom, when and where. If you have more than 10 patterns to report, please prioritize those patterns that are most likely to relate to the crime.

Please limit your response to no more than 10 images and 1000 words.

#### My approach:
To start solving the first problem, I would filter out all communication from the IDs with large volumes. From that I would try to visualize the following:
* With who are these persons are communicating with?
* From what places?
* Where is the recipient located?
* What time of the day?
* Do the recipients communicate with each other?
* And more possible properties of the communication data that are relevant


First, start with imports.

In [1]:
import numpy as np
import pandas as pd
import nvd3

#import matplotlib.pyplot as plt
#import pylab
#import plotly.plotly as py

loaded nvd3 IPython extension
run nvd3.ipynb.initialize_javascript() to set up the notebook
help(nvd3.ipynb.initialize_javascript) for options


Read in the data for Friday.
Pandas series is a vector and a dataframe is a matrix of series.

In [33]:
data_Fri = pd.read_csv("data/comm-data-Fri.csv")
#type(data_Fri): pandas.core.frame.DataFrame
#data_Fri.head(): gives 5 first elements


#	Timestamp 			from 		to 			location
#0 	2014-6-06 08:03:19 	439105 		1053224 	Kiddie Land
#1 	2014-6-06 08:03:19 	439105 		1696241 	Kiddie Land
#2 	2014-6-06 08:03:19 	439105 		580064 		Kiddie Land
#3 	2014-6-06 08:03:19 	439105 		1464748 	Kiddie Land
#4 	2014-6-06 08:03:47 	1836139 	1593258 	Entry Corridor


Find out which senders stand out for their large communication volumes. Pick the five largest.

In [38]:
#Count occurrences for each sender
from_data = data_Fri['from']
message_counts = from_data.value_counts()

average_sent_messages = message_counts.mean()

threshold = average_sent_messages

to_remove = message_counts[message_counts <= threshold].index
removed = message_counts.drop(to_remove)

Different ways to access the data in the DataFrame

In [6]:
#data_Fri.loc[0:1, ['to', 'location']]
#data_Fri['from']

Use nvd3 to construct the plot, and export it to view as html.

In [36]:
nvd3.ipynb.initialize_javascript(use_remote=True)
np.random.seed(100)

chart_type = 'discreteBarChart'
chart = nvd3.discreteBarChart(name=chart_type, height=700, width=1500)

ydata = removed.to_list()[2:]
xdata = list(removed.keys())[2:]

chart.add_serie(y=ydata, x=xdata)
chart.buildhtml()
chart_html = chart.htmlcontent

# The chart in html code
#chart_html

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [37]:
from IPython.display import display, HTML


display(HTML(chart_html))