# Graph Analysis
## Data Download
In order to analyse the data we first need to read in the scala output. Unfortuantely, the output of the scala program has created a folder of various csv's rather than producing one file. As such we need to iterate through each one and append the data into one DataFrame. The csv naming scheme was 'part-00[200-399,600-799]' followed by the same string of characters. However, for some reason the scheme does not include 605,685 or 708. 

In [2]:
import pandas as pd
import glob
import os

In [42]:
app1 = pd.read_csv("../data/sent1.csv/part-00200-7d6177c2-d913-4829-b6e3-3a1722f32bcd-c000.csv",names=["id","neg_sentiment"])
app1 = pd.concat([app1, pd.read_csv("../data/sent1.csv/part-00600-7d6177c2-d913-4829-b6e3-3a1722f32bcd-c000.csv",names=["id","neg_sentiment"])],ignore_index=True)
for i in range(1,200):
    n = 200+i
    app1 = pd.concat([app1,pd.read_csv("../data/sent1.csv/part-00"+str(n)+"-7d6177c2-d913-4829-b6e3-3a1722f32bcd-c000.csv",names=["id","neg_sentiment"])],ignore_index=True)
    if (i != 5) & (i != 85) & (i != 108):
        m = 600+i
        app1 = pd.concat([app1,pd.read_csv("../data/sent1.csv/part-00"+str(m)+"-7d6177c2-d913-4829-b6e3-3a1722f32bcd-c000.csv",names=["id","neg_sentiment"])],ignore_index=True)

In [54]:
app2 = pd.read_csv("../data/sent2.csv/part-00200-639f09f3-46ed-41c8-91c5-b12e3652562e-c000.csv",names=["id","neg_sentiment"])
app2 = pd.concat([app2, pd.read_csv("../data/sent2.csv/part-00600-639f09f3-46ed-41c8-91c5-b12e3652562e-c000.csv",names=["id","neg_sentiment"])],ignore_index=True)
for i in range(1,200):
    n = 200+i
    app2 = pd.concat([app2,pd.read_csv("../data/sent2.csv/part-00"+str(n)+"-639f09f3-46ed-41c8-91c5-b12e3652562e-c000.csv",names=["id","neg_sentiment"])],ignore_index=True)
    if (i != 5) & (i != 85) & (i != 108):    
        m = 600+i
        app2 = pd.concat([app2,pd.read_csv("../data/sent2.csv/part-00"+str(m)+"-639f09f3-46ed-41c8-91c5-b12e3652562e-c000.csv",names=["id","neg_sentiment"])],ignore_index=True)

## Basic Analysis

In order to simulate negative sentiment spreading through a network, we first assigned the a negative sentiment of 1 to everyone who is fired during the period of data collection and 0 to everyone else. Then for each vertex, sum the weights for each in-edge (calculated differently for each approach), and adding this sum to the negative sentiment of the vertex. This new negative sentiment is then used as the prior for the next iteration using the same method. Therefore, we have negative sentiment spreading throughout the network with a higher value indicating a higher likelihood of being an insider threat. 


### Approach 1
For approach 1, consider an edge a->b. The weight of this edge is = a('neg_sentiment') / inDegree(b). The reasoning behind this approach is that we need to convey the negative sentiment of the source node, however we reason that if the destination node receives lots of emails, then they are less likely to be affected by the sentiment of one person. 

Originally we wanted to also divide by the outDegree(a) but reasoned that just because a sends out lots of emails, doesn't mean their effect on the recipient is any different. 

In [48]:
app1["neg_sentiment"].describe()

count    4432.000000
mean        0.003776
std         0.042134
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.108481
Name: neg_sentiment, dtype: float64

In [56]:
count1 = (app1["neg_sentiment"] == 0).sum()
count1

3477

### Approach 2
For approach 2 we used a page rank algorithm to calculate the importance of each edge to the populaity of the graph. We then multiplied this importance by the negative sentiment of the source node, this is the weight of edge. We chose this as this provides a better theroetical weighting of the importance of edges so by multiplying this by the negative sentiment, we hoped this would provide a better method for the spread of negative sentiment throughout the graph. 

In [55]:
app2["neg_sentiment"].describe()

count     4432.000000
mean        45.968997
std        673.541401
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max      24594.844048
Name: neg_sentiment, dtype: float64

In [57]:
count2 = (app2["neg_sentiment"] == 0).sum()
count2

3477

We can see that both show that the vast majority of vertices have 0 negative_sentiment, with 3477 out of 4432. It makes sense that the number of 0 neg_sentiment vertices are the same on each approach as the negative sentiment approaches are both increasing functions with the property that (the score of approach 1 > 0) iff (the score of approach 2 > 0). 

However, the fact that there are so many 0 vertices with 0 negative sentiment shows that rather than just 2 iterations of dissemination, that we should have had more iterations. The reasoning behind this is that, by having so many 0's means the data is harder to analyse for this project. Also I do not believe that it is realistic that that many people would not be affected at all by the firings. 

Now, this number of iterations could be realistic at negative sentiment dissemination via email because email is such a small part of a work communication infrastructure (with instant messaging, meetings and general office chatter). However, I think that by increasing the number of iterations this could help close the gap between the holistic negative sentiment dissemination, even if it decreases the realism of negative sentiment dissemination via email.

There is a very big difference between the summary statistics of both approaches. Approach 1 is much more what I was expecting with max values around 1. On the otherhand, approach 2 produced much higher scores than I was expecting. I am confused at to why the values of approach 2 are so high, given that the edge weights produced by the page rank algorithm were between 1 and 0, so summing should not have produced such high values. Never the less, we shall normalise both approaches and compare their normalised forms.  

if values are too low, then needed to emphasise the spread of negative sentiment.