<img src="https://www.cems.uwe.ac.uk/~pa-legg/uwecyber/images/uwe.png" width=300>
<img src="https://www.cems.uwe.ac.uk/~pa-legg/uwecyber/images/uwecyber_acecse_200.jpg" width=300>

# UFCFEL-15-3 Security Data Analytics and Visualisation
# Portfolio Task 3: Insider Threat Detection (2021)
---

The completion of this worksheet is worth **40%** towards your portfolio for the UFCFEL-15-3 Security Data Analytics and Visualisation (SDAV) module.

### Task
---

In this task, you have been asked to investigate a potential security threat within an organisation. Building on your previous worksheet expertise, you will need to apply your skills and knowledge of data analytics and visualisation to examine and explore the datasets methodically to uncover which employee is acting as a threat and why. The company have provided you with activity logs for various user interactions for the past 6 months, resulting in a lot of data that they need your expertise for to decipher. They want to have a report that details the investigation that you have carried out, details of the suspected individual, and a clear rationale as to why this suspect is flagged. You will need to document your investigation, giving clear justification for your process using Markdown annotation within your notebook. You will need to provide a clear rationale for why you suspect a given individual to be acting as a threat, based on the pattern of activity that you identify.

<i>This coursework is specifically designed to challenge your critical thinking and creativity, and is designed as an open problem. Examine the data and try to think how an individual user may appear as an anomaly against the remainder of the data. This could be an anomaly compared to a group of users, or an anomaly as compared over time.</i>


### Assessment and Marking
---

| Criteria | 0-39 | 40-49 | 50-59 | 60-69 | 70-84 | 85-100 |
| --- | --- | --- | --- | --- | --- | --- |
| **Identification of the suspicious activity (30%)** | No evidence of progress | A limited attempt to address this criteria | A working solution but perhaps not optimal | Good approach to the problem | Very good approach to the problem | Excellent approach to the problem |
| **Analytical process and reasoning (30%)**  | No evidence of progress | A limited attempt to address this criteria |  A working solution but perhaps not optimal | Good approach to the problem | Very good approach to the problem | Excellent approach to the problem |
| **Visualisation techniques (20%)**  | No evidence of progress | A limited attempt to address this criteria |  A working solution but perhaps not optimal | Good approach to the problem | Very good approach to the problem | Excellent approach to the problem |
| **Clarity and professional presentation (20%)**  | No evidence of progress | A limited attempt to address this criteria | Some evidence of markdown commentary | Good approach to the problem | Very good approach to the problem | Excellent approach to the problem |

To achieve the higher end of the grade scale, you need to demonstrate creativity in how you approach the problem of identifying malicious behaviours, and ensure that you have accounted for multiple anomalies across the set of data available.

You will need to implement your final solution in the Notebook format, with Markdown annotation -  you should use this notebook file as a template for your submission. You are also expected to complete the assignment self-assessment.

Your submission should include:
- HTML export of your complete assignment in notebook format.
- Original ipynb source file of your notebook.

### Self-Assessment
---

For each criteria, please reflect on the marking rubric and indicate what grade you would expect to receive for the work that you are submitting. For your own personal development and learning, it is important to reflect on your work and to attempt to assess this careful. Do think carefully about both positive aspects of your work, as well as any limitations you may have faced.

- **Identification of the suspicious activity (30%)**: You estimate that your grade will be 70.

- **Analytical process and reasoning (30%)**: You estimate that your grade will be 70.

- **Visualisation techniques (20%)**: You estimate that your grade will be 70.

- **Clarity and professional presentation (20%)**: You estimate that your grade will be 70.

Please provide a minimum of two sentences to comment and reflect on your own self-assessment: I learned more about how to use visual charts to analyze data.I have a deeper understanding of data analysis.


### Contact
---

Questions about this assignment should be directed to your module leader (Phil.Legg@uwe.ac.uk). You can use the Blackboard Q&A feature to ask questions related to this module and this assignment, as well as the on-site teaching sessions.

---


## Load in the data

Please provide the string below that you have been assigned as given in the spreadsheet available on Blackboard. Please also ensure you have saved your dataset folder in the following directory relative to your notebook: **"./T3_data/"**

In [1]:
# POSSIBLE DATASETS FOR 2021-22 MODULE RUN
dataset_list = ['lockdown-lockups', 'onlinebargains', 'trackntrace', 'zoooom']

# PLEASE ENTER THE NAME OF THE DATASET ASSIGNED TO YOU AS INDICATED ON BLACKBOARD
DATASET = 'onlinebargains'

### Function for loading data - do not change

In [None]:
import random
import string
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

def load_data(DATASET):

    if DATASET in dataset_list:
        email_data = pd.read_csv('./T3_data/' + DATASET + '/email_data.csv', parse_dates=True, index_col=0)
        file_data = pd.read_csv('./T3_data/' + DATASET + '/file_data.csv', parse_dates=True, index_col=0)
        web_data = pd.read_csv('./T3_data/' + DATASET + '/web_data.csv', parse_dates=True, index_col=0)
        login_data = pd.read_csv('./T3_data/' + DATASET + '/login_data.csv', parse_dates=True, index_col=0)
        usb_data = pd.read_csv('./T3_data/' + DATASET + '/usb_data.csv', parse_dates=True, index_col=0)
        employee_data = pd.read_csv('./T3_data/' + DATASET + '/employee_data.csv', index_col=0)
        
        email_data['datetime'] = pd.to_datetime(email_data['datetime'])
        file_data['datetime'] = pd.to_datetime(file_data['datetime'])
        web_data['datetime'] = pd.to_datetime(web_data['datetime'])
        login_data['datetime'] = pd.to_datetime(login_data['datetime'])
        usb_data['datetime'] = pd.to_datetime(usb_data['datetime'])
        print(len(email_data),len(file_data),len(web_data),len(login_data),len(usb_data))
    else:
        print ("DATASET variable not defined - please refer to Blackboard for DATASET name")
        return
    return employee_data, login_data, usb_data, web_data, file_data, email_data

employee_data, login_data, usb_data, web_data, file_data, email_data = load_data(DATASET)
# employee_data

## Start your Investigation

In [None]:
### The following code samples may be useful to aid your investigation

In [None]:
### Create a hirarchy dictionary that specifies, all users within each role / all user e-mails within each role
user_set = {}
user_set_emails = {}
all_roles = employee_data['role'].unique()
for role in all_roles:
    user_set[role] = list(employee_data[employee_data['role'] == role]['user'].values)
    user_set_emails[role] = list(employee_data[employee_data['role'] == role]['email'].values)
# user_set_emails

In [None]:
### Filter the data by all users that are in a given role
file_data[ file_data['user'].isin(user_set[role]) ]

In [None]:
### Get the day of the year for a given datetime column
# email_data['datetime'].dt.dayofyear

In [None]:
### Number of unique users in the dataset - this should be 249
# len(employee_data['user'].unique())

The above code will load in the dataset, and it will reveal details about users based on the role that they are linked to. It will show you how you can filter the data based on those in a particular role, and it will show how you can obtain the day of the year that a data entry refers to.

Now it is over to you...

# Analysis of the number of web pages visited by users

In [None]:
import numpy as np
# sns_plot = sns.countplot(x="user",data=web_data)
# sns_plot.figure.set_size_inches(18,10)


In [None]:
def get_maxRequestByUser():
    browser_user = web_data.groupby('user').agg({'website':np.count_nonzero})>20000
    user_target = browser_user[browser_user["website"]]
    df = user_target.reset_index()
    user_set = set(df["user"])
    return user_set
user_set_b = get_maxRequestByUser()


# Website browsing analysis

In [None]:
# sns.lineplot(x="user",y= 'website',data=web_data);

In [None]:
web_data['datetime'] = web_data['datetime'].apply(lambda x:x.strftime('%Y-%m-%d'))

In [None]:
# sns_plot = sns.countplot(x="datetime",data=web_data)
# sns_plot.figure.set_size_inches(18,10)

# User file operation frequency analysis

In [None]:
file_data['datetime'] = file_data['datetime'].apply(lambda x:x.strftime('%Y-%m-%d'))

In [None]:
file_user = file_data.groupby('user').agg({'filename':np.count_nonzero})>=31225
user_target = file_user[file_user["filename"]]
df = user_target.reset_index()
user_set_c = set(df["user"])
# user_set_c

In [None]:

user_set_intersect = user_set_b.intersection(user_set_c)

# user_set_intersect

In [None]:
user_role_rel = []
for item in user_set:
    for j in user_set_intersect:
        if j in user_set[item]:
            user_role_rel.append({item, j})
user_role_rel;


## Result as flow
# Technical According to the analysis list, these employees are all technical departments and non security departments, but they often visit network attack websites http://www.wireshark.com Therefore, these employees may pose internal threats