<img src="https://uwe-cyber.github.io/images/uwe_banner.png">

# UFCFEL-15-3 Security Data Analytics and Visualisation
## Portfolio Assignment: Part 3
### Academic year: 2024-25

## Conduct a security investigation into a suspected insider threat
---

**UWEtech** are calling you back once more to help them with their security challenges. They believe that one of their employees has been the cause of their recent security problems, and they believe they may have an insider threat within the company. They enlist your help to examine employee log activity, to see what behaviours deviate from the norm and to identify which user may be acting as a threat to their organisation.

**Dataset:** You will be issued a **unique dataset** based on your UWE student ID. **Failure to use the dataset that corresponds to your student ID will result in zero marks.** Please access the datasets via Blackboard.

**This exercise carries a weight of 45% towards your overall portfolio submission**


### Submission Documents
---

For Part 3 of your portfolio, your complete output file should be saved as:

- **STUDENT_ID-PART3.ipynb**

This should then be included in a ZIP file along with your other two portfolio documents.

The deadline for your portfolio submission is **THURSDAY 12th DECEMBER @ 14:00**. 

## DATASET: Load in the data

**Please provide the string below that you have been assigned as given in the spreadsheet available on Blackboard. The directory containing your dataset should be at the same location as your notebook file.**

In [1]:
# PLEASE ENTER THE NAME OF THE DATASET ASSIGNED TO YOU AS INDICATED ON BLACKBOARD
DATASET = 'uwetech-dataset01'

### Function for loading data - do not change

In [5]:
import random
import string
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
import datetime

def load_data(DATASET):
    dataset_list = ['uwetech-dataset01', 'uwetech-dataset02', 'uwetech-dataset03', 'uwetech-dataset04']
    if DATASET in dataset_list:
        email_data = pd.read_csv('./' + DATASET + '/email_data.csv', parse_dates=True, index_col=0)
        file_data = pd.read_csv('./' + DATASET + '/file_data.csv', parse_dates=True, index_col=0)
        web_data = pd.read_csv('./' + DATASET + '/web_data.csv', parse_dates=True, index_col=0)
        login_data = pd.read_csv('./' + DATASET + '/login_data.csv', parse_dates=True, index_col=0)
        usb_data = pd.read_csv('./' + DATASET + '/usb_data.csv', parse_dates=True, index_col=0)
        employee_data = pd.read_csv('./' + DATASET + '/employee_data.csv', index_col=0)
        
        email_data['datetime'] = pd.to_datetime(email_data['datetime'])
        file_data['datetime'] = pd.to_datetime(file_data['datetime'])
        web_data['datetime'] = pd.to_datetime(web_data['datetime'])
        login_data['datetime'] = pd.to_datetime(login_data['datetime'])
        usb_data['datetime'] = pd.to_datetime(usb_data['datetime'])
    else:
        print ("DATASET variable not defined - please refer to Blackboard for DATASET name")
        return
    return employee_data, login_data, usb_data, web_data, file_data, email_data

employee_data, login_data, usb_data, web_data, file_data, email_data = load_data(DATASET)

### The following code samples may be useful to aid your investigation

In [6]:
# This shows the employee_data DataFrame
employee_data

Unnamed: 0,user,role,email,pc
0,usr-uda,Security,usr-uda@uwetech.com,pc0
1,usr-hhe,Security,usr-hhe@uwetech.com,pc1
2,usr-vxr,Finance,usr-vxr@uwetech.com,pc2
3,usr-nba,Finance,usr-nba@uwetech.com,pc3
4,usr-hqt,Finance,usr-hqt@uwetech.com,pc4
...,...,...,...,...
245,usr-jwo,Finance,usr-jwo@uwetech.com,pc245
246,usr-hiz,Security,usr-hiz@uwetech.com,pc246
247,usr-svz,Services,usr-svz@uwetech.com,pc247
248,usr-ndr,HR,usr-ndr@uwetech.com,pc248


In [9]:
# This shows the login_data DataFrame
login_data

Unnamed: 0,datetime,user,action,pc
0,2022-01-01 00:00:30,usr-lfl,login,pc18
1,2022-01-01 00:09:21,usr-vul,login,pc54
2,2022-01-01 00:14:04,usr-jmr,login,pc137
3,2022-01-01 00:15:06,usr-hvd,login,pc110
4,2022-01-01 00:15:57,usr-ebj,login,pc108
...,...,...,...,...
151995,2022-10-31 23:40:34,usr-bsx,logoff,pc79
151996,2022-10-31 23:41:08,usr-gvw,logoff,pc87
151997,2022-10-31 23:43:11,usr-hfz,logoff,pc112
151998,2022-10-31 23:46:29,usr-dmi,logoff,pc17


In [10]:
# This shows how to filter the login_data DataFrame by a particular username
login_data[login_data['user']=='usr-nic']

Unnamed: 0,datetime,user,action,pc
36,2022-01-01 03:05:21,usr-nic,login,pc181
458,2022-01-01 20:50:55,usr-nic,logoff,pc181
510,2022-01-02 00:55:34,usr-nic,login,pc181
963,2022-01-02 21:17:28,usr-nic,logoff,pc181
1122,2022-01-03 06:46:55,usr-nic,login,pc181
...,...,...,...,...
150987,2022-10-29 23:00:00,usr-nic,logoff,pc181
151247,2022-10-30 10:52:27,usr-nic,login,pc181
151473,2022-10-30 21:50:44,usr-nic,logoff,pc181
151502,2022-10-31 00:14:21,usr-nic,login,pc181


In [115]:
# This shows how to filter the login_data DataFrame by a particular set of usernames within a list
login_data[login_data['user'].isin(['usr-nic'])]

Unnamed: 0,datetime,user,action,pc
36,2022-01-01 03:05:21,usr-nic,login,pc181
458,2022-01-01 20:50:55,usr-nic,logoff,pc181
510,2022-01-02 00:55:34,usr-nic,login,pc181
963,2022-01-02 21:17:28,usr-nic,logoff,pc181
1122,2022-01-03 06:46:55,usr-nic,login,pc181
...,...,...,...,...
150987,2022-10-29 23:00:00,usr-nic,logoff,pc181
151247,2022-10-30 10:52:27,usr-nic,login,pc181
151473,2022-10-30 21:50:44,usr-nic,logoff,pc181
151502,2022-10-31 00:14:21,usr-nic,login,pc181


In [116]:
# all_roles is an array/list of all job roles that are within our DataFrame
all_roles = employee_data['role'].unique()
all_roles

array(['Security', 'Finance', 'Legal', 'HR', 'Services', 'Technical',
       'Director'], dtype=object)

In [117]:
### This sample code helps to create two dictionary objects - user_set and user_set_emails - that group usernames and emails by job role.

user_set = {}
user_set_emails = {}
all_roles = employee_data['role'].unique()
for role in all_roles:
    user_set[role] = list(employee_data[employee_data['role'] == role]['user'].values)
    user_set_emails[role] = list(employee_data[employee_data['role'] == role]['email'].values)

In [118]:
# List all usernames that belong to the job role Finance
user_set['Finance']

['usr-vxr',
 'usr-nba',
 'usr-hqt',
 'usr-gyk',
 'usr-tiz',
 'usr-eqp',
 'usr-avx',
 'usr-zjh',
 'usr-hsh',
 'usr-gro',
 'usr-xkb',
 'usr-qcf',
 'usr-zuq',
 'usr-rjv',
 'usr-wer',
 'usr-sgi',
 'usr-utk',
 'usr-zge',
 'usr-inp',
 'usr-ssv',
 'usr-lhu',
 'usr-uby',
 'usr-nvl',
 'usr-vmk',
 'usr-oza',
 'usr-xgk',
 'usr-uyp',
 'usr-jwo',
 'usr-eie']

### Question 1: For all Finance staff members during the month of January, show the distribution of when users logon and logoff by hour using one or more Bar Charts, and report the most common login and logoff time for this role.

*Hint: Once you have filtered the data to only Finance staff in January, count the number of logons and logoffs that occur in each hour of the day.*

#### (1 mark)

In [None]:
######### ADD YOUR CODE HERE ##########

### Question 2: Plot a multi-line chart that shows the logon and logoff times during the month of January for the user of pc42.

*Hint: Filter the data as you need, and make two calls to plt.plot().*

#### (1 mark)



In [None]:
######### ADD YOUR CODE HERE ##########

### Question 3: Use a node-link graph to show all emails sent by Security staff on January 5th 2022. Your node link graph may show only those users who receive emails from the select senders.

*Hint: Filter the data and then refer back to Question 4 from Part 1 to format the data correctly*

#### (1 mark)

In [None]:
######### ADD YOUR CODE HERE ##########

### (Advanced) Question 4: Extend the above, now showing a node for every possible user. The edge connections should be as above, for emails sent by Security staff on 5th January 2022. You should use a shell layout for your network plot.

*Hint: Think about how to include all users as nodes. You may even include a dummy node and remove this in your processing depending on how you form your edgelist - https://networkx.org/documentation/stable/index.html*

#### (3 marks)

In [None]:
######### ADD YOUR CODE HERE ##########

### Question 5: Show a comparison between the files accessed by HR staff, Services staff, and Security staff, during January. You will need to think of a suitable way to convey this information within a single plot so that comparison of activity can be made easily.

*Hint: Think which plot enables you to make comparisons between two attributes, and then think what the attributes would need to be for mapping three job roles against the possible set of files accessed.*

#### (4 marks)

In [None]:
######### ADD YOUR CODE HERE ##########

### Question 6: Carry on your own investigation to find the anomalous activity across all data files provided. Provide clear evidence and justification for your investigative steps.

Marks are awarded for: 
- a clear explanation of the steps you take to complete your investigation (5)
- suitable use of data analysis with clear explanation (6)
- suitable use of visualisation methods with clear annotation (6)
- identifying all of the suspicious events (8)

#### (25 marks)

In [6]:
######### ADD YOUR CODE HERE ##########

### Question 7: Describe what you believe are the key findings of your investigation. You should clearly state the suspect identified, and the sequential order of suspicious events, including the date and time that these occurred. You should then provide your own critical reflection of what has occurred in this scenario, giving justification for any assumptions made. Limit your response to a maximum of 400 words. 

Please make clear which dataset you have used for your investigation.

#### (10 marks)