# <center>Data Mining and Pre-processing.</center>
- ### This project is to highlight my data pre-processing skills to bring down complex data formats to usable formats for better decision making and analytics which usually takes 70% of the time while working on in any data science project. This project doesn't include detailed data cleaning, visualization, EDA, statistics or model development.

### <u>Use case</u>: 
- SOAR is a Security Orchestration and automated response platform used by SOC analysts to drive the complete life cycle of a threat from alert generation till it's remediation. 
- The first functionality of SOAR platform is recieving an alert. There after, SOC analyst creates a case out of the alert and fills list of check boxes, text boxes and drop downs for the same. Within those fields, there's an important field 'Threat Severity Level' that decides the priority of the threat. 
- Usually the SOC analyst populates this field based on his/her experience and domain understanding. This manual process leads to incorrect prioritization of alerts because of the presence of false positives in the alerts and hence a lot of times threats causing more damange are attended later leading to huge loss of money, privacy and makes the institution vulnerable.

### <u>Solution</u>: 
- This project is to make the life of a SOC analyst easier by automating assigning Threat Severity Level  to help them prioritize the alerts and attend high priority attacks first ignoring false positive alerts.

### <u>Dataset</u>:
- Open source data: https://www.circl.lu/doc/misp/feed-osint/ This data is used by testers on SOAR platform to simulate life cycle of an alert. Complete information about the data and it's field is provided on https://www.circl.lu/doc/misp/
- This data forms backend of the framework when alerts are generated.

### Importing libraries

In [1]:
#importing libraries
import os
import pandas as pd
import numpy as np
from statistics import mode

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## 1. Functions

### 1.1 Read function

In [2]:
def read_all_jsons(folder_name):
    json_files=[pos_json for pos_json in os.listdir(folder_name) if pos_json.endswith(".json")]
    df=pd.read_json(folder_name + json_files[0])
    df=df.T
    
    for i in range(1, len(json_files)):
        new_df=pd.read_json(folder_name + json_files[i])
        new_df=new_df.T
        
        df=pd.concat([df, new_df])
        
    df.reset_index(drop=True, inplace=True)
    
    return df

#os.listdir() method in python is used to get the list of all files and directories in the specified directory. 
#If we don’t specify any directory, then list of files and directories in the current working directory will be returned.

### Executing read function

In [3]:
alert_data=read_all_jsons('C:/Users/gangu/OneDrive/Desktop/Data Science material other than upGrad/Threat Severity in A-SOAR/Data/')
alert_data.head()

Unnamed: 0,Attribute,Object,Orgc,Tag,analysis,date,extends_uuid,info,publish_timestamp,published,threat_level_id,timestamp,uuid
0,"[{'deleted': False, 'category': 'Persistence m...","[{'deleted': False, 'description': 'Microblog ...",{'uuid': '55f6ea5e-2c60-40e5-964f-47a8950d210f...,"[{'colour': '#004646', 'name': 'type:OSINT'}, ...",2,2021-03-12,,OSINT - DearCry ransomware (abusing Exchange S...,1615541662,True,1,1615541608,0165e5d7-51e6-4c2e-a382-1dd1e706f7bb
1,"[{'deleted': False, 'category': 'External anal...","[{'deleted': False, 'description': 'Object des...",{'uuid': '55f6ea5e-2c60-40e5-964f-47a8950d210f...,"[{'colour': '#305600', 'name': 'malware_classi...",1,2021-06-19,,Netfilter Rootkit Samples - F. Roth - Google s...,1624087659,True,1,1624087411,01e8868d-48ae-41aa-8516-ea5a303758b8
2,"[{'deleted': False, 'category': 'Payload deliv...","[{'deleted': False, 'description': 'An annotat...",{'uuid': '5cf66e53-b5f8-43e7-be9a-49880a3b4631...,"[{'colour': '#f82378', 'name': 'zloader'}, {'c...",0,2020-06-24,,"zloader: VBA, R1C1 References, and Other Tomfo...",1594257453,True,4,1607522938,0733f160-8e52-4548-a4c8-19a1cfb41d0d
3,"[{'deleted': False, 'category': 'Network activ...","[{'deleted': False, 'description': 'File objec...",{'uuid': '55f6ea5e-2c60-40e5-964f-47a8950d210f...,"[{'colour': '#004646', 'name': 'type:OSINT'}, ...",2,2020-11-27,,OSINT - Egregor: The New Ransomware Variant To...,1607324084,True,1,1607324075,0b988513-9535-42f0-9ebc-5d6aec2e1c79
4,"[{'deleted': False, 'category': 'Payload deliv...","[{'deleted': False, 'description': 'File objec...",{'uuid': '55f6ea5e-2c60-40e5-964f-47a8950d210f...,"[{'colour': '#0088cc', 'name': 'misp-galaxy:mi...",2,2022-01-28,,OSINT - North Korea’s Lazarus APT leverages Wi...,1643368423,True,2,1643368411,0e887f03-5aa2-4a7b-b0f7-66208c6c657b


In [4]:
#Printing first row of column 'Attribute'
alert_data['Attribute'][0]

[{'deleted': False,
  'category': 'Persistence mechanism',
  'value': 'Files\\Microsoft\\Exchange',
  'timestamp': '1615538748',
  'disable_correlation': False,
  'to_ids': False,
  'type': 'regkey',
  'comment': '',
  'uuid': '2bc0505c-6566-416f-9f4b-2a689d78edb8'},
 {'deleted': False,
  'category': 'Payload delivery',
  'value': 'Server\\V15\\FrontEnd\\HttpProxy\\owa\\auth\\logout.aspx',
  'timestamp': '1615538748',
  'disable_correlation': False,
  'to_ids': True,
  'type': 'filename',
  'comment': '',
  'uuid': 'eebfaac3-846d-4883-a01e-706600c5aab2'},
 {'deleted': False,
  'category': 'Payload delivery',
  'value': 'Server\\V15\\FrontEnd\\HttpProxy\\owa\\auth\\one.aspx',
  'timestamp': '1615538748',
  'disable_correlation': False,
  'to_ids': True,
  'type': 'filename',
  'comment': '',
  'uuid': 'a6e83ff7-f43c-400a-9f85-6f856e537ff2'},
 {'deleted': False,
  'category': 'Payload delivery',
  'value': 'Server\\V15\\FrontEnd\\HttpProxy\\owa\\auth\\one1.aspx',
  'timestamp': '16155387

#### The columns of this dataset are populated with such complicated and long jsons throughout.

### 1.2 Data Preporcessing functions.

In [5]:
#Function to replace null values in initial untreated csv with 'No <name of the column>'
def replace_initial_null_values(df):
    for i in df.columns:
        df[i]=df[i].fillna('No ' + i)
    return df

#Function to create a new column that gives us the count of attribute dictionaries in the event(row)
def attribute_count(col):
    if type(col)==list:
        return len(col)
    else:
        return 0

#Function to create a new field which returns count of malicious Category-Type combination.
def category_type(col):
    count=0
    if col!='No Attribute':
        for element in col:
            cat=element['category']
            typ=element['type']
            if ((cat in category_type_combination_dict) & (typ in category_type_combination_dict[cat])):
                count+=1
            else:
                continue
    return count

#Function to create a new field which returns a dictionary with category-type combination as key and it's count in Attribute
#column as value
def category_type_combination(col):
    cat_typ_list=[]
    if col!='No Attribute':
        for element in col:
            cat=element['category']
            typ=element['type']
            if ((cat in category_type_combination_dict) & (typ in category_type_combination_dict[cat])):
                cat_typ_list.append(cat + ' ' + typ)
    cat_typ_combination_count= {x: cat_typ_list.count(x) for x in cat_typ_list}
    return cat_typ_combination_count

#Function to create columns out of the column created above which has a dictionary of malicious category-type combination.
#These newly formed column will represent such combinations and the count of each such combination for the particular row.
#Also fill the rows with NaN for these columns with 0
def create_columns_from_category_type_combination(df):
    df=pd.concat([df.drop(['Malicious_Category_Type_Combination'], axis=1), df['Malicious_Category_Type_Combination'].apply(pd.Series)], axis=1)
    df=df.fillna(0)
    return df

#Create a function that return a new field with the distribution id that is mostly populated in the Object column.    
def distribution(col):
    dist=[]
    if col!='No Attribute':
        for element in col:
            if 'distribution' in element:
                dist.append(int(element['distribution']))
    if len(dist)>0:
        return mode(dist)
    else:
        return(0)
    
#Create a function that returns a new field with the sharing group id that is mostly populated in the object column.
def sharing_group_id(col):
    shr_grp_id=[]
    if col!='No Attribute':
        for element in col:
            if 'sharing_group_id' in element:
                shr_grp_id.append(int(element['sharing_group_id']))
    if len(shr_grp_id)>0:
        return mode(shr_grp_id)
    else:
        return(0)
    
def create_hash_list(csv_name):
    hash_df=pd.read_csv(csv_name, names=['A', 'B'])
    hash_list1=hash_df['A'].to_list()
    hash_list2=hash_df['B'].to_list()
    hash_list1.extend(hash_list2)
    hash_list1=list(set(hash_list1))
    return hash_list1


#Create a function that returns a new field describing if the uuid existing in the dataset is present in hashes.csv
def malicious_uuid(col):
    if col in hash_df_col1_list:
        return 1
    else:
        return 0
    
#Create a function that returns a new field describing if the uuid existing in the dataset is present in hashes.csv
def malicious_extended_uuid(col):
    if ((col !=' ') & (col !='No extends_uuid ')):
        if col in hash_df_col1_list:
            return 1
        else:
            return 0
    else:
        return 0
    
#Create a function with a new field that describes if the uuids present in the json of Attribute column existing in the dataset
#is present in hashes.csv
def count_uuid_attribute(col):
    count_uuid=0
    for i in col:
        if 'uuid' in i:
            if i['uuid'] in hash_df_col1_list:
                count_uuid+=1
            else:
                continue
    return count_uuid

#Create a function with a new field that describes if the uuids present in the json of Orgc column existing in the dataset
#is present in hashes.csv
def count_uuid_orgc(col):
    count_uuid=0
    if 'uuid' in col:
        if col['uuid'] in hash_df_col1_list:
            count_uuid+=1
    else:
        count_uuid==count_uuid
    return count_uuid

#Create a new column that returns a new column description extracted from description fields in Object
def create_description_col(col):
    desc=''
    for i in col:
        if 'description' in i:
            desc = desc + ' ' + i['description']
        else:
            continue
    return desc

## 2. Executing the above functions.

### 2.1 Data Pre-processing

In [8]:
#Treat null values
alert_data=replace_initial_null_values(alert_data)

#Create column Attribute count
alert_data['Attribute count']=alert_data['Attribute'].apply(attribute_count)

#Create column Malicious_Category_Type_Count
alert_data['Malicious_Category_Type_Count']=alert_data['Attribute'].apply(category_type)

#Create column Malicious_Category_Type_Combination
alert_data['Malicious_Category_Type_Combination']=alert_data['Attribute'].apply(category_type_combination)

#Create column Distribution
alert_data['Distribution']=alert_data['Object'].apply(distribution)

#Create column Sharing group id
alert_data['Sharing_Group_Id']=alert_data['Object'].apply(sharing_group_id)

#Read both columns of hashes.csv in a list for further computation
hash_df_col1_list=create_hash_list('C:/Users/gangu/OneDrive/Desktop/Data Science material other than upGrad/Threat Severity in A-SOAR/Data/hashes.csv')

#Create column Mal_uuid
alert_data['Mal_uuid']=alert_data['uuid'].apply(malicious_uuid)

#Create column Mal_extends_uuid
alert_data['Mal_extends_uuid']=alert_data['extends_uuid'].apply(malicious_extended_uuid)

#Create column Mal_uuid_count_attribute
alert_data['Mal_uuid_count_attribute']=alert_data['Attribute'].apply(count_uuid_attribute)

#Create column Mal_uuid_count_orgc
alert_data['Mal_uuid_count_orgc']=alert_data['Orgc'].apply(count_uuid_orgc)

#Create column description which is a part of Object
alert_data['Description']=alert_data['Object'].apply(create_description_col)

#Create new columns from the column created above Malicious_Category_Type_Combination and then dropping Malicious_Category_Type_Combination
alert_data=create_columns_from_category_type_combination(alert_data)

In [9]:
#Drop columns that were used up for pre-processing
alert_data.drop(['Attribute', 'Object', 'Orgc', 'Tag', 'date', 'extends_uuid', 'publish_timestamp', 'timestamp', 'uuid'], 
               axis=1, inplace=True)

In [10]:
alert_data.head()

Unnamed: 0,analysis,info,published,threat_level_id,Attribute count,Malicious_Category_Type_Count,Distribution,Sharing_Group_Id,Mal_uuid,Mal_extends_uuid,Mal_uuid_count_attribute,Mal_uuid_count_orgc,Description,Persistence mechanism regkey,Payload delivery filename,Payload delivery sha256,External analysis link,Network activity url,Network activity ip-dst,Payload delivery sha1,Payload delivery md5,Network activity domain,Network activity hostname,Payload delivery vulnerability,Payload delivery domain,Antivirus detection text,Payload delivery ip-src,Payload delivery email-src,Artifacts dropped pdb,External analysis vulnerability,Artifacts dropped yara,Artifacts dropped mutex,Artifacts dropped filename,External analysis text,External analysis comment,Network activity ip-src,External analysis md5,External analysis sha256,External analysis sha1,Artifacts dropped md5,Artifacts dropped sha1,Artifacts dropped pattern-in-file,External analysis hostname,External analysis ip-dst,Artifacts dropped sha256,Attribution text,Network activity snort,Payload installation sha256,Payload installation md5,Payload installation sha1,Network activity email-dst,Payload installation filename,Other other,Other comment,External analysis attachment,Network activity user-agent,Artifacts dropped regkey|value,Attribution whois-registrant-email,Artifacts dropped malware-sample,Artifacts dropped filename|sha256,Artifacts dropped filename|sha1,Artifacts dropped filename|md5,Internal reference comment,Payload delivery filename|sha1,Payload delivery email-dst,Payload delivery email-subject,Payload delivery email-attachment,Other text,Network activity comment,Artifacts dropped comment,Artifacts dropped named pipe,Artifacts dropped regkey,Attribution comment,Network activity AS,Payload delivery hostname,External analysis url,External analysis filename,External analysis other,External analysis regkey|value,External analysis regkey,Payload delivery comment,Network activity pattern-in-traffic,Attribution campaign-name,Payload installation filename|sha256,Payload delivery filename|md5,Payload delivery url,Payload delivery ip-dst,Artifacts dropped pattern-in-memory,Network activity x509-fingerprint-sha1,Attribution threat-actor,Payload delivery malware-sample,Payload delivery filename|sha256,Financial fraud btc,Persistence mechanism filename,Payload installation text,Payload installation filename|md5,Targeting data target-user,Targeting data target-machine,Payload delivery text,Targeting data comment,Payload delivery attachment,Payload delivery yara,Payload delivery imphash,Payload delivery pehash,Payload delivery ssdeep,Attribution whois-creation-date,Attribution whois-registrant-phone,Attribution whois-registrant-name,Attribution whois-registrar,Payload installation imphash,Targeting data target-org,Financial fraud prtn,Artifacts dropped windows-scheduled-task,Payload installation malware-sample,Payload installation filename|sha1,Internal reference link,Payload delivery pattern-in-file,Attribution campaign-id,Payload delivery user-agent,Payload delivery x509-fingerprint-sha1,Network activity domain|ip,Targeting data target-location,Network activity text,Attribution x509-fingerprint-sha1,Network activity uri,Payload type text,Persistence mechanism regkey|value,Financial fraud comment,Artifacts dropped x509-fingerprint-sha1,Payload delivery mobile-application-id,Payload delivery link,Payload delivery sha224,Payload delivery sha384,Payload delivery sha512,Support Tool link,Payload installation vulnerability,Social network github-username,Social network github-repository,Artifacts dropped windows-service-name,Payload delivery email-src-display-name,Payload delivery email-reply-to,Payload delivery email-message-id,Payload delivery email-x-mailer,Payload delivery email-mime-boundary,Targeting data target-external,Artifacts dropped text,Internal reference text,Social network jabber-id,Social network email-src,Payload installation mobile-application-id,Payload delivery other,Payload delivery ip-dst|port,Payload installation yara,Artifacts dropped sigma,Artifacts dropped imphash,Artifacts dropped attachment,Artifacts dropped other,Payload delivery email-body,Other datetime,Support Tool text,Artifacts dropped hex,Payload installation hex,Social network whois-registrant-email,Person phone-number,Financial fraud other,Attribution dns-soa-email,Attribution x509-fingerprint-sha256,Attribution x509-fingerprint-md5,Payload delivery whois-registrant-email,Antivirus detection link,Other phone-number,Support Tool attachment,Network activity ip-dst|port,Other cpe,Network activity port,Financial fraud text,Other size-in-bytes,Network activity attachment,Social network other,Network activity other,Payload delivery AS,Financial fraud phone-number,Payload installation comment,Payload delivery ip-src|port
0,2,OSINT - DearCry ransomware (abusing Exchange S...,True,1,37,37,5,0,1,0,0,0,Microblog post like a Twitter tweet or a post...,1.0,33.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Netfilter Rootkit Samples - F. Roth - Google s...,True,1,2,2,5,0,1,0,0,0,Object describing a section of a Portable Exe...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,"zloader: VBA, R1C1 References, and Other Tomfo...",True,4,1,1,5,0,1,0,0,0,An annotation object allowing analysts to add...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2,OSINT - Egregor: The New Ransomware Variant To...,True,1,50,50,5,0,1,0,0,0,File object describing a file with meta-infor...,0.0,0.0,20.0,1.0,3.0,2.0,12.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2,OSINT - North Korea’s Lazarus APT leverages Wi...,True,2,10,10,5,0,1,0,0,0,File object describing a file with meta-infor...,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
alert_data.shape

(1431, 184)

#### The dataset is reduced to an interpretable format. Here, we have derived 13 columns out of the raw data and rest 181 columns are type of dummy variables created from category-type combination.

### List of Malicious Category- Type combinations captured from circl documentation on research.

In [7]:
category_type_combination_dict={
    'Antivirus detection' : ['link', 'comment', 'text', 'hex', 'attachment', 'other', 'anonymised'],
    'Artifacts dropped' : ['md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512', 'sha512/224', 'sha512/256', 'sha3-224',
                           'sha3-256', 'sha3-384', 'sha3-512', 'ssdeep', 'imphash', 'telfhash', 'impfuzzy', 'authentihash',
                           'vhash', 'cdhash', 'filename', 'filename|md5', 'filename|sha1', 'filename|sha224', 'filename|sha256',
                           'filename|sha384', 'filename|sha512', 'filename|sha512/224', 'filename|sha512/256',
                           'filename|sha3-224', 'filename|sha3-256', 'filename|sha3-384', 'filename|sha3-512', 
                           'filename|authentihash', 'filename|vhash', 'filename|ssdeep', 'filename|tlsh', 'filename|imphash',
                           'filename|impfuzzy', 'filename|pehash', 'regkey', 'regkey|value', 'pattern-in-file', 
                           'pattern-in-memory', 'filename-pattern', 'pdb', 'stix2-pattern', 'yara', 'sigma', 'attachment', 
                           'malware-sample', 'named pipe', 'mutex', 'process-state', 'windows-scheduled-task', 
                           'windows-service-name', 'windows-service-displayname', 'comment', 'text', 'hex', 
                           'x509-fingerprint-sha1', 'x509-fingerprint-md5', 'x509-fingerprint-sha256', 'other', 'cookie',
                           'gene', 'kusto-query', 'mime-type', 'anonymised', 'pgp-public-key', 'pgp-private-key'],
    'Attribution' : ['threat-actor', 'campaign-name', 'campaign-id', 'whois-registrant-phone', 'whois-registrant-email',
                     'whois-registrant-name', 'whois-registrant-org', 'whois-registrar', 'whois-creation-date', 'comment',
                     'text', 'x509-fingerprint-sha1', 'x509-fingerprint-md5', 'x509-fingerprint-sha256', 'other', 
                     'dns-soa-email', 'anonymised', 'email'],
    
    'External analysis' : ['md5', 'sha1', 'sha256', 'sha3-224', 'sha3-256', 'sha3-384', 'sha3-512', 'filename', 
                           'filename|md5', 'filename|sha1', 'filename|sha256', 'filename|sha3-224', 'filename|sha3-256',
                           'filename|sha3-384', 'filename|sha3-512', 'ip-src', 'ip-dst', 'ip-dst|port', 'ip-src|port',
                           'mac-address', 'mac-eui-64', 'hostname', 'domain', 'domain|ip', 'url', 'user-agent', 'regkey', 
                           'regkey|value', 'AS', 'snort', 'bro', 'zeek', 'pattern-in-file', 'pattern-in-traffic', 
                           'pattern-in-memory', 'filename-pattern', 'vulnerability', 'cpe', 'weakness', 'attachment',
                           'malware-sample', 'link', 'comment', 'text', 'x509-fingerprint-sha1', 'x509-fingerprint-md5',
                           'x509-fingerprint-sha256', 'ja3-fingerprint-md5', 'jarm-fingerprint', 'hassh-md5', 'hasshserver-md5',
                           'github-repository', 'other', 'cortex', 'anonymised', 'community-id'],
    'Financial fraud' : ['btc', 'dash', 'xmr', 'iban', 'bic', 'bank-account-nr', 'aba-rtn', 'bin', 'cc-number', 'prtn',
                         'phone-number', 'comment', 'text', 'other', 'hex', 'anonymised'],
    
    'Internal reference' : ['text', 'link', 'comment', 'other', 'hex', 'anonymised', 'git-commit-id'],
    
    'Network activity' : ['ip-src', 'ip-dst', 'ip-dst|port', 'ip-src|port', 'port', 'hostname', 'domain', 'domain|ip',
                          'mac-address', 'mac-eui-64', 'email', 'email-dst', 'email-src', 'eppn', 'url', 'uri', 'user-agent',
                          'http-method', 'AS', 'snort', 'pattern-in-file', 'filename-pattern', 'stix2-pattern', 
                          'pattern-in-traffic', 'attachment', 'comment', 'text', 'x509-fingerprint-md5', 'x509-fingerprint-sha1',
                          'x509-fingerprint-sha256', 'ja3-fingerprint-md5', 'jarm-fingerprint', 'hassh-md5', 'hasshserver-md5',
                          'other', 'hex', 'cookie', 'hostname|port', 'bro', 'zeek', 'anonymised', 'community-id', 
                          'email-subject', 'favicon-mmh3', 'dkim', 'dkim-signature', 'ssh-fingerprint'],
    'Other' : ['comment', 'text', 'other', 'size-in-bytes', 'counter', 'datetime', 'cpe', 'port', 'float', 'hex', 'phone-number',
               'boolean', 'anonymised', 'pgp-public-key', 'pgp-private-key'],
    'Payload delivery' : ['md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512', 'sha512/224', 'sha512/256', 'sha3-224',
                          'sha3-256', 'sha3-384', 'sha3-512', 'ssdeep', 'imphash', 'telfhash', 'impfuzzy', 'authentihash',
                          'vhash', 'pehash', 'tlsh', 'cdhash', 'filename', 'filename|md5', 'filename|sha1', 'filename|sha224',
                          'filename|sha256', 'filename|sha384', 'filename|sha512', 'filename|sha512/224', 'filename|sha512/256',
                          'filename|sha3-224', 'filename|sha3-256', 'filename|sha3-384', 'filename|sha3-512', 
                          'filename|authentihash', 'filename|vhash', 'filename|ssdeep', 'filename|tlsh', 'filename|imphash',
                          'filename|impfuzzy', 'filename|pehash', 'mac-address', 'mac-eui-64', 'ip-src', 'ip-dst',
                          'ip-dst|port', 'ip-src|port', 'hostname', 'domain', 'email', 'email-src', 'email-dst',
                          'email-subject', 'email-attachment', 'email-body', 'url', 'user-agent', 'AS', 'pattern-in-file',
                          'pattern-in-traffic', 'filename-pattern', 'stix2-pattern', 'yara', 'sigma', 'mime-type',
                          'attachment', 'malware-sample', 'link', 'malware-type', 'comment', 'text', 'hex', 'vulnerability',
                          'cpe', 'weakness', 'x509-fingerprint-sha1', 'x509-fingerprint-md5', 'x509-fingerprint-sha256',
                          'ja3-fingerprint-md5', 'jarm-fingerprint', 'hassh-md5', 'hasshserver-md5', 'other', 'hostname|port',
                          'email-dst-display-name', 'email-src-display-name', 'email-header', 'email-reply-to',
                          'email-x-mailer', 'email-mime-boundary', 'email-thread-index', 'email-message-id',
                          'mobile-application-id', 'chrome-extension-id', 'whois-registrant-email', 'anonymised'],
    'Payload installation' : ['md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512', 'sha512/224', 'sha512/256', 'sha3-224',
                              'sha3-256', 'sha3-384', 'sha3-512', 'ssdeep', 'imphash', 'telfhash', 'impfuzzy', 'authentihash',
                              'vhash', 'pehash', 'tlsh', 'cdhash', 'filename', 'filename|md5', 'filename|sha1',
                              'filename|sha224', 'filename|sha256', 'filename|sha384', 'filename|sha512', 'filename|sha512/224',
                              'filename|sha512/256', 'filename|sha3-224', 'filename|sha3-256', 'filename|sha3-384',
                              'filename|sha3-512', 'filename|authentihash', 'filename|vhash', 'filename|ssdeep',
                              'filename|tlsh', 'filename|imphash', 'filename|impfuzzy', 'filename|pehash', 'pattern-in-file',
                              'pattern-in-traffic', 'pattern-in-memory', 'filename-pattern', 'stix2-pattern', 'yara', 'sigma',
                              'vulnerability', 'cpe', 'weakness', 'attachment', 'malware-sample', 'malware-type', 'comment',
                              'text', 'hex', 'x509-fingerprint-sha1', 'x509-fingerprint-md5', 'x509-fingerprint-sha256',
                              'mobile-application-id', 'chrome-extension-id', 'other', 'mime-type', 'anonymised'],
    'Payload type' : ['comment', 'text', 'other', 'anonymised'],
    
    'Persistence mechanism' : ['filename', 'regkey', 'regkey|value', 'comment', 'text', 'other', 'hex', 'anonymised'],
    
    'Person' : ['first-name', 'middle-name', 'last-name', 'full-name', 'date-of-birth', 'place-of-birth', 'gender', 
                'passport-number', 'passport-country', 'passport-expiration', 'redress-number', 'nationality', 'visa-number',
                'issue-date-of-the-visa', 'primary-residence', 'country-of-residence', 'special-service-request',
                'frequent-flyer-number', 'travel-details', 'payment-details', 'place-port-of-original-embarkation', 
                'place-port-of-clearance', 'place-port-of-onward-foreign-destination', 'passenger-name-record-locator-number',
                'comment', 'text', 'other', 'phone-number', 'identity-card-number', 'anonymised', 'email', 'pgp-public-key',
                'pgp-private-key'],
    'Social network' : ['github-username', 'github-repository', 'github-organisation', 'jabber-id', 'twitter-id', 'email',
                        'email-src', 'email-dst', 'eppn', 'comment', 'text', 'other', 'whois-registrant-email', 'anonymised',
                        'pgp-public-key', 'pgp-private-key'],
    'Support Tool' : ['link', 'text', 'attachment', 'comment', 'other', 'hex', 'anonymised'],
    'Targeting data' : ['target-user', 'target-email', 'target-machine', 'target-org', 'target-location', 'target-external',
                        'comment', 'anonymised']
}