# AI based Prioritizaton of IOCs

### Introduction of Problem Landscape
- **Cyber Attack** - Malicious actors attacking organizations or individuals IT infra with the intent of stealing data, trade secrets or disabling the service and cause financial loss.
- **Threat Intelligence** - Analysing an attack and gathering artifacts and evidence from the attack.
- **Threat Intelligence Sharing** - Sharing the artifacts and evidence of an attack as Indicator of Compromise(IOC) to prevent the attack in other IT infra.
- **IOC - Indicator of Compromise** is a piece of artifact/evidence/data based on the analysis of an attack.

### Threat Intelligence and Actions

| Type of Attack | IOC source | IOC/artifact | Action/Counter measure |
| :- | -: | :-: | :-: |
| Brute force ssh,telent | dataplane.org,honeynet.asia,snot.org | IP address | block/filter 
| SMTP attacks | smptgreet, pop3gropers | IP address | block/filter
| Phishing | Phishtank | URL | block/filter
| Maleware | Abuse.ch. virus total | Filename, hash | Add to antivirus, cleanup 
| Malware URL | sabuse.ch | URL |block/filter
| Crypto mining | Zerodot1 | URL | block/filter
| Software vulnerabilities | Metasploit CVE | Software/lib name, method | Patch software
|VNC Remove buffer sessions | dataplane.org | IP address | block/filter
|SIP attacks | dataplane.org | IP address | block/filter
|DNS rd | dataplane.org | IP address | block/filter 
| DDOS | | IP, port | block/filter

### Setup the environment

- Install the required modules

In [1]:
#Install needed packages if not installed.
import sys
!{sys.executable} -m pip install requests
!{sys.executable} -m pip install maxminddb
!{sys.executable} -m pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [2]:
#Import packages
from datetime import datetime
import os
import csv
import maxminddb
import re
import requests

### Create the directory structure

- Create following directory structrue under  current date
    - raw
    - cleaned
    - processed

In [3]:
#Get today's Date
today = datetime.now()
currentdate = today.strftime('%Y%m%d')

#Function to create directory structure
def create_date_directory():
    today = datetime.now()
    currentdate = today.strftime('%Y%m%d')
    try:
        os.mkdir("data/" + currentdate)
    except OSError as error:
        pass 
    try:
        os.mkdir("data/" + currentdate+"/"+"raw")
    except OSError as error:
        pass 
    try:
        os.mkdir("data/" + currentdate+"/"+"cleaned")
    except OSError as error:
        pass 
    try:
        os.mkdir("data/" + currentdate+"/"+"processed")
    except OSError as error:
        pass 
    
#Create directory structure
create_date_directory()

### Data Collection

**IOC Sources**
- IOC Sources array contains the following
    - URL of the source
    - Filename to save the data
    - Weightage assigned to the source
    
- Data download
    - Walk through sources array and get the data and save it in raw directory

In [4]:
#IOC sources list for IP addresses
#1. URL of IOC source
#2. Filename to save
#3. Weightage for the source
ioc_sources = [ ["https://rules.emergingthreats.net/blockrules/compromised-ips.txt","emerginthreats_net_cips.txt",0.31],
               ["https://www.dan.me.uk/torlist/?exit","tor_exit_nodes.txt",0.21],
               ["https://www.dan.me.uk/torlist/","all_tor_nodes.txt",0.11],
               ["https://home.nuug.no/~peter/pop3gropers.txt","pop3gropers_raw.txt",0.41],
               ["https://feodotracker.abuse.ch/downloads/ipblocklist.csv","feodo_ip_blocklist.csv",0.32],
               ["https://dataplane.org/sshpwauth.txt","dataplane_org_ssh.txt",0.91],
               ["https://dataplane.org/telnetlogin.txt","dataplane_org_telent.txt",0.81],
               ["https://api.cybercure.ai/feed/get_ips?type=csv","cybercure_badips.txt",0.42],
               ["http://www.ipspamlist.com/public_feeds.csv","ipspamlist.txt",0.33],
               ["https://mirai.security.gives/data/ip_list.txt","mirai_security.txt",0.51],
               ["https://raw.githubusercontent.com/stamparm/ipsum/master/levels/6.txt","ipsum6.txt",0.21],
               ["https://raw.githubusercontent.com/stamparm/ipsum/master/levels/7.txt","ipsum7.txt",0.22],
               ["https://raw.githubusercontent.com/stamparm/ipsum/master/levels/8.txt","ipsum8.txt",0.23]
               #["https://feeds.honeynet.asia/bruteforce/latest-telnetbruteforce-unique.csv","honeynet_telnet.txt"]
              ]

In [5]:
#Function to download from one source
import requests
def get_source_file(file_url, filename):
    response = requests.get(file_url)
    dfilename="data/"+currentdate+"/"+"raw"+"/"+filename
    print(dfilename)
    open("data/"+currentdate+"/"+"raw"+"/"+filename, "wb").write(response.content)
    
#Function to walkthrough all the sources 
def collect_iocs():
    for ioc_source in ioc_sources:
        print(ioc_source[0])
        get_source_file(ioc_source[0],ioc_source[1])    
        
#Collect from all the sources
collect_iocs()

https://rules.emergingthreats.net/blockrules/compromised-ips.txt
data/20230711/raw/emerginthreats_net_cips.txt
https://www.dan.me.uk/torlist/?exit
data/20230711/raw/tor_exit_nodes.txt
https://www.dan.me.uk/torlist/
data/20230711/raw/all_tor_nodes.txt
https://home.nuug.no/~peter/pop3gropers.txt
data/20230711/raw/pop3gropers_raw.txt
https://feodotracker.abuse.ch/downloads/ipblocklist.csv
data/20230711/raw/feodo_ip_blocklist.csv
https://dataplane.org/sshpwauth.txt
data/20230711/raw/dataplane_org_ssh.txt
https://dataplane.org/telnetlogin.txt
data/20230711/raw/dataplane_org_telent.txt
https://api.cybercure.ai/feed/get_ips?type=csv
data/20230711/raw/cybercure_badips.txt
http://www.ipspamlist.com/public_feeds.csv
data/20230711/raw/ipspamlist.txt
https://mirai.security.gives/data/ip_list.txt
data/20230711/raw/mirai_security.txt
https://raw.githubusercontent.com/stamparm/ipsum/master/levels/6.txt
data/20230711/raw/ipsum6.txt
https://raw.githubusercontent.com/stamparm/ipsum/master/levels/7.txt
d

### Data preprocessing

- Cleanup
    - Extract only the IP address using regular expression
    
- Unique IPs
    - Combine all IPs and get the Unique IPs

In [13]:
#Cleanup the data : Extract IP addresses from the files
import re
valid =[]
def cleanup_data():
    for ioc_source in ioc_sources:
        rawfilename = "data"+"/"+currentdate+"/"+"raw"+"/"+ioc_source[1]
        print(rawfilename)
        with open(rawfilename) as fh:
            string = fh.readlines()
            for line in string:
                line = line.rstrip()
                ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', line )
                if (len(ip) > 0):
                    #print(ip[0])
                    valid.append(ip[0])
        cleanedfilename = "data"+"/"+currentdate+"/"+"cleaned"+"/"+ioc_source[1]
        print(cleanedfilename)
        file = open(cleanedfilename,'w')
        for item in valid:
            file.write(item+"\n")
        file.close()

cleanup_data()

data/20230711/raw/emerginthreats_net_cips.txt
data/20230711/cleaned/emerginthreats_net_cips.txt
data/20230711/raw/tor_exit_nodes.txt
data/20230711/cleaned/tor_exit_nodes.txt
data/20230711/raw/all_tor_nodes.txt
data/20230711/cleaned/all_tor_nodes.txt
data/20230711/raw/pop3gropers_raw.txt
data/20230711/cleaned/pop3gropers_raw.txt
data/20230711/raw/feodo_ip_blocklist.csv
data/20230711/cleaned/feodo_ip_blocklist.csv
data/20230711/raw/dataplane_org_ssh.txt
data/20230711/cleaned/dataplane_org_ssh.txt
data/20230711/raw/dataplane_org_telent.txt
data/20230711/cleaned/dataplane_org_telent.txt
data/20230711/raw/cybercure_badips.txt
data/20230711/cleaned/cybercure_badips.txt
data/20230711/raw/ipspamlist.txt
data/20230711/cleaned/ipspamlist.txt
data/20230711/raw/mirai_security.txt
data/20230711/cleaned/mirai_security.txt
data/20230711/raw/ipsum6.txt
data/20230711/cleaned/ipsum6.txt
data/20230711/raw/ipsum7.txt
data/20230711/cleaned/ipsum7.txt
data/20230711/raw/ipsum8.txt
data/20230711/cleaned/ipsum

In [15]:
#Get unique IPs
iplists = []
allips = []
uniqueips = []
def load_ips():
    global allips
    global iplists
    for ioc_source in ioc_sources:
        cleanedfilename = "data"+"/"+currentdate+"/"+"cleaned"+"/"+ioc_source[1]
        print(cleanedfilename)
        with open(cleanedfilename) as fh:
            string = fh.readlines()
            iplists.append(string)
            for line in string:
                line = line.rstrip()
                ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', line )
                if (len(ip) > 0):
                    #print(ip[0])
                    allips.append(ip[0])
                    
load_ips()

data/20230711/cleaned/emerginthreats_net_cips.txt
data/20230711/cleaned/tor_exit_nodes.txt
data/20230711/cleaned/all_tor_nodes.txt
data/20230711/cleaned/pop3gropers_raw.txt
data/20230711/cleaned/feodo_ip_blocklist.csv
data/20230711/cleaned/dataplane_org_ssh.txt
data/20230711/cleaned/dataplane_org_telent.txt
data/20230711/cleaned/cybercure_badips.txt
data/20230711/cleaned/ipspamlist.txt
data/20230711/cleaned/mirai_security.txt
data/20230711/cleaned/ipsum6.txt
data/20230711/cleaned/ipsum7.txt
data/20230711/cleaned/ipsum8.txt


In [17]:
    print("Number of IP Lists ",len(iplists))
    print("Number of IPs ",len(allips))
    print("Number of Unique IPs ",len(uniqueips))

Number of IP Lists  13
Number of IPs  1548706
Number of Unique IPs  0


In [24]:
#Dump unique IPs to a file
def unique_ips():
    load_ips()
    global uniqueips
    tips=set(allips)
    uniqueips = list(tips)
    file = open("data/" + currentdate+"/"+"processed"+"/"+"unique_ips.txt",'w')
    for ip in uniqueips:
        file.write(ip+"\n")
    file.close()
    print("Number of IP Lists ",len(iplists))
    print("Number of IPs ",len(allips))
    print("Number of Unique IPs ",len(uniqueips))

unique_ips()

data/20230711/cleaned/emerginthreats_net_cips.txt
data/20230711/cleaned/tor_exit_nodes.txt
data/20230711/cleaned/all_tor_nodes.txt
data/20230711/cleaned/pop3gropers_raw.txt
data/20230711/cleaned/feodo_ip_blocklist.csv
data/20230711/cleaned/dataplane_org_ssh.txt
data/20230711/cleaned/dataplane_org_telent.txt
data/20230711/cleaned/cybercure_badips.txt
data/20230711/cleaned/ipspamlist.txt
data/20230711/cleaned/mirai_security.txt
data/20230711/cleaned/ipsum6.txt
data/20230711/cleaned/ipsum7.txt
data/20230711/cleaned/ipsum8.txt
Number of IP Lists  39
Number of IPs  4646118
Number of Unique IPs  171586


### Data Enrichment
- Create features for each IP address
    - Add source weight for each IP
    - Add country based weight for each IP
        - Find country of the IP
        - From the database pickup the weight for each country
    - For initial training assign a priority. This can be enchanced further

In [25]:
#Load country level scoring
country_score={}
country_data = {}
def load_country_score():
    with open('country_code.csv', newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            country_score[row['Code2']]=[row['Score'],row['Country'],row['No']]
            country_data[row['Country']]=0
        
load_country_score()

In [26]:
#Get gelocation of IP address
import maxminddb

def get_country_score(ip):
    with maxminddb.open_database('GeoLite2-Country.mmdb') as reader:
        country = reader.get(ip)
        if country is None:
            return 0,0,0
        if "country" in country:
            return country_score[country["country"]["iso_code"]][0],country_score[country["country"]["iso_code"]][1],country_score[country["country"]["iso_code"]][2]
        else:
            return 0,0,0

In [27]:
#Priority for first time training. Can be enchanced in the future
def get_priority(ipscore):
    priority = 0
    if (ipscore >= 6.2):
        priority= 1
    elif (ipscore >= 6):
        priority = 2
    elif (ipscore >= 5.8):
        priority = 3
    elif (ipscore >= 5.6):
        priority = 4
    elif (ipscore >= 5.4):
        priority = 5
    elif (ipscore >= 5.2):
        priority = 6
    elif (ipscore >= 5):
        priority = 7
    elif (ipscore >= 4.8):
        priority = 8
    elif (ipscore >= 4.6):
        priority = 9
    else :
        priority= 10
    return priority

In [58]:
#Create the feature set
def populate_features():
    features = [0] * 25
    count=0
    file = open("data/" + currentdate+"/"+"processed"+"/"+"features.csv",'w')
    writer = csv.writer(file)
    for ip in uniqueips:
        for index in range(13):
            #print(iplists[index])
            ip1 = ip+"\n"
            if ip1 in iplists[index]:
                features[index]=ioc_sources[index][2]
                #print("found in"+str(index))

        score, country_name, country_code =get_country_score(ip)
        if country_name == 0:
            continue

        features[18]= score

        country_data[country_name]+=1
    
        count=count+1
        ipscore=0
        for i in range(19):
            ipscore=ipscore+float(features[i])
        
        features[19]= ipscore
        features[20] = get_priority(ipscore)
        features[21]=int(country_code)
        features[22]=str(ip)
        #features[23]=country_name
        writer.writerow(features)
        #print(label)
        #print(features)
        #print("\n")
        if (count % 1000) == 0:
            print(count)
            #Only 1000 IPs will be processed for faster development
            #Comment following line to process all the IPs
            if (count == 1000):
                break
            
    file.close()
    
populate_features()

1000


In [41]:
#Country level IP data for analyes
file = open("data/" + currentdate+"/"+"processed"+"/"+"country_count.csv",'w')
for c in country_data:
    file.write(c+","+str(country_data[c])+"\n")
file.close()

### Machine Learning 

#### Training
- Load the data
    - Load features from the file
    - Get X and Y data
- Model
    - This is a multi classifer problem. Choosing Decision Tree Classifier
    - Train DecisionClassifier Model
    - Save trained model to a file

In [42]:
# Load featrues
from numpy import genfromtxt
def load_features():
    tdata = genfromtxt("data/" + currentdate+"/"+"processed"+"/"+"features.csv", delimiter=',')
    #print(data)
    return tdata

data = load_features()

In [43]:
#Data for machine learning
X=data[:,1:19]
y=data[:,20]
#print(X)
#print(y)

[[0.   0.   0.   ... 0.   0.   0.6 ]
 [0.   0.   0.   ... 0.   0.   0.7 ]
 [0.   0.   0.41 ... 0.   0.   0.4 ]
 ...
 [0.21 0.11 0.41 ... 0.   0.   0.9 ]
 [0.21 0.11 0.41 ... 0.   0.   0.37]
 [0.21 0.11 0.41 ... 0.   0.   0.9 ]]
[10. 10.  9.  9.  6.  6.  8.  8.  6.  7.  6.  7.  8.  7.  5.  7.  5.  5.
  5.  5.  7.  8.  6.  8.  8.  8.  5.  7.  5.  6.  6.  6.  3.  6.  4.  6.
  3.  5.  5.  5.  3.  5.  6.  5.  6.  5.  6.  6.  6.  3.  4.  5.  6.  6.
  6.  6.  5.  3.  6.  5.  3.  6.  6.  6.  6.  4.  4.  6.  4.  3.  3.  4.
  5.  6.  5.  5.  6.  6.  6.  3.  6.  4.  4.  3.  4.  5.  6.  3.  4.  6.
  4.  5.  3.  3.  4.  6.  6.  3.  5.  5.  4.  5.  3.  3.  6.  4.  3.  6.
  5.  3.  6.  6.  6.  4.  3.  6.  4.  3.  3.  6.  3.  6.  6.  5.  6.  3.
  5.  4.  5.  4.  6.  6.  6.  4.  3.  3.  6.  4.  6.  3.  3.  6.  4.  6.
  5.  3.  4.  4.  6.  3.  5.  4.  6.  3.  5.  4.  6.  3.  4.  3.  6.  3.
  5.  3.  4.  4.  6.  5.  6.  4.  3.  4.  6.  3.  5.  3.  6.  4.  6.  4.
  4.  6.  6.  6.  6.  4.  3.  3.  6.  5.  

In [44]:
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

In [46]:
# Create Decision Tree classifer object
iocp_model = DecisionTreeClassifier()

# Train Decision Tree Classifer
iocp_model = iocp_model.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = iocp_model.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9666666666666667


In [47]:
#Save the trained model
import pickle
pickle.dump(iocp_model, open("iocp_model.pkl", 'wb'))

### AI based prioritization of IOCs
Run the following daily to get the prioritization of IOCs

- Load the saved model from the file
- Collect daily data from the IOC sources
- Do preprocessing 
- Enrich the data
- Using the pretrained model assign the priority for each IP

In [48]:
#Load IOC Priotation Model
iocp_trained_model = pickle.load(open("iocp_model.pkl", 'rb'))

In [59]:
#create_date_directory()
#collect_iocs()
cleanup_data()
unique_ips()
populate_features()
data = load_features()

data/20230711/raw/emerginthreats_net_cips.txt
data/20230711/cleaned/emerginthreats_net_cips.txt
data/20230711/raw/tor_exit_nodes.txt
data/20230711/cleaned/tor_exit_nodes.txt
data/20230711/raw/all_tor_nodes.txt
data/20230711/cleaned/all_tor_nodes.txt
data/20230711/raw/pop3gropers_raw.txt
data/20230711/cleaned/pop3gropers_raw.txt
data/20230711/raw/feodo_ip_blocklist.csv
data/20230711/cleaned/feodo_ip_blocklist.csv
data/20230711/raw/dataplane_org_ssh.txt
data/20230711/cleaned/dataplane_org_ssh.txt
data/20230711/raw/dataplane_org_telent.txt
data/20230711/cleaned/dataplane_org_telent.txt
data/20230711/raw/cybercure_badips.txt
data/20230711/cleaned/cybercure_badips.txt
data/20230711/raw/ipspamlist.txt
data/20230711/cleaned/ipspamlist.txt
data/20230711/raw/mirai_security.txt
data/20230711/cleaned/mirai_security.txt
data/20230711/raw/ipsum6.txt
data/20230711/cleaned/ipsum6.txt
data/20230711/raw/ipsum7.txt
data/20230711/cleaned/ipsum7.txt
data/20230711/raw/ipsum8.txt
data/20230711/cleaned/ipsum

In [60]:
X=data[:,1:19]

In [61]:
y_pred = iocp_model.predict(X)

In [62]:
print(len(X))

1000


In [64]:
print(uniqueips[0])

200.205.131.106


In [79]:
prioirty_count=[0,0,0,0,0,0,0,0,0,0,0]
for i in range(len(X)):
    prioirty_count[int(y_pred[i])]+=1
    
print(prioirty_count)

[0, 0, 0, 294, 180, 172, 352, 0, 0, 0, 2]


In [71]:
print("IP Address         Priority")
for i in range(len(X)):
    print(uniqueips[i].ljust(18, ' '), y_pred[i])

IP Address         Priority
200.205.131.106    10.0
35.224.2.98        10.0
182.253.184.74     5.0
220.127.101.254    6.0
114.231.140.120    6.0
222.65.226.51      6.0
138.97.92.137      4.0
5.2.70.140         6.0
170.238.112.11     4.0
171.235.77.167     5.0
98.184.33.210      4.0
136.185.13.42      5.0
109.245.240.125    6.0
103.54.202.220     5.0
180.121.254.21     3.0
59.178.10.122      5.0
117.80.144.212     3.0
210.13.91.139      3.0
223.113.1.138      3.0
111.181.44.152     3.0
117.203.195.204    5.0
89.36.178.54       6.0
185.100.215.12     4.0
5.235.202.131      6.0
185.47.64.246      6.0
45.128.232.187     6.0
27.202.245.40      3.0
160.116.251.108    6.0
120.86.253.38      3.0
165.154.132.104    6.0
87.251.74.213      6.0
91.92.189.62       6.0
180.117.20.228     3.0
78.92.113.214      6.0
186.193.69.122     4.0
149.106.157.39     6.0
27.29.45.193       3.0
103.63.24.62       5.0
43.245.102.85      5.0
103.36.35.197      5.0
121.226.173.198    3.0
124.121.148.105    5.0
5.74