# WSE 380 - Medium Interaction Honeypot Data Analysis
In this notebook we will go through our medium interaction honeypot data and attempt to extract interesting insights to help us better understand where attackers are coming from and what they are doing.
## Part 0: Install Dependencies
In addition to the dependencies we already have installed, we need to install the ploytly. This library will allow us to plot attacker locations on an interactive world map. We install it using the following command:

`pip3 install plotly`

## Part 1: Data Preparation
Let's start by importing all of the necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import maxminddb
import json, os, requests
import plotly.graph_objects as go
from io import StringIO

Now let's load our data in. All you need to do is enter the pathname of the folder containing your honeypot data. If you put your data folder in the same directory as this notebook, just enter the name of the folder here.

In [None]:
# Load in the data from one particular data file and return that in its own Dataframe
def loadHoneypotDataFile(filePath):
    maxLines = 1000
    linesLoaded = 0
    honeypotActions = []
    with open(filePath, 'r') as inputFile:
        for line in inputFile:
            if(linesLoaded > maxLines):
                break
            linesLoaded+=1
            honeypotActions.append(json.loads(line))

    honeypotData = pd.read_json(StringIO(json.dumps(honeypotActions)))
    print(f" loaded {linesLoaded} lines from {filePath}")
    return honeypotData

############################
honeypotDataFolder = '../cowrie/var/log/cowrie' # Enter the path to the folder containing your honeypot files
############################

# Loop through all of the files in the directory specified, get each files's dataframe and append
# it to the main datafram
honeypotData = pd.DataFrame()
maxFiles = 1
filesLoaded = 0
for honeypotDataFile in os.listdir(honeypotDataFolder):
    if(filesLoaded > maxFiles):
        break
    honeypotDataFile = os.path.join(honeypotDataFolder, honeypotDataFile)

    if('json' in honeypotDataFile):
        filesLoaded+=1
        currentData = loadHoneypotDataFile(honeypotDataFile)
        honeypotData = pd.concat([honeypotData, currentData], ignore_index=True)
        

# Sort the end dataframe by the timestamp field
honeypotData = honeypotData.sort_values('timestamp')

Now that we have our data loaded into memory, let's use the `head()` function of Pandas to view the first few values of our newly created dataframe

In [None]:
print('Data has %d rows and %d columns' % (honeypotData.shape[0], honeypotData.shape[1]))

honeypotData.head()

Unlike the low interaction honeypot data, our medium interaction honeypot records many different kinds of actions, many more than we can see just by looking at the first few elements. Let's see how many distinct actions are recorded by our honeypot:

In [None]:
uniqueActionCounts = honeypotData['eventid'].value_counts()
print('%d unique actions recorded by the honeypot: ' % len(uniqueActionCounts))
uniqueActionCounts

There's a lot more to this data than the low interaction honeypot data! Therefore, we can't break our data up into multiple parts like we did last time. Instead we will filter out the data we're interested in on the fly.

## Part 2: Data Analysis
### Basic Stats
Let's start by simply figuring out how many connections and login attempts we received.

In [None]:
honeypotConnections = honeypotData.loc[honeypotData['eventid'] == 'cowrie.session.connect']
honeypotLoginAttempts = honeypotData.loc[(honeypotData['eventid'] == 'cowrie.login.success')
                                       | (honeypotData['eventid'] == 'cowrie.login.failed')]

print('Number of connections: %d' % honeypotConnections.shape[0])
print('Number of login attempts: %d' % honeypotLoginAttempts.shape[0])

Unlike the low interaction honeypot, Cowrie distinguishes successful logins from failed logins. To ensure the effectiveness of our honeypot, we want to make sure the users we setup allowed for a sufficient number of successful logins.

In [None]:
numFailedLogins = honeypotData.loc[honeypotData['eventid'] == 'cowrie.login.failed'].shape[0]
numSuccessfulLogins = honeypotData.loc[honeypotData['eventid'] == 'cowrie.login.success'].shape[0]

print('%d successful logins, %d failed logins, %.2f%% of logins succeeded' % (numSuccessfulLogins, numFailedLogins, 
                                                  (numSuccessfulLogins / honeypotLoginAttempts.shape[0] * 100)))

There is no "right" answer of what percentage of logins should be successful on a honeypot. It depends entirely on what you're trying to accomplish with your honeypot. In our case we want as much data as possible so the higher percentage of successful logins, the better.

To give our login statistics a point of reference, let's see how many unique attackers logged in.

In [None]:
numAttackerIPs = len(honeypotData.src_ip.unique())
print('Number of unique IP addresses encountered: %d' % numAttackerIPs)

### Top Credentials Used

Now let's look at what credentials attackers used to try to log into our honeypots. We want to know what are the most popular credentials tried so we know not to use them for our own accounts. How do we define popular though? We can look at the raw number of times a particular username or password was used across all login attempts, or we can see how many attackers tried a particular username or password. Why not look at both? They both give us unique insights into what credentials attackers are using. Let's first look at the raw frequencies of credentials.

In [None]:
usernameFreq = honeypotLoginAttempts['username'].value_counts()
passwordFreq = honeypotLoginAttempts['password'].value_counts()

topNToDisplay = 5

display(pd.DataFrame(usernameFreq.head(n=topNToDisplay)))
display(pd.DataFrame(passwordFreq.head(n=topNToDisplay)))

Here, we can see how many times each username and password was tried. This gives us a great view into:
1. Which account names attract attackers the most
2. Which passwords attackers try first/most

Now, let's see which credentials were tried by the greatest number of attackers.

In [None]:
usernameAttackerFreq = pd.DataFrame(honeypotLoginAttempts.groupby(['username'])['src_ip'].nunique().sort_values(ascending=False))
passwordAttackerFreq = pd.DataFrame(honeypotLoginAttempts.groupby(['password'])['src_ip'].nunique().sort_values(ascending=False))

# Here we get the number of unique IP addresses as a basic measure of the number of attackers
numUniqueAttackers = len(honeypotLoginAttempts['src_ip'].unique())

# We can use the apply() function of Pandas to pass each element of a column to a function and set the output to a new column
usernameAttackerFreq['percentage'] = usernameAttackerFreq['src_ip'].apply(lambda x: '%.2f%%' % (x / numUniqueAttackers * 100))
passwordAttackerFreq['percentage'] = passwordAttackerFreq['src_ip'].apply(lambda x: '%.2f%%' % (x / numUniqueAttackers * 100))

topNToDisplay = 5

display(usernameAttackerFreq.head(n=topNToDisplay))
display(passwordAttackerFreq.head(n=topNToDisplay))

Now that we know what percentage of attackers tried each username and password, we can see more clearly which credentials are most popular.

### Attacker Login Frequency
Now let's look to see the number of usernames and passwords the most active attackers used. An attacker that tried a lot of usernames but not passwords may have wanted to see if there exists one of many possible usernames that have a very weak password on the server. While an attacker that tried many passwords but not many usernames was targeting a single or small number of usernames.

In [None]:
# Here we group our login data by each IP address and count the number of distinct user names and passwords in each group
ipUsernameFreq = honeypotLoginAttempts.groupby(['src_ip'])['username'].nunique().sort_values(ascending=False)
ipPasswordFreq = honeypotLoginAttempts.groupby(['src_ip'])['password'].nunique().sort_values(ascending=False)

topNToDisplay = 5

display(pd.DataFrame(ipUsernameFreq.head(n=topNToDisplay)))
display(pd.DataFrame(ipPasswordFreq.head(n=topNToDisplay)))

The last statistic we are going to generate about this data is the number of login attempts attackers tried before quitting. However, instead of making more tables, let's try to plot this data. We are going to make a "Cumulative Distribution Function" or "CDF" plot. While the name may sound complex, what we are looking to find is simply, "what percentage of attackers made less than or equal to x number of login attempts?". In other words, when we look at our firgure, we want to be able to say for example "20% of attackers tried to log in less than or equal to 10,000 times". The x axis of our plot will be the number of login attempts and the y axis will be the percentage of all attackers. Let's see it in action.

In [None]:
# Here we're getting the number of login attempts per IP address using the value_counts() function on the source IP
# address column
numAuthTriesPerIP = honeypotLoginAttempts['src_ip'].value_counts().reset_index()
#numAuthTriesPerIP = numAuthTriesPerIP.loc[numAuthTriesPerIP.src_ip < 20]

fig, ax = plt.subplots(figsize=(14, 8))

ax.hist(numAuthTriesPerIP['src_ip'], cumulative=True, density=True, bins=1000, histtype='step')

plt.title('Cumulative Distribution Function of the Total Login Attempts by Each Attacker')
plt.xlabel('Number of Login Attempts')
plt.ylabel('Percentage of Attackers')
plt.show()

To view the same information in a different format, let's make a boxplot.

In [None]:
# Here we're getting the number of login attempts per IP address using the value_counts() function on the source IP
# address column
numAuthTriesPerIP = honeypotLoginAttempts['src_ip'].value_counts().reset_index()
#numAuthTriesPerIP = numAuthTriesPerIP.loc[numAuthTriesPerIP.src_ip < 20]

fig, ax = plt.subplots(figsize=(14, 8))

# Change the 'showfliers' argument to False to see what the dataset looks like without outliers
ax.boxplot(numAuthTriesPerIP['src_ip'], showfliers=True)

plt.title('The Number of Login Attempts by Each Attacker')
plt.ylabel('Number of Login Attempts')
plt.show()

### Attacker Locations
The last bit of analysis we will try on our low interaction honeypot data will look to determine where in the world attackers are coming from. To do this, we are going to use the Maxmind DB library we intalled at the beginning. This library allows us to get an estimate of where an IP address is located in the world. It's not perfect but close enough for our analysis. This library gets its information from a GeoIP database file. Along with this notebook is a current databse file but in the future you can download an updated database from the [Maxmind website](https://dev.maxmind.com/geoip/geoip2/downloadable/).

To get started with the Maxmind library, let's look to see how we can get location information on a single IP address. First, we must load in the database from the local file, and then we use the `get()` function with the IP address we're interested in. In this example we got the location information for Google's open DNS server. This library returns data in a dictionary, meaning we must reference the values we're interested in by their keys.

In [None]:
reader = maxminddb.open_database('../Session3/GeoLite2-City.mmdb')

Let's create a new column in our dataset with the country each attacker came from. To do this, we'll use the name `apply()` function we saw earlier. Since an attacker's location won't change between login attempts, we'll use the connection dataset we defined earlier.

Let's also do the same thing for the country code as well. We need this information for the plotting we will do below.

**Note:** You may see a `SettingWithCopyWarning` when running the following code. Don't worry about this. It is telling us we're adding values to a copy of a dataframe rather than the actual dataframe. In our case this is okay.

In [None]:
r = requests.get('http://country.io/iso3.json')
countryCodeMappings = r.json()

def getCountryName(ip):
    locationData = reader.get(ip)
    
    # In the cases where the IP address' location can't be determined, we return an empty string
    if(locationData == None or 'country' not in locationData):
        return ''
    
    return locationData['country']['names']['en']

def getCountryCode(ip):
    locationData = reader.get(ip)
    
    # In the cases where the IP address' location can't be determined, we return an empty string
    if(locationData == None or 'country' not in locationData):
        return ''
    
    return countryCodeMappings[locationData['country']['iso_code']]

honeypotConnections = honeypotConnections.copy()
honeypotConnections['country'] = honeypotConnections['src_ip'].apply(lambda ip: getCountryName(ip))
honeypotConnections['country_code'] = honeypotConnections['src_ip'].apply(lambda ip: getCountryCode(ip))

Now that we have each IP's location, let's get the most popular countries

In [None]:
topNToDisplay = 10

numConnectionsPerCountry = honeypotConnections['country'].value_counts().reset_index()

display(numConnectionsPerCountry.head(n = topNToDisplay))

Now let's display this same information in a more interesting way. We can use the plotly library to create an interactive world map that allows us to zoom in and hover over each country to get how many attackers originated from there.

In [None]:
numConnectionsPerCountryCode = honeypotConnections['country_code'].value_counts().reset_index()

fig = go.Figure(data=go.Choropleth(
    locations = numConnectionsPerCountryCode['index'],
    z = numConnectionsPerCountryCode['country_code'],
    text = numConnectionsPerCountryCode['index'],
    colorscale = 'Blues',
    autocolorscale=True,
    reversescale=False,
    marker_line_color='darkgray',
    marker_line_width=0.5,
    colorbar_title = 'Number of<br> Connections',
))

fig.update_layout(
    title_text='Attacker Locations',
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    ),
)

fig.show()

## Commands Executed
Unlike the low interaction honeypot, medium interaction honeypots allow the attacker to actually interact with the system. This means we have data on which commands attackers ran on our honeypot. Let's take a look at the most popular commands.

In [None]:
honeypotCommandInputs = honeypotData.loc[honeypotData['eventid'] == 'cowrie.command.input']
honeypotCommandInputCounts = honeypotCommandInputs['input'].value_counts().reset_index()

print('%d total commands executed' % honeypotCommandInputs.shape[0])
display(honeypotCommandInputCounts.head(n=20))
honeypotCommandInputCounts.to_csv('honeypotCommandInputCounts.csv')

Now let's look at the commands that were successfully ran by our honeypot. If a command was successfully ran, it means Cowrie had some code to support the output of that command.

In [None]:
honeypotSuccessfulCommandInputs = honeypotData.loc[honeypotData['eventid'] == 'cowrie.command.success']
honeypotSuccessfulCommandInputCounts = honeypotSuccessfulCommandInputs['input'].value_counts().reset_index()

print('%d commands successfully executed' % honeypotSuccessfulCommandInputs.shape[0])
display(honeypotSuccessfulCommandInputCounts.head(n=20))

On the other hand, if Cowrie didn't know how to handle a command, it told the attacker the command was not found even if it was a valid command on Linux. Why might this be bad?

In [None]:
honeypotFailedCommandInputs = honeypotData.loc[honeypotData['eventid'] == 'cowrie.command.failed']
honeypotFailedCommandInputCounts = honeypotFailedCommandInputs['input'].value_counts().reset_index()

print('%d commands failed to execute' % honeypotFailedCommandInputs.shape[0])
display(honeypotFailedCommandInputCounts.head(n=20))

## Downloaded and Uploaded Files
Attackers also download files to machines they compromise and upload files back to their own machines. Let's take a look at what these attackers did and try to figure out the reasoning.

Downloaded files:

In [None]:
honeypotDownloadedFiles = honeypotData.loc[honeypotData['eventid'] == 'cowrie.session.file_download']
honeypotUploadedFiles = honeypotData.loc[honeypotData['eventid'] == 'cowrie.session.file_upload']

print('%d files downloaded overall' % honeypotDownloadedFiles.shape[0])
display(honeypotDownloadedFiles['destfile'].value_counts().reset_index())

Uploaded files:

In [None]:
print('%d file uploaded overall' % honeypotUploadedFiles.shape[0])
display(honeypotUploadedFiles['destfile'].value_counts().reset_index())

## SSH Forwarding
The last main action attackers might take against compromised SSH servers is using them as a proxy server to launch further attacks. Why might attackers want to do this?

Let's look at where attackers are connecting through our honeypots.

In [None]:
honeypotForwardingRequests = honeypotData.loc[honeypotData['eventid'] == 'cowrie.direct-tcpip.request']

print('%d forwarding requests made by attackers' % honeypotForwardingRequests.shape[0])
display(honeypotForwardingRequests['dst_ip'].value_counts().reset_index().head(n=10))

## What did you find?
You now have some basic statistics and figures describing your dataset. Take a look at what you found and make some notes on anything interesting. We will discuss our findings during our next session. If you can think of any other interesting insights we did not do here, try them out yourself below!