# WSE 380 - Low Interaction Honeypot Data Analysis
In this notebook we will go through our low interaction honeypot data and attempt to extract interesting insights to help us better understand where our attackers are coming from and what they are doing.
## Part 0: Install Dependencies
In addition to the dependencies we already have installed, we need to install the Maxmind DB library. This library will allow us to get the location of each attacker based on their IP address. We install it using the following command:

`pip3 install maxminddb`

## Part 1: Data Preparation
Let's start by importing all of our necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import maxminddb
import json

Now let's load our data in. All you need to do is enter the pathname of the file containing your honeypot data. If you put your data file in the same directory as this notebook, just enter the name of the file here.

In [None]:
honeypotDataFile = '/home/user/ssh_logs.json' # Enter the path to your data file here

honeypotActions = []
with open(honeypotDataFile, 'r') as inputFile:
    for line in inputFile:
        honeypotActions.append(json.loads(line))

honeypotData = pd.read_json(json.dumps(honeypotActions))

Our honeypot outputs each action it records into its own JSON object. If you look at your data file in a text editor, you will see it is simply just a bunch of JSON objects on each line. Therefore, we needed to loop through our file line by line and load each JSON object individually, using the `json.loads()` function, and store it into a list. We then could use the `json.dumps()` function to turn our list of objects into a single JSON string that Pandas could read into a dataframe.

Now that we have our data loaded into memory, let's use the `head()` function of Pandas to view the first few values of our newly created dataframe

In [None]:
honeypotData.head()

Here, we can get a better view of what our honeypot recorded from each interaction with attackers. There appears to be a lot of interesting data, but also some uninteresing, unchanging columns. For our analysis, we are going to want to look at the following columns:
* client_version : The SSH version the attacker was using
* duser: The username the attacker tried
* password : The password the attacker tried
* src : The attacker's IP address
* time : Timestamp of the login attempt

All other columns are not very interesting for our analysis, either because they don't change or are simply just random numbers. For example, the `product` column is simply just the name of our honeypot, "ssh-auth-logger". Additionally, the `spt` column represents the network port on the attacker's computer that sent the current request. This port is randomly chosen so it doesn't mean too much to us.

Before we move on to our analysis, there appears to be something strange going on with our data. It seems in some of the rows the username and password have the value NaN. Why could this be? If we look closer we can see that in all of these rows, the column `msg` has the value "Connection" and all other rows have the value "Request with password". This must mean the honeypot distinguishes different parts of the SSH authentication process as separate data points. Let's find out how many different kinds of actions are recorded by our honeypot.

To do this, let's use the `unique()` function of pandas to get all unique values of the `msg` column.

In [None]:
uniqueActions = list(honeypotData['msg'].unique())
print('Unique actions recorded by the honeypot: ', end='')
for action in uniqueActions:
    print(action, end=', ')

So it appears we may have up to three actions recorded: connections, authentication attempts with some username, password combination, and authentication attempts with an SSH key. We now know we have up to three different kinds of data hidden in our entire dataset. Let's separate our dataset into three so it's easier to work with.

In [None]:
honeypotConnections = honeypotData.loc[honeypotData['msg'] == 'Connection']
honeypotLogins = honeypotData.loc[honeypotData['msg'] == 'Request with password']
honeypotLoginKeys = honeypotData.loc[honeypotData['msg'] == 'Request with key']

## Part 2: Data Analysis
### Number of Connections and Authentication Attempts
Let's start by simply figuring out how many connections and login attempts we received. This one is up to you. Remember, we already split the data into connections and login attempts. We just want to know how many of each we recorded. 

Hint: We can use the same function we learned to get the length of a list in python

In [None]:
numConnections = 0      # Fill in with a function call that returns the length of this dataframe
numLoginAttempts = 0    # Fill in with a function call that returns the length of this dataframe
numLoginKeyAttempts = 0 # Fill in with a function call that returns the length of this dataframe

print('Number of connections: %d' % numConnections)
print('Number of login attempts: %d' % numLoginAttempts)
print('Number of login attempts with a key: %d' % numLoginKeyAttempts)

has_login_key_attempts = numLoginKeyAttempts > 0

With this basic statistic of just a total count of connections and login attempts we can see some interesting information. First, we can see just how many attackers are constantly trying to break into SSH servers. We only had our honeypots running for about two weeks and we have tens to hundreds of thousands of login attempts. This is why it's critically important to use strong passwords on these servers.

There is also something interesting about the relationship between connections and login attempts. Off the bat, we can see there are many more login attempts than connections. This is to be expected as attackers are not going to connect and try just one password. However, by default, SSH servers will allow 3 incorrect login attempts before the user is disconnected. This honeypot is not any different, but we see way more login attempts than 3\*numConnections. Why is this?

### Top Credentials Used

Now let's look at what credentials attackers used to try to log into our honeypots. We want to know what are the most popular credentials tried so we know not to use them for our own accounts. How do we define popular though? We can look at the raw number of times a particular username or password was used across all login attempts, or we can see how many attackers tried a particular username or password. Why not look at both? They both give us unique insights into what credentials attackers are using. Let's first look at the raw frequencies of credentials.

In [None]:
usernameFreq = honeypotLogins['duser'].value_counts()
passwordFreq = honeypotLogins['password'].value_counts()

topNToDisplay = 5

display(pd.DataFrame(usernameFreq.head(n=topNToDisplay)))
display(pd.DataFrame(passwordFreq.head(n=topNToDisplay)))

if has_login_key_attempts:
    keyFreq = honeypotLoginKeys['fingerprint'].value_counts()
    display(pd.DataFrame(keyFreq.head(n=topNToDisplay)))

Here, we can see how many times each username and password was tried. This gives us a great view into:
1. Which account names attract attackers the most
2. Which passwords attackers try first/most
3. Which SSH keys attackers try first/most

Now, let's see which credentials were tried by the greatest number of attackers.

In [None]:
usernameAttackerFreq = pd.DataFrame(honeypotLogins.groupby(['duser'])['src'].nunique().sort_values(ascending=False))
passwordAttackerFreq = pd.DataFrame(honeypotLogins.groupby(['password'])['src'].nunique().sort_values(ascending=False))

topNToDisplay = 5

display(usernameAttackerFreq.head(n=topNToDisplay))
display(passwordAttackerFreq.head(n=topNToDisplay))

if has_login_key_attempts:
    keyAttackerFreq = pd.DataFrame(honeypotLoginKeys.groupby(['fingerprint'])['src'].nunique().sort_values(ascending=False))
    display(keyAttackerFreq.head(n=topNToDisplay))

These tables tell us a different story. While the first tables may be swayed by one attacker trying to break into one account over and over again, these tables let us know how many attackers try each username, password, and SSH key. However, in this case a count of the number of attackers doesn't paint the entire picture. We'd rather know what **percentage** of attackers tried each set of credentials. To do this, let's create a new column in these tables with that percentage.

In [None]:
# Here we get the number of unique IP addresses as a basic measure of the number of attackers
numUniqueAttackers = len(honeypotLogins['src'].unique())

# We can use the apply() function of Pandas to pass each element of a column to a function and set the output to a new column
usernameAttackerFreq['percentage'] = usernameAttackerFreq['src'].apply(lambda x: '%.2f%%' % (x / numUniqueAttackers * 100))
passwordAttackerFreq['percentage'] = passwordAttackerFreq['src'].apply(lambda x: '%.2f%%' % (x / numUniqueAttackers * 100))

topNToDisplay = 5

display(usernameAttackerFreq.head(n=topNToDisplay))
display(passwordAttackerFreq.head(n=topNToDisplay))

if has_login_key_attempts:
    numUniqueKeyAttackers = len(honeypotLoginKeys['src'].unique())
    keyAttackerFreq['percentage'] = keyAttackerFreq['src'].apply(lambda x: '%.2f%%' % (x / numUniqueKeyAttackers * 100))
    display(keyAttackerFreq.head(n=topNToDisplay))

Now that we know what percentage of attackers tried each username and password, we can see more clearly which credentials are most popular.

### Attacker Login Frequency
Now let's look to see the number of usernames and passwords the most active attackers used. An attacker that tried a lot of usernames but not passwords may have wanted to see if there exists one of many possible usernames that have a very weak password on the server. While an attacker that tried many passwords but not many usernames was targeting a single or small number of usernames.

In [None]:
# Here we group our login data by each IP address and count the number of distinct user names and passwords in each group
ipUsernameFreq = honeypotLogins.groupby(['src'])['duser'].nunique().sort_values(ascending=False)
ipPasswordFreq = honeypotLogins.groupby(['src'])['password'].nunique().sort_values(ascending=False)

topNToDisplay = 5

display(pd.DataFrame(ipUsernameFreq.head(n=topNToDisplay)))
display(pd.DataFrame(ipPasswordFreq.head(n=topNToDisplay)))

if has_login_key_attempts:
    ipKeyFreq = honeypotLoginKeys.groupby(['src'])['fingerprint'].nunique().sort_values(ascending=False)
    display(pd.DataFrame(ipKeyFreq.head(n=topNToDisplay)))

The last statistic we are going to generate about this data is the number of login attempts attackers tried before quitting. However, instead of making more tables, let's try to plot this data. We are going to make a "Cumulative Distribution Function" or "CDF" plot. While the name may sound complex, what we are looking to find is simply, "what percentage of attackers made less than or equal to x number of login attempts?". In other words, when we look at our firgure, we want to be able to say for example "20% of attackers tried to log in less than or equal to 10,000 times". The x axis of our plot will be the number of login attempts and the y axis will be the percentage of all attackers. Let's see it in action.

In [None]:
# Here we're getting the number of login attempts per IP address using the value_counts() function on the source IP
# address column
numAuthTriesPerIP = honeypotLogins['src'].value_counts()

fig, ax = plt.subplots(figsize=(8, 4))

ax.hist(numAuthTriesPerIP, cumulative=True, density=True, bins=50, histtype='step')

plt.xlabel('Number of Login Attempts')
plt.ylabel('Percentage of Attackers')
plt.show()

### Attacker Locations
The last bit of analysis we will try on our low interaction honeypot data will look to determine where in the world attackers are coming from. To do this, we are going to use the Maxmind DB library we intalled at the beginning. This library allows us to get an estimate of where an IP address is located in the world. It's not perfect but close enough for our analysis. This library gets its information from a GeoIP database file. Along with this notebook is a current databse file but in the future you can download an updated database from the [Maxmind website](https://dev.maxmind.com/geoip/geoip2/downloadable/).

To get started with the Maxmind library, let's look to see how we can get location information on a single IP address. First, we must load in the database from the local file, and then we use the `get()` function with the IP address we're interested in. In this example we got the location information for Google's open DNS server. This library returns data in a dictionary, meaning we must reference the values we're interested in by their keys.

In [None]:
reader = maxminddb.open_database('GeoLite2-City.mmdb')
reader.get('8.8.8.8')

Let's create a new column in our dataset with the country each attacker came from. To do this, we'll use the name `apply()` function we saw earlier. Since an attacker's location won't change between login attempts, we'll use the connection dataset we defined earlier.

**Note:** You may see a `SettingWithCopyWarning` when running the following code. Don't worry about this. It is telling us we're adding values to a copy of a dataframe rather than the actual dataframe. In our case this is okay.

In [None]:
def getCountryName(ip):
    locationData = reader.get(ip)
    
    # In the cases where the IP address' location can't be determined, we return an empty string
    if(locationData is None or 'country' not in locationData):
        return ''
    
    return locationData['country']['names']['en']

honeypotConnections = honeypotConnections.copy()
honeypotConnections['country'] = honeypotConnections['src'].apply(lambda ip: getCountryName(ip))

Now that we have each IP's location, let's get the most popular countries

In [None]:
topNToDisplay = 5

numConnectionsPerCountry = honeypotConnections['country'].value_counts()

display(numConnectionsPerCountry.head(n = topNToDisplay))

## What did you find?
You now have some basic statistics and figures describing your dataset. Take a look at what you found and make some notes on anything interesting. We will discuss our findings during our next session. If you can think of any other interesting insights we did not do here, try them out yourself below!