# WSE 380 - Low Interaction Honeypot Data Analysis
In this notebook we will compare your honeypot data, which was recorded from a public cloud machine, to data from the same honeypot but located on a Stony Brook University computer. 

Before we begin, try to think why we might see differences between these two datasets. What would an attacker find interesting about targeting a public cloud computer vs. a university computer and vice versa?

## Part 1: Data Preparation
Let's start by importing all of our necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import maxminddb
import json

Now let's load our data in. All you need to do is enter the pathname of the file containing your honeypot data. If you put your data file in the same directory as this notebook, just enter the name of the file in the specified variable.

In [None]:
#####################
cloudHoneypotDataFile = '/home/user/ssh_logs.json' # Enter the path to your data file here
#####################

sbuHoneypotDataFile = 'sbuHoneypotLogs.json'
NUM_LOGS_PER_FILE = 100_000

# Load data from honeypot log files line by line and return dataframe containing all data
def loadData(inputFileName):
    honeypotActions = []
    with open(inputFileName, 'r') as inputFile:
        for line in inputFile:
            if len(honeypotActions) >= NUM_LOGS_PER_FILE:
                break
            honeypotActions.append(json.loads(line))
            
    return pd.read_json(json.dumps(honeypotActions))

cloudHoneypotData = loadData(cloudHoneypotDataFile)
sbuHoneypotData = loadData(sbuHoneypotDataFile)

Just like when we looked at this data individually, we want to break our dataset into two: connections to our honeypots and login attempts to our honeypots.

In [None]:
cloudHoneypotConnections = cloudHoneypotData.loc[cloudHoneypotData['msg'] == 'Connection']
cloudHoneypotLogins = cloudHoneypotData.loc[cloudHoneypotData['msg'] == 'Request with password']

sbuHoneypotConnections = sbuHoneypotData.loc[sbuHoneypotData['msg'] == 'Connection']
sbuHoneypotLogins = sbuHoneypotData.loc[sbuHoneypotData['msg'] == 'Request with password']

### Helper Function
Since we now have two different datasets that we're comparing, we want a way to combine multiple output dataframes into one to make it easier to view our data.

In [None]:
# Input a list of dataframes and desired column names and output a dataframe that is the horizontal
# concatenation of the inputs
def combineDataFrames(outputColumnNames, inputDataFrames):
    outputDataFrame = pd.concat(inputDataFrames, axis=1)
    outputDataFrame.columns = outputColumnNames
    return outputDataFrame

## Part 2: Data Analysis
### Timeframe of our collections
Let's start by taking a look at the timeframe our collections occured over. This information is important so we can have a point of reference when comparing the raw number of connections and login attempts each dataset contains.

In [None]:
# Set our timestamp field to the datatype, "datetime"
cloudHoneypotData['time'] = pd.to_datetime(cloudHoneypotData['time'])
sbuHoneypotData['time'] = pd.to_datetime(sbuHoneypotData['time'])


print("The honeypot located on the cloud ran for %d days" % 
          (cloudHoneypotData['time'].iloc[-1] - cloudHoneypotData['time'].iloc[0]).days)
print("The honeypot located at SBU ran for %d days" % 
      (sbuHoneypotData['time'].iloc[-1] - sbuHoneypotData['time'].iloc[0]).days)

### Number of Connections and Authentication Attempts
Now, let's see how many connections and login attempts we received in each of the datasets. 

Before you calculate this, which dataset do you think will have more connections/login attempts?

In [None]:
print('Number of connections on cloud honeypot: %d' % len(cloudHoneypotConnections))
print('Number of logins on cloud honeypot: %d' % len(cloudHoneypotLogins))
print()
print('Number of connections on SBU honeypot: %d' % len(sbuHoneypotConnections))
print('Number of logins on SBU honeypot: %d' % len(sbuHoneypotLogins))

Do you see anything interesting in the number of connections/logins between the two datasets? If so, what do you see? Why do you think this is?

### Top Credentials Used

Now let's look at what credentials attackers used to try to log into our honeypots. We want to know what are the most popular credentials tried so we know not to use them for our own accounts. How do we define popular though? We can look at the raw number of times a particular username or password was used across all login attempts, or we can see how many attackers tried a particular username or password. Why not look at both? They both give us unique insights into what credentials attackers are using. Let's first look at the raw frequencies of credentials.

In [None]:
cloudUsernameFreq = cloudHoneypotLogins['duser'].value_counts()
cloudPasswordFreq = cloudHoneypotLogins['password'].value_counts()

sbuUsernameFreq = sbuHoneypotLogins['duser'].value_counts()
sbuPasswordFreq = sbuHoneypotLogins['password'].value_counts()

cloudUsernamePassCombined = combineDataFrames(['username','usernameCount','password','passwordCount'], 
                                              [cloudUsernameFreq.reset_index(), cloudPasswordFreq.reset_index()])
sbuUsernamePassCombined = combineDataFrames(['username','usernameCount','password','passwordCount'], 
                                              [sbuUsernameFreq.reset_index(), sbuPasswordFreq.reset_index()])

topNToDisplay = 5
print("Top credentials by raw count for honeypot on cloud machine:")
display(cloudUsernamePassCombined.head(n=topNToDisplay))
print("\nTop credentials by raw count for honeypot on SBU machine:")
display(sbuUsernamePassCombined.head(n=topNToDisplay))

Here, we see the number of times each username or password was used is about the same between the two datasets. However, there are a **large** amount of logins to the username "root" in the SBU dataset. 

What does your dataset look like in comparison? If you see any differences, what are they and what might they mean?

Now, let's see which credentials were tried by the greatest number of attackers. This will remove the cases where one attacker may try one username or password many times, swaying the results.

In [None]:
cloudUsernameAttackerFreq = pd.DataFrame(cloudHoneypotLogins.groupby(['duser'])['src'].nunique().sort_values(ascending=False)).reset_index()
cloudPasswordAttackerFreq = pd.DataFrame(cloudHoneypotLogins.groupby(['password'])['src'].nunique().sort_values(ascending=False)).reset_index()
sbuUsernameAttackerFreq = pd.DataFrame(sbuHoneypotLogins.groupby(['duser'])['src'].nunique().sort_values(ascending=False)).reset_index()
sbuPasswordAttackerFreq = pd.DataFrame(sbuHoneypotLogins.groupby(['password'])['src'].nunique().sort_values(ascending=False)).reset_index()

numUniqueCloudAttackers = len(cloudHoneypotLogins['src'].unique())
numUniqueSbuAttackers = len(sbuHoneypotLogins['src'].unique())

cloudUsernameAttackerFreq['userPercentage'] = cloudUsernameAttackerFreq['src'].apply(
                                                lambda x: '%.2f%%' % (x / numUniqueCloudAttackers * 100))
cloudPasswordAttackerFreq['passPercentage'] = cloudPasswordAttackerFreq['src'].apply(
                                                lambda x: '%.2f%%' % (x / numUniqueCloudAttackers * 100))
sbuUsernameAttackerFreq['userPercentage'] = sbuUsernameAttackerFreq['src'].apply(
                                                lambda x: '%.2f%%' % (x / numUniqueSbuAttackers * 100))
sbuPasswordAttackerFreq['passPercentage'] = sbuPasswordAttackerFreq['src'].apply(
                                                lambda x: '%.2f%%' % (x / numUniqueSbuAttackers * 100))

columnNames = ['username','usernameCount','userAttackerPercentage','password','passwordCount','passAttackerPercentage']
cloudUserPassAttackerFreq = combineDataFrames(columnNames, [cloudUsernameAttackerFreq, cloudPasswordAttackerFreq])
sbuUserPassAttackerFreq = combineDataFrames(columnNames, [sbuUsernameAttackerFreq, sbuPasswordAttackerFreq])

topNToDisplay = 5
print("Top credentials by most unique attacker usage on cloud machine:")
display(cloudUserPassAttackerFreq.head(n=topNToDisplay))
print("\nTop credentials by most unique attacker usage on SBU machine:")
display(sbuUserPassAttackerFreq.head(n=topNToDisplay))

These tables tell us a different story. While the first tables may be swayed by one attacker trying to break into one account over and over again, these tables let us know how many attackers try each username and password. Here we can see that there seems to be way more unique attackers connecting to the cloud machine than the SBU machine. Let's see what those numbers look like:

In [None]:
print("Number of unique attackers that connected to cloud machine: %d" % 
            len(cloudHoneypotConnections['src'].unique()))
print("Number of unique attackers that connected to SBU machine: %d" % 
             len(sbuHoneypotConnections['src'].unique()))

There past two measurements give us a better idea of what's going on. Based on the raw connection and login attempt numbers, it seemed like the SBU honeypot received more attention than the public cloud honeypot. However, this shows us public cloud machines get a larger number of attackers connecting to them, but maybe not as patient attackers. 

Why do you think attackers connecting to the SBU honeypot might make more login attempts than attackers on the public cloud honeypot?

### Attacker Login Frequency
Now let's look to see the number of usernames and passwords the most active attackers used. An attacker that tried a lot of usernames but not passwords may have wanted to see if there exists one of many possible usernames that have a very weak password on the server. While an attacker that tried many passwords but not many usernames was targeting a single or small number of usernames.

In [None]:
# Here we group our login data by each IP address and count the number of distinct user names and passwords in each group
cloudIPUsernameFreq = cloudHoneypotLogins.groupby(['src'])['duser'].nunique().sort_values(ascending=False).reset_index()
cloudIPPasswordFreq = cloudHoneypotLogins.groupby(['src'])['password'].nunique().sort_values(ascending=False).reset_index()
cloudIPCredentialFreq = combineDataFrames(['IP','Unique Usernames','IP','Unique Passwords'],
                                              [cloudIPUsernameFreq,cloudIPPasswordFreq])

sbuIPUsernameFreq = sbuHoneypotLogins.groupby(['src'])['duser'].nunique().sort_values(ascending=False).reset_index()
sbuIPPasswordFreq = sbuHoneypotLogins.groupby(['src'])['password'].nunique().sort_values(ascending=False).reset_index()
sbuIPCredentialFreq = combineDataFrames(['IP','Unique Usernames','IP','Unique Passwords'],
                                              [sbuIPUsernameFreq,sbuIPPasswordFreq])


topNToDisplay = 5
display(pd.DataFrame(cloudIPCredentialFreq.head(n=topNToDisplay)))
display(pd.DataFrame(sbuIPCredentialFreq.head(n=topNToDisplay)))

The last statistic we are going to generate about this data is the number of login attempts attackers tried before quitting. However, instead of making more tables, let's try to plot this data. We are going to make a "Cumulative Distribution Function" or "CDF" plot. While the name may sound complex, what we are looking to find is simply, "what percentage of attackers made less than or equal to x number of login attempts?". In other words, when we look at our firgure, we want to be able to say for example "20% of attackers tried to log in less than or equal to 10,000 times". The x axis of our plot will be the number of login attempts and the y axis will be the percentage of all attackers. Let's see it in action.

In [None]:
# Here we're getting the number of login attempts per IP address using the value_counts() function on the source IP
# address column
cloudNumAuthTriesPerIP = cloudHoneypotLogins['src'].value_counts()
sbuNumAuthTriesPerIP = sbuHoneypotLogins['src'].value_counts()

fig, ax = plt.subplots(figsize=(14, 8))

ax.hist(cloudNumAuthTriesPerIP, cumulative=True, density=True, bins=1000, histtype='step', label="Cloud")
ax.hist(sbuNumAuthTriesPerIP, cumulative=True, density=True, bins=1000, histtype='step', label="SBU")

plt.title('CDF of Login Attempts per Attacker IP Address')
plt.xlabel('Number of Login Attempts')
plt.ylabel('Percentage of Attackers')
plt.legend(loc=4)
plt.show()

What does your CDF plot tell you about the difference in how attackers behave on the public cloud vs. on SBU honeypots?

### Attacker Locations
Let's create a new column in our dataset with the country each attacker came from. To do this, we'll use the name `apply()` function we saw earlier. Since an attacker's location won't change between login attempts, we'll use the connection dataset we defined earlier.

**Note:** You may see a `SettingWithCopyWarning` when running the following code. Don't worry about this. It is telling us we're adding values to a copy of a dataframe rather than the actual dataframe. In our case this is okay.

In [None]:
reader = maxminddb.open_database('GeoLite2-City.mmdb')

def getCountryName(ip):
    locationData = reader.get(ip)
    
    # In the cases where the IP address' location can't be determined, we return an empty string
    if locationData is None or 'country' not in locationData:
        return ''
    
    return locationData['country']['names']['en']
    
cloudHoneypotConnections = cloudHoneypotConnections.copy()
sbuHoneypotConnections = sbuHoneypotConnections.copy()
cloudHoneypotConnections['country'] = cloudHoneypotConnections['src'].apply(lambda ip: getCountryName(ip))
sbuHoneypotConnections['country'] = sbuHoneypotConnections['src'].apply(lambda ip: getCountryName(ip))

Now that we have each IP's location, let's get the most popular countries, first by the raw number of connections originating from each country.

In [None]:
numCloudConnectionsPerCountry = cloudHoneypotConnections['country'].value_counts()
numSBUConnectionsPerCountry = sbuHoneypotConnections['country'].value_counts()

topNToDisplay = 5
print('Top %d locations of attackers in the cloud honeypot dataset:' % topNToDisplay)
display(numCloudConnectionsPerCountry.head(n = topNToDisplay))
print('Top %d locations of attackers in the SBU honeypot dataset:' % topNToDisplay)
display(numSBUConnectionsPerCountry.head(n = topNToDisplay))

Now, let's see where the most **unique** attackers come from.

In [None]:
# Remove duplicate rows, keeping only the first connection of each IP address
cloudHoneypotConnectionsUnique = cloudHoneypotConnections.drop_duplicates('src')
sbuHoneypotConnectionsUnique = sbuHoneypotConnections.drop_duplicates('src')

# Count the number of times each country appears
numCloudConnectionsPerCountry = cloudHoneypotConnectionsUnique['country'].value_counts()
numSBUConnectionsPerCountry = sbuHoneypotConnectionsUnique['country'].value_counts()

topNToDisplay = 5
print('Top %d locations of attackers in the cloud honeypot dataset:' % topNToDisplay)
display(numCloudConnectionsPerCountry.head(n = topNToDisplay))
print('Top %d locations of attackers in the SBU honeypot dataset:' % topNToDisplay)
display(numSBUConnectionsPerCountry.head(n = topNToDisplay))

Where did attackers originate in your dataset? Did filtering by unique attackers change your top 5 rankings much?

## What did you find?
You now have some basic statistics and figures describing your dataset. Take a look at what you found and make some notes on anything interesting. We will discuss our findings during our next session. If you can think of any other interesting insights we did not do here, try them out yourself below!

## Discussion Questions
1. What is the most frequently-occurring country from which attackers originate?
2. True or False: Most attackers originate from your answer to #1.
3. What are some factors that could have influenced the results we are seeing?
4. How could we have expanded on this experiment to have more accurate results?
5. What are some reasonable conclusions you can draw from these results?
6. Conversely, what conclusions are **not reasonable given the data**?