### Exploring Automated SSH Cyber Attacks

We'll take a look at some cyber security data gathered by a SSH honeypot. SSH is a protocol for connecting to remote servers on a network, which in this case is the internet. 

Dataset can be found here: https://www.kaggle.com/lako65/ssh-brute-force-ipuserpassword

Lets go ahead and take a look at our dataset

In [8]:
import pandas as pd
attacks_df = pd.read_json('/home/anferneejervis/brute_force_data.json', orient='records')
attacks_df.head()

Unnamed: 0,foreign_ip,passwords,timestamp,username
0,109.87.224.151,"[������, albert, 123456]",2018-11-05 08:31:18,albert
1,122.226.181.166,"[digi, daddy913, covergirl]",2018-11-05 22:16:56,root
2,42.7.27.166,"[qwerty11, qwerty12, qweqweqwe, qwer`123, qwer...",2018-11-05 07:18:16,root
3,125.65.42.181,"[123456, root, password]",2018-11-03 19:30:58,root
4,61.184.247.12,"[lomtjjz, lolita, jake1996]",2018-11-05 08:53:41,root


In [4]:
attacks_df.count()

foreign_ip    14795
passwords     14795
timestamp     14795
username      14795
dtype: int64

Interesting! So we have the attack location via IP address, the time it occurred and the username/password combinations used for the attack. This dataset can give us some insights on what attackers use often (and what you probably should avoid!)

Pay attention to the values in the passwords field. You'll notice that they are in a list format. Data formatted like this isn't intuitive for analytics. Think of it this way: each list is like a table and you end up with a table within a table. Data formatted like this can confuse some querying functions. Lets fix this first.


It might take a while...

In [9]:
import copy

attack_rows = attacks_df.to_dict(orient='records')

attacks_flattened_df = list()
for row in attack_rows:
  if isinstance(row['passwords'], list):
    for pw in row['passwords']:
      
      # We're gonna copy each row and modify the copy so we don't screw up the original
      r = copy.copy(row)
      r['password'] = pw
      attacks_flattened_df.append(r)
  else:
    attacks_flattened_df.append(row)


In [10]:
# we dont need the passwords list anymore. Lets remove it
attacks_flattened_df = pd.DataFrame(attacks_flattened_df)
attacks_flattened_df = attacks_flattened_df.drop(columns="passwords")
attacks_flattened_df.head()

Unnamed: 0,foreign_ip,password,timestamp,username
0,109.87.224.151,������,2018-11-05 08:31:18,albert
1,109.87.224.151,albert,2018-11-05 08:31:18,albert
2,109.87.224.151,123456,2018-11-05 08:31:18,albert
3,122.226.181.166,digi,2018-11-05 22:16:56,root
4,122.226.181.166,daddy913,2018-11-05 22:16:56,root


Lets do some analytics. Looking at our data, we can see that the variables of interest (basically all the variables, except maybe timestamps), are categorical types. Therefore, it would not necessarily help to get the mean, std, percentile, etc for our dataset.

We can still try it and see for the fun of it.

In [11]:
attacks_flattened_df.describe()

Unnamed: 0,foreign_ip,password,timestamp,username
count,53756,53756.0,53756,53756
unique,303,22736.0,14251,441
top,116.31.116.42,123456.0,2018-11-06 23:07:44,root
freq,8329,469.0,12,52124
first,,,2018-11-03 19:30:58,
last,,,2018-11-07 14:17:33,


### Popular passwords used

To get an idea of what passwords were common, we can use a popular data querying function called *Groupby*.

In [18]:
attacks_flattened_df.groupby(['password']).count().reset_index().sort_values('foreign_ip', ascending=False)

Unnamed: 0,password,foreign_ip,timestamp,username
1064,123456,469,469,469
15818,password,447,447,447
0,������,423,423,423
17602,root,416,416,416
20568,ubuntu,410,410,410
17636,root123,405,405,405
7780,centos6svm,387,387,387
15861,passwrod,353,353,353
5395,admin,108,108,108
16954,qwerty,68,68,68


Cool! Now we know the commonly used passwords for attacks. Should definitely avoid those. Thats not enough though. There are still over 27k unique passwords, and our table doesn't necessarily help. We don't want have to look at all 27k passwords to know which ones we shouldn't use. Wouldn't it be great if you can know how secure a password might be?

We can probably do that.

To predict the secureness of passwords, we would need a model that predicts the likeliness of the password being used by the attackers. To do this, we need to find features for our passwords, then apply a machine learning algorithm that will train a model using our data