# Data Pre-Processing.

Now that we have sorted out our dataset we are going to pre-process the data in order to be used with our topic models. Data pre-processing and cleaning is extremely important for Topic modelling as it revolves around text. We must remove the unnecessary to help speed up our algorithm and to help it to run smoother. In this document we will use a variety of different tools available to us to clean our text.

It is worth noting that due to the amount of data processed and the type of data we are processing (SSH Logs), in later documents of the report we will find that some of the "words" that are left in after this process are not particularly useful or suitable for topic modelling however without manual inspection (which has been conducted in this report for some of the more obvious words) it is difficult to remove these words and the later models do not have much to say about them. However we do successfully remove more obvious problematic values such as those featuring punctuation and numbers.

We start by importing the packages we are going to use in this document as usual.

In [1]:
import pandas as pd
from urllib.request import urlopen
from urllib.error import HTTPError
import pickle
import requests
import datetime as dt
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
from nltk.corpus import wordnet
import numpy as np
import gzip

Now we are going to read in the data generated in the previous document. We have decided we are going to work on the 10% sampled data throughout this report. There are several reasons for this, in our testing we used the 5% data but feel that this may have been too small (though quick to load) and on the other hand when we tried to use larger datasets like the 50% data, when we tried to generate models we ran into a computational time problem. The model that was being generated itself wasn't too large for most computers to handle but was taking several hours to compute and we decided to use this smaller data set for a better balance.

In [2]:
start = dt.datetime.now()
df = pd.read_csv("https://github.com/Galeforse/DST-Assessment-03/raw/master/Data/master_log_10.csv.gz")
print("Data fetched in:" ,dt.datetime.now()-start)

Data fetched in: 0:00:05.653516


We check the length of our dataset to make sure it is the size that we expect (10% of the original approx. 15,000,000 points) and have a look at the first few values in the data.

In [3]:
print(len(df))
df.head()

1588052


Unnamed: 0,anon_ip,log
0,161.166.1.23,Jan 5 03:23:54 161.166.1.23 sshd[27076]: Fail...
1,161.166.1.23,Jan 5 03:24:25 161.166.1.23 sshd[27087]: Disc...
2,161.166.1.23,Jan 5 03:24:27 161.166.1.23 sshd[27090]: pam_...
3,161.166.1.23,Jan 5 04:08:19 161.166.1.23 sshd[27584]: PAM ...
4,161.166.1.23,Jan 5 04:08:21 161.166.1.23 sshd[27590]: pam_...


We also check a random section of the data somewhere in the middle and can see that the order has been preserved and we have a mix of data from different ips, therefore we have a good sample of the overall data.

In [4]:
df[30000:30008]

Unnamed: 0,anon_ip,log
30000,161.166.1.38,Jan 21 17:55:47 161.166.1.38 sshd[6843]: pam_u...
30001,161.166.1.38,Jan 21 17:55:47 161.166.1.38 sshd[6843]: pam_l...
30002,161.166.1.38,Jan 21 17:55:53 161.166.1.38 sshd[6845]: pam_u...
30003,161.166.1.38,Jan 21 17:56:03 161.166.1.38 sshd[6849]: Inval...
30004,161.166.1.38,Jan 21 17:56:03 161.166.1.38 sshd[6849]: pam_l...
30005,161.166.1.38,Jan 21 17:56:19 161.166.1.38 sshd[6855]: pam_l...
30006,161.166.1.38,Jan 21 17:56:24 161.166.1.38 sshd[6857]: Inval...
30007,161.166.1.38,Jan 21 17:56:31 161.166.1.38 sshd[6859]: Faile...


In [5]:
df["log"].iloc[500000]

'Jan 21 22:25:29 161.166.9.144 sshd[6993]: pam_unix(sshd:session): session closed for user XXXXX'

In [6]:
data_text = df[['log']]
data_text["index"] = data_text.index
print(len(data_text))
print(data_text[:5])

1588052
                                                 log  index
0  Jan  5 03:23:54 161.166.1.23 sshd[27076]: Fail...      0
1  Jan  5 03:24:25 161.166.1.23 sshd[27087]: Disc...      1
2  Jan  5 03:24:27 161.166.1.23 sshd[27090]: pam_...      2
3  Jan  5 04:08:19 161.166.1.23 sshd[27584]: PAM ...      3
4  Jan  5 04:08:21 161.166.1.23 sshd[27590]: pam_...      4


## Regex

In this subsection we are going to make use of Python's in built regular expression functions to filter the text. This function allows us to search through strings to match various user defined patterns of letters, numbers, and/or special characters. 

Below we definte a function that we call regex, for which we specify certain regular expression substitutions (substituting with a blank space; in the long term this does not matter as when we lemmatize the data further down the document all the spaces in the string will be removed).

These regular expression substitutions were defined from analysis of the log files, though as previously mentioned there are several nonsense words in the data that have been left in due to their sheer number. Most of these are either more technically words generated by the computer, or names of various companies/domains that are referenced in the log. We acknowledge that not successfully filtering out all of these junk words may have had a negative effect on the overall performance of the models later on.

It is here that we filter out the anonymised usernames "XXXXX" that we mentioned in the start of the previous section.

In [7]:
import re
def regex(text):
    text = re.sub(r"[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+"," ",text) #Filters out ip adresses
    text = re.sub(r"[0-9]{2}\:[0-9]{2}\:[0-9]{2}"," ",text)#filters out times
    text = re.sub(r'\d+'," ",text) #filters out numbers
    text = re.sub(r"[^A-Za-z0-9 ]+"," ",text) # filters out punctuation
    text = re.sub(r"XXXXX"," ",text) #filters out anonymised user
    text = re.sub(r"HHHHH"," ",text)
    text = re.sub(r"sshd"," ",text) #sshd comes up in every log and is not required
    text = re.sub(r"ruser","user",text)
    text = re.sub(r"rhost","host",text)
    text = re.sub(r"euid","user",text)
    text = text.lower()
    return text

To illustrate exactly what the above does on a basic example we define this text file (featuring fake snippets of similar form to those in the log files) and output to check what kind of an effect our function will have on the strings.

In [8]:
text = "Bill said: hello, how are you doing?   112.116.2.45   114.117.45.234  ruser rhost 25.6.7   01:25:53  17:59:23 password failed for XXXXX on port: 73  ssh=tty  uid="
regex(text)

'bill said  hello  how are you doing           user host              password failed for   on port     ssh tty  uid '

## Tokenising

Now we define the second of the two main functions we need for text cleaning. In the below function we define stop words, and remove them from the data, as well as lemmatising words that could be similar but have different endings (e.g. failure and fail). We also tokenise the data so that for each data point we have a list of words that will be used in our model.

In [9]:
en_stop = set(nltk.corpus.stopwords.words('english'))

from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma(word):
    return WordNetLemmatizer().lemmatize(word)

def tokenizer(text):
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if len(token) > 3] #filter out words that are shorter than 4 characters
    tokens = [token for token in tokens if token not in en_stop] #filter out stop words
    tokens = [get_lemma(token) for token in tokens]
    return tokens

We define a function that is a combination of both above functions to apply to the text. 

In [10]:
def preprocess(text):
    complete = tokenizer(regex(text))
    return complete

We check the effect this function has on our strings by testing on a few randomly selected inputs from the dataset.

In [11]:
from gensim import parsing
doc_sample = data_text[data_text['index'] == 16].values[0][0]

print("original document: ")
print(doc_sample)
print("\n tokenized data: ")
print(tokenizer(doc_sample))
print("\n fully processed data: ")
print(preprocess(doc_sample))

original document: 
Jan  5 08:16:28 161.166.1.23 sshd[31213]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=216.186.0.161  user=XXXXX

 tokenized data: 
['08:16:28', '161.166.1.23', 'sshd', '31213', 'pam_unix', 'sshd', 'auth', 'authentication', 'failure', 'logname=', 'uid=0', 'euid=0', 'tty=ssh', 'ruser=', 'rhost=216.186.0.161', 'user=XXXXX']

 fully processed data: 
['unix', 'auth', 'authentication', 'failure', 'logname', 'user', 'user', 'host', 'user']


In [12]:
doc_sample = data_text[data_text['index'] == 18].values[0][0]

print("original document: ")
print(doc_sample)
print("\n tokenized data: ")
print(tokenizer(doc_sample))
print("\n fully processed data: ")
print(preprocess(doc_sample))

original document: 
Jan  5 08:16:48 161.166.1.23 sshd[31217]: Failed password for XXXXX from 216.186.0.161 port 2506 ssh2

 tokenized data: 
['08:16:48', '161.166.1.23', 'sshd', '31217', 'Failed', 'password', 'XXXXX', '216.186.0.161', 'port', '2506', 'ssh2']

 fully processed data: 
['failed', 'password', 'port']


In [13]:
doc_sample = data_text[data_text['index'] == 270564].values[0][0]

print("original document: ")
print(doc_sample)
print("\n tokenized data: ")
print(tokenizer(doc_sample))
print("\n fully processed data: ")
print(preprocess(doc_sample))

original document: 
Feb 18 15:04:46 161.166.5.240 sshd[21299]: error: PAM: Authentication failure for XXXXX from 82.122.154.220

 tokenized data: 
['15:04:46', '161.166.5.240', 'sshd', '21299', 'error', 'Authentication', 'failure', 'XXXXX', '82.122.154.220']

 fully processed data: 
['error', 'authentication', 'failure']


In [14]:
testvar = preprocess(doc_sample)
print(type(testvar))
print(type(testvar[1]))

<class 'list'>
<class 'str'>


Here we quickly test and compare what happens if we apply this function to the first 20 values in our dataset, and find the results that we expect to see.

In [15]:
processed_docs = data_text['log'][:20].map(preprocess)

In [16]:
data_text["log"][18]

'Jan  5 08:16:48 161.166.1.23 sshd[31217]: Failed password for XXXXX from 216.186.0.161 port 2506 ssh2'

In [17]:
processed_docs[18]

['failed', 'password', 'port']

Now we are going to conduct the filtering on the whole dataset, first of all checking the length, just to make sure we have the same length data after processing.

In [18]:
len(data_text)

1588052

In [19]:
processed = data_text['log'].map(preprocess)

In [20]:
processed

0                                   [failed, password, port]
1             [disconnecting, many, authentication, failure]
2          [unix, auth, authentication, failure, logname,...
3                               [service, ignoring, retries]
4          [unix, auth, authentication, failure, logname,...
                                 ...                        
1588047                  [user, allowed, listed, allowusers]
1588048                                      [invalid, user]
1588049                                      [invalid, user]
1588050                                      [invalid, user]
1588051                                      [invalid, user]
Name: log, Length: 1588052, dtype: object

From the snippets we can see in the above data it seems that our processing has been quite successful, reducing and cleaning the text data considerably.

## Corpus and Dictionary creation

We are going to generate a dictionary and corpus for use in topic models. We are generating them here in this document and will then write these to Python pickle files so that we do not have to repeatedly compute them and also so that each model is using the same corpus so that our comparisons are more accurate. We start by generating a dictionary from our processed data above.

In [21]:
dictionary = gensim.corpora.Dictionary(processed)

count = 0
for k,v  in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break
print(len(dictionary))

0 failed
1 password
2 port
3 authentication
4 disconnecting
5 failure
6 many
7 auth
8 host
9 logname
10 unix
613


It's important to filter out extremes in dictionaries, here we filter out any terms that come up a disproportionately small or large. We see that this has cut down the number of junk terms considerably reducing the length of the dictionary.

In [22]:
dictionary2 = dictionary
dictionary2.filter_extremes(no_below=10, no_above=0.5, keep_n=100000)
len(dictionary2)

350

We now write the pickle, the below script first checks a pickle already exists in the correct folder of the GitHub repository and if not proceeds to generate the corpus pickle and dumps both the corpus and dictionary pickles to the relevant folder on my computer that corresponds to the GitHubDesktop repository; we will then push this data when generated to be able to be accessed from anywhere with an internet connection.

In [23]:
try:
    print("Attempting to read corpus from pickle")
    fp=gzip.open(urlopen('https://github.com/Galeforse/DST-Assessment-03/raw/master/Data/main/corpus_10.pkl.gz'),'rb')
    corpus=pickle.load(fp)
    fp.close()
    print("Corpus read from pickle.")
except HTTPError as err:
    if err.code == 404:
        print("Pickle not found, creating corpus and saving to pickle")
        start=dt.datetime.now()
        corpus = [dictionary2.doc2bow(doc) for doc in processed]
        fp=gzip.open('G:/Users/Gabriel/Documents/Education/UoB/GitHubDesktop/DST-Assessment-03/data/main/corpus_10.pkl.gz','wb')
        pickle.dump(corpus,fp)
        fp.close()
        pickle.dump(corpus, open('G:/Users/Gabriel/Documents/Education/UoB/GitHubDesktop/DST-Assessment-03/data/main/corpus_10.pkl', 'wb'))
        pickle.dump(dictionary2, open('G:/Users/Gabriel/Documents/Education/UoB/GitHubDesktop/DST-Assessment-03/data/main/dictionary_10.pkl', 'wb'))
        print("Pickle saved. Time taken: " + str(dt.datetime.now()-start))
    else:
        raise

Attempting to read corpus from pickle
Pickle not found, creating corpus and saving to pickle
Pickle saved. Time taken: 0:00:17.875929


Now these pickles have been generated they can be imported for use in our models that follow in the rest of the report, without the need for each file and user to apply the data processing themselves (as can be computationally expensive)