# Hacker log analysis project

## OVERVIEW

> Analyzing textual log data from an online chat forum related to the 
Anonymous hacktivist group.Applying regular expressions to summarize log data, 
quantify text data, and summarize time trends.

### objectives

1. Find and list the URLs posted in the chat.

2. Find the most common words.

3. Find and rank (by count) words not in an English dictionary. This is a simple method 
   that can identify some names of malware tools.
   
4. Count the total number of written messages (only those with actual text content) and
   Summarize the users that posted the most messages.
   
5. Which hours of the day had the most messages? Which days had the most traffic (or 
   messages?
   
6. Many users log in and view the chat without commenting. Which users spent the most time 
   in the logs? Which users logged in the most.
   

In [1]:
#import necessary libraries for EDA(exploratory data analysis) 
#we are using regular expression to search in the text.
import pandas as pd
import matplotlib as plt
import numpy as np
import re  #regular expression

In [5]:
#log file is binary file to we need to read binary file (rb) then decode it to text/string.
with open('hackers.log','rb') as fp:
    data = fp.read()
new_data = data.decode(encoding='latin1') #utf-8 doesn't work on some files.

In [7]:
print(new_data[:1000])
#this is how the log looks

--- Log opened Tue Sep 20 00:01:49 2016
00:01 -!- Guest40341 [AndChat2541@AN-pl0gl1.8e2d.64f9.r226rd.IP] has quit [Quit: Bye]
00:11 -!- peejr [peeejr@AN-sru.3ib.ec0efc.IP] has joined #hackers
00:14 -!- Gilgamesh [Gilgamesh@AN-nkf.mv0.se355c.IP] has joined #hackers
00:15 -!- _CyBruh_ [-Cybruh@AN-gm6.oj9.rj1tv4.IP] has quit [Quit: Leaving]
00:20 -!- peejr [peeejr@AN-sru.3ib.ec0efc.IP] has quit [Quit: Leaving]
00:25 < ice231> anyone good with exploiting cisco asa with extrabacon?
00:27 < ice231> we need help with an op but were stuck at this one part
00:27 -!- Bobseviltwin [steven@stupid.hunkey.monkey] has quit [Ping timeout: 121 seconds]
00:30 -!- Gilgamesh [Gilgamesh@AN-nkf.mv0.se355c.IP] has quit [Quit: Leaving]
00:30 -!- peejr [peeejr@AN-sru.3ib.ec0efc.IP] has joined #hackers
00:34 -!- peejr [peeejr@AN-sru.3ib.ec0efc.IP] has quit [Quit: Leaving]
00:34 -!- peejr [peeejr@AN-sru.3ib.ec0efc.IP] has joined #hackers
00:35 -!- Anonymous5 [Anonymous5@AN-tu4.e85.r2ddjo.IP] has joined #hackers


In [None]:
#looking at sample of data because jupyter notebook has a output limit we need to change that 
#limit we use this command on jupyter cell
# !jupyter notebook --NotebookApp.iopub_data_rate_limit=1e+15

`1. Find and list the URLs posted in the chat.`

In [47]:
#first we check chat using <.+> than for urls. it may start with http or www .

pattern = re.compile(r"<.+>.+((https|http|www|ftp)[A-Za-z0-9-._~/:?#\[\]@!$&'\(\)*+,;=]{9,})\s")
result = pattern.findall(new_data) #gives a list of items
result[0][0] #url is at 0 position of list.

'http://i.imgur.com/PoCjYqQ.png'

In [48]:
#we have this many urls
len(result)

4622

In [10]:
#result will give two groups one is the complete url and other is the begining word.
# iterating the result to get the desired result 
for i in range(len(result)):
    if i <= 20:
        url = result[i][0]
        print(url)
    else:
        break
# we are only showing 20 urls in the output.

http://i.imgur.com/PoCjYqQ.png
https://vid.pr0gramm.com/2015/08/28/8a9af1793785d29e.mp4
www.ismoman.com
www.ismoman.com:2083
www.ismoman.com/wp-content/themes/ism/
http://pastebin.com/iuE1sEZq
http://monkey.org/~dugsong/dsniff/
http://monkey.org/~dugsong/dsniff/faq.html
https://newblood.anonops.com/vpn.html
https://ghostbin.com/paste/r6mte
https://github.com/Netsukuku/netsukuku
www.nbrri.gov.ng
www.tibia.com
www.sunnieday.be
www.youtube.com/watch?v=IdKKCJk0w2E
www.youtube.com/channel/UCz6mEi8mD55SHnQW08uNKAg
www.facebook.com/jamal.oubram.7?ref=br_rs
https://paper.li/f-1471864321#!headlines
www.youtube.com/watch?v=tWdgAMYjYSs
https://en.wikipedia.org/wiki/Max_Headroom_broadcast_signal_intrusion
https://youtu.be/ThxmcVbV1ZI


In [14]:
#we can also write the urls in a log/text file
for i in range(len(result)):
    with open("urls.log","a") as fp:
        fp.write(f"{result[i][0]}\n")


`2. Find the most common words.`

In [16]:
pat = r"<.+> ([a-zA-Z0-9 !*&^%$#@()-_=+:;'\"\?\/\\<>.]+)\n"
answer = re.findall(pat,new_data,re.M|re.I)
#this will find the chat 

In [17]:
#splitting the chat sentence into words then cleaning them of special characters then appending
# into list of words
words = []
for line in answer:
    words += [word.strip("?!@#$%^&></'\"\\*(,).`~=+-_|") for word in line.split() if len(word)<=20]

In [18]:
#making a data frame of words
df = pd.DataFrame(words,columns=["words"])
df

Unnamed: 0,words
0,anyone
1,good
2,with
3,exploiting
4,cisco
...,...
1553353,wreckless
1553354,alpha
1553355,code
1553356,out


In [21]:
df.value_counts() #our dataframe contains empty string . we need to remove them .

words      
               41039
a              39548
to             38476
and            35262
the            34382
               ...  
confoos            1
confortable        1
confounded         1
confrences         1
zzzzzzzzz          1
Length: 59290, dtype: int64

In [22]:
new_df = df[df["words"]!=""] #selecting dataframe which does not contain empty string.

most common words -->

In [37]:
new_df.value_counts().head(10)

words
a        39548
to       38476
and      35262
the      34382
you      23552
is       20233
of       20010
i        19806
for      19083
it       18589
dtype: int64

`3. Find and rank (by count) words not in an English dictionary. This is a simple method 
   that can identify some names of malware tools`

In [None]:
#this will remove the links present in data
for url in words:
    if 'https' in url:
        words.remove(url)
    elif 'http' in url:
        words.remove(url)
    elif 'www' in url:
        words.remove(url)
    elif 'ftp' in url:
        words.remove(url)

In [None]:
new_df.value_counts(ascending=True)[:60]

`4. Count the total number of written messages (only those with actual text content) and
   Summarize the users that posted the most messages.`
   

In [49]:
pat = r"<(.+)> ([a-zA-Z0-9 !*&^%$#@()-_=+:;'\"\?\/\\<>.]+)\n"
result = re.findall(pat,new_data,re.M|re.I)

In [50]:
df = pd.DataFrame(result,columns=["user","message"])
df

Unnamed: 0,user,message
0,ice231,anyone good with exploiting cisco asa with ext...
1,ice231,we need help with an op but were stuck at this...
2,HeavenGuard,hello?
3,+nemecy,hi
4,ice231,hi
...,...,...
227346,hypnotic,I do not understand that... I understand the p...
227347,hypnotic,ah nevermind
227348,+Cogitabundus,Some like chaos.
227349,+Cogitabundus,Oh they left.


In [57]:
#we have bot messages (+evilbot) we drop those messages.
index = df[df["user"]=="+evilbot"].index

In [59]:
df.drop(index,axis=0,inplace=True)

`1.Number of written messages`

In [61]:
df['message'].count()

196711

`2.users that posted the most messages`

In [66]:
c = df["user"].value_counts().head(10) #top ten messagers.

In [79]:
d=pd.DataFrame(c).reset_index()

In [80]:
d.columns=['user','count']

In [81]:
d

Unnamed: 0,user,count
0,@guapo,8075
1,sTrikEforCe,5269
2,@BOFH,4413
3,lazarus,3885
4,DeTH,2975
5,catface,2927
6,dd_,2906
7,+Cogitabundus,2572
8,ubiquitous,2529
9,maxmuster,2236
