# Finding Outliers In A Stream of Logs?

## Context
What is the best way to find a needle in a haystack? Probably by using a magnet... But how about identifying suspicious users in gigabytes of logs?

Before even starting to search for them, Step 1 would be starting by identifying what it means to be suspicious.

    Suspicious normally means that it stands out from the normal.
    
Ok, step 2: what is normal?

    Normal behaviour consist of expected usage. Anything outside of this can be considered abnormal.

... We are not much more advanced because now we need to go over each action, and identify what is normal, and what is not, and assign a score to each.

For example, if a user login once on the system without any mistake, that action convey less useful information than seeing someone failing to login 10 times, than logging in once.

A simple rule based system could easily be configured to trigger on this pattern. For small scale or simple systems, this will is likely to be enough and the number of scenarios should be small enough to come up with few rules manually.

The challenge is that we need need to identify these behaviours. 

One hypothesis would be to use Hidden Markov Chains, but to use HMM, series of actions needs to be classified. Over this, logs are noisy because users will not simply go from A to B then C. they might crawl the whole site before reaching C, which complicates the creation of HMM.

We could filter out the actions that can be considered as noise. But which ones is?

## The Theory, Based on Information Theory

I will use a concept used in Information Theory called [Surprisal Analysis](https://en.wikipedia.org/wiki/Surprisal_analysis)
See [Information Content](https://en.wikipedia.org/wiki/Information_content) for more details


Surprisal Analysis is a measure of entropy

    ... A method of information quantification and compaction, providing an unbiased characterization of systems. Surprisal analysis is particularly useful to characterize and understand dynamics in small systems, where energy fluxes otherwise negligible in large systems, heavily influence system behavior.
    
[Example of Surprisal](http://www.umsl.edu/~fraundorfp/egsurpri.html)

This is what this notebook intend to demonstrate: Using Surprisal analysis, we will assign a score to actions, and by adding up the score of each action, identify series of actions that are unlikely to occur.

# But first, we need some logs

This is the part that we are looking for, but we need to generate them if we want to analyse them.

Let's say that we have x user profiles:

Normal users
* Buyer
* Merchants

Abnormal users:
* Scraper bots
* Spammers
* fraudster
* Account Attackers

Buyers and merchants represent 97% of our logs. Scrapers and attackers represents 1%.

The following block configure these profiles

In [1]:
user_distribution = { 
    "buyer": 0.49, "merchant": 0.49, "bot": 0.003, "spammer": 0.003, "fraudster": 0.002, "attacker": 0.002 # in percentage
}

user_velocity = {
    "buyer": 40, "merchant": 50, "bot": 5, "spammer": 10, "fraudster": 15, "attacker": 5 # seconds per actions
} 

user_start_action = {
    "buyer": 'home', "merchant": 'home', "bot": 'home', "spammer": 'home', "fraudster": 'home', "attacker": 'login:fail' # seconds per actions
} 


legal_actions = {
    'home',
    'login:success', 'login:fail', 'password_reset'
    'logout',
    'buy_item:success', 'buy_item:fail',
    'view_item:success', 'view_item:fail',
    'sell_item:success', 'sell_item:fail',
    'view_profile', 
    'update_email:success', 'update_email:fail',
    'update_address:success', 'update_address:fail',
    'payment_modify:success', 'payment_modify:fail',
    'bank_modify:success', 'bank_modify:fail',
    'withdraw_income:success', 'withdraw_income:fail',
    'comment:success', 'comment:fail',
    'end'
}


user_profile = {
    "buyer": {
        "home": { "login:success": 0.919, "login:fail": 0.03, "password_reset": 0.05, "end": 0.001},
        "login:success": { "view_item:success": 0.978, "view_profile": 0.001, "buy_item:success":0.02, "buy_item:fail": 0.001},
        "login:fail": {"login:success": 0.9, "login:fail": 0.08, "password_reset": 0.02},
        "password_reset": {"login:success": 0.9, "login:fail": 0.09, "end": 0.01},
        "logout": {"end": 0.99, "home": 0.01},
        "view_item:success": {"comment:success": 0.05, "view_item:success": 0.65, "buy_item:success": 0.299, "buy_item:fail": 0.001},
        "buy_item:success": {"view_item:success": 0.409, "buy_item:success": 0.2, "buy_item:fail": 0.001, "logout": 0.29, "end": 0.1},
        "buy_item:fail": {"buy_item:fail": 0.01, "view_profile": 0.2, "payment_modify:success": 0.59, "payment_modify:fail": 0.1, "logout":0.05, "end": 0.05},
        "view_profile": { "update_email:success": 0.1, "update_email:fail": 0.05, 
                         "update_address:success": 0.2, "update_address:fail": 0.05,
                         "payment_modify:success": 0.1, "payment_modify:fail": 0.05,
                         "view_profile": 0.05, "view_item:success": 0.4},
        "update_email:success": {"view_profile":1},
        "update_email:fail": {"update_email:success": 0.9, "update_email:fail": 0.01, "view_profile":0.09},
        "update_address:success": {"view_profile":1},
        "update_address:fail": {"update_address:success": 0.9, "update_address:fail": 0.01, "view_profile":0.09},
        "payment_modify:success": {"view_profile":1},
        "payment_modify:fail": {"payment_modify:success": 0.9, "payment_modify:fail": 0.01, "view_profile":0.09},
        "comment:success": {"view_item:success": 0.6, "buy_item:success": 0.399, "buy_item:fail": 0.001},
        "end": {}
    },
    "merchant": {
        "home": { "login:success": 0.92, "login:fail": 0.029, "password_reset": 0.05, "end": 0.001},
        "login:success": { "view_item:success": 0.978, "view_profile": 0.001, "sell_item:success":0.02, "sell_item:fail": 0.001},
        "login:fail": {"login:success": 0.9, "login:fail": 0.08, "password_reset": 0.02},
        "password_reset": {"login:success": 0.9, "login:fail": 0.09, "end": 0.01},
        "logout": {"end": 0.99, "home": 0.01},
        "view_item:success": {"view_item:success": 0.4, "sell_item:success": 0.599, "sell_item:fail": 0.001},
        "sell_item:success": {"view_item:success": 0.4, "sell_item:success": 0.399, "sell_item:fail": 0.001, "logout": 0.1, "end": 0.1},
        "sell_item:fail": {"sell_item:fail": 0.1, "view_profile": 0.2, "bank_modify:success": 0.5, "bank_modify:fail": 0.1, "logout":0.05, "end": 0.05},
        "view_profile": { "update_email:success": 0.1, "update_email:fail": 0.05, 
                         "update_address:success": 0.2, "update_address:fail": 0.05,
                         "bank_modify:success": 0.1, "bank_modify:fail": 0.05,
                         "view_profile": 0.05, "view_item:success": 0.4},
        "update_email:success": {"view_profile":1},
        "update_email:fail": {"update_email:success": 0.9, "update_email:fail": 0.01, "view_profile":0.09},
        "update_address:success": {"view_profile":1},
        "update_address:fail": {"update_address:success": 0.9, "update_address:fail": 0.01, "view_profile":0.09},
        "bank_modify:success": {"view_profile":1},
        "bank_modify:fail": {"bank_modify:success": 0.9, "bank_modify:fail": 0.01, "view_profile":0.09},
        "end": {}
    },
    "bot": {
        "home": {"login:success": 0.9, "login:fail": 0.1},
        "login:success": {"view_item:success": 0.95, "view_item:fail": 0.05},
        "login:fail": {"login:success": 0.29, "login:fail": 0.7, "end": 0.01},
        "logout": {"end": 0.2, "home": 0.8},
        "view_item:success": {"view_item:success": 0.9, "view_item:fail": 0.1},
        "view_item:fail": {"view_item:success": 0.29, "view_item:fail": 0.7, "end": 0.01},
        "end": {}
    },
    "spammer": {
        "home": { "login:success": 0.6, "login:fail": 0.3,"password_reset": 0.05, "end": 0.05},
        "login:success": { "view_item:success": 0.9, "view_profile": 0.1},
        "login:fail": {"login:success": 0.7, "login:fail": 0.2, "password_reset": 0.1},
        "password_reset": {"login:success": 0.8, "login:fail": 0.1, "end": 0.1},
        "logout": {"end": 0.9, "home": 0.1},
        "view_item:success": {"comment:success": 0.5, "view_item:success": 0.4, "view_item:fail": 0.1},
        "view_item:fail": {"view_item:success": 0.19, "view_item:fail": 0.7, "logout": 0.1, "end": 0.01},
        "view_profile": { "update_email:success": 0.3, "update_email:fail": 0.1, 
                         "view_profile": 0.2, "view_item:success": 0.4},
        "update_email:success": {"view_profile":1},
        "update_email:fail": {"update_email:success": 0.5, "update_email:fail": 0.4, "view_profile":0.1},
        "update_address:success": {"view_profile":1},
        "comment:success": {"view_item:success": 0.9, "logout": 0.05, "end": 0.05},
        "end": {}
    },
    "fraudster": {
        "home": { "login:success": 0.5, "login:fail": 0.45, "end": 0.05},
        "login:success": { "view_profile": 0.7, "logout":0.2, "end": 0.1},
        "login:fail": {"login:success": 0.7, "login:fail": 0.2, "end": 0.1},
        "logout": {"end": 0.9, "home": 0.1},
        "view_item:success": {"view_item:success": 0.6, "buy_item:success": 0.35, "buy_item:fail": 0.05},
        "buy_item:success": {"view_item:success": 0.3, "buy_item:success": 0.4, "buy_item:fail": 0.1, "logout": 0.1, "end": 0.1},
        "buy_item:fail": {"buy_item:fail": 0.1, "view_profile": 0.2, "payment_modify:success": 0.3, "payment_modify:fail": 0.3, "logout":0.05, "end": 0.05},
        "view_profile": { "update_email:success": 0.05, "update_email:fail": 0.05, 
                         "update_address:success": 0.2, "update_address:fail": 0.05,
                         "payment_modify:success": 0.2, "payment_modify:fail": 0.2,
                         "view_profile": 0.05, "buy_item:success": 0.2},
        "update_email:success": {"buy_item:success":0.9, "view_profile": 0.1},
        "update_email:fail": {"update_email:success": 0.5, "update_email:fail": 0.4, "view_profile":0.1},
        "update_address:success": {"buy_item:success":0.9, "view_profile": 0.1},
        "update_address:fail": {"update_address:success": 0.6, "update_address:fail": 0.3, "view_profile":0.1},
        "payment_modify:success": {"buy_item:success":0.9, "view_profile": 0.1},
        "payment_modify:fail": {"payment_modify:success": 0.5, "payment_modify:fail": 0.4, "view_profile":0.1},
        "end": {}
    },
    "attacker": {
        "home": {"login:success": 0.05, "login:fail": 0.85, "end": 0.1},
        "login:success": { "logout": 0.95, "end": 0.05 },
        "login:fail": {"login:success": 0.05, "login:fail": 0.85, "end": 0.1},
        "logout": {"end": 0.1, "home": 0.9},
        "end": {}
    }
}

if 1:
    for role in user_profile: 
        for action in user_profile[role]: 
            total = 0
            for follow in user_profile[role][action]:
                total+= user_profile[role][action][follow]
            if (1 > round(total, 4) and total > 0) or round(total,4) > 1:
                print(role,action,total, 1-total)

Using these user profiles, we can generate a simulated stream of logs, for day 1

In [2]:
%%time
import random
import numpy as np
from datetime import datetime, date, time, timedelta

random_seed = 42

start_time = datetime(2019,1,1,0,0)

user_lookup = {}

def generate_userlist(nb_users_for_the_day):
    random.seed(random_seed)
    todays_users = []
    
    for i in range(nb_users_for_the_day):
        todays_users.append(random.choices(list(user_distribution.keys()), list(user_distribution.values()))[0])
        
    return todays_users
    
def generate_logs(todays_users, start_time): 
    random.seed(random_seed)
    state = [0] * len(todays_users) 
    next_actions = [random.randint(0,86400) for x in range(len(todays_users))]
    logs = []

    for i in range(len(todays_users)):
        u = todays_users[i]
        state[i] = user_start_action[u]
        user_lookup[todays_users[i] + str(i)] = todays_users[i]
    
    while min(next_actions) < 86400:
        ind = np.argmin(next_actions)
        if state[ind] != 'end':            
            population = list(user_profile[todays_users[ind]][state[ind]].keys())
            weights = list(user_profile[todays_users[ind]][state[ind]].values())
            next_action = random.choices(population, weights)[0]

            spl = next_action.split(":")
            path = spl[0]
            status = 'success'
            if len(spl) > 1:
                status = spl[1]
                
            entry = [str(start_time + timedelta(seconds=next_actions[ind])), todays_users[ind] + str(ind), path, status, ind, todays_users[ind]]
            state[ind] = next_action

            next_actions[ind] += random.randint(1, user_velocity[todays_users[ind]])
            state[ind] = next_action
            logs.append(entry)
            
        else:
            next_actions[ind] = 86400


    return logs


CPU times: user 53.9 ms, sys: 19.3 ms, total: 73.2 ms
Wall time: 88.7 ms


# Day 1, We need to train our model

In [3]:
%%time
import pandas as pd

user_lists = generate_userlist(1000)
todays_logs = generate_logs(user_lists, start_time)
print(len(todays_logs), 'logs event generated for', len(user_lists), 'users')

data = pd.DataFrame(np.array(todays_logs), columns=['time', 'user', 'path', 'status', 'uidx', 'realtype'])
data['prev_path'] = data.groupby(['user'])['path'].shift(1)
data['prev_path'] = data['prev_path'].fillna("")

data['prev_status'] = data.groupby(['user'])['status'].shift(1)
data['prev_status'] = data['prev_status'].fillna("")



print(data.loc[(data['path'] == 'login') & (data['status'] == 'fail')].head())

13586 logs event generated for 1000 users
                     time         user   path status uidx  realtype prev_path  \
322   2019-01-01 00:44:31     buyer685  login   fail  685     buyer             
629   2019-01-01 01:27:53  merchant955  login   fail  955  merchant             
1279  2019-01-01 02:30:38  attacker725  login   fail  725  attacker             
1280  2019-01-01 02:30:42  attacker725  login   fail  725  attacker     login   
1282  2019-01-01 02:30:44  attacker725  login   fail  725  attacker     login   

     prev_status  
322               
629               
1279              
1280        fail  
1282        fail  
CPU times: user 1.31 s, sys: 53.7 ms, total: 1.36 s
Wall time: 1.55 s


We will use these logs to calculate the surprisal value of each action, and create a lookup table for the second day.

In [4]:
from math import log, pow

surprisal = {}

for i in data['path'].unique():
    ds = data.loc[(data['path'] == i) & (data['status'] == 'success')]
    df = data.loc[(data['path'] == i) & (data['status'] == 'fail')]
    
    dsuccess = len(ds) * 1.0
    dfail = len(df) * 1.0

    if dsuccess == 0:
        dsuccess = 1.0 

    denum = dsuccess + dfail

    if dfail == 0:
        dfail = 1.0

    surprisal[i] = {'success': len(ds.index), 
                    'fail': len(df.index), 
                    'ssurprisal': log(1/(dsuccess / denum),2), 
                    'fsurprisal': log(1/(dfail / denum),2),
                    'hsuccess': (dsuccess/denum)*log(dsuccess/denum,2),
                    'hfail': (dfail/denum)*log(dfail/denum,2),
                    'h': -((dsuccess/denum)*log(dsuccess/denum,2) + (dfail/denum)*log(dfail/denum,2))
                   }

def get_surprisal(path):
    if path not in list(surprisal.keys()):
        denum = len(data)
        return {
            'fail': 0,
            'success': 0,
            'ssurprisal': log(1/(1/denum),2),
            'fsurprisal': log(1/(1/denum),2),
            'hsuccess': (1/1)*log(1/denum,2),
            'hfail': (1/denum)*log(1/denum,2),
            'h': -((1/1)*log(1/denum,2) + (1/denum)*log(1/denum,2))
        }
    else:
        return surprisal[path]
    
get_surprisal('login')

{'fail': 40,
 'fsurprisal': 4.701826258412055,
 'h': 0.23502177991774187,
 'hfail': -0.18066575440584268,
 'hsuccess': -0.054356025511899185,
 'ssurprisal': 0.05652809446342354,
 'success': 1001}

In [5]:
transition_surprisal = {}

# data_path_transition_stats = data.loc[data['prev_path'] != ''].groupby(['path','status', 'prev_path']).size().reset_index()
# all_data_length = len(data)
# print(data_path_transition_stats)

for pkey in data['path'].unique():
    data_for_pkey = data.loc[(data['path'] == pkey)]
    denum = len(data.loc[(data['path'] == pkey)])
    
    for ppkey in data_for_pkey['prev_path'].unique():
        ds = data_for_pkey.loc[(data_for_pkey['prev_path'] == ppkey) & (data_for_pkey['status'] == 'success')]
        df = data_for_pkey.loc[(data_for_pkey['prev_path'] == ppkey) & (data_for_pkey['status'] == 'fail')]
        
        dsuccess = len(ds) * 1.0
        dfail = len(df) * 1.0
        
        if dsuccess == 0:
            dsuccess = 1.0 
        
        if dfail == 0:
            dfail = 1.0

        if (pkey not in transition_surprisal.keys()):
            transition_surprisal[pkey] = {}
            
        transition_surprisal[pkey][ppkey] = {
            'success': len(ds), 
            'fail': len(df), 
            'ssurprisal': log(1/(dsuccess / denum),2), 
            'fsurprisal': log(1/(dfail / denum),2),
            'hsuccess': -(dsuccess/denum)*log(dsuccess/denum,2),
            'hfail': -(dfail/denum)*log(dfail/denum,2),
            'h': -((dsuccess/denum)*log(dsuccess/denum,2) + (dfail/denum)*log(dfail/denum,2))
        }
        
def get_transition_surprisal(path, prev_path):
    if path not in list(transition_surprisal.keys()):
        denum = len(data)
        return {
            'fail': 0,
            'success': 0,
            'ssurprisal': log(1/(1/denum),2),
            'fsurprisal': log(1/(1/denum),2),
            'hsuccess': -(1/1)*log(1/denum,2),
            'hfail': -(1/denum)*log(1/denum,2),
            'h': -((1/1)*log(1/denum,2) + (1/denum)*log(1/denum,2))
        }
    else:
        if prev_path not in list(transition_surprisal[path].keys()):
            denum = len(data.loc[(data['path'] == path)])
            return {
                'fail': 0,
                'success': 0,
                'ssurprisal': log(1/(1/denum),2),
                'fsurprisal': log(1/(1/denum),2),
                'hsuccess': -(1/1)*log(1/denum,2),
                'hfail': -(1/denum)*log(1/denum,2),
                'h': -((1/1)*log(1/denum,2) + (1/denum)*log(1/denum,2))
            }
        else:
            return transition_surprisal[path][prev_path]

get_transition_surprisal('buy_item', 'login')

{'fail': 1,
 'fsurprisal': 10.287712379549449,
 'h': 0.045203369322015366,
 'hfail': 0.00823016990363956,
 'hsuccess': 0.036973199418375804,
 'ssurprisal': 7.7027498788282935,
 'success': 6}

If we go through the logs of the day, can we identify outliers?

To do so, the idea is to look at each action being taken by each user, and add the relevant value from the surprisal lookup table.

That said, I did read a lot about information theory and surprisal analysis, and this most probably not how it is supposed to be used, and the calculation is most probably wrong... but this mistake is quite useful

In [6]:
def get_user_score(logs, key, feature, success_val):
    accumulator = {}
    for index,row in logs.iterrows():
        if row[key] not in accumulator.keys():
            accumulator[row[key]] = {k:0 for k in data['path'].unique()}
        if row[feature] is success_val:
            accumulator[row[key]][row[feature]] += get_surprisal([row[feature]])['ssurprisal']
        else:
            accumulator[row[key]][row[feature]] += get_surprisal([row[feature]])['fsurprisal']
            
    return accumulator


user_score = get_user_score(data, 'user', 'path', 'success')

user_score['attacker725']

{'bank_modify': 0,
 'buy_item': 0,
 'comment': 0,
 'end': 13.729833138848363,
 'home': 0,
 'login': 82.37899883309018,
 'logout': 13.729833138848363,
 'password_reset': 0,
 'payment_modify': 0,
 'sell_item': 0,
 'update_address': 0,
 'update_email': 0,
 'view_item': 0,
 'view_profile': 0}

In [7]:
cumulative_score = [[v,sum(user_score[v].values())] for v in [k for k in list(user_score.keys())]]

df_cumulative_score = pd.DataFrame(cumulative_score, columns=['user', 'surprisal'])

avg = df_cumulative_score['surprisal'].mean()
std = df_cumulative_score['surprisal'].std()
df_cumulative_score['z'] = (df_cumulative_score['surprisal'] - avg) / std


In [8]:
df_cumulative_score.loc[df_cumulative_score['z'] >= 2].sort_values(by=['surprisal'], ascending=False)

Unnamed: 0,user,surprisal,z
275,bot776,8704.71421,24.307055
684,bot900,5244.796259,14.434006
721,bot672,3116.672123,8.361297
15,merchant774,1537.741312,3.855739


In [9]:
df_cumulative_score.sort_values(by=['surprisal'], ascending=False).tail()

Unnamed: 0,user,surprisal,z
319,buyer635,27.459666,-0.453925
600,fraudster855,27.459666,-0.453925
500,attacker520,13.729833,-0.493104
566,attacker198,13.729833,-0.493104
999,merchant238,13.729833,-0.493104


In [10]:
%%time

np.seterr(divide='ignore', invalid='ignore', over='ignore')
np.random.seed(random_seed)
user_types = list(data['realtype'].unique())


flat_status = {
    True: { True: 0, False: 0 },
    False: { True: 0, False: 0 }
}

test_stats = {}
for ut1 in user_types:
    test_stats[ut1] =  {
        True: { True: 0, False: 0 },
        False: { True: 0, False: 0 }
    }

# print(test_stats)

def compare_users(user1, user2):
    u1 = np.array(list(user_score[user1].values()))
    px = [1/np.power(2,x) for x in u1]
    real = user_lookup[user1]

    u2 = np.array(list(user_score[user2].values()))
    test_user_type = user_lookup[user2]
    qx = [1/np.power(2,x) for x in u2]
    p = np.array(qx)/np.array(px)
    dkl = (qx * np.log2(p)).sum()
    t = dkl < 1 and dkl >= -1
    return { 'test': t, 'real': real == test_user_type, 'dkl': dkl }


for ut1 in user_types:
    all_ut1 = list(data.loc[(data['realtype'] == ut1)]['user'].unique())
    np.random.shuffle(all_ut1)
    for ut2 in user_types:
        all_ut2 = list(data.loc[(data['realtype'] == ut2)]['user'].unique())
        for j in all_ut1[:10]:
            np.random.shuffle(all_ut2)
            for i in all_ut2[:10]:
                result = compare_users(j, i)
                test_stats[ut1][result['real']][result['test']] += 1
                flat_status[result['real']][result['test']] += 1
#                 if result['real'] != result['test']:
#                     print(j,i,'real',result['real'],'test',result['test'], test_stats[ut1])

print(test_stats)

{'merchant': {True: {True: 65, False: 35}, False: {True: 0, False: 250}}, 'buyer': {True: {True: 60, False: 40}, False: {True: 0, False: 250}}, 'fraudster': {True: {True: 7, False: 9}, False: {True: 39, False: 85}}, 'attacker': {True: {True: 16, False: 9}, False: {True: 98, False: 52}}, 'bot': {True: {True: 0, False: 9}, False: {True: 0, False: 96}}, 'spammer': {True: {True: 7, False: 2}, False: {True: 9, False: 87}}}
CPU times: user 182 ms, sys: 2.4 ms, total: 185 ms
Wall time: 187 ms


In [11]:
tlookup = {}
for ut1 in user_types:
    tp = test_stats[ut1][True][True]
    tn = test_stats[ut1][True][False]
        
    fp = test_stats[ut1][False][True]
    fn = test_stats[ut1][False][False]
        
    pdenum = tp+fp
    ndenum = fn+tn
        
    if pdenum == 0:
        pdenum = 1
    if ndenum == 0:
        ndenum = 1
            
    tlookup[ut1] = {
        True: { True: tp/pdenum, False: tn/ndenum},
        False: { True: fp/pdenum, False: fn/ndenum}
    }
    print(ut1,tlookup[ut1])



merchant {True: {True: 1.0, False: 0.12280701754385964}, False: {True: 0.0, False: 0.8771929824561403}}
buyer {True: {True: 1.0, False: 0.13793103448275862}, False: {True: 0.0, False: 0.8620689655172413}}
fraudster {True: {True: 0.15217391304347827, False: 0.09574468085106383}, False: {True: 0.8478260869565217, False: 0.9042553191489362}}
attacker {True: {True: 0.14035087719298245, False: 0.14754098360655737}, False: {True: 0.8596491228070176, False: 0.8524590163934426}}
bot {True: {True: 0.0, False: 0.08571428571428572}, False: {True: 0.0, False: 0.9142857142857143}}
spammer {True: {True: 0.4375, False: 0.02247191011235955}, False: {True: 0.5625, False: 0.9775280898876404}}


# Day 2, Let's see if we can find something

We are now Day 2. Surprisal values are based on a different day, but if the normal distribution is anything like Day 1, the surprisal value calculated should be still relevant.

Let's generate a new day of logs

This is where we use our log stream analyser. This is basically a state machine that keep an ongoing total of each users. 

The theory is that if a user only do normal actions, the sum of the surprisal value of all his actions should be fairly low (under ~10... but this is arbitrary). Anything over 10 would mean that a high number of unlikely actions were performed. Lets see if we can identify which users stands out.

## Why these users are identified as outliers?

If we look at the Top 1 outlier, we can see each action performed, and the surprisal value of each action.

If we look at a normal user, we can see that the surprisal value assigned to each value is really low, only slightly contributing to give that user a high score.

# Day 3: Automatically Classifying Users

At one point, we still need to classify behaviours. Using Naive Bayes over the actions, we can score each users to the likely category they belong too.

Could we have done that without looking for the outliers? Yes, but we still need to identify which users is normal and which one is likely not.

At first, we need to calculate the probability distribution of each actions for each categories.

We can now use this probability distribution over a log stream