# Likelihood - Demo

Welcome to the Likelihood Demo! This will present a **truncated** version of Likelihood, one which utilizes working features to give the user a good idea of what Likelihood actually does, and how it can be implemented on a dataset!

### A Quick Rundown:

Likelihood is a data quality monitoring engine that measures the surprise, or entropy, of members of a given dataset. To learn more about the basic theory behind it, one may click on the 2 links below:

https://en.wikipedia.org/wiki/Entropy_(information_theory)

http://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf

The basic essence is: uncertainty is maximized (it is most regular) in cases where the probability structure of a dataset is chaotic, meaning we don't have much information about it. However, when we can identify some given patterns about the probability structure of a dataset, we know that data members following these rules are not particularly chaotic. They are structured, and thus unsurprising. It is when these patterns are defied that the entropy shoots up to irregular heights. It is this percise rule defying approach that Likelihood uses to find outliers within a structured dataset.

Likelihood began as a numerical-estimation focused tool, but currently it works quite well with numerical, categorical, and timestamp data. Its functional approaches are mapped out below:

1. **Bootstrapping** - Building a distribution using the properties of a bootstrap, this approach uses the bootstrap to capture standard counts for values by TimeStamp and finds anomaly if test-set counts are a certain level off expected training set ratios.


2. **Time Series** - Using Facebook Prophet, Time Series evaluation puts surprising event in the context of the time in which they happened. Building a pattern off these approximations and understanding, the Time Series tool predicts the future for the test-set and raises an issue if expected future counts fall off.


3. **Kernel Density** - Smoothing a distribution so that certain properties can be utilized, Kernel Density fits the data under a curve depending on the data's variation and approximates which values in a distribution are unlikely by virtue of magnitude, thus finding the most surprising Data.


4. **PCA** - Using Dimensionality Reduction, PCA attributes the variation in the data to several significant columns which are used to compute bias row-wise. This approach is combined with the column based kernel density approach to truly triangulate the percise location of numeric data-error, and PCA's surprise metric is thus grouped with Kernel Density's.


5. **Relative Entropy Model for Categorical Data** - Much in the spirit of grammar, this relative entropy its own rules (expected formatting and behavior) for data, and obtains surprise based off the strictness of the rule that the data defies (defying a stricter rule would inherently be more chaotic)


6. **TimeStamp Intervals** - This Kernel Density approach computes similarly to the numerical Kernel Density, but this time orders the time intervals in the dataset and procceeds to test if there is a weird interval in which no data/ too much data was recorded.


7. **In Progress**: Mutual Entropy for Mixed Numeric and Categorical Data

Ultimately, Likelihood should become a functional tool that can build functional distributions without the need for any context. Currently it functions more as a copilot

In [7]:
# Imports for project purposes
# Full Project imports
import pandas as pd
import math as mt
import dateutil
from datetime import datetime, timedelta
import requests as rd
import numpy as np
from sklearn import neighbors, decomposition
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import smtplib
import scipy.stats as st
import os
from datetime import datetime, timedelta
from pandas.api.types import is_numeric_dtype
import copy

In [25]:
# Loading data into project
def load_data(dataset_link, category):
    if(category == "html"):
        return pd.read_html(dataset_link)
    else:
        if(category == "excel"):
            return pd.read_excel(dataset_link)
        else:
            return pd.read_csv(dataset_link)
        
        
df = load_data("pd_calls_for_service_2020_datasd.csv", "csv")       

The Data used throughout this part of the demo comes from the San Diego County Police Calls for Service Dataset, it will be used to show the effect of Likelihood's Time-Series Methods

In [13]:
df

Unnamed: 0,incident_num,date_time,day_of_week,address_number_primary,address_dir_primary,address_road_primary,address_sfx_primary,address_dir_intersecting,address_road_intersecting,address_sfx_intersecting,call_type,disposition,beat,priority
0,E20010000001,2020-01-01 00:00:09,4,400,,06TH,AVE,,,,11-8,A,523,0
1,E20010000002,2020-01-01 00:00:20,4,5000,,UNIVERSITY,AVE,,,,FD,K,826,2
2,E20010000003,2020-01-01 00:00:21,4,800,,SAWTELLE,AVE,,,,AU1,W,434,1
3,E20010000004,2020-01-01 00:00:32,4,5000,,UNIVERSITY,AVE,,,,FD,K,826,2
4,E20010000005,2020-01-01 00:00:42,4,5200,,CLAIREMONT MESA,BLV,,,,415V,K,111,1
5,E20010000006,2020-01-01 00:01:04,4,0,,CLEVELAND,,,UNIVER,,FD,K,-1,2
6,E20010000007,2020-01-01 00:01:11,4,4300,,MANZANITA,DR,,,,AU1,DUP,835,1
7,E20010000008,2020-01-01 00:01:15,4,3800,,DALBERGIA,CT,,,,AU1,W,443,1
8,E20010000010,2020-01-01 00:01:33,4,4000,,LOGAN,AVE,,,,11-6,K,441,1
9,E20010000009,2020-01-01 00:01:33,4,0,S,37TH,ST,,BETA,,AU1,W,442,1


In [26]:
# Converts Timetamp column of DataFrame to a legitimate timestamp
def convertToDateTime(df, timestamp):
    df[timestamp] =  pd.to_datetime(df[timestamp], format='%Y%m%d %H:%M:%S')
    return df


# Assignments for computational purposes
df['ts'] = df['date_time']
batchHours = 24*7
df = convertToDateTime(df, 'ts')
df

Unnamed: 0,incident_num,date_time,day_of_week,address_number_primary,address_dir_primary,address_road_primary,address_sfx_primary,address_dir_intersecting,address_road_intersecting,address_sfx_intersecting,call_type,disposition,beat,priority,ts
0,E20010000001,2020-01-01 00:00:09,4,400,,06TH,AVE,,,,11-8,A,523,0,2020-01-01 00:00:09
1,E20010000002,2020-01-01 00:00:20,4,5000,,UNIVERSITY,AVE,,,,FD,K,826,2,2020-01-01 00:00:20
2,E20010000003,2020-01-01 00:00:21,4,800,,SAWTELLE,AVE,,,,AU1,W,434,1,2020-01-01 00:00:21
3,E20010000004,2020-01-01 00:00:32,4,5000,,UNIVERSITY,AVE,,,,FD,K,826,2,2020-01-01 00:00:32
4,E20010000005,2020-01-01 00:00:42,4,5200,,CLAIREMONT MESA,BLV,,,,415V,K,111,1,2020-01-01 00:00:42
5,E20010000006,2020-01-01 00:01:04,4,0,,CLEVELAND,,,UNIVER,,FD,K,-1,2,2020-01-01 00:01:04
6,E20010000007,2020-01-01 00:01:11,4,4300,,MANZANITA,DR,,,,AU1,DUP,835,1,2020-01-01 00:01:11
7,E20010000008,2020-01-01 00:01:15,4,3800,,DALBERGIA,CT,,,,AU1,W,443,1,2020-01-01 00:01:15
8,E20010000010,2020-01-01 00:01:33,4,4000,,LOGAN,AVE,,,,11-6,K,441,1,2020-01-01 00:01:33
9,E20010000009,2020-01-01 00:01:33,4,0,S,37TH,ST,,BETA,,AU1,W,442,1,2020-01-01 00:01:33


In [28]:
# Splits data into train and test set based on date/time
def split_train_test(df, batchHours):
    maxTs = max(df['ts'])
    batchTs = maxTs - timedelta(hours = batchHours)
    testDf = df[df['ts'] > batchTs]
    trainDf = df[df['ts'] < batchTs]
    return trainDf, testDf

trainDf, testDf = split_train_test(df, batchHours)

In [31]:
# Helpers and Math
def pValue(data, threshold):
    p_larger = sum(np.array(data) >= threshold) / len(data)
    p_smaller = sum(np.array(data) <= threshold) / len(data)
    p = min(p_larger, p_smaller)

    # only use gaussian p-value when there is variation, but bootsrap p = 0
    stdev = np.std(data)
    if stdev == 0 or p != 0:
        p_gauss = p
    else:
        p_gauss = scipy.stats.norm(np.mean(result['bootstrap_counts']), stdev).cdf(result['count'])
        p_gauss = min(p_gauss,1-p_gauss)
    return p_gauss

def trimTraining(trainDf, params):

    # trim to most recent
    trainDf = trainDf.sort_values(params['ts'], ascending =False)
    trainDfTrimmed = trainDf[:params['maxTrainingSizeMultiple']*len(testDf)]
    
    return trainDfTrimmed

In [32]:
def bootstrap(trainDf, testDf):
    # get all of the string columns
    columnNames = []
    for columnName in testDf.keys():
        if (type (testDf[columnName].iloc[0])) == str:
            columnNames.append(columnName)
    print(columnNames)
    bootstrapDf = trimTraining(trainDf, params)

    # set up dict, add counts
    results = {}
    for columnName in columnNames:
        # if it isn't a string column, reject it
        if type(testDf[columnName].iloc[0]) != str:
            continue
        categories = (bootstrapDf[columnName].append(testDf[columnName])).unique()
        if len(categories) > params['maxCategories']:
            continue

        results[columnName] = {}
        testCounts = testDf[columnName].value_counts(dropna = False)
        for i in np.arange(1,len(categories) -1):
            if(pd.isna(categories[i])):
                categories = np.delete(categories, i)  
        for category in categories:
            results[columnName][category] = {'bootstrap_counts':[],
                                            'count':testCounts.get(category,0)}
    # resample, add boostrap counts
    for ii in range(params['bootstrapResamples']):
        # Draw random sample from training
        sampleDf = bootstrapDf.sample(len(testDf), replace=True)
        for columnName in results.keys():
            # count by category
            trainCounts = sampleDf[columnName].value_counts(dropna = False)
            # put results in dict
            for category in results[columnName].keys():
                boostrapCount = trainCounts.get(category,0)
                results[columnName][category]['bootstrap_counts'].append(boostrapCount)

    # convert to records, add p-values
    bootstrap_results = []
    for columnName in results.keys():
        for category in results[columnName].keys():
            result = results[columnName][category]

            estimatedCount = int(np.round(np.mean(result['bootstrap_counts'])))
            # don't report entries with very low predicted and actual counts
            if estimatedCount < params['minCategoryCount'] and result['count'] < params['minCategoryCount']:
                continue

            p = pValue(result['bootstrap_counts'],result['count'])
            categoryName = category
            if not category:
                categoryName = "NULL"

            bootstrap_results.append({"column":columnName,
                               "category":categoryName,
                               "count":result['count'],
                               "p": p,
                               "estimated_count":estimatedCount,
                               })
    if(np.count_nonzero(p)>0):
        resultsDf = pd.DataFrame.from_records(bootstrap_results).sort_values('p')
        resultsDf['surprise'] = -np.log2(resultsDf['p'])


['incident_num', 'date_time', 'address_road_primary', 'address_sfx_primary', 'call_type', 'disposition']


NameError: name 'params' is not defined