# Detection Theory

Neste notebook iremos explorar aspectos da teoria de detecção. Iremos avaliar um modelo de predição, considerando três threholds diferentes e avaliando seu impacto no desempenho das predições. Como material de suporte utilizamos as aulas, bem como o capítulo de "Detection Theory" do livro.

## A importância dada aos erros

Quando consideramos nosso dataset, entendemos que a saída é binária. Um usuário pode gostar (dando notas 4 ou 5) ou não (dando notas inferiores a 4) de um determinado anúncio. Com base nisso, interpretamos que os **falso negativos** possuem um impacto muito superior em nosso problema. Entendemos isso pois não entregar um anúncio que uma pessoa gostaria tem um impacto muito superior a entregar anúncios que uma pessoa não gosta, principalmente pois os usuários estão acostumados a ignorar propagandas desinteressantes.

Com base nisso, iremos implementar o risco bayesiano para utilizá-lo como uma forma de avaliar os modelos que serão, posteriormente, treinados. Seguindo o livro e o material apresentado em aula, o risco bayesiano pode ser reduzido para:

$$R = c_{10} \times p_{0} \times p_{FP} + c_{01} \times p_{1} \times p_{FN}$$

Em outras palavras, o risco bayesiano é a soma da multiplicação dos custos de cada um dos erros, pela suas prioris e pela probabilidade do erro acontecer. Como é apresentado no livro e nas aulas, essa é a equação reduzida, onde ignoramos os custos dos acertos, pois eles não possuem impacto no risco.

Para o cálculo do threholds, iremos considerar o valor de $\eta$, que pode ser definido por:

$$\eta = \dfrac{c_{10} \times p_0}{c_{01} \times p_1}$$

Com base nisso, podemos calcular o valor do threshold como sendo:

$$T = \dfrac{\eta}{1 + \eta}$$

Como foi dito anteriormente, os falsos negativos possuem um impacto superior, portanto iremos utilizar como custo: $c_{10} = 1$ e $c_{01} = 2$ 

## Carregando os dados

In [6]:
import pandas as pd

df = pd.read_csv('../../data/final_features_df.csv')
df = df.fillna(0)
df

Unnamed: 0.1,Unnamed: 0,Age,Income,faves_pca0,faves_pca1,unfaves_pca0,unfaves_pca1,accessories,alcohol,animamted,...,Drama.2,Entertainment (Variety Shows),Factual,Learning,Music,News,Religion &amp; Ethics,Sport.1,Weather,Rating_bin
0,0,62,1,-0.321485,0.078600,-0.199670,-0.200645,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
1,1,62,1,-0.321485,0.078600,-0.199670,-0.200645,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
2,2,62,1,-0.321485,0.078600,-0.199670,-0.200645,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
3,3,62,1,-0.321485,0.078600,-0.199670,-0.200645,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
4,4,62,1,-0.321485,0.078600,-0.199670,-0.200645,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36115,36115,33,2,-0.000741,0.311926,0.206937,0.190376,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
36116,36116,33,2,-0.000741,0.311926,0.206937,0.190376,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
36117,36117,33,2,-0.000741,0.311926,0.206937,0.190376,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
36118,36118,33,2,-0.000741,0.311926,0.206937,0.190376,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0


## Preparando o dataset para o treinamento

In [7]:
from sklearn.model_selection import train_test_split

Y = df.pop('Rating_bin')
X = df

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)

## Calculando as prioris e definindo os custos

In [58]:
import numpy as np

p1 = np.mean(y_train)
p0 = 1 - p1

c10 = 1
c01 = 2

print(f'Prior for 1: {p1}')
print(f'Prior for 0: {p0}')

Prior for 1: 0.13205980066445183
Prior for 0: 0.8679401993355482


## Preparando o modelo

In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)
lr_probs = model.predict_proba(X_test)
lr_probs = lr_probs[:, 1]

## Definindo a função de predição

Para o exercício, iremos definir uma função de decisão simples, que prediz $0$ ou $1$ com base em um threshold pré-definido

In [75]:
def predict_with_threshold(prob, threshold=0.5):
    return 0 if prob < threshold else 1

model.predict_with_threshold = np.vectorize(predict_with_threshold)

## Utilizando o threshold de 0.5

In [76]:
y_pred = model.predict_with_threshold(lr_probs)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      1.00      0.93      9334
           1       0.00      0.00      0.00      1502

    accuracy                           0.86     10836
   macro avg       0.43      0.50      0.46     10836
weighted avg       0.74      0.86      0.80     10836



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Utilizando o threshold com base no Bayes risk

In [77]:
eta = (c10*p0)/(c01*p1)
threshold = eta/(1+eta)

print(f'Threshold: {threshold}')

Threshold: 0.7666911225238444


In [78]:
y_pred = model.predict_with_threshold(lr_probs, threshold)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      1.00      0.93      9334
           1       0.00      0.00      0.00      1502

    accuracy                           0.86     10836
   macro avg       0.43      0.50      0.46     10836
weighted avg       0.74      0.86      0.80     10836



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Utilizando o threshold que iguala a proporção de 1

In [93]:
def get_threshold_prior(threshold):
    y_pred = model.predict_with_threshold(lr_probs, threshold)
    return np.mean(y_pred)

def find_nearest_idx(array, value):
    return (np.abs(array - value)).argmin()

thresholds = np.arange(0, 1, 0.001)
priors = [get_threshold_prior(t) for t in thresholds]
threshold_idx = find_nearest_idx(priors, p1)
threshold = thresholds[threshold_idx]

print(f'Threshold: {threshold}')

Threshold: 0.187


In [94]:
y_pred = model.predict_with_threshold(lr_probs, threshold)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.88      0.88      9334
           1       0.24      0.23      0.23      1502

    accuracy                           0.79     10836
   macro avg       0.56      0.56      0.56     10836
weighted avg       0.79      0.79      0.79     10836



In [95]:
y_p1 = np.mean(y_pred)
y_p0 = 1 - y_p1

print(f'Original prior for 1: {p1}')
print(f'Predicted prior for 1: {y_p1}')

print('-------')

print(f'Original prior for 0: {p0}')
print(f'Predicted prior for 0: {y_p0}')

Original prior for 1: 0.13205980066445183
Predicted prior for 1: 0.13205980066445183
-------
Original prior for 0: 0.8679401993355482
Predicted prior for 0: 0.8679401993355482
