# Bayes Risk

Neste notebook iremos implementar uma tomada de decisão usando risco bayesiano. Utilizamos como base o capítulo de "Decision Theory" do livro, bem como as aulas que foram apresentadas

## A importância dada aos erros

Quando consideramos nosso dataset, entendemos que a saída é binária. Um usuário pode gostar (dando notas 4 ou 5) ou não (dando notas inferiores a 4) de um determinado anúncio. Com base nisso, interpretamos que os **falso negativos** possuem um impacto muito superior em nosso problema. Entendemos isso pois não entregar um anúncio que uma pessoa gostaria tem um impacto muito superior a entregar anúncios que uma pessoa não gosta, principalmente pois os usuários estão acostumados a ignorar propagandas desinteressantes.

Com base nisso, iremos implementar o risco bayesiano para utilizá-lo como uma forma de avaliar os modelos que serão, posteriormente, treinados. Seguindo o livro e o material apresentado em aula, o risco bayesiano pode ser reduzido para:

$$R = c_{10} \times p_{0} \times p_{FP} + c_{01} \times p_{1} \times p_{FN}$$

Em outras palavras, o risco bayesiano é a soma da multiplicação dos custos de cada um dos erros, pela suas prioris e pela probabilidade do erro acontecer. Como é apresentado no livro e nas aulas, essa é a equação reduzida, onde ignoramos os custos dos acertos, pois eles não possuem impacto no risco.

## Calculando as prioris

Para calcular as prioris, iremos primeiro importar os dados e avaliar o quão balanceado nosso dataset está.

In [50]:
import pandas as pd

df = pd.read_csv('../../data/final_features_df.csv')
df = df.fillna(0)
df

Unnamed: 0.1,Unnamed: 0,Age,Income,faves_pca0,faves_pca1,unfaves_pca0,unfaves_pca1,accessories,alcohol,animamted,...,Drama.2,Entertainment (Variety Shows),Factual,Learning,Music,News,Religion &amp; Ethics,Sport.1,Weather,Rating_bin
0,0,62,1,0.325917,0.112657,-0.038928,0.473323,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
1,1,62,1,0.325917,0.112657,-0.038928,0.473323,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
2,2,62,1,0.325917,0.112657,-0.038928,0.473323,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
3,3,62,1,0.325917,0.112657,-0.038928,0.473323,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
4,4,62,1,0.325917,0.112657,-0.038928,0.473323,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36115,36115,33,2,0.131441,-0.259766,0.348568,0.074988,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
36116,36116,33,2,0.131441,-0.259766,0.348568,0.074988,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
36117,36117,33,2,0.131441,-0.259766,0.348568,0.074988,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0
36118,36118,33,2,0.131441,-0.259766,0.348568,0.074988,0.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,0


In [51]:
Y_column = 'Rating_bin'

prior_probabilities = df.groupby(Y_column).size().div(len(df))
prior_probabilities

Rating_bin
0    0.865975
1    0.134025
dtype: float64

Vemos, claramente, que os dados estão batante desbalanceados. Temos uma proporção muito maior de $0$ do que de $1$. Isso faz sentido, pois é bem pouco comum alguém achar agradável uma propaganda

Seguindo com o cálculo, agora iremos separar essa priori entre $0$ e $1$

In [52]:
prior_0 = prior_probabilities.iloc[0]
prior_1 = prior_probabilities.iloc[1]

print(f'Prior 0: {prior_0}')
print(f'Prior 1: {prior_1}')

Prior 0: 0.8659745293466223
Prior 1: 0.13402547065337764


## Naive Bayes

Como função de decisão para esse exercício iremos utilizar o Naive Bayes. Ele assume a independência entre as features, porém poderá ser utilizado como função de decisão para avaliar o comportamento do risco bayesiano.

In [53]:
from sklearn.naive_bayes import GaussianNB

features = df.loc[:, df.columns != Y_column]
labels = df[Y_column]

model = GaussianNB()
model.fit(features.values, labels.values)

df["prediction"] = df[features.columns.values].apply(
    lambda s: model.predict(s.values[None])[0], axis=1
)
df

Unnamed: 0.1,Unnamed: 0,Age,Income,faves_pca0,faves_pca1,unfaves_pca0,unfaves_pca1,accessories,alcohol,animamted,...,Entertainment (Variety Shows),Factual,Learning,Music,News,Religion &amp; Ethics,Sport.1,Weather,Rating_bin,prediction
0,0,62,1,0.325917,0.112657,-0.038928,0.473323,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1,1,62,1,0.325917,0.112657,-0.038928,0.473323,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,2,62,1,0.325917,0.112657,-0.038928,0.473323,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
3,3,62,1,0.325917,0.112657,-0.038928,0.473323,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
4,4,62,1,0.325917,0.112657,-0.038928,0.473323,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36115,36115,33,2,0.131441,-0.259766,0.348568,0.074988,0.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
36116,36116,33,2,0.131441,-0.259766,0.348568,0.074988,0.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
36117,36117,33,2,0.131441,-0.259766,0.348568,0.074988,0.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
36118,36118,33,2,0.131441,-0.259766,0.348568,0.074988,0.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0


Com as predições geradas, iremos agora calcular a taxa de falso positivos e falso negativos

In [77]:
errors = df.groupby([Y_column,'prediction']).size()

nfn = errors[0][1]
nfp = errors[1][0]
ntn = errors[0][0]
ntp = errors[1][1]

pfn = nfn/(nfn + ntp)
pfp = nfp/(nfp + ntn)

print(f'Probability of FN: {pfn}')
print(f'Probability of FP: {pfp}')

Probability of FN: 0.7489668584393485
Probability of FP: 0.07329997056226081


### Utilizando o risco padrão

In [79]:
c10 = 1/2
c01 = 1/2

r = c10*prior_0*pfp + c01*prior_1*pfn
print(f'Bayes risk with default cost: {r}')

Bayes risk with default cost: 0.0819282716074452


### Utilizando um risco que iguale as proporções de 1

In [84]:
predicted_prior_1 = df.groupby('prediction').size().div(len(df)).iloc[1]
cost_1 = prior_1/predicted_prior_1

print(f'Predicted prior 1: {predicted_prior_1}')
print(f'Original prior 1: {prior_1}')
print(f'Rate: {cost_1}')

Predicted prior 1: 0.3416666666666667
Original prior 1: 0.13402547065337764
Rate: 0.3922696702050077


In [85]:
c10 = 1/2 + cost_1
c01 = 1 - c10

r = c10*prior_0*pfp + c01*prior_1*pfn
print(f'Bayes risk with custom cost: {r}')

Bayes risk with custom cost: 0.0674516660484691
