# Target design: understanding the useful level

## Useful feature in data

This project focuses on the `useful` metric present in the data. This metric, in the review dataframe is the number of times a YELP review was voted useful.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pathlib import Path

In [None]:
data_path = Path("/Users/alexandresepulvedadedietrich/Code/HelpfulLens/data")

df = pd.read_parquet(data_path / "cleaned" / "reviews_clean.parquet")

In [None]:
df_sample = df.sample(n=100000, random_state=42)
df_sample.head()

In [None]:
df_sample.hist(column="useful", log=True, bins=50, figsize=(8, 6))

In [None]:
df_sample['total_votes'] = df_sample['useful'] + df_sample['funny'] + df_sample['cool']

In [None]:
df_eda = df_sample.copy()[["business_id","user_id", "stars", "useful", "total_votes"]]
df_eda.astype({'business_id': 'category',
                'user_id': 'category',
                'stars': 'int8',
                'useful': 'int16',
                'total_votes': 'int16'})

df_eda = df_eda[df_eda["total_votes"] > 0].reset_index(drop=True)
df_eda.head()

In [None]:
corr_matrix = df_eda.corr(numeric_only=True)
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

In [None]:
df_sample.columns

## Review-aware approach - bayesian smoothing

In [None]:
plt.scatter(df_sample['total_votes'], df_sample['useful'], alpha=0.5)
plt.xlabel('Total Votes')
plt.ylabel('Useful Votes')

To try to take into account the review popularity bias, let's assume (strong assumption) that a user either votes `useful`, `cool` or `funny`, and not more than one. 

In that case, for a review $i$, let $u_i = \#\text{useful}_i$ and $v_i = \#\text{total votes}_i$ which acts as a proxy for the number of views.

In that case, the probability $p_i$ that a review $i$ is useful yields that the number of useful votes is distributed under a binomial:

$$
u_i \mid v_i,p_i \sim \mathcal{B}(n = v_i, p = p_i)
$$

The objective becomes learning $p_i$.

A naive estimator could be:
$$\hat{p_i} = \frac{u_i}{v_i}$$

which works when the vote count is large, but this is rarely the case. To be more robust to smaller votes we use a bayesian smoothing:

$$\hat{p_i} = \frac{u_i + \alpha}{v_i + \alpha + \beta}, \quad \alpha, \beta > 0$$


In [None]:
alpha, beta = 1.0, 5.0

df_eda["useful_rate_smoothed"] = (
    (df_eda["useful"] + alpha) /
    (df_eda["total_votes"] + alpha + beta)
)

df_eda.head()

In [None]:
corr_matrix = df_eda.corr(numeric_only=True)
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

In [None]:
df_eda.useful_rate_smoothed.median()

In [None]:
1/6