### Validation Algorithm

Building a validation algorithm.

### Part 1: Reddit Confidence Score

We use Reddits ranking score to get a confidence score of current voting scenarios for a resource.

In [1]:
from math import sqrt
import pandas as pd
from scipy.stats import norm
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

def confidence(ups, downs, confidence_score=0.90):
    #measure the total number of ratings
    n = ups + downs
    #what to do when there are no ratings
    if n == 0:
        return 0
    # finding the (1- confidence / 2) quantile of the standard normal distribution
    z = norm.ppf(1 - ((1 - confidence_score) / 2))
    #find the observed fraction of positive ratings
    p = ups / n
    #use Wilsons score interval to find the lower bound, essentially trying to answer
    # Given the ratings I have, there is a 80% chance (based on confidence score) that the
    # "real" fraction of positive ratings is at least what?
    #     we want the lower bound therefore only calculating that portion of formula
    #Calculate the left side of formula
    left_side = p + 1/(2*n)*z*z
    #Calculate the right side of formula
    right_side = z*sqrt(p*(1-p)/n + z*z/(4*n*n))
    #Calculate the bottom side of formula
    under = 1+1/n*z*z
    #return lower bound of score
    return (left_side - right_side) / under

In [2]:
#example voting scenario for a resource
upvotes = 14
downvotes = 2

print(f"Confidence for this scenario: {confidence(upvotes, downvotes)}")

Confidence for this scenario: 0.683786931501618


### Part 2: Implementing Decay Rate (Over Time)

In this step we implement a decay rate $\lambda$ that controls how the score is affected over time. We also want the rate of decay over time to be at a constant rate, so in order to do this we must use the exponential function. The exponential function helps ensure that time affects the score gradually and naturally as time passes.

Decay rate must be negative to ensure that we are measuring exponential decay rather than exponential growth.
$$score = p^\alpha * e^{-\lambda t}$$

We also include an $\alpha$ value to the reddit confidence score that measures the weight of that value in regards to the final score. Consider a sceneario where we have a resource with a confidence rating of 0.9 from the reddit step, we will see how changing the $\alpha$ state impacts the score over the course of 1 day.

|  | $\lambda = 0.8$ | $\lambda = 1$ | $\lambda = 1.2$ |
| :--:| :---: | :---: | :---: |
| score | 0.91125 | 0.89225 | 0.87365 |


#### How Decay Rate Affects Scores Over Time

Consider a scenario where a resources has a perfect score: $p = 1$. This is how time will affect that score with different decay rates.

| Time (Days) | $\lambda = 10^{-5}$ | $\lambda = 10^{-6}$ | $\lambda = 10^{-7}$ |
| :---------: | :-----------------: | :-----------------: | :-----------------: |
| 1 Day (86400s) | 0.406 | 0.913 | 0.991 |
| 3 Days (259200s) | 0.049 | 0.590 | 0.974 |
| 7 Days (604800s) | 0.0009 | 0.239 | 0.954 |

In [3]:
from math import exp

#build an updated score version
def score_v_one(upvotes, downvotes, time, alpha=0.8):
    #get the reddit confidence score
    p = confidence(upvotes, downvotes)
    #calculate p^alpha
    confidence_score = p ** alpha
    #calculate time part
    time_decay = exp(-(10**(-7))*time)
    return confidence_score * time_decay

In [4]:
#number of upvotes for a resource
upvotes = 20
#number of downvotes for a resource
downvotes = 2
#the time the resources has been up
days = 3
#time in seconds for formula
seconds = days * 86400

print(f"Score: {score_v_one(upvotes, downvotes, seconds)}")

Score: 0.7816252169974596


### Proposed Algo Formula

$$score = p^{0.8} \cdot e^{-10^{-7} \cdot t}$$

With a **validation threshold** of 0.65, which allows for a solid amount of resources to be used for training.

Where:
- $p$ is the confidence score received from Reddits ranking formula, calculated using **upvotes** and **downvotes**
- The value 0.8 allows for the confidence score to be weighed slightly more than time, in order to allow for good resources scores to be preserved more throughout time
- $e^{-10^{-7} \cdot t}$ allows for a solid decay rate that will preserve resources for a longer period of time



### Future Iterations of Formula

In the future if we want to include different variables into the formula, we could include those proportions like so.

For example, if we want to factor in the proportion of views to total users. The formula would look like this:

$$score = p^{0.8} \cdot v^{\beta} \cdot e^{-10^{-7} \cdot t}$$

Where:
- $v$ is the proportion of views / total users
- $\beta$ is the weight we would want to associate to the views proportion parameter