# 2.1 Baselines

In [1]:
import numpy as np
import pandas as pd
from math import sqrt
from statistics import mean
from collections import Counter

Loading training and test data from project #1. We really only need the `rating` column.

In [2]:
train_ratings = pd.read_csv('../local_data/train_data.csv', header=0)['rating']
test_ratings = pd.read_csv('../local_data/test_data.csv', header=0)['rating']

To know if a model's RMSE value is any good, we must compare it to a baseline. Below you'll find 4 such baselines, each of these is a naive no-brainer technique which requires zero effort. If the RMSE of our model is not lower than the baselines, our model is of no use. 

Remeber the RMSE definition: 
$$ \sqrt{\frac{1}{N}\Sigma_{i=1}^{N}{\Big(x_i -\hat{x}_i\Big)^2}} $$

In [3]:
def rmse(lst):
    return sqrt(mean([pow(x - float(r), 2.) for x,r in zip(lst, test_ratings)]))

ratings = Counter(train_ratings).items()
print(ratings)
ratings = [(x[0], float(x[1])/sum([z[1] for z in ratings])) for x in ratings]
np.random.choice([x[0] for x in ratings], p=[x[1] for x in ratings])
num_of_predictions = len(test_ratings)

dict_items([(5, 214936), (4, 331600), (3, 248122), (2, 102148), (1, 53392)])


In [4]:
print("RMSE Baselines:\n-----")
print(f"Random sampling:\t{rmse(np.random.choice(range(1,6), size=num_of_predictions)):.2f}")
print(f"Weighted sampling:\t{rmse(np.random.choice([x[0] for x in ratings],p=[x[1] for x in ratings], size=num_of_predictions)):.2f}")
print(f"Majority class:\t\t{rmse([4.]*num_of_predictions):.2f}")
print(f"Mean value:\t\t{rmse([mean(train_ratings)]*num_of_predictions):.2f}")

RMSE Baselines:
-----
Random sampling:	1.90
Weighted sampling:	1.59
Majority class:		1.19
Mean value:		1.12


We see the lowest RMSE we can rach without any trained model is 1.12