# 1. BaseLine Modeling

## 1.1 Simple Average Model

The first model we'll test is about the simplest one possible. We'll just average all the training set ratings and use that average for the prediction for all test set examples.

In [1]:
class SimpleAverageModel():
    """A very simple model that just uses the average of the ratings in the
    training set as the prediction for the test set.

    Attributes
    ----------
    mean : float
        Average of the training set ratings
    """

    def __init__(self):
        pass

    def fit(self, X):
        """Given a ratings dataframe X, compute the mean rating
        
        Parameters
        ----------
        X : pandas dataframe, shape = (n_ratings, >=3)
            User, item, rating dataframe. Only the 3rd column is used.
        
        Returns
        -------
        self
        """
        self.mean = X.iloc[:, 2].mean()
        return self

    def predict(self, X):
        return np.ones(len(X)) * self.mean


## 1.2 Average by ID

In [3]:


# <!-- collapse=True -->
class AverageByIdModel():
    """Simple model that predicts based on average ratings for a given Id
    (movieId or userId) from training data
    
    Parameters
    ----------
    id_column : string
        Name of id column (i.e. 'itemId', 'userId') to average by in
        dataframe that will be fitted to

    Attributes
    ----------
    averages_by_id : pandas Series, shape = [n_ids]
        Pandas series of rating averages by id
    overall_average : float
        Average rating over all training samples
    """
    def __init__(self, id_column):
        self.id_column = id_column

    def fit(self, X):
        """Fit training data.

        Parameters
        ----------
        X : pandas dataframe, shape = (n_ratings, >=3)
            User, item, rating dataframe. Columns beyond 3 are ignored

        Returns
        -------
        self : object
        """
        rating_column = X.columns[2]
        X = X[[self.id_column, rating_column]].copy()
        X.columns = ['id', 'rating']
        self.averages_by_id = (
            X
            .groupby('id')['rating']
            .mean()
            .rename('average_rating')
        )
        self.overall_average = X['rating'].mean()
        return self

    def predict(self, X):
        """Return rating predictions

        Parameters
        ----------
        X : pandas dataframe, shape = (n_ratings, >=3)
            Array of n_ratings movieIds or userIds

        Returns
        -------
        y_pred : numpy array, shape = (n_ratings,)
            Array of n_samples rating predictions
        """
        rating_column = X.columns[2]
        X = X[[self.id_column, rating_column]].copy()
        X.columns = ['id', 'rating']
        X = X.join(self.averages_by_id, on='id')
        X['average_rating'].fillna(self.overall_average, inplace=True)
        return X['average_rating'].values




## 2.3 Damped Baseline with User + Movie Data

This baseline model takes into account the average ratings of both the user and the movie, as well as a damping factor that brings the baseline prediction closer to the overall mean. The damping factor has been shown empirically to improve the perfomance.

This model follows equation 2.1 from a collaborative filtering paper from GroupLens, the same group that published the MovieLens data. This equation defines rhe baseline rating for user $u$ and item $i$ as
$$b_{u,i} = \mu + b_u + b_i$$

where
$$b_u = \frac{1}{|I_u| + \beta_u}\sum_{i \in I_u} (r_{u,i} - \mu)$$

and
$$b_i = \frac{1}{|U_i| + \beta_i}\sum_{u \in U_i} (r_{u,i} - b_u - \mu).$$

(See equations 2.4 and 2.5). Here, $\beta_u$ and $\beta_i$ are damping factors, for which the paper reported 25 is a good number for this dataset. For now we'll just leave these values equal ($\beta=\beta_u=\beta_i$). Here's a summary of the meanings of all the variables here: