###  Colab Activity 19.1: Regression Models for Prediction

**Expected Time = 60 minutes**


This activity will use regression models to provide scores for unseen content (albums).  Using these scores, you can make recommendations for unheard albums to users. You are also given similar information as to that from the lecture in terms of *lofi* and *slick* scores for each artist.

#### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [3]:
import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression

#### Our Data

This example uses a synthetic dataset of reviews from five individuals and five albums.  The dataset is loaded and displayed below. Two additional columns `lofi` and `slick` are included to rate the nature of the music.


In [4]:
reviews = pd.read_csv('data/sample_reviews.csv', index_col=0)

In [5]:
reviews.head()

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,slick,lofi
Michael Jackson,3.0,,2.0,3.0,1.0,8,2
Clint Black,4.0,9.0,5.0,,1.0,8,2
Dropdead,,,8.0,9.0,,2,9
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,2,10
Cardi B,4.0,8.0,,9.0,5.0,9,3


[Back to top](#-Index)

### Problem 1

#### Considering Alfred

Define `X` to contain only the `slick` and `lofi` columns of the `reviews` dataframe, with rows where the `Alfred` column had missing values removed. Define `y`  as a new series y that contains the non-missing values from the `Alfred` column in the `reviews` dataframe.

Instantiate a new linear regression model and fit it to `X` and `y`. Assign this model to the variable `alfred_lr`.

Next, create a new dataframe `newx` that contains only the rows from the `reviews` dataframe where the `Alfred` column has missing (NaN) values. Additionally, ensure that you are selecting only the `slick` and `lofi` columns from these rows.

Finally, use the function `predict` on `alfred_lr` with argument equal to `newx` to calculate your predictions. Assign your result to `alfred_dd_predict`.


In [6]:
def get_predictions_and_coefficients(reviews_df, user, feature_columns):
    """
    Creates a content-based filtering model for a specified user and returns predictions for unrated items.

    Parameters:
    ----------
    reviews_df : pandas.DataFrame
        DataFrame containing user ratings and item features
    user : str
        The name of the user column in the reviews DataFrame
    feature_columns : list
        List of column names representing the item features (e.g., ['slick', 'lofi'])

    Returns:
    -------
    tuple
        A tuple containing three elements:
        - predictions_df: pandas.DataFrame
            DataFrame with unrated items and their predicted ratings
        - user_coefficients: numpy.ndarray
            The learned coefficients (user factors) representing the user's preferences
        - intercept: float
            The intercept term from the linear regression model

    Examples:
    --------
    >>> predictions_df, coef, intercept = get_predictions_and_coefficients(
    ...     reviews, 'Alfred', ['slick', 'lofi'])
    >>> print(f"Alfred's preferences: {dict(zip(['slick', 'lofi'], coef))}")
    >>> print(f"Intercept: {intercept}")
    >>> print(predictions_df.sort_values('predicted_rating', ascending=False).head())
    """
    # Filter for items the user has rated
    df_user_notnull = reviews_df[reviews_df[user].notnull()]

    # Extract features and target for training
    X = df_user_notnull[feature_columns]
    y = df_user_notnull[user]

    # Fit the linear regression model
    user_lr = LinearRegression().fit(X, y)

    # Get items the user hasn't rated
    df_user_nulls = reviews_df[reviews_df[user].isnull()]

    # If there are unrated items, predict ratings for them
    if not df_user_nulls.empty:
        newx = df_user_nulls[feature_columns]
        predictions = user_lr.predict(newx)

        # Create a DataFrame with the predictions
        predictions_df = df_user_nulls.copy()
        predictions_df['predicted_rating'] = predictions
    else:
        predictions_df = pd.DataFrame()  # Empty DataFrame if no predictions

    return predictions_df, user_lr.coef_, user_lr.intercept_
alfred_dd_predict, alfred_coefficients, _ = get_predictions_and_coefficients(reviews, 'Alfred', ['slick', 'lofi'])
### ANSWER CHECK
alfred_dd_predict

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,slick,lofi,predicted_rating
Dropdead,,,8.0,9.0,,2,9,3.75


[Back to top](#-Index)

### Problem 2

#### User Vector for Alfred


Assign the coefficients of the linear regressions model `alfred_lr` to `alfred_vector` below.


In [7]:

alfred_vector = alfred_coefficients


### ANSWER CHECK
pd.DataFrame(alfred_vector.reshape(1, 2), columns = ['slick', 'lofi'], index = ['Alfred'])

Unnamed: 0,slick,lofi
Alfred,0.25,0.25


[Back to top](#-Index)

### Problem 3

#### Considering Tino


Build a regression model `tino_lr` in a similar way as in Problem 1, but now for the user `Tino`.  Assign the prediction to `tino_dd_predict` as a numpy array below.

In [8]:

tino_dd_predict , tino_coefficients, _ = get_predictions_and_coefficients(reviews, 'Tino', ['slick', 'lofi'])



### ANSWER CHECK
tino_dd_predict

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,slick,lofi,predicted_rating
Dropdead,,,8.0,9.0,,2,9,6.714286


[Back to top](#-Index)

### Problem 4

#### Tino's user vector


Assign the coefficients of the linear regressions model `tino_lr` to `tino_vector` below.

In [9]:

tino_vector = tino_coefficients


### ANSWER CHECK
pd.DataFrame(tino_vector.reshape(1, 2), columns = ['slick', 'lofi'], index = ['Tino'])

Unnamed: 0,slick,lofi
Tino,1.714286,2.285714


[Back to top](#-Index)

### Problem 5

#### Completing the Table


Write a `for` loop to iterate over each column of `reviews` and perform the prediction process using the same columns of `slick` and `lofi` as inputs.

Create a DataFrame called `reviews_df_full` and complete the scores for each individual.

In [10]:
reviews.columns

Index(['Alfred', 'Mandy', 'Lenny', 'Joan', 'Tino', 'slick', 'lofi'], dtype='object')

In [15]:
users = ['Alfred', 'Mandy', 'Lenny', 'Joan', 'Tino']
reviews_df_full = reviews.copy()
for user in users:
    predictions_df, _, _ = get_predictions_and_coefficients(reviews, user, ['slick', 'lofi'])
    if not predictions_df.empty:
        # Update existing columns with predicted values
        reviews_df_full.loc[predictions_df.index, user] = predictions_df['predicted_rating']

### ANSWER CHECK
reviews_df_full

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,slick,lofi
Michael Jackson,3.0,9.0,2.0,3.0,1.0,8,2
Clint Black,4.0,9.0,5.0,4.664444,1.0,8,2
Dropdead,3.75,3.857143,8.0,9.0,6.714286,2,9
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,2,10
Cardi B,4.0,8.0,4.916667,9.0,5.0,9,3
