### Required Codio Assignment 19.1: Collaborative Filtering

**Expected Time = 90 minutes**

**Total Points = 50**

In this activity, you will use collaborative filtering to predict user ratings.  This iterative process will begin with our simple reviews dataset to fill in the missing values for the users.  Your regression models will be built using Scikit-Learn's `LinearRegression` estimator.

### Index


- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)
- [Problem 6](#-Problem-6)

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

### The Data

Again, you begin with data indexed by artists.  You will add random values for `F1` and `F2`, and use these to create regression models for each user.  Then, tracking the coefficients -- you create new artist vectors, and repeat the process.  The goal remains to predict user ratings of unrated albums.

In [3]:
reviews = pd.read_csv('data/user_rated.csv', index_col = 0).iloc[:, :-2].T

In [4]:
reviews

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino
Michael Jackson,3.0,,2.0,3.0,1.0
Clint Black,4.0,9.0,5.0,,1.0
Dropdead,,,8.0,9.0,
Anti-Cimex,4.0,3.0,9.0,4.0,9.0
Cardi B,4.0,8.0,,9.0,5.0


[Back to top](#-Index)

### Problem 1

### Creating F1 and F2

**5 Points**

To begin, create two randomly instantiated vectors `F1` and `F2` as columns in your DataFrame.  To do so, you will draw numbers from a random normal distribution using `np.random.normal(size = 5)`.  Set `np.random.seed = 42`.  

In [5]:
### GRADED
np.random.seed(42)
reviews['F1'] = np.random.normal(size = 5)
reviews['F2'] = np.random.normal(size = 5)

### ANSWER CHECK
reviews

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,F1,F2
Michael Jackson,3.0,,2.0,3.0,1.0,0.496714,-0.234137
Clint Black,4.0,9.0,5.0,,1.0,-0.138264,1.579213
Dropdead,,,8.0,9.0,,0.647689,0.767435
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,1.52303,-0.469474
Cardi B,4.0,8.0,,9.0,5.0,-0.234153,0.54256


[Back to top](#-Index)

### Problem 2

#### Regression models for all users

**10 Points**

Complete the starter code given below to iterate over the first five columns of the `reviews` dataframe. To define `X`, drop the rows where the column `c` is NaN and selects the `F1` and `F2` columns. The target variable `y` is set to the column `c` after dropping NaNs.

Next, use `X` and `y` to fit a linear regression model without an intercept to predict values of column `c` based on `F1` and `F2`. Assign this model to the variable `lr`.

Store the coefficients of the linear regression model  in the list `uf` and convert this list to a NumPy array.


In [6]:
### GRADED

training_features = ['F1','F2']
def get_coefficients_for_training_features(df, training_features, intercept=True ):
    uf = [] 
    for c in df.columns[:-len(training_features)]:
        notnull_df = df[df[c].notnull()]  
        X = notnull_df[training_features]
        y = notnull_df[c]
        lr = LinearRegression(fit_intercept=intercept).fit(X,y)  
        coefs = lr.coef_
        uf.append(coefs)      
    return np.array(uf)
uf = get_coefficients_for_training_features(reviews, training_features, False)

### ANSWER CHECK
uf.shape #should be (5, 2)

(5, 2)

[Back to top](#-Index)

### Problem 3

#### New Model for artists

**10 Points**

Below, a dataframe `ui_df` is created using the coefficients from the previous problem.  Now, you are to use this data with `F1` and `F2` to build a new model and track each *artists* coefficients.  Assign this as a numpy array to `ifs` below.

HINT: The steps for this problem are similar to the ones in Problem 2.


In [7]:
ui_df = reviews.iloc[:, :-2].T
ui_df['F1'] = uf[:, 0]
ui_df['F2'] = uf[:, 1]
ui_df

Unnamed: 0,Michael Jackson,Clint Black,Dropdead,Anti-Cimex,Cardi B,F1,F2
Alfred,3.0,4.0,,4.0,4.0,3.820956,3.395762
Mandy,,9.0,,3.0,8.0,3.710347,7.006197
Lenny,2.0,5.0,8.0,9.0,,7.113263,3.952502
Joan,3.0,,9.0,4.0,9.0,5.240167,10.035759
Tino,1.0,1.0,,9.0,5.0,5.86328,2.197482


In [8]:
### GRADED
ifs = get_coefficients_for_training_features(ui_df, training_features)

### ANSWER CHECK
ifs.shape

(5, 2)

[Back to top](#-Index)

### Problem 4

#### New model for users

**10 Points**

Below, a dataframe is created using the coefficients from our linear model on artists -- `if_df`.  You are to use this data to create new arrays of coefficients for the users.  Assign this array of coefficients as `uf2`.


HINT: The steps for this problem are similar to the ones in Problem 2.

In [9]:
if_df = reviews.copy().iloc[:, :-2]
if_df.loc[:, 'F1'] = ifs[:, 0]
if_df.loc[:, 'F2'] = ifs[:, 1]
if_df

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,F1,F2
Michael Jackson,3.0,,2.0,3.0,1.0,-0.374013,0.158669
Clint Black,4.0,9.0,5.0,,1.0,0.07349,1.616899
Dropdead,,,8.0,9.0,,-0.046233,0.15015
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,1.579452,-0.426502
Cardi B,4.0,8.0,,9.0,5.0,0.281836,0.632869


In [10]:
### GRADED
uf2 = get_coefficients_for_training_features(if_df, training_features, False)

### ANSWER CHECK
uf2

array([[ 2.88218695,  2.80111945],
       [ 3.87098702,  6.19145213],
       [ 5.87426269,  3.30979744],
       [ 5.64168907, 14.25906352],
       [ 5.81583562,  1.12847153]])

[Back to top](#-Index)

### Problem 5

#### One more iteration

**5 Points**

Below, a dataframe `ui_df2` is created using the results of `uf2`.  Use the features `F1` and `F2` to create regression models for each user and track the coefficients in `ifs2`. 

HINT: The steps for this problem are similar to the ones in Problem 2.

In [11]:
ui_df2 = reviews.copy().iloc[:, :-2].T
ui_df2['F1'] = uf2[:, 0]
ui_df2['F2'] = uf2[:, 1]
ui_df2

Unnamed: 0,Michael Jackson,Clint Black,Dropdead,Anti-Cimex,Cardi B,F1,F2
Alfred,3.0,4.0,,4.0,4.0,2.882187,2.801119
Mandy,,9.0,,3.0,8.0,3.870987,6.191452
Lenny,2.0,5.0,8.0,9.0,,5.874263,3.309797
Joan,3.0,,9.0,4.0,9.0,5.641689,14.259064
Tino,1.0,1.0,,9.0,5.0,5.815836,1.128472


In [12]:
### GRADED
ifs2 = get_coefficients_for_training_features(ui_df2, training_features, False)

### ANSWER CHECK
ifs2

array([[ 0.25979239,  0.130603  ],
       [-0.07440515,  1.52244692],
       [ 1.29491622,  0.11883567],
       [ 1.67123624, -0.40000091],
       [ 0.92255808,  0.3384743 ]])

[Back to top](#-Index)

### Problem 6

#### Comparing Models

**10 Points**

Based on the first iteration resulting in `if_df` and the last in `if_df2` use these different item factors as inputs to a `LinearRegression` model to determine the `mean_squared_error` for each model for Alfred.  Which user factors did a better job as inputs to the model -- `if_df` or `if_df2`.  Assign your answer as a string to `ans6` below.

In [13]:
if_df2 = reviews.copy().iloc[:, :-2]
if_df2.loc[:, 'F1'] = ifs2[:, 0]
if_df2.loc[:, 'F2'] = ifs2[:, 1]
if_df2

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,F1,F2
Michael Jackson,3.0,,2.0,3.0,1.0,0.259792,0.130603
Clint Black,4.0,9.0,5.0,,1.0,-0.074405,1.522447
Dropdead,,,8.0,9.0,,1.294916,0.118836
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,1.671236,-0.400001
Cardi B,4.0,8.0,,9.0,5.0,0.922558,0.338474


In [14]:
if_df.to_csv('data/Q.csv')
ui_df.to_csv('data/P.csv')

In [15]:
(ui_df[['F1', 'F2']]@if_df[['F1', 'F2']].T).T

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino
Michael Jackson,-0.890284,-0.276051,-2.033312,-0.367525,-1.84427
Clint Black,5.771405,11.600983,6.913547,16.611904,3.983997
Dropdead,0.333221,0.880442,0.264603,1.264603,0.058877
Anti-Cimex,4.586717,2.872159,9.549307,3.996322,8.323539
Cardi B,3.225954,5.479713,4.506187,7.828187,3.043199


In [16]:
reviews

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,F1,F2
Michael Jackson,3.0,,2.0,3.0,1.0,0.496714,-0.234137
Clint Black,4.0,9.0,5.0,,1.0,-0.138264,1.579213
Dropdead,,,8.0,9.0,,0.647689,0.767435
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,1.52303,-0.469474
Cardi B,4.0,8.0,,9.0,5.0,-0.234153,0.54256


In [17]:
(ui_df2[['F1', 'F2']]@if_df2[['F1', 'F2']].T).T

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino
Michael Jackson,1.114605,1.814275,1.958358,3.327944,1.658292
Clint Black,4.050106,9.138136,4.601916,21.288897,1.28531
Dropdead,4.065064,5.748369,8.0,9.0,7.665123
Anti-Cimex,3.696365,3.992747,8.493359,3.724957,9.268246
Cardi B,3.607092,5.666858,6.53963,10.031112,5.747405


In [18]:
if_df[['F1', 'F2']]

Unnamed: 0,F1,F2
Michael Jackson,-0.374013,0.158669
Clint Black,0.07349,1.616899
Dropdead,-0.046233,0.15015
Anti-Cimex,1.579452,-0.426502
Cardi B,0.281836,0.632869


In [19]:
from sklearn.metrics import mean_squared_error

In [20]:
# mean_squared_error?

In [21]:
### GRADED
# First approach
X1 = if_df[['F1', 'F2']]
y = if_df['Alfred']
# Get indices of non-null values
non_null_indices = y.notna()
# Use only non-null values for both X and y
lr1 = LinearRegression().fit(X1[non_null_indices], y[non_null_indices])
# Predict all values
y1_pred = lr1.predict(X1)
# Calculate MSE only on the non-null indices
mse1 = mean_squared_error(y[non_null_indices], y1_pred[non_null_indices])

print(mse1)
# Second approach
X2 = if_df2[['F1', 'F2']]
# Use the same non-null indices for consistency
lr2 = LinearRegression().fit(X2[non_null_indices], y[non_null_indices])
y2_pred = lr2.predict(X2)
mse2 = mean_squared_error(y[non_null_indices], y2_pred[non_null_indices])
print(mse2)
# Compare the MSE values and assign the answer
ans6 = 'if_df' if mse1 < mse2 else 'if_df2'

### ANSWER CHECK
ans6

0.02135242394309102
0.001221378492181394


'if_df2'

## Additional Exploration - Effort to Abstract Process and Run it on 20 iterations

In [22]:
def create_feature_df(original_df, coefficients, feature_names):
    """
    Creates a new dataframe with updated feature values from coefficients
    
    Parameters:
    original_df (DataFrame): The original dataframe to use as base
    coefficients (ndarray): Array of coefficients (one row per column in original_df)
    feature_names (list): List of feature column names
    
    Returns:
    DataFrame: New dataframe with updated feature columns
    """
    # Create a copy of the original dataframe without the feature columns
    if feature_names[0] in original_df.columns:
        new_df = original_df.drop(columns=feature_names).copy()
    else:
        new_df = original_df.copy()
    
    # Add the feature columns with values from coefficients
    for i, feature in enumerate(feature_names):
        new_df[feature] = coefficients[:, i]
        
    return new_df

In [23]:
def create_transposed_feature_df(original_df, coefficients, feature_names):
    """
    Creates a new transposed dataframe with updated feature values from coefficients
    
    Parameters:
    original_df (DataFrame): The original dataframe to use as base
    coefficients (ndarray): Array of coefficients (one row per column in transposed original_df)
    feature_names (list): List of feature column names
    
    Returns:
    DataFrame: New transposed dataframe with updated feature columns
    """
    # Create a copy of the original dataframe without the feature columns, then transpose
    if feature_names[0] in original_df.columns:
        new_df = original_df.drop(columns=feature_names).copy().T
    else:
        new_df = original_df.copy().iloc[:, :-len(feature_names)].T
    
    # Add the feature columns with values from coefficients
    for i, feature in enumerate(feature_names):
        new_df[feature] = coefficients[:, i]
        
    return new_df

In [24]:
def train_recommendation_model(initial_df, training_features, iterations=2):
    """
    Train a collaborative filtering recommendation model iteratively
    
    Parameters:
    initial_df (DataFrame): The initial dataframe with user-item ratings
    training_features (list): List of feature column names to use
    iterations (int): Number of iterations to perform
    
    Returns:
    dict: Dictionary containing all model components at each iteration
    """
    # Create a dictionary to store all model components
    model_components = {
        'user_factors': [],
        'item_factors': [],
        'user_dfs': [],
        'item_dfs': []
    }
    
    # Make a copy of the initial dataframe and add random features for the first iteration
    current_df = initial_df.copy()
    
    # Check if training features already exist
    for feature in training_features:
        if feature not in current_df.columns:
            np.random.seed(42)
            current_df[feature] = np.random.normal(size=len(current_df))
    
    # Run the specified number of iterations
    for i in range(iterations):
        print(f"Running iteration {i+1}/{iterations}...")
        
        # Step 1: Get user factors
        uf = get_coefficients_for_training_features(current_df, training_features, False)
        model_components['user_factors'].append(uf)
        
        # Step 2: Create transposed dataframe for item modeling
        ui_df = create_transposed_feature_df(initial_df, uf, training_features)
        model_components['user_dfs'].append(ui_df)
        
        # Step 3: Get item factors
        ifs = get_coefficients_for_training_features(ui_df, training_features, False)
        model_components['item_factors'].append(ifs)
        
        # Step 4: Create dataframe with item factors
        base_df = initial_df.copy()
        # Remove feature columns if they exist
        for feature in training_features:
            if feature in base_df.columns:
                base_df = base_df.drop(columns=feature)
                
        if_df = create_feature_df(base_df, ifs, training_features)
        model_components['item_dfs'].append(if_df)
        
        # Update current_df for the next iteration
        current_df = if_df
    
    return model_components

# Function to evaluate model performance for a specific user
def evaluate_user_model_performance(user, item_df, training_features):
    """
    Evaluate model performance for a specific user
    
    Parameters:
    user (str): Name of the user column to evaluate
    item_df (DataFrame): Dataframe with item factors
    training_features (list): List of feature column names
    
    Returns:
    float: Mean squared error for the user
    """
    X = item_df[training_features]
    y = reviews[user]
    
    # Get indices of non-null values
    non_null_indices = y.notna()
    
    # Fit model on non-null values
    lr = LinearRegression(fit_intercept=False).fit(
        X[non_null_indices], 
        y[non_null_indices]
    )
    
    # Predict all values
    y_pred = lr.predict(X)
    
    # Calculate MSE only on the non-null indices
    mse = mean_squared_error(y[non_null_indices], y_pred[non_null_indices])
    
    return mse

# Example usage:
training_features = ['F1', 'F2']
model = train_recommendation_model(reviews, training_features, iterations=20)

# Evaluate model performance for 'Alfred' after each iteration
for i in range(len(model['item_dfs'])):
    mse = evaluate_user_model_performance('Alfred', model['item_dfs'][i], training_features)
    print(f"Iteration {i+1} MSE for Alfred: {mse:.6f}")

# Find which iteration had the best performance
mse_values = [evaluate_user_model_performance('Alfred', df, training_features) 
              for df in model['item_dfs']]
best_iteration = np.argmin(mse_values) + 1
print(f"Best model was from iteration {best_iteration} with MSE: {min(mse_values):.6f}")

Running iteration 1/20...
Running iteration 2/20...
Running iteration 3/20...
Running iteration 4/20...
Running iteration 5/20...
Running iteration 6/20...
Running iteration 7/20...
Running iteration 8/20...
Running iteration 9/20...
Running iteration 10/20...
Running iteration 11/20...
Running iteration 12/20...
Running iteration 13/20...
Running iteration 14/20...
Running iteration 15/20...
Running iteration 16/20...
Running iteration 17/20...
Running iteration 18/20...
Running iteration 19/20...
Running iteration 20/20...
Iteration 1 MSE for Alfred: 0.646826
Iteration 2 MSE for Alfred: 0.632911
Iteration 3 MSE for Alfred: 0.617430
Iteration 4 MSE for Alfred: 0.608499
Iteration 5 MSE for Alfred: 0.603499
Iteration 6 MSE for Alfred: 0.600685
Iteration 7 MSE for Alfred: 0.599076
Iteration 8 MSE for Alfred: 0.598139
Iteration 9 MSE for Alfred: 0.597583
Iteration 10 MSE for Alfred: 0.597250
Iteration 11 MSE for Alfred: 0.597047
Iteration 12 MSE for Alfred: 0.596923
Iteration 13 MSE for A