## CSCI 470 Activities and Case Studies

1. For all activities, you are allowed to collaborate with a partner. 
1. For case studies, you should work individually and are **not** allowed to collaborate.

By filling out this notebook and submitting it, you acknowledge that you are aware of the above policies and are agreeing to comply with them.

Some considerations with regard to how these notebooks will be graded:

1. Cells in which "# YOUR CODE HERE" is found are the cells where your graded code should be written.
2. In order to test out or debug your code you may also create notebook cells or edit existing notebook cells other than "# YOUR CODE HERE". We actually highly recommend you do so to gain a better understanding of what is happening. However, during grading, **these changes are ignored**. 
2. You must ensure that all your code for the particular task is available in the cells that say "# YOUR CODE HERE"
3. Every cell that says "# YOUR CODE HERE" is followed by a "raise NotImplementedError". You need to remove that line. During grading, if an error occurs then you will not receive points for your work in that section.
4. If your code passes the "assert" statements, then no output will result. If your code fails the "assert" statements, you will get an "AssertionError". Getting an assertion error means you will not receive points for that particular task.
5. If you edit the "assert" statements to make your code pass, they will still fail when they are graded since the "assert" statements will revert to the original. Make sure you don't edit the assert statements.
6. We may sometimes have "hidden" tests for grading. This means that passing the visible "assert" statements is not sufficient. The "assert" statements are there as a guide but you need to make sure you understand what you're required to do and ensure that you are doing it correctly. Passing the visible tests is necessary but not sufficient to get the grade for that cell.
7. When you are asked to define a function, make sure you **don't** use any variables outside of the parameters passed to the function. You can think of the parameters being passed to the function as a hint. Make sure you're using all of those variables.
8. Finally, **make sure you run "Kernel > Restart and Run All"** and pass all the asserts before submitting. If you don't restart the kernel, there may be some code that you ran and deleted that is still being used and that was why your asserts were passing.

# Unsupervised Learning - Matrix Completion

In [None]:
!pip install tensorflow
!pip install fancyimpute

import tensorflow as tf  # fancyimpute uses tensorflow, we'll explicitly load it so that's clear
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn as sk
from fancyimpute import SimpleFill, KNN, MatrixFactorization

%matplotlib inline
plt.style.use("ggplot")

tf.random.set_seed(0)
np.random.seed(0)

## Synthetic Data
First, let us create some synthetic data that we are going to use to test out a variety of matrix completion algorithms. Note that we force the data in $X$ to have a low rank. 

In [None]:
# Create the data matrix X
# If you think of this as a recommendation engine
# Select n as the number of users
n = 100 
# Select m as the number of movies or products
m = 50
# Select the inner_rank as the number of real genres or categories that movies or products belong to
inner_rank = 10
user_matrix = np.random.randn(n, inner_rank)
item_matrix = np.random.randn(inner_rank, m)
# X = np.dot(np.random.randn(n, inner_rank), np.random.randn(inner_rank, m))
# X = np.dot(user_matrix, item_matrix)
X = user_matrix @ item_matrix

In [None]:
# Let's visualize the matrix X
plt.figure(figsize=(4, 8))
plt.imshow(X)
plt.grid(False)
plt.title("Ground Truth")
plt.show()

Here we've essentially created a matrix that represents the ground truth of how much each person really likes each product. Now in reality, we don't actually have access to that matrix. We have access to a very tiny view of this matrix because our users don't usually review products to let us know what they like.

## Removing Data From $X$
Since we are studying matrix completion algorithms we need to have some missing values! To do this we will randomly remove some data from $X$. 

Remember... since $X$ can be represented by an `inner_rank` of linearly independent columns it should be relatively easy for a matrix completion algorithm to reconstruct the missing data.

In [None]:
# What percent of user-movie combinations do we have access to?
# For now we'll select an easier case but you can come back and change this value to see how the model performs
visible_percentage = 0.6
missing_mask = np.random.rand(*X.shape) < (1 - visible_percentage)
X_incomplete = X.copy()
# missing entries indicated with NaN
X_incomplete[missing_mask] = np.nan

In [None]:
# Let's visualize the data matrix X_incomplete
plt.figure(figsize=(4, 8))
plt.imshow(X_incomplete)
plt.grid(False)
plt.show()

## Matrix Completion
Now, given the incomplete matrix `X_incomplete` we want to try and fill in the missing values. For this we are going to use the https://github.com/iskandr/fancyimpute package in Python.

For the first example, I will provide an example implementation where we fill in all the missing values using `SimpleFill`:

In [None]:
meanFill = SimpleFill("mean")
X_filled_mean = meanFill.fit_transform(X_incomplete)

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,6))

ax1.imshow(X)
ax1.set_title("Original Matrix")
ax1.grid(False)

ax2.imshow(X_filled_mean)
ax2.set_title("Mean Fill Completed Matrix")
ax2.grid(False)
plt.show()

In [None]:
# To test the performance of our matrix completion algorithm we want to compare
# the "filled-in" values to the original:
def mat_completion_mse(X_filled, X_truth, missing_mask):
    """Calculates the mean squared error of the filled in values vs. the truth
    
    Args:
        X_filled (np.ndarray): The "filled-in" matrix from a matrix completion algorithm
        X_truth (np.ndarray): The true filled in matrix
        missing_mask (np.ndarray): Boolean array of missing values
    
    Returns:
        float: Mean squares error of the filled values
    """
    return ((X_filled[missing_mask] - X_truth[missing_mask]) ** 2).mean()

In [None]:
meanFill_mse = mat_completion_mse(X_filled_mean, X, missing_mask)
print("meanFill MSE: %f" % meanFill_mse)

### KNN Completion
Next you will use the K-Nearest Neighbors algorithm to fill in the missing values. First, we will need to find the best number of neighbors to use for the KNN algorithm:

In [None]:
# Find the best value for k
def find_best_k(k_neighbors, complete_mat, incomplete_mat, missing_mask):
    """Determines the best k to use for matrix completion with KNN
    
    Args:
        k_neighbors (iterable): The list of k's to try
        complete_mat (np.ndarray): The original matrix with complete values
        incomplete_mat (np.ndarray): The matrix with missing values
        missing_mask (np.ndarray): Boolean array of missing values
    
    Returns:
        integer: the best value of k to use for that particular matrix
    """
    best_k = -1
    best_k_mse = np.infty
    
    for neighbors in k_neighbors:
        # YOUR CODE HERE
        raise NotImplementedError()
    return best_k

In [None]:
k_neighbors = [2, 3, 4, 5, 10, 20]
best_k = find_best_k(k_neighbors, X, X_incomplete, missing_mask)

In [None]:
best_k

In [None]:
assert best_k == 5

Now that we have found the `best_k` to use let's see how well it performed:

In [None]:
# Run KNN with the best_k and store the result in X_filled_knn

# YOUR CODE HERE
raise NotImplementedError()

knnFill_mse = mat_completion_mse(X_filled_knn, X, missing_mask)
print("knnFill MSE: %f" % knnFill_mse)

In [None]:
assert knnFill_mse < meanFill_mse

## Visually Comparison of Matrix Completion Algorithms
To get a good idea of how these matrix completion algorithms compare we want to create a method that visualizes how well these algorithms actually perform.

#### Creating a Collection of Models
Now we create a handful of matrix completion algorithms that we want to visualize:
 - Mean Fill
 - K-Nearest Neighbors
 - MatrixFactorization (an implementaiton using gradient descent)

In [None]:
simpleFill = SimpleFill("mean")
knnFill = KNN(k=best_k)
mfFill = MatrixFactorization(learning_rate=0.01, rank=20)
methods = [simpleFill, knnFill, mfFill]
names = ["SimpleFill", "KNN", "MatFactor"]

In [None]:
def mat_completion_comparison(methods, incomplete_mat, complete_mat, missing_mask):
    """Using a list of provided matrix completion methods calculate 
    the completed matrix and the determine the associated 
    mean-squared-error results.
    
    Args:
        methods (iterable): A list of matrix completion algorithms
        incomplete_mat (np.ndarray): The incomplete matrix
        complete_mat (np.ndarray): The full matrix
        missing_mask (np.ndarray): Boolean array of missing values
    
    Returns:
        filled_mats (iterable): the "filled-in" matrices
        mses (iterable): the mean square error results
    """
    X_filled_mats = []
    mses = []
    for method in methods:
        # YOUR CODE HERE
        raise NotImplementedError()

    return X_filled_mats, mses

The autograder test below may take a minute or so to run.

You can ignore any warnings about `lr` or `learning_rate`.

In [None]:
X_filled_mats, mses = mat_completion_comparison(methods, X_incomplete, X, missing_mask)
assert len(X_filled_mats) == len(methods)
assert len(mses) == len(methods)

In [None]:
plt.figure(figsize=(12, 8)) # Change the figure size to your liking

for i in range(0, len(methods)):
    X_filled = X_filled_mats[i]
    mse = mses[i]
    ax = plt.subplot(131 + i)
    ax.imshow(X_filled)
    ax.title.set_text(f'{names[i]} MSE: {mse:0.2f}')
    ax.grid(False)
    
plt.show()

## Feedback

In [None]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    # YOUR CODE HERE
    raise NotImplementedError()