# Matrix Completion, part 1

This lab is a precursor to the "big" lab we'll do next week on matrix completion.  You will build much of the infrastructure for this assignment this week.  We will use our very small foods dataset to make sure all our code runs properly.  In some ways, this dataset is perfect for development - it is small, so our code will run fast, and the matrix is quite dense - that is, there aren't many missing data points.

Begin by running the below to create your `foodMatrix` matrix you'll be completing.  Note you'll have both `foodFrame` which is the pandas dataframe and `foodMatrix` which is the numpy array.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('datasets/foodsAndMovies.csv')
foodFrame = df[['Broccoli', 'Mushrooms', 'Beef Tacos', 'Salads', 'Black Licorice', 'Steak', 'Grilled Chicken', 'Mayonnaise', 'Candy Corn', 'Pulled Pork', 'Spicy Mustard', 'Raw Oysters', 'Bananas', 'Avocado', 'Eggs', 'Olives', 'Tofu', 'Cottage Cheese']]
foodMatrix=foodFrame.values
print(foodMatrix)

We need to know which indices have nans, and which do not.  Write a function which takes in a matrix like `foodMatrix` and returns a list of tuples containing the rows and columns of a rating.  That is, if your function is called on `foodMatrix`, it *should* contain `(0,0)` but should *not* contain `(2,17)`. This is because 0,0 is rated (with 4 stars) and 2,17 is not (it's a NaN).

You may use generative AI for this portion, and all portions until stated otherwise.

Run your function on `foodMatrix` and use sklearn's function `train_test_split()` to split the list into a training set of indices and a testing set.

We will need an error function so we can tell if our completion is working. We'll use root mean squared error (RMSE), which is defined as: 

$$\sqrt{\frac{\sum_{(i,j)\in T}(\text{original}[i,j]-\text{reproduction}[i,j])^2}{|T|}}$$

where $T$ is a set of indices, `original` is the original matrix, and `reproduction` is your value from the completed matrix.

Your function should take three arguments, the original matrix, the reproduction, and a set of indices (like your training or testing sets you created above).  Test your function with some quick sanity checks - if you call it with the same matrix for original and reproduction, do you get back 0?  If you very slightly change a value in the reproduction which appears in the index set, do you get back a non-zero but small value?

We need to make our initial guesses of $P$ and $Q$.  Write a function which takes in a matrix's size and assumed rank, and creates appropriate $P$ and $Q$.  A danger here is if you use large values, `P@Q` will overflow, so start with small, random values close to zero.

It's business time.  Make a function called `epoch`, which runs one epoch of training.  It should accept your original matrix, your current values of $P$ and $Q$, a stepsize, a training set of indices and a testing set of indices.  It should train on all elements of the training set, then calculate the training and testing errors.  It should return the updated matrices $P$, $Q$, and the scalars of the training and testing errors.

You may **NOT** use generative AI on this portion.  You may not use functions from other libraries that purport to perform matrix completion.

Now you have everything you need.  Choose some assumed rank, stepsize, and number of epochs, and run matrix completion.  Keep lists of the training and testing errors after each epoch.  You may **NOT** use generative AI on this portion.

Plot your training and testing errors as a function of the epoch. You **may** use generative AI on this portion.

Play around with the stepsize, rank, and number of epochs.  Below, answer a couple questions, by running experiments and showing me error plots that support your argument.

- What happened when your stepsize was too small? Too large?
- Relate the value of rank to bias and variance.  What would cause high bias and low variance?  What about low bias and high variance?
- Is more epochs always better?

You may **NOT** use generative AI on this portion.