## Matrix sparsity

A common challenge with real-world ratings data is that most users will not have rated most items, and most items will only have been rated by a small number of users. This results in a very empty or sparse DataFrame.

In this exercise, you will calculate how sparse the movie_lens ratings data is by counting the number of occupied cells and compare it to the size of the full DataFrame. The DataFrame user_ratings_df that you have used in previous exercises, containing a row per user and a column per movie, has been loaded for you.

### Instructions
    - Count the number of non-empty cells in user_ratings_df and store the result as sparsity_count.
    - Count the total number of cells in the user_ratings_df DataFrame and store it as full_count.
    - Calculate the sparsity of the DataFrame by dividing the number of non-empty cells by the total number of cells and print the result.

In [None]:
# Count the occupied cells
sparsity_count = user_ratings_df.isnull().values.sum()

# Count all cells
full_count = user_ratings_df.size

# Find the sparsity of the DataFrame
sparsity = sparsity_count / full_count
print(sparsity)

## Limited data in your rows

This data sparsity can cause an issue when using techniques like K-nearest neighbors as discussed in the last chapter. KNN needs to find the k most similar users that have rated an item, but if only less than or equal to k users have given an item the rating, all ratings will be the "most similar".

In this exercise, you will count how often each movie in the user_ratings_df DataFrame has been given a rating, and then see how many have only one or two ratings.

### Instructions 1/3
    - Count the number of non-empty cells in each column of user_ratings_df and store it as occupied_count.

In [None]:
# Count the occupied cells per column
occupied_count = user_ratings_df.notnull().sum()
print(occupied_count)

### Instructions 2/3
    - Sort occupied_count from low to high. Looking at the resulting sorted Series, note the number of movies with one review.

In [None]:
# Count the occupied cells per column
occupied_count = user_ratings_df.notnull().sum()

# Sort the resulting series from low to high
sorted_occupied_count = occupied_count.sort_values()
print(sorted_occupied_count)

### Instructions 3/3
    - Create a histogram of the sorted_occupied_count Series you just created. matplotlib.pyplothas been loaded as plt.

In [None]:
# Count the occupied cells per column
occupied_count = user_ratings_df.notnull().sum()

# Sort the resulting series from low to high
sorted_occupied_count = occupied_count.sort_values()

# Plot a histogram of the values in sorted_occupied_count
sorted_occupied_count.hist()
plt.show()

## Information loss in factorization

You may wonder how the factors with far fewer columns can summarize a larger DataFrame without loss. In fact, it doesn't — the factors we create are generally a close approximation of the data, as it is inevitable for some information to be lost. This means that predicted values might not be exact, but should be close enough to be useful.

In this exercise, you will inspect the same original pre-factorization DataFrame from the last exercise loaded as original_df, and compare it to the product of its two factors, user_matrix and item_matrix.

### Instructions
    - Find the dot product of user_matrix and item_matrix and store it as predictions_df.

In [None]:
import numpy as np

# Multiply the user and item matrices
predictions_df = np.dot(user_matrix, item_matrix)
# Inspect the recreated DataFrame
print(predictions_df)

# Inspect the original DataFrame and compare
print(original_df)

## Normalize your data

Before you can find the factors of the ratings matrix using singular value decomposition, you will need to "de-mean", or center it, by subtracting each row's mean from each value in that row.

In this exercise, you will begin prepping the movie rating DataFrame you have been working with in order to be able to perform Singular value decomposition.

user_ratings_df contains a row per user and a column for each movie and has been loaded for you.

### Instructions
    - Find the average rating each user has given across all the movies they have seen and store these values as avg_ratings.
    - Subtract the row averages from their respective rows and store the result as user_ratings_centered.
    - Finally, fill in all missing values in user_ratings_centered with zeros.
    - Print the average of each column in user_ratings_centered to show they have been de-meaned.

In [None]:
# Get the average rating for each user 
avg_ratings = user_ratings_df.mean(axis=1)

# Center each user's ratings around 0
user_ratings_centered = user_ratings_df.sub(avg_ratings, axis=1)

# Fill in all missing values with 0s
user_ratings_centered.fillna(0, inplace=True)

# Print the mean of each column
print(user_ratings_centered.mean(axis=1))

## Decomposing your matrix

Now that you have prepped your data by centering it and filling in the remaining empty values with 0, you can get around to finding your data's factors. In this exercise, you will break the user_ratings_centered data you generated in the last exercise into 3 factors: U, sigma, and Vt.

    - U is a matrix with a row for each user
    - Vt has a column for each movie
    - sigma is an array of weights that you will need to convert to a diagonal matrix

The user_ratings_centered that you created in the last lesson has been loaded for you.

### Instructions 1/2
    - Import svds from scipy.sparse.linalg.
    - Decompose user_ratings_pivot_centered into its factor matrices: U, sigma and Vt.

In [None]:
# Import the required libraries 
from scipy.sparse.linalg import svds
import numpy as np

# Decompose the matrix
U, sigma, Vt = svds(user_ratings_centered)

### Instructions 2/2
    - Convert the sigma array into a diagonal matrix.

In [None]:
# Import the required libraries 
from scipy.sparse.linalg import svds
import numpy as np

# Decompose the matrix
U, sigma, Vt = svds(user_ratings_centered)

# Convert sigma into a diagonal matrix
sigma = np.diag(sigma)
print(sigma)

## Recalculating the matrix

Now that you have your three factor matrices, you can multiply them back together to get complete ratings data without missing values. In this exercise, you will use numpy's dot product function to multiply U and sigma first, then the result by Vt. You will then be able add the average ratings for each row to find your final ratings.

U, sigma, Vt, avg_ratings, and user_ratings_df from the previous exercise have been loaded for you. Also, numpy has been loaded as np.

### Instructions 1/4
    - Find the dot product of the matrix U and sigma.

In [None]:
# Dot product of U and sigma
U_sigma = np.dot(U, sigma)

### Instructions 2/4
    - Find the dot product of U_sigma and Vt and print the result.

In [None]:
# Dot product of U and sigma
U_sigma = np.dot(U, sigma)

# Dot product of result and Vt
U_sigma_Vt = np.dot(U_sigma, Vt)

# Print the result
print(U_sigma_Vt)

### Instructions 3/4
    - Reshape the values of avg_ratings and add them back onto U_sigma_Vt.

In [None]:
# Dot product of U and sigma
U_sigma = np.dot(U, sigma)

# Dot product of result and Vt
U_sigma_Vt = np.dot(U_sigma, Vt)

# Add the row means back contained in avg_ratings
uncentered_ratings = U_sigma_Vt + avg_ratings.values.reshape(-1, 1)

### Instructions 4/4
    - Create a DataFrame of the results using the original index and column names from user_ratings_df.

In [None]:
# Dot product of U and sigma
U_sigma = np.dot(U, sigma)

# Dot product of result and Vt
U_sigma_Vt = np.dot(U_sigma, Vt)

# Add back on the row means contained in avg_ratings
uncentered_ratings = U_sigma_Vt + avg_ratings.values.reshape(-1, 1)

# Create DataFrame of the results
calc_pred_ratings_df = pd.DataFrame(uncentered_ratings, 
                                    index=user_ratings_df.index,
                                    columns=user_ratings_df.columns
                                   )
# Print both the recalculated matrix and the original 
print(calc_pred_ratings_df)
print(original_df)

## Making recommendations with SVD

Now that you have the recalculated matrix with all of its gaps filled in, the next step is to use it to generate predictions and recommendations.

Using calc_pred_ratings_df that you generated in the last exercise, with all rows and columns filled, find the movies that User_5 is most likely to enjoy.

### Instructions
    - Find the highest ranked movies for User_5 by sorting all the reviews generated for User_5 from high to low.

In [None]:
# Sort the ratings of User 5 from high to low
user_5_ratings = calc_pred_ratings_df.loc['User_5',:].sort_values(ascending=False)

print(user_5_ratings)

## Comparing recommendation methods

In this course, you have predicted how you believe a user would rate movies they have not seen using multiple different methods (basic average ratings, KNN, matrix factorization). In this final exercise, you'll work through a comparison of the averaged ratings and matrix factorization using the mean_squared_error() as the measure of how well they are performing. The predictions based on averages have been loaded as avg_pred_ratings_df while the calculated predictions have been loaded as calc_pred_ratings_df. The ground truth values have been loaded as act_ratings_df.

Finally, the mean_squared_error() function has been imported for your use from sklearn.metrics.

### Instructions
    - Extract rows 0-20 and columns 0-100 (the areas that you want to compare) in the act_ratings_df, avg_pred_ratings_df, and calc_pred_ratings_df DataFrames.
    - Create a mask of the actual_values DataFrame that targets only non-empty cells.
    - Find the mean squared error between the two predictions and the ground truth values.

In [None]:
# Extract the ground truth to compare your predictions against
actual_values = act_ratings_df.iloc[:20, :100].values
avg_values = avg_pred_ratings_df.iloc[:20, :100].values
predicted_values = calc_pred_ratings_df.iloc[:20, :100].values

# Create a mask of actual_values to only look at the non-missing values in the ground truth
mask = ~np.isnan(actual_values)

# Print the performance of both predictions and compare
print(mean_squared_error(actual_values[mask], avg_values[mask], squared=False))
print(mean_squared_error(actual_values[mask], predicted_values[mask], squared=False))