## Matrix sparsity

A common challenge with real-world ratings data is that most users will not have rated most items, and most items will only have been rated by a small number of users. This results in a very empty or sparse DataFrame.

In this exercise, you will calculate how sparse the movie_lens ratings data is by counting the number of occupied cells and compare it to the size of the full DataFrame. The DataFrame user_ratings_df that you have used in previous exercises, containing a row per user and a column per movie, has been loaded for you.

### Instructions
    - Count the number of non-empty cells in user_ratings_df and store the result as sparsity_count.
    - Count the total number of cells in the user_ratings_df DataFrame and store it as full_count.
    - Calculate the sparsity of the DataFrame by dividing the number of non-empty cells by the total number of cells and print the result.

In [None]:
# Count the occupied cells
sparsity_count = user_ratings_df.isnull().values.sum()

# Count all cells
full_count = user_ratings_df.size

# Find the sparsity of the DataFrame
sparsity = sparsity_count / full_count
print(sparsity)

## Limited data in your rows

This data sparsity can cause an issue when using techniques like K-nearest neighbors as discussed in the last chapter. KNN needs to find the k most similar users that have rated an item, but if only less than or equal to k users have given an item the rating, all ratings will be the "most similar".

In this exercise, you will count how often each movie in the user_ratings_df DataFrame has been given a rating, and then see how many have only one or two ratings.

### Instructions 1/3
    - Count the number of non-empty cells in each column of user_ratings_df and store it as occupied_count.

In [None]:
# Count the occupied cells per column
occupied_count = user_ratings_df.notnull().sum()
print(occupied_count)

### Instructions 2/3
    - Sort occupied_count from low to high. Looking at the resulting sorted Series, note the number of movies with one review.

In [None]:
# Count the occupied cells per column
occupied_count = user_ratings_df.notnull().sum()

# Sort the resulting series from low to high
sorted_occupied_count = occupied_count.sort_values()
print(sorted_occupied_count)

### Instructions 3/3
    - Create a histogram of the sorted_occupied_count Series you just created. matplotlib.pyplothas been loaded as plt.

In [None]:
# Count the occupied cells per column
occupied_count = user_ratings_df.notnull().sum()

# Sort the resulting series from low to high
sorted_occupied_count = occupied_count.sort_values()

# Plot a histogram of the values in sorted_occupied_count
sorted_occupied_count.hist()
plt.show()

## Information loss in factorization

You may wonder how the factors with far fewer columns can summarize a larger DataFrame without loss. In fact, it doesn't — the factors we create are generally a close approximation of the data, as it is inevitable for some information to be lost. This means that predicted values might not be exact, but should be close enough to be useful.

In this exercise, you will inspect the same original pre-factorization DataFrame from the last exercise loaded as original_df, and compare it to the product of its two factors, user_matrix and item_matrix.

### Instructions
    - Find the dot product of user_matrix and item_matrix and store it as predictions_df.

In [None]:
import numpy as np

# Multiply the user and item matrices
predictions_df = np.dot(user_matrix, item_matrix)
# Inspect the recreated DataFrame
print(predictions_df)

# Inspect the original DataFrame and compare
print(original_df)

## Normalize your data

Before you can find the factors of the ratings matrix using singular value decomposition, you will need to "de-mean", or center it, by subtracting each row's mean from each value in that row.

In this exercise, you will begin prepping the movie rating DataFrame you have been working with in order to be able to perform Singular value decomposition.

user_ratings_df contains a row per user and a column for each movie and has been loaded for you.

### Instructions
    - Find the average rating each user has given across all the movies they have seen and store these values as avg_ratings.
    - Subtract the row averages from their respective rows and store the result as user_ratings_centered.
    - Finally, fill in all missing values in user_ratings_centered with zeros.
    - Print the average of each column in user_ratings_centered to show they have been de-meaned.

In [None]:
# Get the average rating for each user 
avg_ratings = user_ratings_df.mean(axis=1)

# Center each user's ratings around 0
user_ratings_centered = user_ratings_df.sub(avg_ratings, axis=1)

# Fill in all missing values with 0s
user_ratings_centered.fillna(0, inplace=True)

# Print the mean of each column
print(user_ratings_centered.mean(axis=1))