#### Dimensionality Reduction

One way to increase density is by predicting user ratings of books with matrix factorization. In the following section, we will use Single Value Decomposition (SVD) to fill empty ratings by predicting how users would rate books based on similar user reviews. This will entail the following steps:

- train/test splits
- hyper-parameter tuning for k
- SVD

In the function below, data is split into training and testing data randomly, using an 90/10 split. Initially, I had considered using a leave-one-user-out approach, but since we are not using content based filtering to recommend books to users, I am not particularly worried about a cold start problem. Instead, I just want to get the most accurate prediction of what each existing user would think of each book. For this purpose, leaving a random subset of ratings out will allow for the highest level of accuracy.

In [None]:
def matrix_density(mat, name):
    '''
    Prints the matrix density of the matrix.

    Parameters:
        mat: a matrix
        name: a name for the matrix
    '''
    n_total = mat.shape[0]*mat.shape[1]
    n_ratings = mat.nnz
    density = n_ratings/n_total
    print(f"The {name} has a density of: {round(density * 100, 2)}%")

In [None]:
names = ['All Ratings Matrix', 'All Normalized Ratings Matrix', 'Filtered Ratings Matrix', 'Filtered Normalized Ratings Matrix']

matrices = dict(zip(names, [X, Xn, Xf, Xfn]))

for a, b in matrices.items():
    matrix_density(b, a)


In [None]:
from sklearn.model_selection import train_test_split
import numpy as np
import scipy.sparse as sp

def split_sparse_matrix_by_ratings(R_sparse, test_size=0.1, random_state=37):
    """
    Split a sparse matrix into train and test sets, keeping a percentage of ratings from each user in the test set.
    
    Parameters:
        R_sparse (csr_matrix): Sparse user-item rating matrix (CSR format).
        test_size (float): Fraction of each user's ratings to be used as test set.
        random_state (int): Random seed for reproducibility.
    
    Returns:
        R_train (csr_matrix): Train set sparse matrix.
        R_test (csr_matrix): Test set sparse matrix.
    """
    np.random.seed(random_state)
    
    # Get user indices and their respective nonzero ratings
    R_coo = R_sparse.tocoo()
    rows, cols, data = R_coo.row, R_coo.col, R_coo.data
    
    train_rows, train_cols, train_data = [], [], []
    test_rows, test_cols, test_data = [], [], []
    
    unique_users = np.unique(rows)
    
    for user in unique_users:
        # Get all indices where this user has a rating
        user_indices = np.where(rows == user)[0]
        
        # Shuffle and split user’s ratings
        test_indices = np.random.choice(user_indices, size=int(len(user_indices) * test_size), replace=False)
        train_indices = np.setdiff1d(user_indices, test_indices)
        
        # Assign ratings to train and test
        train_rows.extend(rows[train_indices])
        train_cols.extend(cols[train_indices])
        train_data.extend(data[train_indices])
        
        test_rows.extend(rows[test_indices])
        test_cols.extend(cols[test_indices])
        test_data.extend(data[test_indices])
    
    # Create sparse train and test matrices
    R_train = sp.csr_matrix((train_data, (train_rows, train_cols)), shape=R_sparse.shape)
    R_test = sp.csr_matrix((test_data, (test_rows, test_cols)), shape=R_sparse.shape)
    
    return R_train, R_test


In [None]:
# Tune k 

ks = [40, 50, 75, 100, 120, 130, 150]

scores = []
for k in ks:
    n_splits = 5
    print('Testing k=' + str(k))
    rmse = svd_with_cv(X,n_splits, k)
    scores.append({k: rmse})
    print('Mean RMSE: ' + str(round(sum(rmse) / n_splits, 4)))

In [None]:
# only pop books 
most_pop = ratings[ratings['book_popularity_rated'] > 120]
most_pop = most_pop[most_pop['user_books_rated'] > 500]
X_popular, user_mapper_pop, book_mapper_pop, user_inv_mapper_pop, book_inv_mapper_pop = create_X(most_pop, user_id='user_id', book_id='book_tag', rating='rating')

print(X_popular.shape)

matrix_density(X_popular, 'Most Popular Books Matrix')

In [None]:
ks = [1, 2, 4, 10, 15, 20, 30, 40, 50, 55, 60]
ks = [9, 10, 11, 12, 13, 14, 15, 16]


scores = []
for k in ks:
    n_splits = 5
    print('Testing k=' + str(k))
    rmse = svd_with_cv(X_popular,n_splits, k)#reduce_with_svd(R=X, n_splits=1, k=k)
    scores.append({k: rmse})
    print('Mean RMSE: ' + str(round(sum(rmse) / n_splits, 6)))

In [None]:
# All ratings, not normalized
U, sigma, VT = svds(X_popular, k=12)
sigma = np.diag(sigma)

# Reconstruct the matrix
X_pop_pred = np.dot(np.dot(U, sigma), VT)