# Movie Recommender System Documentation

## Introduction
This document explains the Movie Recommender System, an application that recommends movies based on user preferences. The system utilizes `numpy` and `pandas`.

## HTML File Overview
The HTML file contains two key components for recommending movies: System I (based on genres) and System II (based on Item-Based Collaborative Filtering, IBCF).

### System I: Recommendation Based on Genres
This system recommends movies based on a user's favorite movie genre. The recommendation scheme involves defining "most popular" or "highly-rated" within a genre. For instance:
- **Top Five Most Popular Movies**: Defined as those with the highest number of ratings in a genre.
- **Top Five Highly-Rated Movies**: Defined as those with the highest average rating, ensuring a minimum number of ratings to qualify.

### System II: Recommendation Based on IBCF
This system follows these steps:
1. **Normalize the Rating Matrix (R)**: Center each row by subtracting row means from each row, computed based on non-NA entries.
2. **Compute Cosine Similarity**: Calculate similarity among movies, considering users who rated both movies. Ignore similarities based on less than three user ratings.
3. **Create the Similarity Matrix (S)**: Transform the cosine similarity to ensure measures are between 0 and 1. Set similarities based on fewer than three ratings to NA. Sort and keep the top 30 non-NA similarity measures per row, setting the rest to NA.
4. **Display Similarity Values**: For specified movies (e.g., "m1", "m10"), rounded to 7 decimal places.
5. **Function `myIBCF`**: Takes a new user's ratings vector, downloads the similarity matrix, and computes predictions for unrated movies. Recommends the top 10 movies based on these predictions.

### `myIBCF` Function Details
- **Input**: A vector of the new user's ratings for 3,706 movies.
- **Prediction Calculation**: Uses the formula involving the similarity matrix and user's ratings.
- **Output**: Recommends top 10 movies. If fewer than 10 predictions are non-NA, additional movies not rated by the user are suggested.

## Implementation Details

### Setup and Configuration
- Import libraries: `numpy`, `pandas`.
- Configure logging for monitoring and debugging.
- Ignore warnings for clarity.

### Data Source URLs
Constants for file paths or URLs are defined for movies, ratings, users, and rating matrix data.

### Core Functions
#### `read_data()`
- Reads data from URLs using Pandas.
- Handles exceptions and logs errors.

#### `process_movie_ratings(user_ratings, movies_info)`
- Processes movie ratings for averages and counts.
- Merges data with movie information.
- Handles exceptions and logs errors.

#### `explode_movie_genres(movies_with_ratings)`
- Expands movie genres into separate rows.
- Handles exceptions and logs errors.

#### `list_all_genres(genre_ratings)`
- Lists all unique genres in the dataset.
- Handles exceptions and logs errors.

#### `find_top_movies_by_genre(genre_ratings, genre, top_n=10)`
- Finds the top N movies in a genre based on weighted ratings.
- Handles exceptions and logs errors.

#### `sample_random_movies(movies_info, sample_size=10)`
- Samples random movies from the dataset.
- Handles exceptions and logs errors.

#### `calculate_similarity_matrix(ratings_matrix)`
- Calculates a cosine similarity matrix from the ratings matrix.
- Handles normalization and calculation processes.
- Handles exceptions and logs errors.

#### `myIBCF(similarity_mat, user_ratings, top_n=10)`
- Implements IBCF for movie recommendations.
- Handles exceptions and logs errors.

#### `test_myIBCF(rating_matrix, similarity_matrix)`
- Tests the `myIBCF` function for functionality.
- Handles exceptions and logs errors.


In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Constants for file paths or URLs
URL_MOVIES = 'https://liangfgithub.github.io/MovieData/movies.dat?raw=true'
URL_RATINGS = 'https://liangfgithub.github.io/MovieData/ratings.dat?raw=true'
URL_USERS = 'https://liangfgithub.github.io/MovieData/users.dat?raw=true'
URL_RATING_MATRIX = 'https://project4-movie-recommender.s3.amazonaws.com/project_4_Rmat.csv'

def read_data():
    """Reading data from URLs"""
    try:
        print("Reading data from URLs")
        users = pd.read_csv(URL_USERS, sep='::', engine='python', header=None, names=['UserID', 'Gender', 'Age', 'Occupation', 'Zipcode'], dtype={'UserID': int})
        movies = pd.read_csv(URL_MOVIES, sep='::', engine='python', encoding="ISO-8859-1", header=None, names=['MovieID', 'Title', 'Genres'], dtype={'MovieID': int})
        ratings = pd.read_csv(URL_RATINGS, sep='::', engine='python', header=None, names=['UserID', 'MovieID', 'Rating', 'Timestamp'], dtype={'MovieID': int, 'UserID': int, 'Rating': int})
        ratings_matrix = pd.read_csv(URL_RATING_MATRIX, sep=',')
        print("Data successfully read")
        return ratings, movies, users, ratings_matrix
    except Exception as e:
        print(f"Error reading data: {e}")
        return None, None, None, None

def process_movie_ratings(user_ratings, movies_info):
    """Process movie ratings to compute average and count, and combine with movie info."""
    try:
        print("Processing movie ratings to compute average and count, and combine with movie info.")
        ratings_agg = user_ratings.groupby("MovieID")['Rating'].agg(AvgRating='mean', RatingCount='count').reset_index()
        avg_count = ratings_agg['RatingCount'].mean()
        min_rating = ratings_agg['AvgRating'].min()
        ratings_agg['WeightedRating'] = (ratings_agg['AvgRating'] * ratings_agg['RatingCount'] + min_rating * avg_count) / (ratings_agg['RatingCount'] + avg_count)
        return movies_info.merge(ratings_agg, on="MovieID", how='left')
    except Exception as e:
        error_message = f'Error processing movie ratings due to: {e}'
        print(error_message)
        return None


def explode_movie_genres(movies_with_ratings):
    """Expand movie genres into separate rows."""
    try:
        print("Expanding movie genres into separate rows")
        return movies_with_ratings.assign(Genres=movies_with_ratings['Genres'].str.split('|')).explode('Genres')
    except Exception as e:
        error_message = f'Error expanding all unique genres due to: {e}'
        print(error_message)
        return None
    
def list_all_genres(genre_ratings):
    """List all unique genres."""
    try:
        print("Listing all unique genres...")
        return genre_ratings['Genres'].unique()
    except Exception as e:
        error_message = f'Error listing all unique genres due to: {e}'
        print(error_message)
        return None

def find_top_movies_by_genre(genre_ratings, genre, top_n=10):
    """Find top N movies for a given genre based on weighted ratings."""
    return genre_ratings[genre_ratings['Genres'] == genre].sort_values(by='WeightedRating', ascending=False).head(top_n)
   
def sample_random_movies(movies_info, sample_size=10):
    """Sample a random set of movies."""
    try:
        print("Sampling random set of movies...")
        return movies_info.sample(n=sample_size)
    except Exception as e:
        error_message = f"Error sampling random set of movies due to: {e}"
        print(error_message)
        return None

def calculate_similarity_matrix(ratings_matrix):
    """Calculate cosine similarity matrix from ratings matrix."""
    try:
        # Start the calculation process
        print("Starting to calculate the similarity matrix.")

        # Normalize the ratings matrix by subtracting the mean
        print("Normalizing the ratings matrix.")
        normalized_matrix = ratings_matrix.subtract(ratings_matrix.mean(axis=1), axis='rows').T.fillna(0)

        # Calculate the numerator of the cosine similarity formula
        print("Calculating the numerator for cosine similarity.")
        numerator = normalized_matrix @ normalized_matrix.T

        # Calculate the denominator of the cosine similarity formula
        print("Calculating the denominator for cosine similarity.")
        squared_normalized = (normalized_matrix ** 2).dot((normalized_matrix != 0).T)
        denominator = np.sqrt(squared_normalized) * np.sqrt(squared_normalized.T)

        # Compute cosine similarity
        print("Computing the cosine similarity.")
        cosine_similarity = numerator / denominator

        # Convert cosine similarity to a similarity measure that ranges from 0 to 1
        similarity_matrix = (1 + cosine_similarity) / 2

        # Set diagonal elements to NaN and filter out low-cardinality pairs
        print("Adjusting diagonal and filtering low-cardinality pairs.")
        np.fill_diagonal(similarity_matrix.values, np.nan)
        similarity_matrix[similarity_matrix.count() < 3] = None

        print("Similarity matrix calculation completed successfully.")
        return similarity_matrix
    except Exception as e:
        error_message = f"Error calculating similarity matrix: {e}"
        print(error_message)
        return None

# Note: myIBCF may not need caching as it might be user-specific and dynamic
def myIBCF(similarity_mat, user_ratings, top_n=10):
    """Implement Item-Based Collaborative Filtering."""
    try:
        print("Starting Item-Based Collaborative Filtering")

        # Replace NaN values with zero in similarity matrix and user ratings
        similarity_mat = similarity_mat.fillna(0)
        user_ratings = user_ratings.fillna(0)
        print("NaN values replaced with zero in matrices")

        # Creating a binary identity matrix to identify rated movies
        identity = (~user_ratings.isna()).astype(int)

        # Compute the recommended movies based on similarity matrix and user ratings
        recommended_movies = (user_ratings @ similarity_mat) / identity.dot(similarity_mat)
        recommended_movies = recommended_movies.sort_values(ascending=False).head(top_n).dropna()
        print(f"Computed top {top_n} recommended movies")

        # Check if enough recommendations are available, backfill if necessary
        if recommended_movies.size < top_n:
            backfill_count = top_n - recommended_movies.size
            print(f"Not enough recommendations, backfilling {backfill_count} movies")
            random_genre = np.random.choice(list_all_genres(genre_ratings))
            backfill_movies = find_top_movies_by_genre(genre_ratings, random_genre, backfill_count)
            backfill_series = pd.Series(data=backfill_movies["WeightedRating"].values, index="m" + backfill_movies["MovieID"].astype(str))
            recommended_movies = pd.concat([recommended_movies, backfill_series], axis=0)
            print("Backfilling completed")

        return recommended_movies
    except Exception as e:
        print(f"Error implementing Item-Based Collaborative Filtering: {e}")
        return None

def test_myIBCF(rating_matrix, similarity_matrix):
    """Test to verify that the custom myIBCF function works."""
    try:
        print("Starting the test to verify custom myIBCF function.")
        user_rating_1 = rating_matrix.loc["u1181"].copy()
        print(myIBCF(similarity_matrix, user_rating_1))
        user_rating_2 = rating_matrix.loc["u1351"].copy()
        print(myIBCF(similarity_matrix, user_rating_2))

        row = similarity_matrix.iloc[0, :]
        user_rating_new = row.copy()
        user_rating_new[:] = np.nan
        user_rating_new["m1613"] = 5
        user_rating_new["m1755"] = 4

        print(myIBCF(similarity_matrix, user_rating_new))

        row = similarity_matrix.iloc[0, :]
        user_rating_nan = row.copy()
        user_rating_nan[:] = np.nan
        print(myIBCF(similarity_matrix, user_rating_nan))
    except Exception as e:
            print(f"Error implementing Item-Based Collaborative Filtering: {e}")


In [3]:
(ratings, movies, users, rating_matrix) = read_data()
similarity_matrix = calculate_similarity_matrix(rating_matrix)
test_myIBCF(rating_matrix, similarity_matrix)

Reading data from URLs
Data successfully read
Starting to calculate the similarity matrix.
Normalizing the ratings matrix.
Calculating the numerator for cosine similarity.
Calculating the denominator for cosine similarity.
Computing the cosine similarity.
Adjusting diagonal and filtering low-cardinality pairs.
Similarity matrix calculation completed successfully.
Starting the test to verify custom myIBCF function.
Starting Item-Based Collaborative Filtering
NaN values replaced with zero in matrices
Computed top 10 recommended movies
m3172    3.478261
m3647    3.090909
m1832    3.000000
m3530    2.868852
m3233    2.768212
m989     2.736842
m3779    2.727273
m1830    2.652174
m853     2.619307
m2258    2.609576
Name: u1181, dtype: float64
Starting Item-Based Collaborative Filtering
NaN values replaced with zero in matrices
Computed top 10 recommended movies
m1832    0.500000
m3647    0.500000
m1915    0.400000
m3607    0.348315
m2909    0.288889
m127     0.264706
m3842    0.263158
m2258 