<h1 style= 'text-align:center;'> 10.2 Exercise: Recommender System </h1>

<p style= 'text-align: center;'> Bernard Owusu Sefah</p>

<p style= 'text-align: center;'> 10.2 Exercise: Recommender System</p>

<p style= 'text-align: center;'> DSC 630</p>



## Step 1: Import Necessary Libraries

In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix

## Step 2: Load the Dataset

We load the dataset and take a look at its structure.

In [2]:
# Load all provided datasets
movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')
tags_df = pd.read_csv('tags.csv')
links_df = pd.read_csv('links.csv')

# Displaying the first few rows of each dataset to understand their structure
movies_df.head(), ratings_df.head(), tags_df.head(), links_df.head()


(   movieId                               title  \
 0        1                    Toy Story (1995)   
 1        2                      Jumanji (1995)   
 2        3             Grumpier Old Men (1995)   
 3        4            Waiting to Exhale (1995)   
 4        5  Father of the Bride Part II (1995)   
 
                                         genres  
 0  Adventure|Animation|Children|Comedy|Fantasy  
 1                   Adventure|Children|Fantasy  
 2                               Comedy|Romance  
 3                         Comedy|Drama|Romance  
 4                                       Comedy  ,
    userId  movieId  rating  timestamp
 0       1        1     4.0  964982703
 1       1        3     4.0  964981247
 2       1        6     4.0  964982224
 3       1       47     5.0  964983815
 4       1       50     5.0  964982931,
    userId  movieId              tag   timestamp
 0       2    60756            funny  1445714994
 1       2    60756  Highly quotable  1445714996
 2       

The datasets include the following information:

1. movies.csv: Contains movieId, title, and genres.

2. ratings.csv: Includes userId, movieId, and rating, representing user feedback.

3. tags.csv: Contains user-generated tags associated with movies, which can help enhance recommendations with content-based filtering.

4. links.csv: Provides imdbId and tmdbId for movies, which we won’t use directly in the recommender system but may be helpful for external linking or future enhancements.

## Step 3: Data Preprocessing

Merge the datasets: I’ll merge movies_df, ratings_df, and tags_df using movieId.
Handle Missing Values: We’ll check for and handle any missing values

In [3]:
# Merge movies and ratings data on 'movieId'
merged_df = pd.merge(ratings_df, movies_df, on='movieId', how='inner')

# Merging tags into the main dataset (some movies may not have tags, so we use left join here)
merged_df = pd.merge(merged_df, tags_df[['movieId', 'tag']], on='movieId', how='left')

# Check for any missing values in the merged dataset
missing_values = merged_df.isnull().sum()

# Display merged data and missing values
merged_df.head(), missing_values

(   userId  movieId  rating  timestamp             title  \
 0       1        1     4.0  964982703  Toy Story (1995)   
 1       1        1     4.0  964982703  Toy Story (1995)   
 2       1        1     4.0  964982703  Toy Story (1995)   
 3       5        1     4.0  847434962  Toy Story (1995)   
 4       5        1     4.0  847434962  Toy Story (1995)   
 
                                         genres    tag  
 0  Adventure|Animation|Children|Comedy|Fantasy  pixar  
 1  Adventure|Animation|Children|Comedy|Fantasy  pixar  
 2  Adventure|Animation|Children|Comedy|Fantasy    fun  
 3  Adventure|Animation|Children|Comedy|Fantasy  pixar  
 4  Adventure|Animation|Children|Comedy|Fantasy  pixar  ,
 userId           0
 movieId          0
 rating           0
 timestamp        0
 title            0
 genres           0
 tag          52549
 dtype: int64)

The merged dataset combines ratings, movie titles, genres, and tags. While tag has missing values (indicating that not all movies have associated tags), this won’t hinder the recommendations. Will proceed without additional handling, as tags are supplementary

## Step 4: Build the User-Movie Matrix

Using userId, movieId, and rating, will create a matrix that shows ratings for each movie by each user, which forms the basis for collaborative filtering

In [4]:
# Creating the user-movie matrix with ratings as values
user_movie_matrix = merged_df.pivot_table(index='movieId', columns='userId', values='rating').fillna(0)

# Converting the user-movie matrix to a sparse matrix for efficient similarity computation
user_movie_sparse_matrix = csr_matrix(user_movie_matrix.values)

# Display the shape of the user-movie matrix
user_movie_matrix.shape

(9724, 610)

The user-movie matrix has been created with 9,724 movies and 610 users. This matrix is large enough to provide collaborative filtering but manageable in size.

## Step 5: Calculate Movie Similarity

Collaborative similarity: Based on the user-movie ratings matrix.
Content-Based Filtering: Using genres and tags to identify movies with similar themes.
Will start with collaborative filtering by calculating cosine similarity on the user-movie matrix

In [5]:
# Calculate cosine similarity between movies based on user ratings (collaborative filtering)
movie_similarity_collab = cosine_similarity(user_movie_sparse_matrix)

# Create a DataFrame for collaborative similarity with movieIds as index and columns
movie_similarity_df_collab = pd.DataFrame(movie_similarity_collab, index=user_movie_matrix.index, columns=user_movie_matrix.index)

# Display the first few rows of the collaborative similarity matrix
movie_similarity_df_collab.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.410562,0.296917,0.035573,0.308762,0.376316,0.277491,0.131629,0.232586,0.395573,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.410562,1.0,0.282438,0.106415,0.287795,0.297009,0.228576,0.172498,0.044835,0.417693,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.296917,0.282438,1.0,0.092406,0.417802,0.284257,0.402831,0.313434,0.30484,0.242954,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.035573,0.106415,0.092406,1.0,0.188376,0.089685,0.275035,0.158022,0.0,0.095598,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.308762,0.287795,0.417802,0.188376,1.0,0.298969,0.474002,0.283523,0.335058,0.218061,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The collaborative similarity matrix is ready. Now, let’s proceed with content-based filtering using genres and tags. Will calculate similarity by creating feature vectors from these columns.

## Step 6: Content-Based Filtering with Genres and Tags

Will be using Genres: Tokenize the genres and compute a similarity matrix.
Tags: Concatenate tags by movie and compute a similarity matrix.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Concatenate tags by movie and fill any missing values with empty strings
tags_by_movie = merged_df.groupby('movieId')['tag'].apply(lambda x: ' '.join(x.dropna())).reset_index()

# Merge genres and tags into one feature for content-based filtering
content_features = pd.merge(movies_df[['movieId', 'genres']], tags_by_movie, on='movieId', how='left').fillna('')

# Combine genres and tags into a single feature column
content_features['combined_features'] = content_features['genres'] + ' ' + content_features['tag']

# Initialize a TF-IDF Vectorizer and transform the combined features
tfidf = TfidfVectorizer(stop_words='english')
# Transform the combined features (genres and tags) into a TF-IDF matrix
# Each movie's combined features (genres and tags) are represented as a vector
# The TF-IDF matrix will contain the weight of each word in the context of each movie
content_matrix = tfidf.fit_transform(content_features['combined_features'])

# Calculate cosine similarity for content-based filtering
content_similarity = cosine_similarity(content_matrix)

# Create a DataFrame for the content-based similarity matrix
content_similarity_df = pd.DataFrame(content_similarity, index=content_features['movieId'], columns=content_features['movieId'])

# Display the content-based similarity matrix
content_similarity_df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.000123,1e-06,0.000241,2e-06,0.0,2e-06,0.001166,0.0,0.000468,...,0.00031,0.000831,0.00035,0.000921,0.0,0.001214,0.001349,0.0,0.000751,0.000477
2,0.000123,1.0,0.0,0.0,0.0,0.0,0.0,0.001709,0.0,0.000686,...,0.0,0.0,0.0,0.0,0.0,0.080623,0.089579,0.0,0.0,0.0
3,1e-06,0.0,1.0,0.004331,1e-05,0.0,4e-05,0.0,0.0,0.0,...,0.000634,0.0,0.002053,0.0,0.0,0.00089,0.000989,0.0,0.0,0.002796
4,0.000241,0.0,0.004331,1.0,0.001783,0.0,0.007271,0.0,0.0,0.0,...,0.114433,0.201463,0.687501,0.0,0.0,0.160752,0.178608,0.46662,0.0,0.5049
5,2e-06,0.0,1e-05,0.001783,1.0,0.0,0.682792,0.0,0.0,0.0,...,0.0008,0.0,0.002594,0.0,0.0,0.001124,0.001249,0.0,0.0,0.003532


The content-based similarity matrix is now ready, allowing us to proceed with building the hybrid recommendation function.

## Step 7: Create the Recommendation System

Function Setup: This function will accept a movie title, identify similar movies using both collaborative and content-based similarities, and return the top ten recommended movies.
Combine Similarities: Weight the collaborative and content-based similarities equally (or adjust weights as desired) to form a final score.

In [18]:
def hybrid_movie_recommendations(movie_title, movies_df, collab_df, content_df, top_n=10, weight_collab=0.5, weight_content=0.5):
    # Find the movieId for the given movie title
    try:
        # Attempt to locate the movieId associated with the given movie title
        movie_id = movies_df[movies_df['title'] == movie_title].iloc[0]['movieId']
    except IndexError:
        # Return an error message if the movie title is not found
        return f"Movie '{movie_title}' not found in the dataset."

    # Get similarity scores from collaborative and content-based approaches
    collab_scores = collab_df[movie_id] if movie_id in collab_df.index else None
    # Get similarity scores from content-based filtering if movieId exists in content DataFrame
    content_scores = content_df[movie_id] if movie_id in content_df.index else None

    # Check if either collaborative or content-based scores are unavailable
    if collab_scores is None or content_scores is None:
        # Return a message indicating insufficient data for recommendations
        return "Not enough data to make recommendations for this movie."
    
    # Combine the scores using the specified weights
    combined_scores = (weight_collab * collab_scores) + (weight_content * content_scores)
    
    # Sort by similarity scores in descending order
    similar_movies = combined_scores.sort_values(ascending=False)
    
    # Get the top N movie IDs, excluding the input movie itself
    top_similar_ids = similar_movies.iloc[1:top_n + 1].index
    
    # Retrieve movie titles for the top similar movie IDs
    recommended_movies = movies_df[movies_df['movieId'].isin(top_similar_ids)]['title'].values

    return recommended_movies

# Example: recommending movies similar to "Toy Story (1995)"
#recommendations = hybrid_movie_recommendations("Toy Story (1995)", movies_df, movie_similarity_df_collab, content_similarity_df)
#recommendations


In [20]:
# Enhancing the function to prompt the user for a movie and display a nicely formatted output
def get_movie_recommendations():
    # Prompt the user for a movie title
    movie_title = input("Enter a movie you like with year in () eg. Toy Story (1995): ")
    
    # Fetch recommendations using the hybrid_movie_recommendations function
    recommendations = hybrid_movie_recommendations(movie_title, movies_df, movie_similarity_df_collab, content_similarity_df)
    
    # Display recommendations
    if isinstance(recommendations, str):  # If there's an error message, display it
        print(recommendations)
    else:
        print(f"\nTop 10 movie recommendations similar to '{movie_title}':\n")
        for i, movie in enumerate(recommendations, 1):
            print(f"{i}. {movie}")

# Test the function
get_movie_recommendations()

Enter a movie you like with year in () eg. Toy Story (1995):  Toy Story (1995)



Top 10 movie recommendations similar to 'Toy Story (1995)':

1. Star Wars: Episode IV - A New Hope (1977)
2. Forrest Gump (1994)
3. Lion King, The (1994)
4. Jurassic Park (1993)
5. Independence Day (a.k.a. ID4) (1996)
6. Star Wars: Episode VI - Return of the Jedi (1983)
7. Bug's Life, A (1998)
8. Toy Story 2 (1999)
9. Up (2009)
10. Guardians of the Galaxy 2 (2017)


## Recommender System for Movie Recommendations: Write-Up

#### Datasets Used:

* movies.csv - Contains movieId, title, and genres for each movie.

* ratings.csv - Includes userId, movieId, and rating, representing user feedback.

* tags.csv - Holds userId, movieId, and tag, with user-provided tags describing movies.

* links.csv - Provides imdbId and tmdbId for external references, though not used directly in the recommendation system.

#### Data Preprocessing
1. Merging Datasets: We merged ratings.csv with movies.csv to add movie titles and genres to each rating. Then, tags.csv was merged to incorporate user-provided tags.

2. Handling Missing Data: Not all movies had tags, so missing values in the tag column were left as empty strings. These missing tags don’t impact the system, as they’re supplementary information.

#### Building the User-Movie Matrix
1. Pivot Table Creation: The data was reshaped to form a user-movie matrix, with movieId as rows, userId as columns, and rating as values.

2. Sparse Matrix Conversion: To handle the large size of the matrix efficiently, it was converted into a sparse format, which optimizes memory usage for matrices with many zero values.

#### Collaborative Similarity
Based on the user-movie ratings matrix. Content-Based Filtering: Using genres and tags to identify movies with similar themes. Cosine Similarity Calculation: Cosine similarity was applied to the user-movie matrix to measure the similarity between movies. Movies with similar rating patterns from users are considered more alike.

#### Content-Based Filtering with Genres and Tags
Combining Genres and Tags: The genres and user-provided tags were combined into a single text-based feature for each movie. This allows the system to capture thematic information.

TF-IDF Vectorization: A TfidfVectorizer was used to transform the combined genres and tags into a TF-IDF matrix. This matrix represents the importance of each word (e.g., "comedy," "adventure") within the context of each movie, helping to identify similar themes.

Cosine Similarity Calculation: Cosine similarity was then applied to the TF-IDF matrix, creating a content-based similarity matrix where each entry represents the thematic similarity between two movies.

#### Hybrid Recommendation System
Hybrid Similarity Calculation: For each input movie, collaborative and content-based similarities were weighted equally (0.5 each by default) to generate a combined similarity score. This can be adjusted to prioritize one approach over the other.

Selecting Top Recommendations: The combined similarity scores were sorted in descending order, and the top ten movies (excluding the input movie itself) were selected as recommendations.

Formatting Output: The top ten movies were presented as a neatly formatted list of titles, making it user-friendly and easy to interpret.

#### Example Usage and Results
When a user enters a movie they like, such as “Toy Story (1995),” the recommender system outputs the following:

1. Star Wars: Episode IV - A New Hope (1977)
2. Forrest Gump (1994)
3. Lion King, The (1994)
4. Jurassic Park (1993)
5. Independence Day (a.k.a. ID4) (1996)
6. Star Wars: Episode VI - Return of the Jedi (1983)
7. Bug's Life, A (1998)
8. Toy Story 2 (1999)
9. Up (2009)
10. Guardians of the Galaxy 2 (2017)

These recommendations are generated based on both user preferences and content similarities, resulting in a well-rounded list of movies with similar appeal.

## References 

Dataset was downloaded from https://grouplens.org/datasets/movielens/ and the one recommended for education and development https://files.grouplens.org/datasets/movielens/ml-latest-small.zip. 