
# User-Based Collaborative Filtering


### Reviewers Demographic Dataset

This dataset contains review data for movies, containing information of the characteristics of the movie reviewer such as gender, age, occupation and area.

- **reviewer_id:** Unique ID of the reviewer
- **reviewer_gender:** Gender of the reviewer
  - 1: "F"
  - 2: "M"
- **reviewer_age:** Age of the reviewer in categorical age ranges
  - 1: "Under 18"  
  - 2: "18-24"
  - 3: "25-34"
  - 4: "35-44"
  - 5: "45-49"
  - 6: "50-55"
  - 7: "56+"
- **reviewer_occupation:** Occupation of the reviewer encoded by an integer
  - 0:  "other" or not specified
  - 1:  "academic/educator"
  - 2:  "artist"
  - 3:  "clerical/admin"
  - 4:  "college/grad student"
  - 5:  "customer service"
  - 6:  "doctor/health care"
  - 7:  "executive/managerial"
  - 8:  "farmer"
  - 9:  "homemaker"
  - 10:  "K-12 student"
  - 11:  "lawyer"
  - 12:  "programmer"
  - 13:  "retired"
  - 14:  "sales/marketing"
  - 15:  "scientist"
  - 16:  "self-employed"
  - 17:  "technician/engineer"
  - 18:  "tradesman/craftsman"
  - 19:  "unemployed"
  - 20:  "writer"
- **reviewer_area:** Location of the reviewer grouped by the first digit of the reviewer's zipcode.
- **reviewer_rating:** Movie rating by reviewer from a range of 1-5
- **movie_id:** Unique ID of movie reviewed
- **movie_title:** Title of movie reviewed
- **movie_genre:** Genre of movie reviewed
	* Action
	* Adventure
	* Animation
	* Children's
	* Comedy
	* Crime
	* Documentary
	* Drama
	* Fantasy
	* Film-Noir
	* Horror
	* Musical
	* Mystery
	* Romance
	* Sci-Fi
	* Thriller
	* War
	* Western
- **movie_year_of_release:** Year of release of movie reviewed

# Import Packages Needed

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize, OneHotEncoder
from scipy.sparse import hstack, csr_matrix


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:

movie_reviews_with_demographic = pd.read_csv('/content/drive/My Drive/temp/movie_reviews_with_demographic_clean.csv',engine='python', encoding='ISO-8859-1')

# Randomly sample 150,000 rows as 1 million rows is too large to run successfully
movie_reviews_with_demographic = movie_reviews_with_demographic.sample(n=150000, random_state=42)


# Conduct Data Cleaning

In [None]:
movie_reviews_with_demographic = movie_reviews_with_demographic.drop_duplicates(subset=['reviewer_id','movie_id'], keep='first')
movie_reviews_with_demographic

Unnamed: 0,reviewer_id,reviewer_gender,reviewer_age,reviewer_occupation,reviewer_area,reviewer_rating,movie_id,movie_title,movie_genre,movie_year_of_release
895536,5412,M,3,12,9,2,2683,austin powers: the spy who shagged me (1999),['Comedy'],1999
899739,5440,F,5,2,3,5,904,rear window (1954),"['Mystery', 'Thriller']",1954
55687,368,M,3,0,9,4,3717,gone in 60 seconds (2000),"['Action', 'Crime']",2000
63727,425,M,3,12,5,4,1721,titanic (1997),"['Drama', 'Romance']",1997
822011,4942,M,5,12,4,1,3697,predator 2 (1990),"['Action', 'Sci-Fi', 'Thriller']",1990
...,...,...,...,...,...,...,...,...,...,...
965654,5824,M,2,12,1,3,433,clean slate (1994),['Comedy'],1994
491351,3022,M,3,17,7,3,1127,"abyss, the (1989)","['Action', 'Adventure', 'Sci-Fi', 'Thriller']",1989
694512,4161,M,5,0,9,2,3646,big momma's house (2000),['Comedy'],2000
198242,1218,M,7,15,1,2,1772,blues brothers 2000 (1998),"['Action', 'Comedy', 'Musical']",1998


In [None]:
# Define a mapping for reviewer_gender
gender_encoding = {'F': 1, 'M': 2}

# Apply the encoding to the reviewer_gender column
movie_reviews_with_demographic['reviewer_gender'] = movie_reviews_with_demographic['reviewer_gender'].map(gender_encoding)

# Display the updated DataFrame to verify
movie_reviews_with_demographic.head()

Unnamed: 0,reviewer_id,reviewer_gender,reviewer_age,reviewer_occupation,reviewer_area,reviewer_rating,movie_id,movie_title,movie_genre,movie_year_of_release
895536,5412,2,3,12,9,2,2683,austin powers: the spy who shagged me (1999),['Comedy'],1999
899739,5440,1,5,2,3,5,904,rear window (1954),"['Mystery', 'Thriller']",1954
55687,368,2,3,0,9,4,3717,gone in 60 seconds (2000),"['Action', 'Crime']",2000
63727,425,2,3,12,5,4,1721,titanic (1997),"['Drama', 'Romance']",1997
822011,4942,2,5,12,4,1,3697,predator 2 (1990),"['Action', 'Sci-Fi', 'Thriller']",1990


# User-Based Collaborative Filtering

In [None]:
# Define individual weights for each categorical feature
weight_age = 0.3
weight_occupation = 0.2
weight_area = 0.2
weight_gender = 0.3

# Encode each categorical feature separately and apply its weight
encoder = OneHotEncoder(sparse_output=True)

# Encode and weight each feature
encoded_age = encoder.fit_transform(movie_reviews_with_demographic[['reviewer_age']]) * weight_age
encoded_occupation = encoder.fit_transform(movie_reviews_with_demographic[['reviewer_occupation']]) * weight_occupation
encoded_area = encoder.fit_transform(movie_reviews_with_demographic[['reviewer_area']]) * weight_area
encoded_gender = encoder.fit_transform(movie_reviews_with_demographic[['reviewer_gender']]) * weight_gender

# Combine all weighted encoded features into a single sparse matrix
combined_features = hstack([encoded_age, encoded_occupation, encoded_area, encoded_gender])

# Calculate cosine similarity on the combined sparse matrix
feature_similarity = cosine_similarity(combined_features, dense_output=False)

In [None]:
# Outputting the similarity matrix to understand its structure
feature_similarity[:5, :5]

<5x5 sparse matrix of type '<class 'numpy.float64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [None]:
# Create a DataFrame to track movies watched by each user and their ratings
user_movie_ratings = movie_reviews_with_demographic[['reviewer_id', 'movie_id', 'reviewer_rating']]

# Step 2: Define a function to find similar users
def get_similar_users(user_id, feature_similarity, top_n=5):
    # Extract similarity scores for the target user
    user_similarity_scores = feature_similarity[user_id].toarray().flatten()

    # Get indices of top N most similar users (excluding the user itself)
    similar_users = np.argsort(-user_similarity_scores)[1:top_n+1]

    print(f"Top {top_n} Users similar to user {user_id} and their similarity scores:")
    for user in similar_users:
        print(f"User ID: {user}, Similarity Score: {user_similarity_scores[user]}")
    print("-----")

    return similar_users


def get_past_movie_history(user_id,user_movie_ratings):

    # Get unique movies watched and liked by the user_id
    user_past_movie_history = user_movie_ratings[user_movie_ratings['reviewer_id'] == user_id]['movie_id'].unique()

    print(f"User {user_id}'s past movie history: ")
    print(user_past_movie_history)
    print("-----")

    return user_past_movie_history




# Step 3: Define a function to recommend movies based on similar users' preferences
def recommend_movies(user_id, feature_similarity, user_movie_ratings, top_n_users=10, top_n_movies=5):

    # Get past movie watched history of the user_id
    user_past_movie_history = get_past_movie_history(user_id, user_movie_ratings)

    # Get the top few users that are most similar to the user_id
    similar_users = get_similar_users(user_id, feature_similarity, top_n=(3*top_n_users))


    # Get movies liked by similar users, excluding those rated by the target user
    similar_user_movies = user_movie_ratings[
        (user_movie_ratings['reviewer_id'].isin(similar_users)) &
        (user_movie_ratings['reviewer_id'] != user_id) &             # Exclude the target user's own ratings
        (user_movie_ratings['reviewer_rating'] >= 4)                 # Consider high ratings as liked movies
    ].drop_duplicates().sort_values(by='reviewer_rating', ascending=False)

    print(f"Similar users' recommended movies: ")
    print(similar_user_movies)
    print("-----")

    # Filter out movies the target user has already seen
    recommended_movies = similar_user_movies[~similar_user_movies['movie_id'].isin(user_past_movie_history)]

    # Sort movies by popularity through rating and recommend top N
    top_recommended_movie_ids = recommended_movies['movie_id'].value_counts().head(top_n_movies).index.tolist()

   # Step 6: Fetch movie titles from movie_reviews_with_demographic based on movie_id
    top_recommended_movies = movie_reviews_with_demographic[movie_reviews_with_demographic['movie_id'].isin(top_recommended_movie_ids)][['movie_id', 'movie_title']].drop_duplicates()

    # Step 7: Iterate through recommended movies and print in the desired format
    print(f"RECOMMENDATIONS FOR USER {user_id}:")
    print("-----")
    for _, row in top_recommended_movies.iterrows():
        print(f"Movie ID: {row['movie_id']}")
        print(f"Movie Title: {row['movie_title']}")
        print("-----")

    return top_recommended_movies



In [None]:
test_user_id = np.random.choice(movie_reviews_with_demographic['reviewer_id'].unique())

recommended_movies = recommend_movies(1324, feature_similarity, user_movie_ratings)


Top 10 Users similar to user 1324 and their similarity scores:
User ID: 62634, Similarity Score: 0.9999999999999998
User ID: 62591, Similarity Score: 0.9999999999999998
User ID: 1401, Similarity Score: 0.9999999999999998
User ID: 62779, Similarity Score: 0.9999999999999998
User ID: 74397, Similarity Score: 0.9999999999999998
User ID: 148339, Similarity Score: 0.9999999999999998
User ID: 1324, Similarity Score: 0.9999999999999998
User ID: 76077, Similarity Score: 0.9999999999999998
User ID: 61542, Similarity Score: 0.9999999999999998
User ID: 75754, Similarity Score: 0.9999999999999998
-----
User 1324's past movie history: 
[1358 1231  527 1193  968   24 1267  904 2019 1394 2395 1203  608 3683
 1387 3619 1136 1213  903 1256 2863 2018]
-----
Similar users' recommended movies: 
        reviewer_id  movie_id  reviewer_rating
218454         1324      1231                5
218399         1324      1358                5
218397         1324      1193                5
218436         1324       