# Content-based model

In [2]:
!pip install textblob

Collecting textblob
  Obtaining dependency information for textblob from https://files.pythonhosted.org/packages/02/07/5fd2945356dd839974d3a25de8a142dc37293c21315729a41e775b5f3569/textblob-0.18.0.post0-py3-none-any.whl.metadata
  Downloading textblob-0.18.0.post0-py3-none-any.whl.metadata (4.5 kB)
Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
   ---------------------------------------- 0.0/626.3 kB ? eta -:--:--
   --------------------- ------------------ 337.9/626.3 kB 7.1 MB/s eta 0:00:01
   ---------------------------------------- 626.3/626.3 kB 7.9 MB/s eta 0:00:00
Installing collected packages: textblob
Successfully installed textblob-0.18.0.post0


In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
import warnings
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
warnings.filterwarnings("ignore")

In [5]:
df_merged = pd.read_pickle('data/df_movies_cleaned.pkl')
df_ratings = pd.read_pickle('data/df_ratings_cleaned.pkl')

#### Textual Feature - Combined Text

In [6]:
df_merged.columns

Index(['movieId', 'belongs_to_collection', 'original_language', 'overview',
       'popularity', 'release_date', 'runtime', 'title', 'actors',
       'keywords_extracted', 'genre_extracted', 'production_company_extracted',
       'production_country_extracted'],
      dtype='object')

In [7]:
df_merged['combined_text'] = df_merged.apply(lambda row: ' '.join([
    ' '.join(row['genre_extracted']), 
    ' '.join(row['actors']), 
    ' '.join(row['keywords_extracted']), 
    row['overview'], 
    ' '.join(row['production_company_extracted'])
]).lower(), axis=1)

The combined_text feature aggregates critical textual metadata from genres, actors, keywords, and movie descriptions into a single comprehensive descriptor for each movie. This aggregation captures the essence of a movie’s content, thematic elements, and appeal, which is crucial for content-based filtering. By synthesizing this information, the recommender system can identify and suggest movies with similar thematic and content attributes, enhancing personalization and user engagement.

## Modeling Preprocessing

#### Combining df_ratings and df_merged

In [9]:
df_combined = pd.merge(df_ratings, df_merged, on='movieId', how='inner')

#### Setting Rating Threshold

The decision to set a threshold of 20 ratings for each movie before including it in the item-based recommender system is strategic, with the goal of ensuring the reliability and validity of the generated recommendations. This threshold acts as a quality control measure, weeding out movies with sparse feedback that could otherwise result in skewed or less confident recommendations due to insufficient user data. By setting this minimum, the system focuses on movies with a high level of viewer engagement, allowing recommendations to be built on a solid foundation of user feedback. This approach improves the system's ability to deliver accurate, trustworthy recommendations based on broad consensus rather than outliers or minimal feedback, resulting in a better user experience and increased overall credibility for the recommender system.

In [10]:
ratings_per_movie = df_combined.groupby('movieId').size()

movies_with_enough_ratings = ratings_per_movie[ratings_per_movie >= 20].index

df_item_modeling = df_combined[df_combined['movieId'].isin(movies_with_enough_ratings)]

print(f"Original dataset size: {df_combined.shape}")
print(f"Filtered dataset size: {df_item_modeling.shape}")

Original dataset size: (24669326, 19)
Filtered dataset size: (24548423, 19)


In [11]:
df_item_modeling.columns

Index(['userId', 'movieId', 'rating', 'timestamp', 'user_mean_rating',
       'liked_by_user', 'belongs_to_collection', 'original_language',
       'overview', 'popularity', 'release_date', 'runtime', 'title', 'actors',
       'keywords_extracted', 'genre_extracted', 'production_company_extracted',
       'production_country_extracted', 'combined_text'],
      dtype='object')

With the filtered dataset, df_item_modeling, now comprising 24,528,484 rows out of the original 24,639,944, it's evident that the vast majority of the data meets the threshold of having at least 20 ratings per movie. This minimal reduction in dataset size suggests that most movies in the dataset have a sufficient number of ratings, indicating robust user engagement across a wide range of movies.

##### Grouping Movies

In [13]:
df_grouped = df_item_modeling.groupby('movieId', as_index=False).agg({
    'title': 'first',
    'combined_text': 'first',  # Picking the first since all are the same
})

In [14]:
df_grouped

Unnamed: 0,movieId,title,combined_text
0,1,Toy Story,animation comedy family tom hanks tim allen do...
1,2,Jumanji,adventure fantasy family robin williams jonath...
2,3,Grumpier Old Men,romance comedy walter matthau jack lemmon ann-...
3,4,Waiting to Exhale,comedy drama romance whitney houston angela ba...
4,5,Father of the Bride Part II,comedy steve martin diane keaton martin short ...
...,...,...,...
16122,173941,Atomic Blonde,action thriller charlize theron james mcavoy s...
16123,174053,Black Mirror: White Christmas,drama horror mystery science fiction thriller ...
16124,174055,Dunkirk,action drama history thriller war fionn whiteh...
16125,174371,Once Upon a Time in Venice,action comedy thriller bruce willis jason momo...


### Content-Based Filtering

#### Vectorizing 'combined_text' feature

In [15]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
tfidf_matrix = tfidf_vectorizer.fit_transform(df_grouped['combined_text'])

Vectorizing the combined_text using TF-IDF transforms qualitative textual information into quantitative vectors, facilitating the measurement of content similarity between movies. This numerical representation allows for sophisticated algorithms to compute similarities based on thematic elements, narrative structures, and genre affiliations. For our movie recommender system, this means being able to recommend movies that are contextually and thematically aligned with a user’s preferences, enhancing the discovery of relevant and appealing content.

### BaseLine Model

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

class SimplifiedContentRecommender:
    def __init__(self, movies_df, tfidf_matrix, k=100):
        self.movies_df = movies_df.copy()
        self.movies_df['movieId'] = self.movies_df['movieId'].astype(str)
        self.movie_id_to_index = {movie_id: i for i, movie_id in enumerate(self.movies_df['movieId'])}
        self.tfidf_matrix = tfidf_matrix 
        self.similarity_matrix = cosine_similarity(self.tfidf_matrix)

    def recommend(self, movie_id, top_n=10):
        movie_id = str(movie_id)
        if movie_id not in self.movie_id_to_index:
            print(f"Movie ID {movie_id} not found in the dataset.")
            return []
        
        movie_index = self.movie_id_to_index[movie_id]
        similarity_scores = self.similarity_matrix[movie_index]
        top_k_indices = np.argsort(similarity_scores)[::-1][1:top_n+1]
        recommendations = self.movies_df.iloc[top_k_indices].copy()
        recommendations['cosine_similarity'] = similarity_scores[top_k_indices]
        
        return recommendations.sort_values('cosine_similarity', ascending=False)

In [17]:
recommender_base = SimplifiedContentRecommender(df_grouped, tfidf_matrix, k=100)
recommendations_base = recommender_base.recommend('1', top_n=10)  
print(recommendations_base[['movieId', 'title', 'cosine_similarity']])

      movieId            title  cosine_similarity
2874     3114      Toy Story 2           0.498092
12007   78499      Toy Story 3           0.417211
1722     1920   Small Soldiers           0.216866
2048     2253             Toys           0.186901
7112     7987            Dolls           0.180138
12339   83219  The Pixar Story           0.178043
1552     1707     Home Alone 3           0.163932
1793     1991     Child's Play           0.151260
9645    46948    Monster House           0.144818
1795     1993   Child's Play 3           0.143198


### Sampling

In this scenario, the sampling technique used is to calculate a statistically significant sample size in order to estimate the proportion of movies rated 4.0 or higher in a dataset. This decision is based on a specific confidence level (95%) and margin of error (5%), with the goal of obtaining precise and reliable inferences about the population's characteristics from a sample of data. The method used employs a standard formula that includes the Z-score associated with the desired confidence level and the estimated proportion of interest, ensuring that the sample size is sufficient to accurately reflect the population. This technique is critical for designing studies or analyses that require accurate estimations of population parameters for decision-making or hypothesis testing, as it minimizes potential biases and errors caused by small or arbitrarily chosen sample sizes. By rigorously determining the required sample size, the approach improves the credibility and validity of the findings derived from the sample data, making it a cornerstone of statistical analysis and research methodologies.

In [18]:
import scipy.stats
import math


def calculate_sample_size(confidence_level, margin_of_error, proportion):
    z_score = abs(scipy.stats.norm.ppf((1 - confidence_level) / 2))
    sample_size = math.ceil((z_score ** 2 * proportion * (1 - proportion)) / (margin_of_error ** 2))
    return sample_size

confidence_level = 0.95
margin_of_error = 0.05

proportion_higher_ratings = df_ratings[df_ratings['rating'] >= 4.0].shape[0] / df_ratings.shape[0]
required_sample_size = calculate_sample_size(confidence_level, margin_of_error, proportion_higher_ratings)
print(f"Required sample size: {required_sample_size}")


Required sample size: 385


In [19]:
sample_movie_ids = np.random.choice(df_grouped['movieId'].unique(), size=required_sample_size, replace=False)

## Evaluation Function

In [20]:
def evaluate_movie(movie_id, df_ratings, recommender, top_n=10):
    """Evaluate a single movie for the recommender system, adjusted for actual user ratings."""
    recommendations = recommender.recommend(str(movie_id), top_n=top_n)
    if recommendations.empty:
        return np.array([]), None  # Use None to indicate no data for calculation

    recommended_ids = recommendations['movieId'].astype(str).tolist()
    # Filter ratings to those that match the recommended movie IDs
    matching_ratings = df_ratings[df_ratings['movieId'].astype(str).isin(recommended_ids)]
    
    # Calculate hit rate only for recommended movies that have been rated
    hit_rate = (matching_ratings['rating'] >= 4.0).mean() if not matching_ratings.empty else None

    return np.array(matching_ratings['rating']), hit_rate

def evaluate_recommender(df_ratings, recommender, sample_movie_ids, top_n=10, threshold=4.0):
    """Evaluate the recommender system using sampled movie IDs, including adjusted hit rate."""
    all_ratings, hit_rates = [], []

    for movie_id in sample_movie_ids:
        movie_ratings, hit_rate = evaluate_movie(movie_id, df_ratings, recommender, top_n=top_n)
        if movie_ratings.size > 0:
            all_ratings.extend(movie_ratings)
        if hit_rate is not None:
            hit_rates.append(hit_rate)
    
    all_ratings = np.array(all_ratings)
    # Adjust calculations to handle potential None values in hit_rates
    if len(all_ratings) > 0:
        mae = np.mean(np.abs(all_ratings - 5))
        mse = np.mean((all_ratings - 5) ** 2)
        rmse = np.sqrt(mse)
        precision = np.sum(all_ratings >= threshold) / len(all_ratings)
    else:
        mae, mse, rmse, precision = 0, 0, 0, 0

    avg_hit_rate = np.mean(hit_rates) if hit_rates else None  # Use None or a placeholder if no hit rates available

    print(f"Sample Size: {len(sample_movie_ids)}")
    # Adjust the print statement to handle None value for avg_hit_rate
    print(f"MAE: {mae:.4f}\nMSE: {mse:.4f}\nRMSE: {rmse:.4f}\nPrecision: {precision:.4f}\nAverage Hit Rate: {avg_hit_rate if avg_hit_rate is not None else 'N/A'}")

    return mae, mse, rmse, precision, avg_hit_rate


# Item-based Collaborative Filtering model

In [21]:
df_merged = pd.read_pickle('data/df_movies_cleaned.pkl')
df_ratings = pd.read_pickle('data/df_ratings_cleaned.pkl')

In [22]:
df_ratings_subset = df_ratings.sample(frac=0.01, random_state=42)
df_ratings_subset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 248481 entries, 21920435 to 8204774
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   userId            248481 non-null  int64         
 1   movieId           248481 non-null  int64         
 2   rating            248481 non-null  Float64       
 3   timestamp         248481 non-null  datetime64[ns]
 4   user_mean_rating  248481 non-null  Float64       
 5   liked_by_user     248481 non-null  boolean       
dtypes: Float64(2), boolean(1), datetime64[ns](1), int64(2)
memory usage: 12.3 MB


In [23]:
reader = Reader()

# Prepare the data for Surprise
data = Dataset.load_from_df(df_ratings_subset[['userId', 'movieId', 'rating']], reader)

# Initialize the SVD algorithm
svd = SVD()

# Perform cross-validation
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9500  0.9475  0.9524  0.9525  0.9511  0.9507  0.0018  
MAE (testset)     0.7342  0.7347  0.7369  0.7357  0.7365  0.7356  0.0010  
Fit time          2.71    2.81    2.70    2.70    2.59    2.70    0.07    
Test time         0.32    0.25    0.27    0.25    0.24    0.27    0.03    


{'test_rmse': array([0.95000103, 0.94750656, 0.95237583, 0.95245742, 0.95112749]),
 'test_mae': array([0.7342111 , 0.73471254, 0.73691355, 0.73565976, 0.73648461]),
 'fit_time': (2.709101438522339,
  2.8117599487304688,
  2.70361590385437,
  2.6955010890960693,
  2.5940587520599365),
 'test_time': (0.32032179832458496,
  0.25452637672424316,
  0.26554226875305176,
  0.2522296905517578,
  0.24244332313537598)}

In [39]:
def train_svd_model(ratings_data, n_factors=100, n_epochs=20):
    """
    Trains an SVD model.

    Parameters:
    - ratings_data: A pandas DataFrame containing userId, movieId, and rating columns.
    - n_factors: The number of latent factors to use for the SVD algorithm.
    - n_epochs: The number of epochs for which to train the algorithm.

    Returns:
    - The trained SVD model.
    """
    # Initialize reader and load data into Surprise's format
    reader = Reader()
    data = Dataset.load_from_df(ratings_data[['userId', 'movieId', 'rating']], reader)
    
    # Initialize the SVD algorithm with specified parameters
    svd = SVD(n_factors=n_factors, n_epochs=n_epochs)
    
    # Train the SVD model on the full dataset
    trainset = data.build_full_trainset()
    svd.fit(trainset)
    
    return svd

In [24]:
# TODO GILIAN: Matrix geben lassen:


We get a mean Root Mean Sqaure Error of 0.95 approx which is more than good enough for our case. Let us now train on our dataset and arrive at predictions.

In [25]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1cde7b7b050>

In [51]:
svd.predict(67, 302, 3)

Prediction(uid=67, iid=302, r_ui=3, est=3.97618612522282, details={'was_impossible': False})

# Hybrid Model

In [27]:
# Customize weights

weight_similarity = 0.5
weight_svd = 0.5

In [42]:
def hybrid_recommendation(movie_id, user_id, top_n=10):
    # Get top N content-based recommendations
    content_recs = recommender_base.recommend(movie_id, top_n)
    
    # Prepare final recommendations with an additional column for SVD predictions
    content_recs['svd_prediction'] = content_recs['movieId'].apply(lambda x: svd.predict(user_id, x).est)
    
    # Sort the recommendations based solely on the SVD predictions
    final_recs = content_recs.sort_values('svd_prediction', ascending=False).head(top_n)
    
    return final_recs

In [43]:
hybrid_recommendation(9, 1, top_n=50)

Unnamed: 0,movieId,title,combined_text,cosine_similarity,svd_prediction
1997,2196,Knock Off,action adventure thriller jean-claude van damm...,0.269357,3.526831
3891,4199,Death Warrant,action crime drama mystery thriller jean-claud...,0.138215,3.526831
15962,165087,Brimstone,mystery thriller western guy pearce dakota fan...,0.159557,3.526831
3475,3766,Missing in Action,action adventure thriller war chuck norris m. ...,0.158418,3.526831
11563,71810,Legionnaire,adventure drama action history thriller jean-c...,0.156425,3.526831
12773,90434,Assassination Games,drama action crime jean-claude van damme scott...,0.156123,3.526831
8527,27828,The Memory Of A Killer,crime drama thriller action koen de bouw werne...,0.15474,3.526831
8728,31892,"No Retreat, No Surrender",action kurt mckinney jean-claude van damme j.w...,0.154175,3.526831
15099,133689,Pound of Flesh,action jean-claude van damme john ralston darr...,0.151983,3.526831
2578,2808,Universal Soldier,thriller action science fiction crime jean-cla...,0.145358,3.526831
