# üé¨ Movie Recommendation System

Trong notebook n√†y, ch√∫ng ta s·∫Ω tri·ªÉn khai:
- Collaborative Filtering (G·ª£i √Ω d·ª±a tr√™n ng∆∞·ªùi d√πng t∆∞∆°ng ƒë·ªìng)
- Content-Based Filtering (G·ª£i √Ω d·ª±a v√†o n·ªôi dung phim)
- Hybrid Model (K·∫øt h·ª£p 2 k·ªπ thu·∫≠t tr√™n)

D·ªØ li·ªáu s·ª≠ d·ª•ng: MovieLens 100k  


In [1]:
# N·∫°p d·ªØ li·ªáu t·ª´ c√°c t·ªáp CSV v√† in ra s·ªë l∆∞·ª£ng phim v√† ƒë√°nh gi√°
import pandas as pd

movies = pd.read_csv('../data/movies.csv')
ratings = pd.read_csv('../data/ratings.csv')

print("S·ªë l∆∞·ª£ng phim:", len(movies))
print("S·ªë l∆∞·ª£ng ƒë√°nh gi√°:", len(ratings))


S·ªë l∆∞·ª£ng phim: 1682
S·ªë l∆∞·ª£ng ƒë√°nh gi√°: 100000


In [2]:
# ## üë• Collaborative Filtering

# G·ª£i √Ω phim d·ª±a v√†o ng∆∞·ªùi d√πng t∆∞∆°ng ƒë·ªìng.  
# S·ª≠ d·ª•ng ma tr·∫≠n ng∆∞·ªùi d√πng - phim v√† ƒëo t∆∞∆°ng ƒë·ªìng b·∫±ng cosine similarity.

from sklearn.metrics.pairwise import cosine_similarity

# T·∫°o ma tr·∫≠n ng∆∞·ªùi d√πng - phim
user_movie_matrix = ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0)

# T√≠nh to√°n ƒë·ªô t∆∞∆°ng ƒë·ªìng gi·ªØa ng∆∞·ªùi d√πng
user_sim = cosine_similarity(user_movie_matrix)
user_sim_df = pd.DataFrame(user_sim, index=user_movie_matrix.index, columns=user_movie_matrix.index)

# H√†m g·ª£i √Ω phim
def recommend_user_cf(user_id, top_n=5):
    similar_users = user_sim_df[user_id].sort_values(ascending=False)[1:top_n+1].index
    top_movies = ratings[ratings['userId'].isin(similar_users)] \
                    .groupby('movieId')['rating'].mean() \
                    .sort_values(ascending=False).head(top_n)
    return movies[movies['movieId'].isin(top_movies.index)][['movieId', 'title']]

# V√≠ d·ª•
recommend_user_cf(user_id=1)


Unnamed: 0,movieId,title
168,169,"Wrong Trousers, The (1993)"
173,174,Raiders of the Lost Ark (1981)
301,302,L.A. Confidential (1997)
330,331,"Edge, The (1997)"
342,343,Alien: Resurrection (1997)


In [3]:
# ## üß† Content-Based Filtering

# G·ª£i √Ω phim d·ª±a tr√™n ti√™u ƒë·ªÅ v√† th·ªÉ lo·∫°i b·∫±ng TF-IDF.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

movies['combined'] = movies['title'] + ' ' + movies['genres'].fillna('')
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['combined'])

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# G·ª£i √Ω phim t∆∞∆°ng t·ª±
def recommend_content(movie_id, top_n=5):
    idx = movies[movies['movieId'] == movie_id].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
    movie_indices = [i[0] for i in sim_scores]
    return movies.iloc[movie_indices][['movieId', 'title']]

# V√≠ d·ª•
recommend_content(movie_id=1)


Unnamed: 0,movieId,title
1071,1072,"Pyromaniac's Love Story, A (1995)"
1065,1066,Balto (1995)
1218,1219,"Goofy Movie, A (1995)"
547,548,"NeverEnding Story III, The (1994)"
541,542,Pocahontas (1995)


In [6]:
#   ‚öñÔ∏è Hybrid Recommendation

# K·∫øt h·ª£p Collaborative Filtering v√† Content-Based Filtering.  
# S·ª≠ d·ª•ng k·∫øt qu·∫£ t·ª´ CF, l·∫•y c√°c phim t∆∞∆°ng t·ª± qua content model ƒë·ªÉ t·ªïng h·ª£p ƒëi·ªÉm.

def hybrid_recommend(user_id, top_n=10):
    user_based = recommend_user_cf(user_id, top_n=5)['movieId'].tolist()
    hybrid_scores = {}

    for movie_id in user_based:
        similar_movies = recommend_content(movie_id, top_n=5)
        for _, row in similar_movies.iterrows():
            hybrid_scores[row['movieId']] = hybrid_scores.get(row['movieId'], 0) + 1

    sorted_hybrid = sorted(hybrid_scores.items(), key=lambda x: x[1], reverse=True)
    movie_ids = [mid for mid, _ in sorted_hybrid[:top_n]]
    return movies[movies['movieId'].isin(movie_ids)][['movieId', 'title']]

# V√≠ d·ª•
hybrid_recommend(user_id=1)


Unnamed: 0,movieId,title
94,95,Aladdin (1992)
113,114,Wallace & Gromit: The Best of Aardman Animatio...
188,189,"Grand Day Out, A (1992)"
251,252,"Lost World: Jurassic Park, The (1997)"
357,358,Spawn (1997)
678,679,Conan the Barbarian (1981)
915,916,Lost in Space (1998)
918,919,"City of Lost Children, The (1995)"
1053,1054,Mr. Wrong (1996)
1366,1367,Faust (1994)


In [7]:
#  üìè Evaluation Metrics

# ƒê√°nh gi√° ch·∫•t l∆∞·ª£ng g·ª£i √Ω b·∫±ng RMSE v√† Precision@k

from sklearn.metrics import mean_squared_error
import numpy as np

# V√≠ d·ª• RMSE: gi·∫£ s·ª≠ rating d·ª± ƒëo√°n
def evaluate_rmse(actual_ratings, predicted_ratings):
    return np.sqrt(mean_squared_error(actual_ratings, predicted_ratings))

# Precision@k
def precision_at_k(actual, predicted, k=5):
    hits = sum([1 for i in predicted[:k] if i in actual])
    return hits / k if k > 0 else 0
