# Übung 2.2 - Content Based Recommender - User Profil

Um nun noch User spezifischere Vorschläge machen zu können, werden wir nun für einen User ein Profil anhand seiner bereits bewerteten Filme erstellen, um so personalisierte Recommendations zu machen.

Dazu verwenden wir die selben Features für die Filme wie in der Übung 2.1.

In [1]:
import ast
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import warnings; warnings.simplefilter('ignore')

Einlesen der Film Metadaten

In [2]:
list_columns = ['genres', 'keywords', 'production_companies', 'production_countries', 'spoken_languages', 'cast', 'director', 'producer', 'writer', 'music']

movies = pd.read_csv('data/movies.csv', keep_default_na=False, converters={col: ast.literal_eval for col in list_columns})

Einlesen der Ratings

In [3]:
ratings = pd.read_csv('data/ratings.csv')

In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title
0,0,2852,2.5,1260759144,Dangerous Minds
1,0,1360,3.0,1260759182,Sleepers
2,0,1374,2.0,1260759185,Escape from New York
3,0,1545,3.0,1260759117,Blazing Saddles
4,0,2816,1.0,1260759200,Time Bandits


Meta-Feature Matrix erstellen analog zu Übung 2.1

In [5]:
def one_hot_encoder(col, take_n=3, min_occurence=2):
    features = col.str[:take_n].str.join('|').str.get_dummies()
    features = features.loc[:, features.sum() >= min_occurence]
    print(col.name, features.shape)
    return features

In [6]:
meta_features = pd.concat([
    one_hot_encoder(movies.cast),
    one_hot_encoder(movies.director),
    one_hot_encoder(movies.genres, -1),
], axis=1)

meta_features = meta_features.set_index(movies.title)

cast (9025, 3545)
director (9025, 1656)
genres (9025, 20)


In [7]:
meta_features.head()

Unnamed: 0_level_0,Larry Mullen Jr.,A. Michael Baldwin,Aaliyah,Aamir Khan,Aaron Abrams,Aaron Eckhart,Aaron Taylor-Johnson,Abbie Cornish,Abigail Breslin,Adam Baldwin,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Inception,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,1,0,0
The Dark Knight,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Avatar,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The Avengers (2012),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
Deadpool,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# User Profile

Berechne für einen User das User-Profile anhand seiner bewerteten Filmen.
Da wir hier nur One-Hot-Encodeded Features verwenden, verwenden wir nicht das gewichtete Mittel der Features, sondern
Und berechne nachher für den User die Scores für alle Filme 

In [8]:
# Für diesen User wollen wir Recommendations generieren
user_id = 664

In [9]:
# Filtere alle Ratings vom User mit der Id=user_id
user_ratings = ratings[ratings.userId == user_id]
user_ratings.shape

(434, 5)

In [10]:
user_ratings.sort_values('rating', ascending=False).head(25)

Unnamed: 0,userId,movieId,rating,timestamp,title
99364,664,153,5.0,1010197453,Shrek
99255,664,1229,5.0,992837175,Remember the Titans
99159,664,1884,5.0,993424251,Frequency
99205,664,295,5.0,995233298,Good Will Hunting
99336,664,532,5.0,992920864,Pretty Woman
99375,664,1032,5.0,992838190,Miss Congeniality
99271,664,769,5.0,993179335,Ferris Bueller's Day Off
99439,664,700,5.0,992920551,Casablanca
99263,664,996,5.0,993179525,Big
99150,664,2348,5.0,993534911,Fallen


In [11]:
user_ratings.sort_values('rating').head(20)

Unnamed: 0,userId,movieId,rating,timestamp,title
99189,664,1751,1.0,1046966660,The 13th Warrior
99405,664,4708,1.0,993347173,The Brady Bunch Movie
99207,664,4416,1.0,1046967201,Urban Legends: Final Cut
99257,664,730,1.0,995232660,Romeo + Juliet
99355,664,4767,1.0,993346847,The Bachelor
99356,664,504,1.0,993347429,Dumb and Dumber
99299,664,2989,1.0,992838359,Coneheads
99428,664,1572,1.0,993347829,The Cable Guy
99392,664,695,1.0,1010198112,Scary Movie
99410,664,5689,1.0,993346524,Pecker


In [12]:
# fitere aus der feature-Matrix, die Features der Filme herause, welche der User bewertet hat 
user_movie_features = meta_features.loc[user_ratings.title]
user_movie_features.shape

(434, 5221)

In [13]:
user_movie_features

Unnamed: 0_level_0,Larry Mullen Jr.,A. Michael Baldwin,Aaliyah,Aamir Khan,Aaron Abrams,Aaron Eckhart,Aaron Taylor-Johnson,Abbie Cornish,Abigail Breslin,Adam Baldwin,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Stuart Little,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The General's Daughter,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
Michael,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The Score,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Schindler's List,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Simply Irresistible,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Dumbo,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The Matrix,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
As Good as It Gets,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
# erstelle eine neue Spalte mit dem normierten User-Bewertungen (subtraktion des User-Durchschnitts-Rating)
user_ratings_mean = user_ratings.rating.mean()
user_ratings['rating_norm'] = (user_ratings.rating - user_ratings_mean)
user_ratings_mean

3.2857142857142856

In [15]:
# Gewichtern der Film-Features durch multiplizieren mit dem normierten Bewertungen
user_profile = (user_movie_features.values.T * user_ratings['rating_norm'].values)

In [16]:
user_profile.shape

(5221, 434)

In [17]:
# Durchschnitt der gewichteten Film-Features ergibt das User-Profil
user_profile = user_profile.mean(axis=1)

In [18]:
pd.Series(user_profile, index=meta_features.columns).sort_values(ascending=False).head(20)

Animation            0.049704
Adventure            0.041475
Family               0.035221
Tom Hanks            0.013167
Hamilton Luske       0.012179
Fantasy              0.010533
Tom Cruise           0.009546
Don Bluth            0.009546
Jerry Zucker         0.009546
Ian McKellen         0.009546
Cameron Diaz         0.009217
Wilfred Jackson      0.008887
Denzel Washington    0.008887
Brad Pitt            0.008558
Action               0.008229
Michael J. Fox       0.008229
Meg Ryan             0.007900
Gregory Hoblit       0.007900
Peter Jackson        0.007900
Eddie Murphy         0.007571
dtype: float64

In [19]:
pd.Series(user_profile, index=meta_features.columns).sort_values(ascending=False).tail(20)

Catherine O'Hara    -0.005925
Pierce Brosnan      -0.005925
Mel Brooks          -0.005925
Christine Taylor    -0.005925
Pat Morita          -0.005925
Denis Leary         -0.005925
David Arquette      -0.005925
Jeff Daniels        -0.006583
Claire Danes        -0.006583
Ralph Macchio       -0.006583
Danny DeVito        -0.006583
Gwyneth Paltrow     -0.007242
Renée Zellweger     -0.007900
Drama               -0.008229
Meryl Streep        -0.008887
Frank Oz            -0.009546
John McTiernan      -0.010204
Horror              -0.013825
Leonardo DiCaprio   -0.014154
Comedy              -0.033575
dtype: float64

In [20]:
# berechne die Scores für alle Filme, indem wir die Cosine Similarity von den Film Features und dem User Profile berechnen
user_scores = cosine_similarity(meta_features.values, [user_profile])

Hinweis: [pandas.DataFrame.merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)

In [21]:
# Nun machen wir ein neues Dataframe mit allen Film Titeln und den bestehnden Ratings des Users
user_recommendations = movies[['title']].merge(user_ratings[['title', 'rating']], how='left')

In [22]:
# hinzufügen der Scores aller Filme als neue Spalte
user_recommendations['score'] = user_scores

In [23]:
# berechnen der predicted_ratings, indem wir den Score mit MAX_RATING/2 multiplizieren und das Durchschnittsrating des Users addieren
user_recommendations['predicted_rating'] = (user_recommendations['score'] * 2.5) + user_ratings_mean
user_recommendations['error'] = (user_recommendations.rating - user_recommendations.predicted_rating)

In [24]:
# Ausgeben der Top 10 recommendations
user_recommendations[pd.isnull(user_recommendations.rating)].sort_values('score', ascending=False).head(20)

Unnamed: 0,title,rating,score,predicted_rating,error
5922,Mind Game,,0.515853,4.575346,
3863,Asterix the Gaul,,0.505664,4.549875,
7219,Doug's 1st Movie,,0.480469,4.486888,
7508,Peter & the Wolf,,0.480469,4.486888,
1347,Paperman,,0.480469,4.486888,
666,The Polar Express,,0.474159,4.471112,
2981,Kirikou and the Sorceress,,0.46985,4.460338,
2551,Return to Never Land,,0.46985,4.460338,
5017,Fullmetal Alchemist the Movie: Conqueror of Sh...,,0.459206,4.433729,
746,Penguins of Madagascar,,0.45228,4.416414,


In [25]:
# Ausgeben der Flop 10 recommendations
user_recommendations[pd.isnull(user_recommendations.rating)].sort_values('score', ascending=True).head(10)

Unnamed: 0,title,rating,score,predicted_rating,error
8306,Bana Masal Anlatma,,-0.268634,2.614129,
8488,Funny Felix,,-0.268634,2.614129,
4880,The First Beautiful Thing,,-0.268634,2.614129,
8431,Dorian Blues,,-0.268634,2.614129,
8405,Fashion Victims,,-0.268634,2.614129,
7426,Not Quite Hollywood,,-0.268634,2.614129,
5154,Street Trash,,-0.268634,2.614129,
7249,12:08 East of Bucharest,,-0.268634,2.614129,
6841,Flesh Gordon,,-0.268634,2.614129,
2030,The Kings of Summer,,-0.268634,2.614129,


In [26]:
user_recommendations[pd.notnull(user_recommendations.error)].sort_values('error')

Unnamed: 0,title,rating,score,predicted_rating,error
1751,The 13th Warrior,1.0,0.146169,3.651136,-2.651136
2503,"The Karate Kid, Part III",1.0,0.050646,3.412329,-2.412329
809,The Beach,1.0,0.025881,3.350417,-2.350417
1988,The Score,1.0,-0.025805,3.221203,-2.221203
695,Scary Movie,1.0,-0.072986,3.103248,-2.103248
...,...,...,...,...,...
996,Big,5.0,-0.013036,3.253124,1.746876
1817,The Santa Clause,5.0,-0.048776,3.163774,1.836226
1710,The Whole Nine Yards,5.0,-0.064780,3.123765,1.876235
1741,The Family Man,5.0,-0.065699,3.121468,1.878532
