# Übung 2.1 - Content Based Recommender - Ähnliche Filme

Die Vorschläge welche wir in der 1. Übung gemacht haben, sind noch nicht wirklich gut. Denn alle Personen erhalten die gleichen Vorschläge, unabhängig vom Geschmack der Person. Um bessere Vorschläge machen zu können, werden wir nun die Meta-Daten der Filme hinzuziehen. So können wir z.B. nachdem eine Person einen Film geschaut hat, ähnliche Filme vorschlagen. z.B. Filme mit einem ähnlichen Cast oder ähnlichen Beschreibung.

In [1]:
import ast
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity, pairwise_distances

Einlesen der Film Metadaten

In [2]:
list_columns = ['genres', 'keywords', 'production_companies', 'production_countries', 'spoken_languages', 'cast', 'director', 'producer', 'writer', 'music']

movies = pd.read_csv('data/movies.csv', keep_default_na=False, converters={col: ast.literal_eval for col in list_columns})
movies = movies.set_index('title', drop=False)

### Movie Description Based Recommender
Wir werden nun einen Recommender implementieren, welcher die Beschreibungen der Filme beachtet.\
Hinweis: [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [3]:
# Transformiere die descriptions mit dem TFidfVectorizer zu einer tfidf-Matrix
tf = TfidfVectorizer(analyzer='word', stop_words='english')
tfidf_matrix = tf.fit_transform(movies['tagline'] + ' ' + movies['overview'])
tfidf_matrix.shape

(9025, 30483)

Hinweis: [sklearn.feature_extraction.text.TfidfVectorizer.get_feature_names_out](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.get_feature_names_out)

In [4]:
# Um die tfidf_matrix besser zu verstehen können wir diese wieder in ein DataFrame umwandeln, 
# mit den Filmtitel als Zeilen (index) und Feature-Names in den Spalten (columns)
text_features = pd.DataFrame(tfidf_matrix.todense(), index=movies.title, columns=tf.get_feature_names_out())
text_features

Unnamed: 0_level_0,00,000,007,01,05,05pm,06,08,09,10,...,élan,émigré,état,étienne,évocateur,ôtomo,østergaard,žižek,ˈfil,ˌrän
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Inception,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Dark Knight,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Avatar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Avengers (2012),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Deadpool,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
The Fern Flower,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Wonderland,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
To Have (Or Not),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Swedish Auto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
text_features.loc['Toy Story'].sort_values(ascending=False).head(20)

buzz             0.497922
woody            0.444590
andy             0.428771
lightyear        0.189024
toys             0.161764
aside            0.153902
separate         0.143716
plots            0.142924
differences      0.141424
afraid           0.139359
circumstances    0.137479
happily          0.135753
duo              0.133654
birthday         0.131294
room             0.122936
scene            0.121770
losing           0.118374
led              0.112888
brings           0.111280
owner            0.109787
Name: Toy Story, dtype: float64

In [6]:
text_features.mean().sort_values(ascending=False).head(30)

life       0.015326
man        0.013160
new        0.012499
love       0.012346
young      0.011603
world      0.011529
story      0.010687
family     0.010360
woman      0.009218
film       0.008990
time       0.008812
old        0.007567
father     0.007189
war        0.006921
years      0.006795
finds      0.006718
wife       0.006615
friends    0.006614
town       0.006556
way        0.006552
lives      0.006548
school     0.006497
year       0.006483
girl       0.006418
just       0.006164
home       0.006120
help       0.005897
son        0.005821
city       0.005723
mother     0.005659
dtype: float64

#### Cosine Similarity

Wir werden nun eine Ähnlichkeits-Matrix aller Filme berechnen. Verwende hierfür die Cosine-Similarity, welche folgendermassen definiert ist:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Hinweis: [sklearn.metrics.pairwise.cosine_similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

In [7]:
# Berechne die Ähnlichkeits-Matrix
cosine_sim = cosine_similarity(text_features.values, text_features.values)
cosine_sim.shape

(9025, 9025)

In [8]:
# Wir können Ähnlichkeitsmatrix ebenfalls wieder in ein DataFrame umwandeln, 
# jetzt mit dem Film Titel im Index und auch in den Spalten
text_similarities = pd.DataFrame(cosine_sim, index=movies.title, columns=movies.title)
text_similarities.head()

title,Inception,The Dark Knight,Avatar,The Avengers (2012),Deadpool,Interstellar,Django Unchained,Guardians of the Galaxy,Fight Club,The Hunger Games,...,Life After Tomorrow,Life Is Sacred,"This World, Then the Fireworks",Hav Plenty,Dear Jesse,The Fern Flower,Wonderland,To Have (Or Not),Swedish Auto,Men with Guns
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Inception,1.0,0.013137,0.0,0.0,0.004119,0.0,0.007773,0.031783,0.0,0.019787,...,0.0,0.0,0.0,0.0,0.0,0.0,0.004991,0.006115,0.006811,0.0
The Dark Knight,0.013137,1.0,0.0,0.016492,0.0,0.0,0.036738,0.0,0.0,0.016152,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Avatar,0.0,0.0,1.0,0.009178,0.0,0.0,0.0,0.0,0.0,0.00521,...,0.0,0.011221,0.0,0.0,0.0,0.0,0.0,0.0,0.010981,0.0
The Avengers (2012),0.0,0.016492,0.009178,1.0,0.0,0.0,0.0,0.013542,0.0,0.004654,...,0.0,0.010024,0.0,0.0,0.0,0.0,0.0,0.066166,0.009809,0.0
Deadpool,0.004119,0.0,0.0,0.0,1.0,0.0,0.006275,0.0,0.004492,0.029084,...,0.0,0.0,0.0,0.007663,0.0,0.0,0.008716,0.029266,0.013205,0.0


Da wir nun die paarweisen Ähnlichkeiten aller Filme haben, ist der nächste Schritt nun die 20 änlichsten Filme eines Filmes zu finden.

In [9]:
# implementiere nun folgende Funktion die basierend auf einem Film-Titel die 20 ähnlichsten Filme zurückgibt
def get_recommendations(similarities, title):
    if title not in similarities:
        return ''
    return similarities.loc[title].sort_values(ascending=False).drop(title)[:20]


Jetzt können wir für ein paar Filme die ähnlichste Filme ermitteln

In [10]:
get_recommendations(text_similarities, 'The Dark Knight').head(30)

title
The Dark Knight Rises                      0.305602
Batman Returns                             0.240747
Batman: The Dark Knight Returns, Part 2    0.234297
Batman Forever                             0.218458
Batman: Under the Red Hood                 0.211061
Batman: Mask of the Phantasm               0.184428
Batman: Year One                           0.183901
Batman: The Dark Knight Returns, Part 1    0.174043
Batman                                     0.152652
Batman Begins                              0.152145
JFK                                        0.120650
Batman v Superman: Dawn of Justice         0.117945
Criminal Law                               0.117328
Q & A                                      0.109159
To End All Wars                            0.101575
Batman & Robin                             0.099902
Law Abiding Citizen                        0.099254
The Wrong Man                              0.090724
The Rookie                                 0.087983
The Fi

In [11]:
movies.loc['The Dark Knight'].overview

'Batman raises the stakes in his war on crime. With the help of Lt. Jim Gordon and District Attorney Harvey Dent, Batman sets out to dismantle the remaining criminal organizations that plague the streets. The partnership proves to be effective, but they soon find themselves prey to a reign of chaos unleashed by a rising criminal mastermind known to the terrified citizens of Gotham as the Joker.'

In [12]:
movies.loc['The Dark Knight Rises'].overview

"Following the death of District Attorney Harvey Dent, Batman assumes responsibility for Dent's crimes to protect the late attorney's reputation and is subsequently hunted by the Gotham City Police Department. Eight years later, Batman encounters the mysterious Selina Kyle and the villainous Bane, a new terrorist leader who overwhelms Gotham's finest. The Dark Knight resurfaces to protect a city that has branded him an enemy."

In [13]:
text_features.loc['The Dark Knight'].sort_values(ascending=False).head(15)

batman           0.347590
criminal         0.243522
organizations    0.209640
dismantle        0.209640
dent             0.197753
joker            0.193511
effective        0.181624
gotham           0.172189
unleashed        0.170688
terrified        0.170688
partnership      0.169277
raises           0.169277
harvey           0.165495
prey             0.164360
mastermind       0.164360
Name: The Dark Knight, dtype: float64

In [14]:
text_features.loc['The Dark Knight Rises'].sort_values(ascending=False).head(15)

dent              0.342053
batman            0.300612
gotham            0.297835
attorney          0.259377
protect           0.212176
bane              0.181307
selina            0.181307
overwhelms        0.181307
branded           0.171027
city              0.161662
resurfaces        0.161568
kyle              0.159198
subsequently      0.157077
assumes           0.150306
responsibility    0.145248
Name: The Dark Knight Rises, dtype: float64

### Metadata Based Recommender

Der vorherige Recommender basiert nur auf den Beschreibungen der Filme. Ein grosser Anteil was einen Film ausmacht sind aber auch Schauspieler und der Direktor.
Darum werden wir nun einen Recommender bauen, welche diese Aspekte auch einbezieht.

Zuerst erstellen wir eine Funktion um aus einer Spalte *col* mit Listen ein One Hot Encoding zu erstellen
Dazu wählen wir die ersten *take_n* Elemente jedes Eintrag, erstellen ein One Hot Encoding der Elemente welche mindestens *min_occurence* auftreten

In [15]:
# Wir verkleinern das Dataset etwas um die Berechnungen etwas zu beschleunigen
movies_small = movies[movies.vote_count > 1000]
movies_small.shape

(1019, 19)

In [16]:
# Betrachten wir die Spalte mit den Schauspielern
col = movies_small.cast
take_n = 3
min_occurence = 2
col

title
Inception              [Leonardo DiCaprio, Joseph Gordon-Levitt, Elle...
The Dark Knight        [Christian Bale, Michael Caine, Heath Ledger, ...
Avatar                 [Sam Worthington, Zoe Saldana, Sigourney Weave...
The Avengers (2012)    [Robert Downey Jr., Chris Evans, Mark Ruffalo,...
Deadpool               [Ryan Reynolds, Morena Baccarin, Ed Skrein, T....
                                             ...                        
Secret Window          [Johnny Depp, John Turturro, Maria Bello, Timo...
Ouija                  [Olivia Cooke, Ana Coto, Daren Kagasoff, Bianc...
One Day                [Anne Hathaway, Jim Sturgess, Patricia Clarkso...
Goldfinger             [Sean Connery, Honor Blackman, Gert Fröbe, Shi...
The Ugly Truth         [Katherine Heigl, Gerard Butler, Eric Winter, ...
Name: cast, Length: 1019, dtype: object

Hinweis: auf einer pandas.Series welche aus Listen besteht, können auch die Stringfuktionen angewendet werden\
Ersten n Elemente auswählen: [pandas.Series.str.slice](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.slice.html)\
Elemente der Liste zusammenführen: [pandas.Series.str.join](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.join.html)\
Onehot-Encoding eines Strings: [pandas.Series.str.get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get_dummies.html)

In [17]:
# Selectieren der ersten take_n Elemente 
col_n = col.str[:take_n]
col_n

title
Inception              [Leonardo DiCaprio, Joseph Gordon-Levitt, Elle...
The Dark Knight            [Christian Bale, Michael Caine, Heath Ledger]
Avatar                  [Sam Worthington, Zoe Saldana, Sigourney Weaver]
The Avengers (2012)       [Robert Downey Jr., Chris Evans, Mark Ruffalo]
Deadpool                     [Ryan Reynolds, Morena Baccarin, Ed Skrein]
                                             ...                        
Secret Window                  [Johnny Depp, John Turturro, Maria Bello]
Ouija                           [Olivia Cooke, Ana Coto, Daren Kagasoff]
One Day                 [Anne Hathaway, Jim Sturgess, Patricia Clarkson]
Goldfinger                    [Sean Connery, Honor Blackman, Gert Fröbe]
The Ugly Truth             [Katherine Heigl, Gerard Butler, Eric Winter]
Name: cast, Length: 1019, dtype: object

In [18]:
# Zusammenführen der Elemente der Listen, mit '|' getrennt (da get_dummies default-Seperator | ist) 
col_n_str = col_n.str.join('|')
col_n_str

title
Inception              Leonardo DiCaprio|Joseph Gordon-Levitt|Ellen Page
The Dark Knight                Christian Bale|Michael Caine|Heath Ledger
Avatar                      Sam Worthington|Zoe Saldana|Sigourney Weaver
The Avengers (2012)           Robert Downey Jr.|Chris Evans|Mark Ruffalo
Deadpool                         Ryan Reynolds|Morena Baccarin|Ed Skrein
                                             ...                        
Secret Window                      Johnny Depp|John Turturro|Maria Bello
Ouija                               Olivia Cooke|Ana Coto|Daren Kagasoff
One Day                     Anne Hathaway|Jim Sturgess|Patricia Clarkson
Goldfinger                        Sean Connery|Honor Blackman|Gert Fröbe
The Ugly Truth                 Katherine Heigl|Gerard Butler|Eric Winter
Name: cast, Length: 1019, dtype: object

In [19]:
# Onehot-Endcoding mit get_dummies()
col_ohe = col_n_str.str.get_dummies()
col_ohe

Unnamed: 0_level_0,A.J. Cook,Aaron Eckhart,Aaron Paul,Aaron Taylor-Johnson,Aasif Mandvi,Abbie Cornish,Abigail Breslin,Abigail Hargrove,Adam Baldwin,Adam Driver,...,Zach Galligan,Zach Gilford,Zachary Levi,Zachary Quinto,Zhang Ziyi,Zoe Saldana,Zooey Deschanel,Zoë Bell,Óscar Jaenada,Моррис Честнат
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Inception,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The Dark Knight,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Avatar,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
The Avengers (2012),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Deadpool,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Secret Window,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Ouija,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
One Day,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Goldfinger,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Hinweis: [pandas.DataFrame.sum](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html)

In [20]:
# Filtern der Spalten, mit weniger als min_occurence Elementen
col_ohe = col_ohe.loc[:, col_ohe.sum() >= min_occurence]
col_ohe

Unnamed: 0_level_0,Aaron Eckhart,Aaron Taylor-Johnson,Abbie Cornish,Adam Sandler,Adrien Brody,Al Pacino,Alan Rickman,Alan Tudyk,Albert Brooks,Alden Ehrenreich,...,William H. Macy,William Moseley,Winona Ryder,Woody Allen,Woody Harrelson,Zac Efron,Zach Galifianakis,Zachary Quinto,Zoe Saldana,Zooey Deschanel
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Inception,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The Dark Knight,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Avatar,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
The Avengers (2012),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Deadpool,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Secret Window,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Ouija,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
One Day,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Goldfinger,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Für nun die einzelnen Schritte von oben, in folgender Funktion zusammen, um für eine beliebige Spalte ein Onehot-Encoding zu machen

In [21]:
# One Hot Encoder
def one_hot_encoder(col, take_n=3, min_occurence=2):
    features = col.str[:take_n].str.join('|').str.get_dummies()
    features = features.loc[:, features.sum() >= min_occurence]
    print(col.name, features.shape)
    return features

Berechne nun mit one_hot_encoder verschiedene Splaten und führe sie in ein grosses features-DataFrame zusammen\
Hinweis: [pandas.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)

In [22]:
# Erstelle meta_features-DataFrame
meta_features = pd.concat([
    one_hot_encoder(movies_small.cast),
    one_hot_encoder(movies_small.director),
    #one_hot_encoder(movies_small.producer),
    #one_hot_encoder(movies_small.writer),
    #one_hot_encoder(movies_small.music),
    one_hot_encoder(movies_small.genres, -1),
    #one_hot_encoder(movies_small.keywords, -1),
    one_hot_encoder(movies_small.production_companies),
    #one_hot_encoder(movies_small.production_countries),
    one_hot_encoder(movies_small.spoken_languages),
], axis=1)

meta_features = meta_features.set_index(movies_small.title)

cast (1019, 525)
director (1019, 245)
genres (1019, 17)
production_companies (1019, 315)
spoken_languages (1019, 36)


In [23]:
meta_features

Unnamed: 0_level_0,Aaron Eckhart,Aaron Taylor-Johnson,Abbie Cornish,Adam Sandler,Adrien Brody,Al Pacino,Alan Rickman,Alan Tudyk,Albert Brooks,Alden Ehrenreich,...,עִבְרִית,اردو,العربية,فارسی,हिन्दी,ภาษาไทย,广州话 / 廣州話,日本語,普通话,한국어/조선말
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Inception,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The Dark Knight,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
Avatar,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The Avengers (2012),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Deadpool,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Secret Window,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Ouija,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
One Day,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Goldfinger,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [24]:
# Häufigste Features
meta_features.sum().sort_values(ascending=False).head(30)

English                                   999
Action                                    338
Adventure                                 268
Drama                                     254
Comedy                                    206
Thriller                                  138
Fantasy                                   124
Universal Pictures                        118
Animation                                 107
Science Fiction                           104
Français                                  103
Crime                                      99
Twentieth Century Fox Film Corporation     89
Warner Bros.                               87
Paramount Pictures                         85
Español                                    80
Family                                     78
Columbia Pictures                          73
Deutsch                                    58
Horror                                     57
Walt Disney Pictures                       54
Mystery                           

In [25]:
# Features für einen Film
meta_features.loc['The Dark Knight'].sort_values(ascending=False).head(30)

Heath Ledger          1
Christopher Nolan     1
Legendary Pictures    1
Warner Bros.          1
DC Comics             1
Michael Caine         1
Action                1
English               1
Christian Bale        1
Crime                 1
Drama                 1
普通话                   1
Timur Bekmambetov     0
Todd Phillips         0
Wes Anderson          0
Tim Story             0
Tim Johnson           0
Tom Hooper            0
Tim Burton            0
Tom McGrath           0
Tom Shadyac           0
Tom Tykwer            0
Tony Scott            0
Vicky Jenson          0
Aaron Eckhart         0
Wes Ball              0
Wes Craven            0
Wilfred Jackson       0
Will Gluck            0
Terry Jones           0
Name: The Dark Knight, dtype: int64

Nun können wir eine neue Ählichkeits-Matrix berechnen mit unseren Meta-Features und z.B. der Jaccard-Ähnlichkeit\
Hinweis: [sklearn.metrics.pairwise_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html)

In [26]:
# neue Ähnlichkeits-Matrix berechnen (Jaccard)
jaccard_sim = 1-pairwise_distances(meta_features.astype(bool).values, metric="jaccard", n_jobs=-1)
meta_similarities_jaccard = pd.DataFrame(jaccard_sim, index=movies_small.title, columns=movies_small.title)

Jetzt können wir, wie zuvor mit der neuen Ähnlichkeitsmatrix wieder Vorschläge generieren

In [27]:
get_recommendations(meta_similarities_jaccard, 'The Dark Knight').head(10)

title
Batman Begins                            0.714286
The Dark Knight Rises                    0.642857
The Prestige                             0.375000
Interstellar                             0.294118
Gangster Squad                           0.277778
Pacific Rim                              0.266667
The Fast and the Furious: Tokyo Drift    0.266667
Inception                                0.263158
Heat                                     0.263158
Superman Returns                         0.263158
Name: The Dark Knight, dtype: float64

In [28]:
# Vergleichen welche Features bei beiden Filmen auftreten
compare = ['The Dark Knight', 'The Prestige']
meta_features.loc[compare, (meta_features.loc[compare[0]] > 0) & (meta_features.loc[compare[1]] > 0)]

Unnamed: 0_level_0,Christian Bale,Michael Caine,Christopher Nolan,Drama,Warner Bros.,English
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
The Dark Knight,1,1,1,1,1,1
The Prestige,1,1,1,1,1,1
