In [92]:
import pandas as pd
import numpy as np
import copy


First let's load the data into dataframes:

In [None]:
tags_df = pd.read_csv('../data/tags.csv')
ratings_df = pd.read_csv('../data/ratings.csv')
genome_tags_df = pd.read_csv('../data/genome-tags.csv')
genome_scores_df = pd.read_csv('../data/genome-scores.csv')
movies_df = pd.read_csv('../data/movies.csv')


For exploratory data analysis, combining movie and the relevance score for all of the tags makes a lot of sense to me if we are trying to perform clustering. We can use the relevance values for a tag-movie pair in the 'genome-scores' dataset. Let's do that:

First, we'll create columns for each of the unique tags using the genome_tags dataframe

In [None]:
movies_and_tags = movies_df[['movieId', 'title']]



genome_tags_list = genome_tags_df.values.tolist()
genome_scores_list = genome_scores_df.values.tolist()
for item in genome_tags_list:
    movies_and_tags[item[0]] = 0



Then, we'll create a dictionary and store all of the relevance scores in it. (There's gotta be an easier way to do this with pandas but I couldn't figure out how).

In [None]:
scores = {}
for item in genome_scores_list:
    if item[0] not in scores:
        scores[item[0]] = {}
    scores[item[0]][item[1]] = item[2]
    

Finally, we'll read in all of the tag scores (there are a LOT of them) and place them with their appropriate movie in the dataframe. Note: this takes about an hour on my machine, but the results are spectacular!

In [None]:
un_tagged_films = []

def fill_in_tag_scores(index, movie_id):
    if movie_id in scores:
        return scores[movie_id][index]
    else:
        un_tagged_films.append(movie_id)
        return 0
    

for item in genome_tags_list:
    print(item[0])
    movies_and_tags[item[0]] = movies_and_tags.apply(lambda row: fill_in_tag_scores(item[0], row['movieId']), axis=1)

We'll also go ahead and convert the columns to strings (in order to make accessing them easier). Then, we can remove all of the films that don't have tags.

In [157]:
movies_and_tags.columns = movies_and_tags.columns.astype(str)
movies_and_tags = movies_and_tags[movies_and_tags['1'] != 0]

And then we'll have a combined dataset!

In [34]:
movies_and_tags


Unnamed: 0,movieId,title,1,2,3,4,5,6,7,8,...,1119,1120,1121,1122,1123,1124,1125,1126,1127,1128
0,1,Toy Story (1995),0.02900,0.02375,0.05425,0.06875,0.16000,0.19525,0.07600,0.25200,...,0.03775,0.02250,0.04075,0.03175,0.12950,0.04550,0.02000,0.03850,0.09125,0.02225
1,2,Jumanji (1995),0.03625,0.03625,0.08275,0.08175,0.10200,0.06900,0.05775,0.10100,...,0.04775,0.02050,0.01650,0.02450,0.13050,0.02700,0.01825,0.01225,0.09925,0.01850
2,3,Grumpier Old Men (1995),0.04150,0.04950,0.03000,0.09525,0.04525,0.05925,0.04000,0.14150,...,0.05800,0.02375,0.03550,0.02125,0.12775,0.03250,0.01625,0.02125,0.09525,0.01750
3,4,Waiting to Exhale (1995),0.03350,0.03675,0.04275,0.02625,0.05250,0.03025,0.02425,0.07475,...,0.04900,0.03275,0.02125,0.03675,0.15925,0.05225,0.01500,0.01600,0.09175,0.01500
4,5,Father of the Bride Part II (1995),0.04050,0.05175,0.03600,0.04625,0.05500,0.08000,0.02150,0.07375,...,0.05375,0.02625,0.02050,0.02125,0.17725,0.02050,0.01500,0.01550,0.08875,0.01575
5,6,Heat (1995),0.02925,0.02575,0.02700,0.03450,0.06825,0.04675,0.04675,0.23375,...,0.05050,0.02475,0.02400,0.06500,0.17975,0.09250,0.03225,0.02150,0.09900,0.02300
6,7,Sabrina (1995),0.04775,0.05075,0.13400,0.08825,0.09550,0.19500,0.05400,0.08975,...,0.04675,0.03050,0.01275,0.02775,0.13300,0.04475,0.01800,0.01000,0.08550,0.01575
7,8,Tom and Huck (1995),0.03125,0.03750,0.08050,0.03150,0.05100,0.01750,0.01475,0.07650,...,0.05225,0.02475,0.02750,0.02575,0.31750,0.06400,0.01525,0.01450,0.08875,0.01675
8,9,Sudden Death (1995),0.03225,0.03550,0.02150,0.01650,0.02350,0.01975,0.01450,0.04600,...,0.02925,0.01350,0.01100,0.01275,0.15475,0.01525,0.01150,0.00700,0.09025,0.01600
9,10,GoldenEye (1995),0.99950,1.00000,0.03050,0.05150,0.07850,0.08550,0.12125,0.12200,...,0.61825,0.03450,0.01950,0.03225,0.14875,0.06225,0.02000,0.01400,0.09425,0.01850


Now let's think about clustering. I chose to use Affinity propagation because it fits our usecase nicely: we'll have an assumed large, yet unknown amount of clusters, and the odds are high that the films will be distributed unevenly between them.

In order to prepare the data, let's remove everything that isn't a tag relevance score. (with a little bit of tact we can still match these up to movie ids.)

In [158]:
from sklearn.cluster import AffinityPropagation

filtered_data = movies_and_tags[movies_and_tags.columns.difference(['title', 'movieId'])]
filtered_data


Unnamed: 0,1,10,100,1000,1001,1002,1003,1004,1005,1006,...,990,991,992,993,994,995,996,997,998,999
0,0.02900,0.02400,0.10550,0.30425,0.08700,0.00200,0.66100,0.14700,0.04425,0.02000,...,0.37350,0.09275,0.49400,0.01375,0.03375,0.11250,0.04725,0.23025,0.12875,0.52275
1,0.03625,0.05250,0.20275,0.14950,0.34075,0.00175,0.10825,0.14550,0.06950,0.03700,...,0.28200,0.58000,0.19775,0.01425,0.03425,0.12475,0.07125,0.08775,0.27925,0.41050
2,0.04150,0.03200,0.23975,0.07850,0.07600,0.00275,0.29025,0.10375,0.02500,0.01600,...,0.09625,0.07550,0.26650,0.01950,0.04425,0.13875,0.07200,0.10625,0.11625,0.26700
3,0.03350,0.02400,0.30825,0.07650,0.06500,0.00150,0.44625,0.10050,0.01625,0.01475,...,0.07950,0.07800,0.19275,0.00725,0.02950,0.15400,0.08250,0.07825,0.11825,0.18525
4,0.04050,0.02375,0.24475,0.05550,0.08650,0.00175,0.45425,0.32125,0.02050,0.01500,...,0.10050,0.07950,0.21975,0.01000,0.02450,0.12925,0.06550,0.10375,0.08950,0.23425
5,0.02925,0.02000,0.10650,0.52275,0.03525,0.00425,0.04300,0.08375,0.07550,0.03900,...,0.13200,0.04675,0.58250,0.00975,0.11875,0.07625,0.04000,0.53525,0.10625,0.82700
6,0.04775,0.11650,0.18575,0.05425,0.34200,0.00125,0.56050,0.24025,0.04750,0.04675,...,0.11050,0.04975,0.24350,0.01000,0.05325,0.10475,0.06025,0.14000,0.11050,0.23900
7,0.03125,0.03450,0.27050,0.06050,0.12300,0.00300,0.04950,0.19525,0.01825,0.01550,...,0.12700,0.13225,0.21175,0.00625,0.01900,0.13925,0.07925,0.05250,0.09400,0.19275
8,0.03225,0.01375,0.28800,0.13700,0.09175,0.00200,0.04075,0.02950,0.03125,0.02250,...,0.09250,0.04575,0.19000,0.00775,0.01775,0.18025,0.07475,0.45075,0.15225,0.34525
9,0.99950,0.01850,0.17200,0.17050,0.11350,0.00325,0.07075,0.17250,0.02575,0.01875,...,0.20600,0.06275,0.24575,0.01825,0.02350,0.13250,0.07250,0.29175,0.24975,0.43625


Now we'll load the data into a numpy array, and attempt to fit it to clusters using Affinity propogation. As you can see, we get a list of labels for each of the different rows in the dataset.

In [43]:
X = filtered_data.to_numpy()
clusters = AffinityPropagation().fit(X)
clusters.labels_

array([  0, 123,  31, ..., 539, 506, 541], dtype=int64)

With a little bit of hacking, we can set up dictionaries to map clusters to lists of film ids, and film ids to their corresponding cluster to make accessing this information easy.

In [159]:
cluster_list = clusters.labels_.tolist()
movie_ids_list = movies_and_tags['movieId'].tolist()
cluster_movie_dict = {}
movie_cluster_dict = {}
for x in range(0, len(cluster_list)):
    movie_cluster_dict[movie_ids_list[x]] = cluster_list[x]
    if cluster_list[x] not in cluster_movie_dict:
        cluster_movie_dict[cluster_list[x]] = []
    cluster_movie_dict[cluster_list[x]].append(movie_ids_list[x])
    

We can also define a few helper functions to make retrieval easier.

In [145]:
def similar_movies(movie_id):
    movies_in_cluster = list(cluster_movie_dict[movie_cluster_dict[movie_id]])
    movies_in_cluster.remove(movie_id)
    return movies_in_cluster

def get_film_name(movie_id):
    return movies_df.loc[movies_df['movieId'] == movie_id, 'title'].values[0]

Ok! Let's test and see how well this has worked out. Let's see what the first film clustered with:

In [160]:
get_film_name(1)

'Toy Story (1995)'

In [161]:
similar_to_toy_story = similar_movies(1)
[get_film_name(x) for x in similar_to_toy_story]

['Toy Story 2 (1999)',
 'Monsters, Inc. (2001)',
 'Finding Nemo (2003)',
 'Incredibles, The (2004)',
 'Ratatouille (2007)',
 'WALL·E (2008)',
 'Up (2009)',
 'Toy Story 3 (2010)',
 'The Lego Movie (2014)']

Seems legit, let's try a few more personal favorites:

In [151]:
get_film_name(1197)


'Princess Bride, The (1987)'

In [152]:
[get_film_name(x) for x in similar_movies(1197)]

["It's a Wonderful Life (1946)",
 'Quiet Man, The (1952)',
 'Stand by Me (1986)',
 'Iron Giant, The (1999)',
 'Millions (2004)',
 'Slumdog Millionaire (2008)',
 'Dean Spanley (2008)',
 'Curious Case of Benjamin Button, The (2008)',
 'Butterfly, The (Papillon, Le) (2002)',
 'Pixar Story, The (2007)',
 'Rush: Beyond the Lighted Stage (2010)',
 'Steve Jobs: The Lost Interview (2012)',
 'Paperman (2012)',
 'Mystery of the Third Planet, The (Tayna tretey planety) (1981)',
 'Electric Boogaloo: The Wild, Untold Story of Cannon Films (2014)',
 'Patton Oswalt: Werewolves and Lollipops (2007)',
 'Legend No. 17 (2013)',
 'Inside Out (2015)',
 'Queen (2014)',
 'The Boy and the Beast (2015)',
 "Can't Change the Meeting Place (1979)",
 'Over the Garden Wall (2013)',
 'Tower (2016)',
 'A Monster Calls (2016)',
 'Mudbound (2017)',
 '13 reasons why',
 'Death Note: Desu nôto (2006–2007)',
 'The Adventures of Sherlock Holmes and Doctor Watson: The Hunt for the Tiger (1980)',
 'Coco (2017)',
 'Paddington 

In [153]:
get_film_name(152970)

'Hunt for the Wilderpeople (2016)'

In [154]:
[get_film_name(x) for x in similar_movies(152970)]

["Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",
 'Moonrise Kingdom (2012)',
 'Grand Budapest Hotel, The (2014)']

In [155]:
get_film_name(148626)

'Big Short, The (2015)'

In [156]:
[get_film_name(x) for x in similar_movies(148626)]

['Wall Street (1987)',
 'China Syndrome, The (1979)',
 'Manufacturing Consent: Noam Chomsky and the Media (1992)',
 'Trinity and Beyond (1995)',
 'Corporation, The (2003)',
 'Enron: The Smartest Guys in the Room (2005)',
 'Constant Gardener, The (2005)',
 'Lord of War (2005)',
 'Syriana (2005)',
 'Thank You for Smoking (2006)',
 'When the Levees Broke: A Requiem in Four Acts (2006)',
 'Jonestown: The Life and Death of Peoples Temple (2006)',
 'Bigger, Stronger, Faster* (2008)',
 'Cove, The (2009)',
 'Casino Jack and the United States of Money (2010)',
 'Inside Job (2010)',
 'Art of the Steal, The (2009)',
 'Client 9: The Rise and Fall of Eliot Spitzer (2010)',
 'Too Big to Fail (2011)',
 'Margin Call (2011)',
 'No (2012)',
 'Inequality for All (2013)',
 'Particle Fever (2013)',
 'Citizenfour (2014)',
 'Going Clear: Scientology and the Prison of Belief (2015)',
 'Best of Enemies (2015)',
 'Requiem for the American Dream (2015)',
 'Weiner (2016)',
 'HyperNormalisation (2016)']

Wow! I'm honestly blown away by how well this has worked. I was pretty worried that the sheer magnitude of different tags would result in too many features, but AP clustering seems to have done an incredible job listing similar films to those above.