# Determine Movie Genre by Neighboring Movies

Using the k-nearest neighbors method, use the top-k most similar movies to a target film to predict the target movie's genre.

Use Jaccard similarity based on actors in each movie to rank movies and select the top-k most similar movies.

In [1]:
import json

import pandas as pd
import numpy as np

from scipy.sparse import lil_matrix

import matplotlib.pyplot as plt


In [2]:
actor_name_map = {}
movie_actor_map = {}
actor_genre_map = {}


with open("imdb_movies_2000to2022.prolific.json", "r") as in_file:
    for line in in_file:
        
        # Read the movie on this line and parse its json
        this_movie = json.loads(line)
        
        # Skip movies with no ratings
        if len(this_movie["rating"]) == 0:
            continue
                    
        # Add all actors to the id->name map
        for actor_id,actor_name in this_movie['actors']:
            actor_name_map[actor_id] = actor_name
            
        # For each actor, add this movie's genres to that actor's list
        for actor_id,actor_name in this_movie['actors']:
            this_actors_genres = actor_genre_map.get(actor_id, {})
            
            # Increment the count of genres for this actor
            for g in this_movie["genres"]:
                this_actors_genres[g] = this_actors_genres.get(g, 0) + 1
                
            # Update the map
            actor_genre_map[actor_id] = this_actors_genres
            
        # Finished with this film
        movie_actor_map[this_movie["imdb_id"]] = ({
            "movie": this_movie["title"],
            "actors": set([item[0] for item in this_movie['actors']]),
            "genres": this_movie["genres"],
            "rating": this_movie["rating"]["avg"]
        })

In [3]:
print("Known Actors:", len(actor_name_map))
print("Known Movies:", len(movie_actor_map))

Known Actors: 29499
Known Movies: 18841


In [4]:
actor_id_to_index = {actor_id:i for i,actor_id in enumerate(actor_name_map.keys())}


In [118]:
# target_movie_id = "tt0317705" # The Incredibles
# target_movie_id = "tt0816692" # Interstellar
target_movie_id = "tt0332280" # The Notebook

In [119]:
target_movie_object = movie_actor_map[target_movie_id]

In [120]:
target_movie_object

{'movie': 'The Notebook',
 'actors': {'nm0001258', 'nm0001687', 'nm0331516', 'nm1046097'},
 'genres': ['Drama', 'Romance'],
 'rating': 7.8}

## Find the Most Similar Movies by Jaccard Similarity in Actor

In [121]:
movie_similarities = []

for this_movie_id,this_movie_obj in movie_actor_map.items():
    # Skip the target movie
    if this_movie_id == target_movie_id:
        continue
    target_actors = set(target_movie_object['actors'])
    this_actors = set(this_movie_obj['actors'])
        
    intersection = len(target_actors & this_actors)
    union = len(target_actors | this_actors)

    jaccard = intersection / union
    
    # Add this movie and its Jaccard similarity to the list, so we can rank at the end
    movie_similarities.append({
        "movie": this_movie_id,
        "jaccard": jaccard,
    })

In [122]:
similarity_df = pd.DataFrame(movie_similarities, columns=["movie", "jaccard"])

## Use the top-k similar movies to infer genre

In [123]:
k_nn = 5

In [124]:
similarity_df.sort_values(by="jaccard", ascending=False).head(k_nn)

Unnamed: 0,movie,jaccard
9244,tt1580346,0.25
10165,tt1798684,0.142857
10704,tt1972571,0.142857
4236,tt0772157,0.142857
9367,tt1606389,0.142857


In [125]:
similarity_df.sort_values(by="jaccard", ascending=False).head()

Unnamed: 0,movie,jaccard
9244,tt1580346,0.25
10165,tt1798684,0.142857
10704,tt1972571,0.142857
4236,tt0772157,0.142857
9367,tt1606389,0.142857


In [126]:
# TODO: count the genres of the top k movies, printing out the most common
#. genres as the predicted genre for the target movie

In [127]:
top_k = similarity_df.sort_values('jaccard', ascending=False).head(k_nn)

genre_counts = {}
for i in range(len(top_k)):
    movie_id = top_k.iloc[i]['movie']
    movie = movie_actor_map[movie_id]
    # print(movie)
    for genre in movie['genres']:
        if genre not in genre_counts:
            genre_counts[genre] = 0
        genre_counts[genre] += 1
        
sorted_genres = sorted(genre_counts.items(), key=lambda x: x[1], reverse=True)

print("Predicted genres:", end=' ') 
for i in range(min(3, len(sorted_genres))):
    print(sorted_genres[i][0], end=' ')
print()

Predicted genres: Drama Action Crime 


for k = 1:
The Incredibles: Action Adventure Animation,
Interstellar: Drama Mystery Thriller,
The Notebook: Action Biography Documentary;

for k =3:
The Incredibles: Adventure Documentary Action,
Interstellar: Drama Documentary Mystery,
The Notebook: Action Crime Drama;

for k = 5:
The Incredibles: Documentary Adveture History,
Interstellar: Documentary Drama Mystery,
The Notebook: Drama Action Crime

As k increases, the genre predictions become more broad and include additional related genres, but core genres remain inconsistent. Higher k is good for diverse genres, and lower k is better for focused genres

## Use the top-k similar movies to infer rating

In [17]:
# TODO: Take the average rating from these top-k movies