# Predict Movie Genre and Rating in Test/Train Data

Using the k-nearest neighbors method, use the top-k most similar movies to a target film to predict the target movie's genre and rating.

Use Jaccard similarity based on actors in each movie to rank movies and select the top-k most similar movies.

In [None]:
import json

import pandas as pd
import numpy as np

from scipy.sparse import lil_matrix

import matplotlib.pyplot as plt


In [None]:
actor_name_map = {}
movie_actor_map = {}
actor_genre_map = {}


with open("../data/imdb_movies_2000to2022.prolific.json", "r") as in_file:
    for line in in_file:
        
        # Read the movie on this line and parse its json
        this_movie = json.loads(line)
        
        # Skip movies with no ratings
        if len(this_movie["rating"]) == 0:
            continue
                    
        # Add all actors to the id->name map
        for actor_id,actor_name in this_movie['actors']:
            actor_name_map[actor_id] = actor_name
            
        # For each actor, add this movie's genres to that actor's list
        for actor_id,actor_name in this_movie['actors']:
            this_actors_genres = actor_genre_map.get(actor_id, {})
            
            # Increment the count of genres for this actor
            for g in this_movie["genres"]:
                this_actors_genres[g] = this_actors_genres.get(g, 0) + 1
                
            # Update the map
            actor_genre_map[actor_id] = this_actors_genres
            
        # Finished with this film
        movie_actor_map[this_movie["imdb_id"]] = ({
            "movie": this_movie["title"],
            "actors": set([item[0] for item in this_movie['actors']]),
            "genres": this_movie["genres"],
            "rating": this_movie["rating"]["avg"]
        })

In [None]:
print("Known Actors:", len(actor_name_map))
print("Known Movies:", len(movie_actor_map))

In [None]:
actor_id_to_index = {actor_id:i for i,actor_id in enumerate(actor_name_map.keys())}


## Split Data into Training/Testing Data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
all_movie_ids = np.array(list(movie_actor_map.keys()))
training_ids, testing_ids = train_test_split(all_movie_ids, test_size=0.1)

In [None]:
training_ids.shape, testing_ids.shape

## For Each Movie in Test Set, Use Jaccard Similarity by Actor to Predict Genre and Rating

For each movie in the test set, we find the k most similar movies **in the training set** to predict the test movie's genre and its rating.

Then compare the predictions to the actual data from the `movie_actor_map` to calculate accuracy of this prediction.

In [None]:
k_nn = 10

In [None]:
prediction_rows = []

for test_movie_id in testing_ids:

    # This test movie
    target_movie_object = movie_actor_map[test_movie_id]
    
    # List of similar movies
    test_movie_similarities = []

    # Iterate through every movie in the training set
    for this_movie_id in training_ids:
        # TODO iterate through all the training movies to find the top-k
        # most similar to this test movie
        # Add result to an array of movies

    
    similarity_df = pd.DataFrame(test_movie_similarities, columns=["movie", "jaccard"])
    
    # Get the top k movies
    top_movies = similarity_df.sort_values(by="jaccard", ascending=False).head(k_nn)

    # iterate through the top k movies and predict genre and rating
    relevant_ratings = []
    relevant_genre_count = {}
    for this_movie_id in top_movies["movie"]:
        # add relevant ratings
        # TODO add relevant rating to `relevant_ratings`
        
        # Track genres
        # TODO add this similar movie's genre to map of genre counts
            
    # Get the most common genre from the neighbors
    predicted_genre = # TODO
    
    # Take the average rating of k neighbors for prediction
    predicted_rating = # TODO
    
    # Track predictions
    prediction_rows.append({
        "imdb_id": test_movie_id,
        "genre": predicted_genre,
        "rating": predicted_rating,
    })


In [None]:
predicted_df = pd.DataFrame(prediction_rows).set_index("imdb_id")
predicted_df

## Measure Accuracy of Predictions

In [None]:
correct_genres = []

for movie_id,row in predicted_df.iterrows():

    # TODO calculate accuracy for this movie
    # add 1 to `correct_genres` if predicted genre 
    #. is in the set of actual genres
    
accuracy = sum(correct_genres) / len(correct_genres)
print("Genre Accuracy:", accuracy)

In [None]:
rating_mses = []

for movie_id,row in predicted_df.iterrows():

    # TODO calculate MSE by squaring the error
    # between predicted and actual rating
    
mse = sum(rating_mses) / len(rating_mses)
print("Rating Mean Squared Error:", mse)