# Probabilities and k-Means Clustering

Using the IMDB data, construct a feature matrix, and apply `k-Means` to the data to extract clusters. 

We then inspect various aspects of probability associated with these clusterings.

In [8]:
import json

import pandas as pd
import numpy as np

In [9]:
actor_name_map = {}
movie_actor_map = {}
actor_genre_map = {}


with open("../data/imdb_movies_2000to2022.prolific.json", "r") as in_file:
    for line in in_file:
        
        # Read the movie on this line and parse its json
        this_movie = json.loads(line)
                    
        # Add all actors to the id->name map
        for actor_id,actor_name in this_movie['actors']:
            actor_name_map[actor_id] = actor_name
            
        # For each actor, add this movie's genres to that actor's list
        for actor_id,actor_name in this_movie['actors']:
            this_actors_genres = actor_genre_map.get(actor_id, {})
            
            # Increment the count of genres for this actor
            for g in this_movie["genres"]:
                this_actors_genres[g] = this_actors_genres.get(g, 0) + 1
                
            # Update the map
            actor_genre_map[actor_id] = this_actors_genres
            
        # Finished with this film
        movie_actor_map[this_movie["imdb_id"]] = ({
            "movie": this_movie["title"],
            "actors": set([item[0] for item in this_movie['actors']]),
            "genres": this_movie["genres"]
        })

In [10]:
print("Known Actors:", len(actor_name_map))
print("Known Movies:", len(movie_actor_map))

Known Actors: 33609
Known Movies: 20620


## Read CSV of Movies to Cluster IDs

Using the provided movie-to-cluster mapping CSV file, we assess the distributions of movies per cluster and ask questions about genres and actors in each cluster.

In [11]:
cluster_df = pd.read_csv("movie_to_cluster.csv")
# cluster_df = pd.read_csv("actor_movie_clusters.csv", index_col="movie")

In [12]:
cluster_df

Unnamed: 0,movie_id,cluster
0,tt0035423,8
1,tt0088751,12
2,tt0096056,6
3,tt0113092,3
4,tt0116391,3
...,...,...
20615,tt9906278,10
20616,tt9906644,13
20617,tt9906844,10
20618,tt9907032,10


In [13]:
cluster_df["cluster"].value_counts()

cluster
6     3177
0     3097
15    1754
13    1705
2     1503
12    1466
1     1376
3     1240
14     893
8      774
10     761
4      655
11     640
7      635
5      560
9      384
Name: count, dtype: int64

In [14]:
cluster_pr_map = {cluster_id:cluster_pr for cluster_id,cluster_pr in (cluster_df["cluster"].value_counts() / cluster_df.shape[0]).items()}
cluster_df["cluster"].value_counts() / cluster_df.shape[0]

cluster
6     0.154074
0     0.150194
15    0.085063
13    0.082687
2     0.072890
12    0.071096
1     0.066731
3     0.060136
14    0.043307
8     0.037536
10    0.036906
4     0.031765
11    0.031038
7     0.030795
5     0.027158
9     0.018623
Name: count, dtype: float64

## Assess Genre-Specific Cluster Probabilities

We want to determine, for a new movie with a known genre, to which cluster is it most likely to be assigned?

In [19]:
# For each genre, count the number of movies
genre_counts = {}

# For each movie, get its genres and update the genre count
for movie_id in movie_actor_map.keys():
    for genre in movie_actor_map[movie_id]["genres"]:
        genre_counts[genre] = genre_counts.get(genre, 0) + 1
        
genre_prs = []
for genre,g_count in genre_counts.items():
    genre_prs.append((genre, g_count/sum(genre_counts.values())))
    
genre_prs_df = pd.DataFrame(genre_prs, columns=["genre", "probability"])
genre_pr_map = {row["genre"]:row["probability"] for idx,row in genre_prs_df.iterrows()}

genre_prs_df.sort_values(by="probability", ascending=False)

Unnamed: 0,genre,probability
5,Drama,0.237964
0,Comedy,0.140969
10,Thriller,0.093808
6,Action,0.087693
3,Horror,0.07209
8,Crime,0.064967
2,Romance,0.059063
7,Adventure,0.039055
9,Mystery,0.035963
4,Sci-Fi,0.024717


In [23]:
target_genre = "Sci-Fi"

per_cluster_prs = []
for cluster_id,group in cluster_df.groupby("cluster"):

    this_cluster_genre_count = sum([
        1 if target_genre in movie_actor_map[m]["genres"] else 0 
        for m in group["movie_id"]
    ])
    this_cluster_total_genre_count = len([g for m in group["movie_id"] for g in movie_actor_map[m]["genres"]])
    
    # Calculate conditional probability
    pr_genre_given_cluster = this_cluster_genre_count / this_cluster_total_genre_count
    print("Pr[%s| Cluster %02d]:" % (target_genre, cluster_id), "\t", pr_genre_given_cluster)
    
    # Calculate joint probability
    joint_pr_genre_cluster = pr_genre_given_cluster * group.shape[0] / cluster_df.shape[0]
    print("Pr[%s, Cluster %02d]:" % (target_genre, cluster_id), "\t", joint_pr_genre_cluster)
    per_cluster_prs.append(joint_pr_genre_cluster)

Pr[Sci-Fi| Cluster 00]: 	 0.0386402027027027
Pr[Sci-Fi, Cluster 00]: 	 0.005803526080032506
Pr[Sci-Fi| Cluster 01]: 	 0.01156224791610648
Pr[Sci-Fi, Cluster 01]: 	 0.0007715641674375614
Pr[Sci-Fi| Cluster 02]: 	 0.057951482479784364
Pr[Sci-Fi, Cluster 02]: 	 0.004224106603642866
Pr[Sci-Fi| Cluster 03]: 	 0.007735792918774174
Pr[Sci-Fi, Cluster 03]: 	 0.00046519802227351967
Pr[Sci-Fi| Cluster 04]: 	 0.03105934553521908
Pr[Sci-Fi, Cluster 04]: 	 0.0009866086966813044
Pr[Sci-Fi| Cluster 05]: 	 0.0
Pr[Sci-Fi, Cluster 05]: 	 0.0
Pr[Sci-Fi| Cluster 06]: 	 0.015251055842327546
Pr[Sci-Fi, Cluster 06]: 	 0.0023497868288590985
Pr[Sci-Fi| Cluster 07]: 	 0.0
Pr[Sci-Fi, Cluster 07]: 	 0.0
Pr[Sci-Fi| Cluster 08]: 	 0.005548705302096177
Pr[Sci-Fi, Cluster 08]: 	 0.00020827826885656844
Pr[Sci-Fi| Cluster 09]: 	 0.021062271062271064
Pr[Sci-Fi, Cluster 09]: 	 0.00039223627972415553
Pr[Sci-Fi| Cluster 10]: 	 0.0008539709649871904
Pr[Sci-Fi, Cluster 10]: 	 3.151658120054568e-05
Pr[Sci-Fi| Cluster 11]: 	 0

In [24]:
pr_target_genre = sum(per_cluster_prs)
print("Probability of Target Genre:", pr_target_genre)

Probability of Target Genre: 0.026453157873263466


In [25]:
for cluster_id,cluster_genre_pr in enumerate(per_cluster_prs):

    pr_cluster_given_genre = cluster_genre_pr / genre_pr_map[target_genre]

    print("Pr[Cluster %02d | %s]:" % (cluster_id, target_genre), "\t", pr_cluster_given_genre)
    

Pr[Cluster 00 | Sci-Fi]: 	 0.23479801296116345
Pr[Cluster 01 | Sci-Fi]: 	 0.031215804131504675
Pr[Cluster 02 | Sci-Fi]: 	 0.17089814423060515
Pr[Cluster 03 | Sci-Fi]: 	 0.018820897805403452
Pr[Cluster 04 | Sci-Fi]: 	 0.0399160369672494
Pr[Cluster 05 | Sci-Fi]: 	 0.0
Pr[Cluster 06 | Sci-Fi]: 	 0.0950672523376236
Pr[Cluster 07 | Sci-Fi]: 	 0.0
Pr[Cluster 08 | Sci-Fi]: 	 0.008426484691568635
Pr[Cluster 09 | Sci-Fi]: 	 0.01586902476537074
Pr[Cluster 10 | Sci-Fi]: 	 0.0012750921662397074
Pr[Cluster 11 | Sci-Fi]: 	 0.04225177756594118
Pr[Cluster 12 | Sci-Fi]: 	 0.23840134570964586
Pr[Cluster 13 | Sci-Fi]: 	 0.011066762144978273
Pr[Cluster 14 | Sci-Fi]: 	 0.02567217231064506
Pr[Cluster 15 | Sci-Fi]: 	 0.1365582893727287


### Sample Titles in Each Cluster

We can use the above conditional probabilities to determine the most likely cluster given a movie genre.

Here, we sample movies in the most likely cluster to get a sense of what movies are in that cluster.

In [21]:
target_cluster = 0

In [22]:
for movie_id in cluster_df[cluster_df["cluster"] == target_cluster].sample(n=10, replace=False)["movie_id"]:
    this_movie = movie_actor_map[movie_id]
    print(movie_id, this_movie["movie"], this_movie["genres"])

tt13063384 Burial ['Thriller', 'War']
tt0242264 American Saint ['']
tt4151198 Give Till It Hurts ['Comedy', 'Drama']
tt0454841 The Hills Have Eyes ['Horror', 'Thriller']
tt3236976 Submerged ['Action', 'Thriller']
tt1220628 I Hope They Serve Beer in Hell ['Comedy']
tt7743512 Shakespeare Monologues ['Comedy']
tt0199725 Love & Basketball ['Drama', 'Romance', 'Sport']
tt0282543 Happy Hour ['Comedy', 'Drama']
tt0159784 Takedown ['Biography', 'Crime', 'Drama']


## Assess Actor-Specific Cluster Probabilities

Above, we determine the most likely cluster given a movie genre. Here, we ask the same question for a given actor.

In [21]:
#Setting the actor we will be comparing to
# target_actor_id = 'nm1165110' # Chris Hemsworth
# target_actor_id = 'nm0413168' # Hugh Jackman
# target_actor_id = 'nm0005351' # Ryan Reynolds
# target_actor_id = "nm0000206" # Keanu Reeves
target_actor_id = 'nm0000115' # Nic Cage

In [22]:
per_cluster_prs = []
for cluster_id,group in cluster_df.groupby("cluster"):

    this_cluster_actor_count = sum([
        1 if target_actor_id in movie_actor_map[m]["actors"] else 0 
        for m in group["movie_id"]
    ])
    
    # Calculate conditional probability
    pr_actor_given_cluster = this_cluster_actor_count / group.shape[0]
    print("Pr[%s| Cluster %02d]:" % (target_actor_id, cluster_id), "\t", pr_actor_given_cluster)
    
    # Calculate joint probability
    joint_pr_actor_cluster = pr_actor_given_cluster * group.shape[0] / cluster_df.shape[0]
    print("Pr[%s, Cluster %02d]:" % (target_actor_id, cluster_id), "\t", joint_pr_actor_cluster)
    per_cluster_prs.append(joint_pr_actor_cluster)

Pr[nm0000115| Cluster 00]: 	 0.0
Pr[nm0000115, Cluster 00]: 	 0.0
Pr[nm0000115| Cluster 01]: 	 0.0
Pr[nm0000115, Cluster 01]: 	 0.0
Pr[nm0000115| Cluster 02]: 	 0.0
Pr[nm0000115, Cluster 02]: 	 0.0
Pr[nm0000115| Cluster 03]: 	 0.0
Pr[nm0000115, Cluster 03]: 	 0.0
Pr[nm0000115| Cluster 04]: 	 0.0
Pr[nm0000115, Cluster 04]: 	 0.0
Pr[nm0000115| Cluster 05]: 	 0.0
Pr[nm0000115, Cluster 05]: 	 0.0
Pr[nm0000115| Cluster 06]: 	 0.0
Pr[nm0000115, Cluster 06]: 	 0.0
Pr[nm0000115| Cluster 07]: 	 0.0
Pr[nm0000115, Cluster 07]: 	 0.0
Pr[nm0000115| Cluster 08]: 	 0.0
Pr[nm0000115, Cluster 08]: 	 0.0
Pr[nm0000115| Cluster 09]: 	 0.0
Pr[nm0000115, Cluster 09]: 	 0.0
Pr[nm0000115| Cluster 10]: 	 0.0
Pr[nm0000115, Cluster 10]: 	 0.0
Pr[nm0000115| Cluster 11]: 	 0.0
Pr[nm0000115, Cluster 11]: 	 0.0
Pr[nm0000115| Cluster 12]: 	 0.2033898305084746
Pr[nm0000115, Cluster 12]: 	 0.002909796314258002
Pr[nm0000115| Cluster 13]: 	 0.0
Pr[nm0000115, Cluster 13]: 	 0.0
Pr[nm0000115| Cluster 14]: 	 0.0013333333333

In [23]:
pr_target_actor = sum(per_cluster_prs)
print("Probability of Target Actor:", pr_target_actor)

Probability of Target Actor: 0.0029582929194956354


In [24]:
for cluster_id,cluster_actor_pr in enumerate(per_cluster_prs):

    pr_cluster_given_actor = cluster_actor_pr / pr_target_actor

    print("Pr[Cluster %02d | %s]:" % (cluster_id, target_actor_id), "\t", pr_cluster_given_actor)
    

Pr[Cluster 00 | nm0000115]: 	 0.0
Pr[Cluster 01 | nm0000115]: 	 0.0
Pr[Cluster 02 | nm0000115]: 	 0.0
Pr[Cluster 03 | nm0000115]: 	 0.0
Pr[Cluster 04 | nm0000115]: 	 0.0
Pr[Cluster 05 | nm0000115]: 	 0.0
Pr[Cluster 06 | nm0000115]: 	 0.0
Pr[Cluster 07 | nm0000115]: 	 0.0
Pr[Cluster 08 | nm0000115]: 	 0.0
Pr[Cluster 09 | nm0000115]: 	 0.0
Pr[Cluster 10 | nm0000115]: 	 0.0
Pr[Cluster 11 | nm0000115]: 	 0.0
Pr[Cluster 12 | nm0000115]: 	 0.9836065573770492
Pr[Cluster 13 | nm0000115]: 	 0.0
Pr[Cluster 14 | nm0000115]: 	 0.01639344262295082
Pr[Cluster 15 | nm0000115]: 	 0.0


In [28]:
target_cluster = 12

In [30]:
for movie_id in cluster_df[cluster_df["cluster"] == target_cluster].sample(n=10, replace=False)["movie_id"]:
    this_movie = movie_actor_map[movie_id]
    print(movie_id, this_movie["movie"], this_movie["genres"])

tt3460252 The Hateful Eight ['Crime', 'Drama', 'Mystery']
tt5462326 Mom and Dad ['Comedy', 'Horror', 'Thriller']
tt1252507 The Way Home ['Drama', 'Family']
tt10328018 A Child of the King ['Biography', 'Drama']
tt3481634 Inconceivable ['Drama', 'Thriller']
tt1843866 Captain America: The Winter Soldier ['Action', 'Adventure', 'Sci-Fi']
tt6143850 Distorted ['Crime', 'Mystery', 'Thriller']
tt1227182 Subject: I Love You ['Drama', 'Romance', 'Thriller']
tt1219289 Limitless ['Sci-Fi', 'Thriller']
tt1860353 Turbo ['Adventure', 'Animation', 'Comedy']
