# Probabilities and k-Means Clustering

Using the IMDB data, construct a feature matrix, and apply `k-Means` to the data to extract clusters. 

We then inspect various aspects of probability associated with these clusterings.

In [1]:
import json

import pandas as pd
import numpy as np

In [2]:
actor_name_map = {}
movie_actor_map = {}
actor_genre_map = {}


with open("../data/imdb_movies_2000to2022.prolific.json", "r") as in_file:
    for line in in_file:
        
        # Read the movie on this line and parse its json
        this_movie = json.loads(line)
                    
        # Add all actors to the id->name map
        for actor_id,actor_name in this_movie['actors']:
            actor_name_map[actor_id] = actor_name
            
        # For each actor, add this movie's genres to that actor's list
        for actor_id,actor_name in this_movie['actors']:
            this_actors_genres = actor_genre_map.get(actor_id, {})
            
            # Increment the count of genres for this actor
            for g in this_movie["genres"]:
                this_actors_genres[g] = this_actors_genres.get(g, 0) + 1
                
            # Update the map
            actor_genre_map[actor_id] = this_actors_genres
            
        movie_rating = this_movie["rating"]
        if len(movie_rating) == 0:
            movie_rating = {"avg": 0, "votes": 0}
        
        # Finished with this film
        movie_actor_map[this_movie["imdb_id"]] = ({
            "movie": this_movie["title"],
            "actors": set([item[0] for item in this_movie['actors']]),
            "genres": this_movie["genres"],
            "rating": movie_rating["avg"],
            "raters": movie_rating["votes"],
        })

In [3]:
print("Known Actors:", len(actor_name_map))
print("Known Movies:", len(movie_actor_map))

Known Actors: 33609
Known Movies: 20620


## Read CSV of Movies to Cluster IDs

Using the provided movie-to-cluster mapping CSV file, we assess the distributions of movies per cluster and ask questions about genres and actors in each cluster.

In [4]:
cluster_df = pd.read_csv("movie_to_cluster_rating.csv")

In [5]:
cluster_df

Unnamed: 0,movie_id,cluster,rating,raters
0,tt0035423,8,6.4,85923
1,tt0088751,12,5.3,328
2,tt0096056,6,5.6,830
3,tt0113092,3,3.4,829
4,tt0116391,3,6.2,257
...,...,...,...,...
20615,tt9906278,10,0.0,0
20616,tt9906644,13,6.8,835
20617,tt9906844,10,0.0,0
20618,tt9907032,10,0.0,0


In [6]:
cluster_df.groupby("cluster").mean()

Unnamed: 0_level_0,rating,raters
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4.552502,24081.608008
1,5.502326,26345.518169
2,4.896075,83537.664005
3,5.515968,50720.176613
4,5.425038,39424.616794
5,6.464643,54953.271429
6,5.44073,11503.330186
7,6.048661,29263.581102
8,5.481137,24112.288114
9,4.876823,35051.484375


In [7]:
cluster_df["cluster"].value_counts()

6     3177
0     3097
15    1754
13    1705
2     1503
12    1466
1     1376
3     1240
14     893
8      774
10     761
4      655
11     640
7      635
5      560
9      384
Name: cluster, dtype: int64

In [8]:
cluster_pr_map = {cluster_id:cluster_pr for cluster_id,cluster_pr in (cluster_df["cluster"].value_counts() / cluster_df.shape[0]).items()}
cluster_df["cluster"].value_counts() / cluster_df.shape[0]

6     0.154074
0     0.150194
15    0.085063
13    0.082687
2     0.072890
12    0.071096
1     0.066731
3     0.060136
14    0.043307
8     0.037536
10    0.036906
4     0.031765
11    0.031038
7     0.030795
5     0.027158
9     0.018623
Name: cluster, dtype: float64

## Assess Genre-Specific Cluster Probabilities

We want to determine, for a new movie with a known genre, to which cluster is it most likely to be assigned?

In [9]:
# For each genre, count the number of movies
genre_counts = {}

# For each movie, get its genres and update the genre count
for movie_id in movie_actor_map.keys():
    for genre in movie_actor_map[movie_id]["genres"]:
        genre_counts[genre] = genre_counts.get(genre, 0) + 1
        
genre_prs = []
for genre,g_count in genre_counts.items():
    genre_prs.append((genre, g_count/len(movie_actor_map)))
    
genre_prs_df = pd.DataFrame(genre_prs, columns=["genre", "probability"])
genre_pr_map = {row["genre"]:row["probability"] for idx,row in genre_prs_df.iterrows()}

genre_prs_df.sort_values(by="probability", ascending=False)

Unnamed: 0,genre,probability
5,Drama,0.49258
0,Comedy,0.291804
10,Thriller,0.19418
6,Action,0.181523
3,Horror,0.149224
8,Crime,0.134481
2,Romance,0.12226
7,Adventure,0.080844
9,Mystery,0.074442
4,Sci-Fi,0.051164


In [28]:
target_genre = "Sci-Fi"

per_cluster_prs = []
for cluster_id,group in cluster_df.groupby("cluster"):

    this_cluster_genre_count = sum([
        1 if target_genre in movie_actor_map[m]["genres"] else 0 
        for m in group["movie_id"]
    ])
    
    # Calculate conditional probability
    pr_genre_given_cluster = this_cluster_genre_count / group.shape[0]
    print("Pr[%s| Cluster %02d]:" % (target_genre, cluster_id), "\t", pr_genre_given_cluster)
    
    # Calculate joint probability
    joint_pr_genre_cluster = pr_genre_given_cluster * group.shape[0] / cluster_df.shape[0]
    print("Pr[%s, Cluster %02d]:" % (target_genre, cluster_id), "\t", joint_pr_genre_cluster)
    per_cluster_prs.append(joint_pr_genre_cluster)

Pr[Sci-Fi| Cluster 00]: 	 0.05908944139489829
Pr[Sci-Fi, Cluster 00]: 	 0.008874878758486906
Pr[Sci-Fi| Cluster 01]: 	 0.03125
Pr[Sci-Fi, Cluster 01]: 	 0.002085354025218235
Pr[Sci-Fi| Cluster 02]: 	 0.1430472388556221
Pr[Sci-Fi, Cluster 02]: 	 0.010426770126091174
Pr[Sci-Fi| Cluster 03]: 	 0.020967741935483872
Pr[Sci-Fi, Cluster 03]: 	 0.0012609117361784678
Pr[Sci-Fi| Cluster 04]: 	 0.08549618320610687
Pr[Sci-Fi, Cluster 04]: 	 0.0027158098933074684
Pr[Sci-Fi| Cluster 05]: 	 0.0
Pr[Sci-Fi, Cluster 05]: 	 0.0
Pr[Sci-Fi| Cluster 06]: 	 0.02045955303745672
Pr[Sci-Fi, Cluster 06]: 	 0.0031522793404461687
Pr[Sci-Fi| Cluster 07]: 	 0.0
Pr[Sci-Fi, Cluster 07]: 	 0.0
Pr[Sci-Fi| Cluster 08]: 	 0.011627906976744186
Pr[Sci-Fi, Cluster 08]: 	 0.0004364694471387003
Pr[Sci-Fi| Cluster 09]: 	 0.059895833333333336
Pr[Sci-Fi, Cluster 09]: 	 0.0011154219204655674
Pr[Sci-Fi| Cluster 10]: 	 0.001314060446780552
Pr[Sci-Fi, Cluster 10]: 	 4.8496605237633365e-05
Pr[Sci-Fi| Cluster 11]: 	 0.078125
Pr[Sci-Fi,

In [29]:
pr_target_genre = sum(per_cluster_prs)
print("Probability of Target Genre:", pr_target_genre)

Probability of Target Genre: 0.051163918525703206


In [33]:
cluster_posterior_prs = []
for cluster_id,cluster_genre_pr in enumerate(per_cluster_prs):

    pr_cluster_given_genre = cluster_genre_pr / genre_pr_map[target_genre]
    cluster_posterior_prs.append(pr_cluster_given_genre)

    print("Pr[Cluster %02d | %s]:" % (cluster_id, target_genre), "\t", pr_cluster_given_genre)
    

Pr[Cluster 00 | Sci-Fi]: 	 0.17345971563981044
Pr[Cluster 01 | Sci-Fi]: 	 0.040758293838862564
Pr[Cluster 02 | Sci-Fi]: 	 0.20379146919431282
Pr[Cluster 03 | Sci-Fi]: 	 0.02464454976303318
Pr[Cluster 04 | Sci-Fi]: 	 0.05308056872037915
Pr[Cluster 05 | Sci-Fi]: 	 0.0
Pr[Cluster 06 | Sci-Fi]: 	 0.061611374407582936
Pr[Cluster 07 | Sci-Fi]: 	 0.0
Pr[Cluster 08 | Sci-Fi]: 	 0.008530805687203791
Pr[Cluster 09 | Sci-Fi]: 	 0.021800947867298578
Pr[Cluster 10 | Sci-Fi]: 	 0.0009478672985781991
Pr[Cluster 11 | Sci-Fi]: 	 0.047393364928909956
Pr[Cluster 12 | Sci-Fi]: 	 0.18862559241706164
Pr[Cluster 13 | Sci-Fi]: 	 0.013270142180094787
Pr[Cluster 14 | Sci-Fi]: 	 0.03033175355450237
Pr[Cluster 15 | Sci-Fi]: 	 0.13175355450236967


In [35]:
pr_cluster_given_genre

0.13175355450236967

In [39]:
poster_cluster_prs_df = pd.DataFrame(cluster_posterior_prs, columns=['posterior_cluster_pr'])
poster_cluster_prs_df["cluster"] = poster_cluster_prs_df.index

poster_cluster_prs_df

Unnamed: 0,posterior_cluster_pr,cluster
0,0.17346,0
1,0.040758,1
2,0.203791,2
3,0.024645,3
4,0.053081,4
5,0.0,5
6,0.061611,6
7,0.0,7
8,0.008531,8
9,0.021801,9


In [51]:
cluster_df.groupby("cluster").mean()

Unnamed: 0_level_0,rating,raters
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4.552502,24081.608008
1,5.502326,26345.518169
2,4.896075,83537.664005
3,5.515968,50720.176613
4,5.425038,39424.616794
5,6.464643,54953.271429
6,5.44073,11503.330186
7,6.048661,29263.581102
8,5.481137,24112.288114
9,4.876823,35051.484375


In [43]:
joined_df = poster_cluster_prs_df.set_index("cluster").join(cluster_df.groupby("cluster").mean())
joined_df

Unnamed: 0_level_0,posterior_cluster_pr,rating,raters
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.17346,4.552502,24081.608008
1,0.040758,5.502326,26345.518169
2,0.203791,4.896075,83537.664005
3,0.024645,5.515968,50720.176613
4,0.053081,5.425038,39424.616794
5,0.0,6.464643,54953.271429
6,0.061611,5.44073,11503.330186
7,0.0,6.048661,29263.581102
8,0.008531,5.481137,24112.288114
9,0.021801,4.876823,35051.484375


In [52]:
joined_df["posterior_cluster_pr"] * joined_df["rating"]

cluster
0     0.789676
1     0.224265
2     0.997778
3     0.135939
4     0.287964
5     0.000000
6     0.335211
7     0.000000
8     0.046759
9     0.106319
10    0.005305
11    0.207257
12    0.771113
13    0.077831
14    0.183077
15    0.655846
dtype: float64

In [45]:
sum(joined_df["posterior_cluster_pr"] * joined_df["rating"])

4.824340296971503

In [49]:
np.mean([m["rating"] for m in movie_actor_map.values() if target_genre in m["genres"]])

4.778862559241706

## Most Probable Cluster and Joint Probability

In [53]:
target_genre = "Drama"
target_actor = "nm0000120" # Jim Carrey

per_cluster_prs = []
for cluster_id,group in cluster_df.groupby("cluster"):

    this_cluster_genre_count = sum([
        1 if (target_genre in movie_actor_map[m]["genres"] and target_actor in movie_actor_map[m]["actors"]) else 0 
        for m in group["movie_id"]
    ])
    
    # Calculate conditional probability
    pr_genre_actor_given_cluster = this_cluster_genre_count / group.shape[0]
    print("Pr[%s, %s| Cluster %02d]:" % (target_genre, target_actor, cluster_id), "\t", pr_genre_actor_given_cluster)
    
    # Calculate joint probability
    joint_pr_genre_actor_cluster = pr_genre_actor_given_cluster * group.shape[0] / cluster_df.shape[0]
    print("Pr[%s, %s, Cluster %02d]:" % (target_genre, target_actor, cluster_id), "\t", joint_pr_genre_actor_cluster)
    per_cluster_prs.append(joint_pr_genre_actor_cluster)

Pr[Drama, nm0000120| Cluster 00]: 	 0.0
Pr[Drama, nm0000120, Cluster 00]: 	 0.0
Pr[Drama, nm0000120| Cluster 01]: 	 0.0
Pr[Drama, nm0000120, Cluster 01]: 	 0.0
Pr[Drama, nm0000120| Cluster 02]: 	 0.0
Pr[Drama, nm0000120, Cluster 02]: 	 0.0
Pr[Drama, nm0000120| Cluster 03]: 	 0.0
Pr[Drama, nm0000120, Cluster 03]: 	 0.0
Pr[Drama, nm0000120| Cluster 04]: 	 0.0
Pr[Drama, nm0000120, Cluster 04]: 	 0.0
Pr[Drama, nm0000120| Cluster 05]: 	 0.0
Pr[Drama, nm0000120, Cluster 05]: 	 0.0
Pr[Drama, nm0000120| Cluster 06]: 	 0.0
Pr[Drama, nm0000120, Cluster 06]: 	 0.0
Pr[Drama, nm0000120| Cluster 07]: 	 0.0
Pr[Drama, nm0000120, Cluster 07]: 	 0.0
Pr[Drama, nm0000120| Cluster 08]: 	 0.0
Pr[Drama, nm0000120, Cluster 08]: 	 0.0
Pr[Drama, nm0000120| Cluster 09]: 	 0.0
Pr[Drama, nm0000120, Cluster 09]: 	 0.0
Pr[Drama, nm0000120| Cluster 10]: 	 0.0
Pr[Drama, nm0000120, Cluster 10]: 	 0.0
Pr[Drama, nm0000120| Cluster 11]: 	 0.0
Pr[Drama, nm0000120, Cluster 11]: 	 0.0
Pr[Drama, nm0000120| Cluster 12]: 	 0.0


In [25]:
for cluster_id,cluster_genre_pr in enumerate(per_cluster_prs):

    pr_cluster_given_genre = cluster_genre_pr / genre_pr_map[target_genre]

    print("Pr[Cluster %02d | %s, %s]:" % (cluster_id, target_genre, target_actor), "\t", pr_cluster_given_genre)
    

Pr[Cluster 00 | Comedy, nm0000120]: 	 0.0016619577862722287
Pr[Cluster 01 | Comedy, nm0000120]: 	 0.0
Pr[Cluster 02 | Comedy, nm0000120]: 	 0.0004985873358816686
Pr[Cluster 03 | Comedy, nm0000120]: 	 0.0
Pr[Cluster 04 | Comedy, nm0000120]: 	 0.0
Pr[Cluster 05 | Comedy, nm0000120]: 	 0.0
Pr[Cluster 06 | Comedy, nm0000120]: 	 0.0
Pr[Cluster 07 | Comedy, nm0000120]: 	 0.0
Pr[Cluster 08 | Comedy, nm0000120]: 	 0.00016619577862722287
Pr[Cluster 09 | Comedy, nm0000120]: 	 0.0
Pr[Cluster 10 | Comedy, nm0000120]: 	 0.0
Pr[Cluster 11 | Comedy, nm0000120]: 	 0.0
Pr[Cluster 12 | Comedy, nm0000120]: 	 0.0
Pr[Cluster 13 | Comedy, nm0000120]: 	 0.0
Pr[Cluster 14 | Comedy, nm0000120]: 	 0.0
Pr[Cluster 15 | Comedy, nm0000120]: 	 0.0


### Sample Titles in Each Cluster

We can use the above conditional probabilities to determine the most likely cluster given a movie genre.

Here, we sample movies in the most likely cluster to get a sense of what movies are in that cluster.

In [26]:
target_cluster = 0

In [27]:
for movie_id in cluster_df[cluster_df["cluster"] == target_cluster].sample(n=10, replace=False)["movie_id"]:
    this_movie = movie_actor_map[movie_id]
    print(movie_id, this_movie["movie"], this_movie["genres"])

tt8246114 The Fighters Prayer ['Sport']
tt0795368 Death at a Funeral ['Comedy']
tt0493407 Cook Off! ['Comedy']
tt0765476 Meet Dave ['Adventure', 'Comedy', 'Family']
tt8870946 A Dim Valley ['Comedy']
tt2022490 Pennin Manathai Thottu ['Musical']
tt0216584 Bob the Butler ['Comedy', 'Family']
tt0425395 Relative Strangers ['Comedy']
tt1241332 Here's the Kicker ['Comedy']
tt1964806 Invasion of the Reptoids ['Sci-Fi']
