# Exploration Amazon Notebook
Amazon Prime is another one of the most popular media and video streaming platforms. They have close to 10000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally.

This tabular dataset consists of listings of all the movies and tv shows available on Amazon Prime, along with details such as - cast, directors, ratings, release year, duration, etc.

As a data scientist working for Amazon Prime Video, your task is to improve the platform's recommendation system to enhance user engagement and satisfaction. With the wealth of data available, you aim to develop a personalized recommendation algorithm that suggests movies and TV shows tailored to each user's preferences.

In [None]:
#For data handling
import pandas as pd
import numpy as np

#For visualization
import matplotlib.pyplot as plt
import seaborn as sns

#For handling categorical and missing data
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder, MultiLabelBinarizer

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.metrics.pairwise import cosine_similarity

#Creating the train test split
from sklearn.model_selection import train_test_split
import re

# Importing library
from scipy.stats import skew, kurtosis

sns.set_style('darkgrid')

In [None]:
!pip install -U sentence-transformers umap-learn hdbscan bertopic

In [None]:
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer

# Pull in data
Look at the initial records of the data and perform elementary EDA.

In [None]:
# read in the data via read_csv
df = pd.read_csv('sample_data/amazon_prime_titles.csv')

In [None]:
# view the top 5 rows of the data
df.head()

In [None]:
# view the dimensions of the data using df.shape
df.shape

In [None]:
# look at data types as well as missing values in the dataset
df.info()

Notice that there is a lot of missing values in various columns. Let's go ahead and impute them (i.e., deal with them by filling). Percentage wise how much does each feature have in null values.

In [None]:
# get a percentage of missing values in each column
# round each missing percentage two decimal units
df['date_added'].isna().sum()/len(df)

np.round(df.isna().sum()/len(df), 4) * 100

Given that the `director`, `cast`, `data_added`, and `rating` are null and there's no real way of finding them with the given data, let's just impute them with `Unknown`.

This command is often forgotten, so let's ask Google Colab's AI assistant to help with this. Ask it something like: `fill in null values of director, cast, data_added, and rating in the data frame`

In [None]:
# fill in director, date_added, cast, rating column with 'Unknown' value for nulls
# prompt: how to fill null values of a column in a pandas dataframe


Looking at the `country` column, most movies and shows are mostly null and they are predominately made in the US (just from observation) so we are going to fill the nulls in with the most frequent value.

Let's ask Google's assistant again. Ask is something like this: `impute missing values of a country column using sklearn SimpleImputer`

In [None]:
# use an imputer to fill in all the missing values
# easier ways to do this, but this is good to get
# practice with especially with productionization.


Check to make sure that there are no more null features.

In [None]:
# use the same code as before
np.round(df.isna().sum()/len(df), 4) * 100

In [None]:
# look at a few different examples.
df.sample(7)

We notice that the dataset contains TV Shows and Movies, let's see the distribution of each kind. Also, is there a relationship between the type and directors.

In [None]:
# see that there are movies and tv shows
movie_tv = df.type.value_counts(normalize=False)\
                  .rename('number')\
                  .to_frame()\
                  .reset_index()\
                  .rename(columns={'index':'type'})

plt.figure(figsize=(6, 6))
ax = sns.barplot(movie_tv, x='type', y='number')
plt.title('Number of TV Shows vs Movies')
ax.bar_label(ax.containers[0])
plt.plot();

In [None]:
# get the top 15 directors by count - use the value counts method

In [None]:
# there are a good amount of unknown directors so
# find out if the unknown directors
# correspond to movies or tv shows more
# what do you notice?

#### df[df.director == 'Unknown'].type.value_counts(normalize=True)

In [None]:
# what do you notice about the
# directors for all tv shows in the
# data frame

#### df[df.type == 'TV Show'].director.value_counts()

# Ratings Exploration
Let's look at the distribution of ratings in the dataset and see if there are any correlations between ratings and type.

In [None]:
# plot a distribution of ratings amoung all
# movies and tv shows
plt.figure(figsize=(10, 6))
sns.countplot(x='rating', data=df, order=df['rating'].value_counts().index)
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Number of Movies')
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
# Distribution of ratings and correlation to type
plt.figure(figsize=(10, 8))
ax = sns.countplot(y='rating',
                   data=df,
                   order=df['rating'].value_counts().index,
                   hue='type')
plt.title('Distribution of Ratings with Types')
plt.xlabel('Rating')
plt.ylabel('Number of Movies')
plt.xticks(rotation=45, ha='right')
ax.bar_label(ax.containers[0])
ax.bar_label(ax.containers[1])
plt.show()

# Listings Exploration
Let's take a deeper dive into the `listed_in` feature and notice if we see anything. This is also commonly known as the genres of the tv shows and movies. Let's take a deeper dive into this.

In [None]:
# get the number of unique values inside
# of the listed_in feature

#### df['listed_in'].nunique()

In [None]:
# plot the top genres
top_genres = df['listed_in'].value_counts().head(15).index.tolist()

# get all records that correspond to the top genres
top_genre_df = df[df.listed_in.isin(top_genres)]
# Plot Top Genres
plt.figure(figsize=(12, 8))
ax = sns.countplot(data = top_genre_df, y='listed_in', hue='type')
plt.title('Top 15 Genres')
plt.xlabel('Number of Movies')
plt.ylabel('Genre')
ax.bar_label(ax.containers[0])
ax.bar_label(ax.containers[1])
plt.show()

Notice that there is some overlap in the genres. For example, `comedy` is in `comedy` as well as  `comedy, drama`. Let's see if we can separate the features from each other and just to a plain count of how often the individual genres occur.

In [None]:
# really long genre related to Arts, Entertainment, and Culture, let's make it
# one genre by replacing where the comma is.
df['listed_in'] = df.listed_in.apply(lambda x : re.sub('Arts, Entertainment, and Culture', 'Arts Entertainment and Culture', x)
                                                if re.search('Arts, Entertainment, and Culture', x) else x)

In [None]:
# split the listed_in features and get a list
# inside a new feature called 'genres'
df['genres'] = df['listed_in'].apply(lambda x : [z.strip(' ') for z in x.split(',')])

df.head()

If we want to find out how often a certain genre appears across all movies (for example, how often does just `comedy` appear), we need to one hot encode the `listed_in` feature. That can be tricky to do if we have lists. This is where MultiLabelBinarizer comes into the picture and does the OHE for us! For more information on MultilabelBinarizer see [this](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html).

In [None]:
# use the MultilabelBinarizer to one hot encode features
# this will give you an array of genres

# create a MultiLabelBinarizer object
mlb = MultiLabelBinarizer()

# fit_transform the MultilabelBinarizer on the
# newly found genres column.
mlb.fit_transform(df['genres'])

In [None]:
# look at the classes that the multilabel binarizer found
# these are the genres that have been listed inside the dataset.
mlb.classes_

In [None]:
# create a data frame with the rows as listed_in
# and the columns the new found individual dramas.
df_mlb = pd.DataFrame(mlb.fit_transform(df['genres']), columns=mlb.classes_, index=df.index)

df_genre = pd.concat([df[['title', 'type', 'listed_in', 'genres']], df_mlb], axis=1)

df_genre.head()

In [None]:
# see how often each genre occured across the TV shows and Movies.

# TV Shows
df_genre[df_genre.type == 'TV Show'].iloc[:, 4:]\
                                    .sum(axis=0)\
                                    .sort_values(ascending=False)\
                                    .head(15)

In [None]:
# Movies
df_genre[df_genre.type == 'Movie'].iloc[:, 4:]\
                                  .sum(axis=0)\
                                  .sort_values(ascending=False)\
                                  .head(15)

# Duration Exploration
Do some exploration on the duration. The duration of the movies and TV shows can vary depending on type of movie, genre etc. Let's do some exploration there to see if there is anything to discover.

In [None]:
# get all of the unique values
# inside of the duration feature
# what do you notice

#### df.duration.unique()

In [None]:
# create a new feature called 'duration_season'
# where we indicate if the duration
# feature has 'season' in the title or not.
# maybe there is some correlation there between that and TV shows
# use the regex library from python
df['duration_season'] = df.duration.apply(lambda x : True if re.search('season', x.lower()) else False)

In [None]:
# plot the movies vs tv shows and see how many of them have
# seasons in the duration title.
# what do you notice?
plt.figure(figsize=(6, 6))
ax=sns.countplot(df, x='type', hue='duration_season')
plt.title('Seasons in TV shows vs Movies')
ax.bar_label(ax.containers[0])
ax.bar_label(ax.containers[1])
plt.plot();

So all TV Shows are measured in seasons rather than minutes, which makes sense. Let's clean this data up and see if we can view any correlations between duration and genre type

## Movie Duration Exploration
Let's do movies first and see if we can find some interesting insights within the movies in terms of duration time.

In [None]:
# get all of the movies based off of our new feature (we could also use the type feature)
df_movies = df[df.duration_season==False].copy()

# create a new feature which strips the 'min' from the feature value
# and gets a numerical value for the length of the movie. Call this 'duration_time'.
df_movies['duration_time'] = df_movies.duration.apply(lambda x : x.split('min')[0]).astype(int)
df_movies.sample(3)

In [None]:
# plot a distribution of duration time for the movies, what do you notice?
plt.figure(figsize=(10, 8))
sns.histplot(data=df_movies, x ='duration_time', bins=50)
plt.show()

In [None]:
# give the descriptive statistics of the duration_time feature
# what do you notice?

df_movies.duration_time.describe()

In [None]:
# there's a really long tail in the distribution of times
# let's look into the tail and see what kind of movies there are.
# pick a time in minutes, say 400 and look at a sample
# of movies that have duration past that length

### df_movies[df_movies.duration_time>400].sample(9)

In [None]:
# most of these really long movies are
# coming from what director?

#### df_movies[df_movies.duration_time>400].director.value_counts()

Looking at the descriptions of these really long movies, it seems like they are all ambient noise movies. These movies provide white noise to aid in sleeping and relaxation.

In [None]:
# removing these outliers you can see a much more symmetric curve
# of movie times. What do you notice about the movie duration times? Does it look
# similar to a type of distribution that you have seen before?
plt.figure(figsize=(10, 8))
sns.histplot(data=df_movies[df_movies.duration_time<=400],
             x ='duration_time',
             bins=50,
             kde=True)
plt.show()

In [None]:
# looks like there are also movies with <10 minutes in length? Why?
df_movies[df_movies.duration_time <= 10].listed_in.value_counts().head(15)

In [None]:
# what kind of movies are these?
df_movies[df_movies.duration_time <= 10].sample(3)

In [None]:
# removing these outliers you can see a much more normal distribution
# of movie times. What do you notice about the movie duration times?
# Does it look similar to a type of distribution that you have seen before?
plt.figure(figsize=(10, 8))
sns.histplot(data=df_movies[(df_movies.duration_time<=400) &
                            (df_movies.duration_time>=10)],
                            x ='duration_time', bins=50, kde=True)
plt.show()

In [None]:
# calculate some stats on the distribution
# in particular skew and the kurtosis of it
duration_time_non_tail = df_movies[(df_movies.duration_time<=400)
                                  & (df_movies.duration_time>=10)].duration_time.values

# print(f"The skew of the duration_time is: {skew(duration_time_non_tail)}")

# print(f"The kurtosis of the duration_time is: {kurtosis(duration_time_non_tail)}")

In [None]:
# calcualte some general descriptive statistics of the movies lengths
df_movies[(df_movies.duration_time<=400) & (df_movies.duration_time>=10)].duration_time.describe()

## Movie Duration Exploration vs Genre
Are there any interesting insights between the genre and the duration time e.g., avg movie lengths by genre, are comedies usually shorter than dramas etc.?

In [None]:
# group by listed_in feature and look at the average
# duration time across multiple genres
df_movies.groupby('listed_in').duration_time.mean().sort_values(ascending=False)

In [None]:
## As was mentioned in the genre section, because of the duplicates
## you can't really get a full picture of how the genres relate to
## movie lengths because Drama, Comedy is considered different from Drama or Comedy.
## let's use the genres data frame to calculate more descriptive
## features by individual dramas.

# get the movies and merge them onto the genres
# df_movies merge onto df_genres, on title and listed in
# do a left join just to be safe
df_movie_genre = df_movies.merge(df_genre, on=['title', 'listed_in'], how='left')

df_movie_genre.head()

In [None]:
# use the mlb binarizer to get the genre names
# in the form of a list. Make sure to remove TV shows
# since those aren't measure in minutes
genres = [genre for genre in mlb.classes_.tolist() if genre != 'TV Shows']

In [None]:
# intiial group by that we'll be calculating
df_movie_genre.groupby(f'Action').duration_time.agg(['size','mean']).reset_index()

In [None]:
# create an empty dictionary that
# will keep track of genres and their avg minutes
genre_duration_time_dict = {}

# go through each genre, find the average duration time
# and place that value in the dictionary with the genre as a key
for genre in genres:
  # group by to get total movies per genre and their avg time
  gen_df = df_movie_genre.groupby(f'{genre}')\
                         .duration_time\
                         .agg(['size','mean'])\
                         .reset_index()
  # get the total number of movies for the genre
  size_duration = gen_df[gen_df[f'{genre}'] == 1]['size'].iloc[0]
  # get the avg movie length per genre
  avg_duration = gen_df[gen_df[f'{genre}'] == 1]['mean'].iloc[0]
  # assign the value in a dictionary
  genre_duration_time_dict[f'{genre}'] = [size_duration, avg_duration]

In [None]:
# genre_duration_time_dict

In [None]:
# use the pandas from_dict method to turn the previous
# dictionary into a pandas data frame
gen_duration_time_df = pd.DataFrame.from_dict(genre_duration_time_dict,
                                              orient='index',
                                              columns=['num_movies_with_genre',
                                                       'avg_duration_time'])\
                                              .reset_index(names='genre')
# sort out the values from in descending order, what do you notice?
gen_duration_time_df.sort_values('avg_duration_time', ascending=False)

In [None]:
# give a bar plot representation for this
plt.figure(figsize=(12, 6))
sns.barplot(data=gen_duration_time_df.sort_values('avg_duration_time', ascending=False),
            y='genre',
            x='avg_duration_time')
plt.title('Average Duration Time by Genre')
plt.ylabel('Genre')
plt.xlabel('Average Duration Time')
plt.savefig('avg_duration.png',  bbox_inches='tight')
plt.gcf()
plt.show()

## TV Shows Duration Exploration
Explore some general behavior of TV shows duration and a breakdown by genre.  

In [None]:
# Do it again for TV Shows
## get all the tv shows
df_shows = df[df.duration_season==True].copy()

plt.figure(figsize=(6, 6))
ax = sns.countplot(data = df_shows,
                   y='duration',
                   order = df_shows['duration'].value_counts().index)
plt.title('Duration in TV Shows')
ax.bar_label(ax.containers[0])
plt.plot();

In [None]:
# are there duplicate shows, for example 1 show can have 1 season and 2 seasons
# looks like no duplicate titles
df_shows.title.duplicated().sum()

In [None]:
## see how many tv shows have 1 season 2 seasons etc.
df_shows['duration'].value_counts()

In [None]:
# 12 percent of tv shows on prime only have two seasons!
df_shows['duration'].value_counts()/len(df_shows) * 100

#### How do TV show seasons relate to release year?

In [None]:
# Distribution of movies by release year for movies and tv shows
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='release_year', bins=30, kde=True, hue='type')
plt.title('Distribution of Movies and TV Show by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Number of Movies')
plt.show()

In [None]:
# Distribution of movies by release year
# at this point we can notice that tons of tv shows
# only have 1 season
plt.figure(figsize=(12, 6))
sns.histplot(data=df_shows, x='release_year', bins=30, hue='duration')
plt.title('Distribution of TV Shows by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Number of TV Shows')
plt.show()

Can probably infer that the majority of TV shows that have 1 season were created fairly recently. Let's see if we can validate that.

In [None]:
#because there are not many shows before 1990, let's remove those and
# see what the distribution of 1 season shows looks like
df_shows_new = df_shows[df_shows.release_year>=1990]
df_shows_unstack = df_shows_new.groupby(['duration', 'release_year'])\
                               .duration.count()\
                               .to_frame().unstack().fillna(0).iloc[:, -22:]
df_shows_unstack

In [None]:
# don't like things being out of order
# so let's go ahead and reorder them
unstacked_indexes = df_shows_unstack.index.tolist()
ordered_season_num = sorted([int(x.split(' ')[0]) for x in unstacked_indexes])

ordered_seasons = []
for num in ordered_season_num:
    for duration in  unstacked_indexes:
        if re.search(fr'^\b({num})\b', f'{duration}'):
            ordered_seasons.append(duration)
ordered_seasons

In [None]:
# groupby the duration and release year, count the
# number of times that this these occur together
# unpivot using unstack and place it into
# a nice pandas dataframe from the year 2000 onwards
df_shows_gpby = df_shows_new.groupby(['duration', 'release_year'])\
                            .duration.count().rename('release_year')\
                            .to_frame().unstack()\
                            .fillna(0).iloc[:, -22:]\
                            .reindex(ordered_seasons)
df_shows_gpby

In [None]:
# plot a heatmap equivalent of this dataframe
plt.figure(figsize=(26, 8))
sns.heatmap(df_shows_gpby, cmap="Blues", annot=True)
plt.show()

In [None]:
# sum across the columns to show how many
# tv shows have 1 season, 2 season, 3 season etc.
df_shows_gpby.sum(axis=1)

Interesting that 1280 out of the 1342 of the shows that have 1 season were released in 2000 or later. What do we think this means?

In [None]:
df_shows[df_shows.duration == '14 Seasons'].title.values.tolist()

#### TV Shows Season Duration vs Genres
Can we see a breakdown of TV shows season lengths vs the genres of those tv shows. Do certain genres last longer than others at least on Amazon prime.

In [None]:
# using the python split method
# get the season length of each tv show
df_shows['season_length'] = df_shows.duration.apply(lambda x : int(x.split(' ')[0]))
df_shows.head()

In [None]:
# similar to the movies portion above, take the shows data frame
# and merge it on the genres data frame
df_shows_genre = df_shows.merge(df_genre, on=['title', 'listed_in'], how='left')
df_shows_genre.head()

In [None]:
# create an empty dictionary
genre_duration_time_dict_shows = {}

# go through each genre, find the average duration time
# and place that value in the dictionary with the genre as a key
for genre in genres:
  # group by to get total movies per genre and their avg time
  gen_df = df_shows_genre.groupby(f'{genre}')\
                         .season_length\
                         .agg(['size','mean'])\
                         .reset_index()
  # get the total number of movies for the genre
  size_duration = gen_df[gen_df[f'{genre}'] == 1]['size'].iloc[0]
  # get the avg movie length per genre
  avg_duration = gen_df[gen_df[f'{genre}'] == 1]['mean'].iloc[0]
  # assign the value in a dictionary
  genre_duration_time_dict_shows[f'{genre}'] = [size_duration, avg_duration]

In [None]:
# genre_duration_time_dict_shows

In [None]:
# use the pandas from_dict method to turn the previous
# dictionary into a pandas data frame
gen_duration_time_df_shows = pd.DataFrame.from_dict(genre_duration_time_dict_shows,
                                              orient='index',
                                              columns=['num_shows_with_genre', 'avg_duration_time'])\
                                              .reset_index(names='genre')
# sort out the values from in descending order, what do you notice?
gen_duration_time_df_shows.sort_values('avg_duration_time', ascending=False)

In [None]:
# give a bar plot representation for this
plt.figure(figsize=(12, 6))
sns.barplot(data=gen_duration_time_df_shows.sort_values('avg_duration_time', ascending=False),
            y='genre',
            x='avg_duration_time')
plt.title('Average TV Show Duration Time by Genre')
plt.ylabel('Genre')
plt.xlabel('Average Season Duration Time')
# plt.savefig('avg_duration.png',  bbox_inches='tight')
# plt.gcf()
plt.show()

# Country Breakdown
Are there any insights to be had broken down by country? Let's plot some descriptive plots to visualize things like duration.

In [None]:
# plot the top 15 countries by how they appear in the
# data set
top_countries = df['country'].value_counts().head(15).index

# Filter the data for the top countries
country_data = df[df['country'].isin(top_countries)]

# Visualize the total movies and tv shows for the top countries
plt.figure(figsize=(22, 11))
sns.countplot(x='country',
              data=country_data,
              hue='type',
              order=top_countries,
              palette='viridis')
plt.title('Total Movies and Shows by Top Countries')
plt.xlabel('Country')
plt.ylabel('Number of Shows')
plt.legend(title='Type', loc='upper right')
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
# Visualize the total shows by country for the top countries
# Top N countries
top_countries = df['country'].value_counts().head(5).index

df_movies_top_countries = df_movies[df_movies.country.isin(top_countries)]

plt.figure(figsize=(22, 11))
sns.boxplot(x='country',
            data=df_movies_top_countries,
            y='duration_time',
            hue='country',
            palette=['r','g','b','m', 'y'])
plt.title('Total Shows by Top Countries')
plt.xlabel('Country')
plt.ylabel('Duration of time')
plt.legend(title='Type', loc='upper right')
plt.xticks(rotation=45, ha='right')
plt.show();

In [None]:
# Visualize the total shows by country for the top countries
# Top N countries
top_countries = df['country'].value_counts().head(5).index

df_movies_top_countries = df_movies[df_movies.country.isin(top_countries)]

plt.figure(figsize=(22, 8))
sns.violinplot(x='country',
               data=df_movies_top_countries,
               y='duration_time',
               hue='country',
               palette=['r','g','b','m', 'y'])
plt.title('Total Shows by Top Countries')
plt.xlabel('Country')
plt.ylabel('Duration of time')
plt.legend(title='Type', loc='upper right')
plt.xticks(rotation=45, ha='right')
plt.show();

# Descriptions Recommendations Exploration
Recommendation systems are usually done via user interest and stats e.g., what movies have they watched, shows liked, minutes watched, genres watched etc. Since we don't have user data in that way, let's develop a basic recommendation system that will return the top 5 movies based off of a user query for what they would like to watch.

In [None]:
# let's look at only movies and their descriptions
df[df.type == 'Movie'].description.values[:7]

In [None]:
# get the lengths of each of the descriptions and see what those look like
df['description_length'] = df.description.apply(lambda x : len(x.split(' ')))
df.sample(7)


In [None]:
# plot a violin plot of the description lengths
# this will give us an idea of how long each description is
# upon using embeddings
plt.figure(figsize=(8,8))
sns.violinplot(data=df, x='type', y='description_length')
plt.show();

To find general recommendations, we first need to embed the descriptions that we do have into a numerical embedding. Ideally, we'd like for the numerical embeddings to have all the contextual information that describes the film e.g., a drama about sports, comedies about friends etc.

This is going to require us to first embed text into it's numerical representation through the use of LLMs. A powerful technique that does this is the so called Sentence Transformer. Instead of transforming individual words into numerical embeddings, we are going to transform sentences or paragraphs into their numerical embeddings that way we get all of the contextual information needed. More information sentence transformers can be found [here](https://www.sbert.net/).

In [None]:
# import the general use sentence transformer all-mpnet-base-v2
model = SentenceTransformer('all-mpnet-base-v2')

In [None]:
# create a dictionary that will keep
# track of movie titles and descriptions
movie_dict = {
    'movie': df[df.type == 'Movie'].title.values,
    'descriptions' : df[df.type == 'Movie'].description.values
}

In [None]:
# get the movie descriptions into a variable
movie_descriptions = movie_dict['descriptions']

In [None]:
# embed to the descriptions into their
# numerical embeddings
embeddings = model.encode(movie_dict['descriptions'] , show_progress_bar=True)

In [None]:
# get the embeddings shape
embeddings.shape

In [None]:
# given a user query, let's embed
# the query into it's numerical representation
user_query = 'a drama about sports'
query_embedding = model.encode(f'{user_query}')

In [None]:
# another example
user_query = 'comedies for adults'
query_embedding = model.encode(f'{user_query}')

In [None]:
# get the similarity between the query embedding and
# the movies
similarity_array = cosine_similarity(query_embedding.reshape(1, -1), embeddings)

In [None]:
# get the top 5 movie descriptions that
# are most similar to the query
top_five_sim_indices = np.argsort(-similarity_array)[0][:5].tolist()

In [None]:
# print out the top 5 most
# similar movies to the user query
for idx in top_five_sim_indices:
  print(f"Title: {movie_dict['movie'][idx]}")
  print()
  print(f"Description: {movie_dict['descriptions'][idx]}")
  print('---------------------')

# Topic Model the Descriptions
Given a large corpus of movies descriptions, can we find out what the movies are about without having to read each one? In other words, can we get a general sense of what types of movies are in the data set just from examining the text?

This is what is known as Topic Modeling and their are a variety of ways to do it, one of the more powerful method is known as BertTopic. For more information on BertTopic please see [here](https://maartengr.github.io/BERTopic/index.html)

In [None]:
# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-mpnet-base-v2")

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=30, n_components=5, min_dist=0.0, metric='cosine')

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=18, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

# All steps together
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model                # Step 5 - Extract topic words
)


In [None]:
# fit transform the model - this can take some time
# make sure to run it on a GPU if possible
topics, probs = topic_model.fit_transform(movie_descriptions)

In [None]:
# get the general sense of topics
# -1 correspond to outliers and weren't assigned a topic
model_topic_info = topic_model.get_topic_info()
model_topic_info

In [None]:
# if you want to get a general sense of waht words
# are in each topic
cluster = 3
topic_model.get_topic(cluster)

In [None]:
# see what the outliers look like
model_topic_info[model_topic_info.Topic == -1].Representative_Docs.values[0]

In [None]:
#
topic_model.get_document_info(movie_descriptions).sample(7)

In [None]:
# if you would like to visualize the topics you can look at the following three
# plots
topic_model.visualize_topics()

In [None]:
topic_model.visualize_barchart()

In [None]:
topic_model.visualize_heatmap()