# Streaming TV Show Recommender System
## TV Shows Hosted on Four Streaming Platforms, with IMDB and Rotten Tomatoes Rating
## By: Bryan Kolano, November 22nd, 2022
#### Original Data from [Kaggle Dataset by User: Ruchi Bhatia](https://www.kaggle.com/datasets/ruchi798/tv-shows-on-netflix-prime-video-hulu-and-disney) 
#### Themes and descriptions in enhanced dataset comes from [IMDb](www.imdb.com)

***

#### Read in packages and data

In [89]:
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

In [90]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\x7824\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

The original data is called 'tv_shows.csv'.  However, after I enhanced it with the tv show themes and descriptions from IMDb, the new dataset used through this analysis is called "tv_shows_themes_descriptions.csv"

In [91]:
df = pd.read_csv('tv_shows_themes_descriptions.csv')
df.head()

Unnamed: 0,title,year,age,imdb,rotten_tomatoes,netflix,hulu,prime_video,disney+,imdb_raw,rotten_raw,themes,description
0,Breaking Bad,2008,18+,9.4/10,100/100,1,0,0,0,9.4,100.0,Crime Drama Thriller,A chemistry teacher diagnosed with inoperable lung cancer turns to manufacturing and selling methamphetamine with a former student in order to sec...
1,Stranger Things,2016,16+,8.7/10,96/100,1,0,0,0,8.7,96.0,Drama Fantasy Horror,"When a young boy disappears, his mother, a police chief and his friends must confront terrifying supernatural forces in order to get him back."
2,Attack on Titan,2013,18+,9.0/10,95/100,1,1,0,0,9.0,95.0,Animation Action Adventure,"After his hometown is destroyed and his mother is killed, young Eren Jaeger vows to cleanse the earth of the giant humanoid Titans that have broug..."
3,Better Call Saul,2015,18+,8.8/10,94/100,1,0,0,0,8.8,94.0,Crime Drama,The trials and tribulations of criminal lawyer Jimmy McGill in the years leading up to his fateful run-in with Walter White and Jesse Pinkman.
4,Dark,2017,16+,8.8/10,93/100,1,0,0,0,8.8,93.0,Crime Drama Mystery,"A family saga with a supernatural twist, set in a German town where the disappearance of two young children exposes the relationships among four f..."


#### Data Manipulation
The original data was not considered "tidy."  There were four columns of streaming platforms and each column value is a '1' if a particular show is hosted on a particular platform.  This isn't the best way to display tabular data because it means you have to manipulate multiple columns to get the platform of a show, as opposed to just one column called 'platform.' <br>

We can use pandas.melt to condense all of our four platform variables into one variable called 'platform.'  I did the same thing in the EDA notebook.  It makes for easier visualizations and analysis.

In [92]:
df_longer = pd.melt(df, id_vars=['title','year','age','imdb','imdb_raw','rotten_tomatoes', 'description'], 
                    value_vars= ['netflix','hulu','prime_video','disney+'],
                   var_name= 'platform',
                    value_name = 'on_platform'
                    )
df_longer = df_longer.query('on_platform == 1')
df_longer.head()

Unnamed: 0,title,year,age,imdb,imdb_raw,rotten_tomatoes,description,platform,on_platform
0,Breaking Bad,2008,18+,9.4/10,9.4,100/100,A chemistry teacher diagnosed with inoperable lung cancer turns to manufacturing and selling methamphetamine with a former student in order to sec...,netflix,1
1,Stranger Things,2016,16+,8.7/10,8.7,96/100,"When a young boy disappears, his mother, a police chief and his friends must confront terrifying supernatural forces in order to get him back.",netflix,1
2,Attack on Titan,2013,18+,9.0/10,9.0,95/100,"After his hometown is destroyed and his mother is killed, young Eren Jaeger vows to cleanse the earth of the giant humanoid Titans that have broug...",netflix,1
3,Better Call Saul,2015,18+,8.8/10,8.8,94/100,The trials and tribulations of criminal lawyer Jimmy McGill in the years leading up to his fateful run-in with Walter White and Jesse Pinkman.,netflix,1
4,Dark,2017,16+,8.8/10,8.8,93/100,"A family saga with a supernatural twist, set in a German town where the disappearance of two young children exposes the relationships among four f...",netflix,1


#### Function Definitions

##### TF-IDF model
Below is a function to find similar shows given a particular show of interest.  The inputs for this function are:
1. The data
2. The show name you are interested in finding matches for
3. How many matches you want
4. The minimum IMDb rating of the matches
5. (Optional): the platform you want the matches to come from

All of the fields are mandatory except platform. If platform is left blank, matches will be returned from all streaming platforms. <br>

This function uses TF_IDF to create its embedding matrix.  With TF-IDF we want to see how individual words are important inversely across all the data.  In other words, we want to find important words in a particular section of text that is not important across the other documents in the corpus.  <br>

We then use cosine similarity to find which titles have close vector repesentations in multi-dimensional space.



In [93]:
def find_similar_shows_tf(orig_data, show_name: str, num_matches: int, min_rating: 1.0, platform = None):
    """
    orig_data = dataset to use.  In this case, it is the tv shows on stream platforms dataset \n
    show_name = Type of the show name you want to look up, not case sensitive.  If not on streaming platform, message wil be returned. String
    num_matched = Number of similarity matches to be returned. Integer. \n
    min_rating = For returned matches, select lowing IMDb rating acceptable.  For example, is a match has an IMDB rating of 5.2, but your min_rating is 6.5, that match of 5.2 will be ignored.  float \n
    
    
    """
    orig_data = orig_data.reset_index()
    data_lower = orig_data.assign(title = lambda x: x['title'].str.lower())
    
    #find show index of the show we are looking for
    show_idx = data_lower[data_lower['title'].str.lower() == show_name.lower()].index[0]
    
    #create TF-IDF model and remove all english stopwords
    tfidf = TfidfVectorizer(stop_words= stopwords.words('english'))
    #create matrix based on our description column
    tf_idf_matrix = tfidf.fit_transform(data_lower['description'].values.astype('U'))
    
    #If the show name given by the function is one of the titles in the dataset, continue
    if data_lower['title'].str.contains(show_name.lower()).any():
        
        #create cosine similarity matrix of our TF-IDF matrix and turn into dataframe
        sim = cosine_similarity(tf_idf_matrix,tf_idf_matrix)
        sim_df = pd.DataFrame(sim)
        

      
        #rank all cosine similirity numbers for the particular show_name and sort highest to lowest, 
        #get the index
        top_matches = sim_df.iloc[:,show_idx].sort_values(ascending = False).index

        #Create blank dataframe
        match_df = pd.DataFrame(columns=['title', 'year','imdb_rating','imdb_raw','platform','description'])

        #fill in dataframe for all shows so that we can filter out platform and number of matches later

        for match in top_matches:
            match_df.loc[len(match_df)] = ([data_lower.iloc[match]['title'], 
                                           data_lower.iloc[match]['year'], 
                                           data_lower.iloc[match]['imdb'],
                                           data_lower.iloc[match]['imdb_raw'],
                                           data_lower.iloc[match]['platform'], 
                                           data_lower.iloc[match]['description']])
        pd.options.display.max_colwidth =150

        #if platform left blank, return shows from all platforms
        if platform != None:
            match_df = match_df.query('platform == @platform')

        #filter out low ratings
        filter_ratings = match_df.query('imdb_raw >= @min_rating')
        #Return the number of matches desried        

        return(filter_ratings[['title','year','imdb_rating','platform','description']].head(n = num_matches))
    
    #if show_name is not found in the titles, return that it cannot be found
    else:
        return f'The show "{show_name}" does not exist in this dataset.'
    

##### Transformers model

TF-IDF is one way to check for similarity.  We can assign values to words that are important in particular documents, and then find multi-dimensional similarities.  However, using transformer models, instead of just looking for similar words when making our similarity searches, transformers help us to understand the context our of sentences to make "smarter" embeddings as opposed to TF-IDF which is making embeddings based on word "popularity." <br>

Therefore, I wanted to create a function that use cosine similarity from a TF-IDF matrix and one that uses transformers, and then compare to few shows.  Since these recommender systems are unsupervised learning (we have no "right answers" or tagged data showing the "correct" match), I will visually inspect the recommendations to determine what recommender provides the best recommendations.  <br>

The model used in this function comes from [HuggingFace](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).  I choose a pretrained model that is meant for sentence similarity, called "all-MiniLM-L6-V2".


In [94]:
def find_similar_shows_transformers(orig_data, show_name: str, num_matches: int, min_rating: float, platform = None):
    """
    orig_data = dataset to use.  In this case, it is the tv shows on stream platforms dataset \n
    show_name = Type of the show name you want to look up, not case sensitive.  If not on streaming platform, message wil be returned. String
    num_matched = Number of similarity matches to be returned. Integer. \n
    min_rating = For returned matches, select lowing IMDb rating acceptable.  For example, is a match has an IMDB rating of 5.2, but your min_rating is 6.5, that match of 5.2 will be ignored.  float \n
    
    
    """

    #reset the index
    orig_data= orig_data.reset_index()
    #turn all titles into lower case
    data_lower = orig_data.assign(title = lambda x: x['title'].str.lower())
    #find show index of our show on interest
    show_idx = data_lower[data_lower['title'].str.lower() == show_name.lower()].index[0]
    
    #If our show_name is in the titles of the dataset, continue
    if data_lower['title'].str.contains(show_name.lower()).any():
        
        #create the sentence transformers model
        model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
        #create model embeddings model based on description column
        embeddings = model.encode(data_lower['description'])
        #create cosine similarity matrix of our embeddings model
        sim = cos_sim(embeddings, embeddings)
        #Turn our similarity matrix into a dataframe
        trans_matrix = pd.DataFrame(sim)

        #rank all shows of our show of interest by their highest cosine sim value.
        top_matches = trans_matrix.iloc[:,show_idx].sort_values(ascending = False).index

        #create blank dataframe to fill
        match_df = pd.DataFrame(columns=['title', 'year','imdb_rating','imdb_raw','platform','description'])

        #fill all matches into match DF based on cosine similarity values, highest first
        for match in top_matches:
                    match_df.loc[len(match_df)] = ([data_lower.iloc[match]['title'], 
                                                   data_lower.iloc[match]['year'], 
                                                   data_lower.iloc[match]['imdb'],
                                                   data_lower.iloc[match]['imdb_raw'],
                                                   data_lower.iloc[match]['platform'], 
                                                   data_lower.iloc[match]['description']])
        pd.options.display.max_colwidth =150

        #if platform left blank, return shows from all platforms
        if platform != None:
            match_df = match_df.query('platform == @platform')

        #filter out low ratings
        filter_ratings = match_df.query('imdb_raw >= @min_rating')
        #Return the number of matches desried        

        return(filter_ratings[['title','year','imdb_rating','platform','description']].head(n = num_matches))

    #if show_name is not found in the titles, return that it cannot be found
    else:
        return f'The show "{show_name}" does not exist in this dataset.'

In [95]:
find_similar_shows_transformers(df_longer, 'The Mandalorian', 5, 8.4)

Unnamed: 0,title,year,imdb_rating,platform,description
0,the mandalorian,2019,8.8/10,disney+,"The travels of a lone bounty hunter in the outer reaches of the galaxy, far from the authority of the New Republic."
1,cowboy bebop,1998,8.9/10,hulu,The futuristic misadventures and tragedies of an easygoing bounty hunter and his partners.
12,firefly,2002,9.0/10,hulu,"Five hundred years in the future, a renegade crew aboard a small spacecraft tries to survive as they travel the unknown parts of the galaxy and ev..."
31,vikings,2013,8.5/10,hulu,"Vikings transports us to the brutal and mysterious world of Ragnar Lothbrok, a Viking warrior and farmer who yearns to explore--and raid--the dist..."
32,vikings,2013,8.5/10,prime_video,"Vikings transports us to the brutal and mysterious world of Ragnar Lothbrok, a Viking warrior and farmer who yearns to explore--and raid--the dist..."


#### Comparison of Results

##### The Mandalorian

Finding similar shows for the Mandalorian returned some similar results.  The TF-IDF model (TF for short), returned some shows that seem similar to the Mandalorian, given the description of the Mandalorian.  They deal with futuristic adventures, some related to space.  The funny one is Rick Steves' Europe.  Its description contains "autority" and "travel", which is what made it match with The Mandalorian which has the same words, though these shows are not similar at all.

Compare this to what the transformer model (trans for short) returned.  It returns some of the same shows, but also returns other shows that seem similar to the Mandalorian too.  In the case of the trans model, it doesn't return any mismatches shows like the TF model.

In [96]:
find_similar_shows_tf(df_longer, 'The Mandalorian', 5, 8.5)

Unnamed: 0,title,year,imdb_rating,platform,description
0,the mandalorian,2019,8.8/10,disney+,"The travels of a lone bounty hunter in the outer reaches of the galaxy, far from the authority of the New Republic."
2,cowboy bebop,1998,8.9/10,hulu,The futuristic misadventures and tragedies of an easygoing bounty hunter and his partners.
7,firefly,2002,9.0/10,hulu,"Five hundred years in the future, a renegade crew aboard a small spacecraft tries to survive as they travel the unknown parts of the galaxy and ev..."
9,rick steves' europe,2000,8.5/10,prime_video,American travel authority Rick Steves guides viewers through his favorite European cities.
48,hunter x hunter,2011,9.0/10,netflix,"Gon Freecss aspires to become a Hunter, an exceptional being capable of greatness. With his friends and his potential, he seeks out his father, wh..."


In [97]:
find_similar_shows_transformers(df_longer, 'The Mandalorian', 5, 8.5)

Unnamed: 0,title,year,imdb_rating,platform,description
0,the mandalorian,2019,8.8/10,disney+,"The travels of a lone bounty hunter in the outer reaches of the galaxy, far from the authority of the New Republic."
1,cowboy bebop,1998,8.9/10,hulu,The futuristic misadventures and tragedies of an easygoing bounty hunter and his partners.
12,firefly,2002,9.0/10,hulu,"Five hundred years in the future, a renegade crew aboard a small spacecraft tries to survive as they travel the unknown parts of the galaxy and ev..."
31,vikings,2013,8.5/10,hulu,"Vikings transports us to the brutal and mysterious world of Ragnar Lothbrok, a Viking warrior and farmer who yearns to explore--and raid--the dist..."
32,vikings,2013,8.5/10,prime_video,"Vikings transports us to the brutal and mysterious world of Ragnar Lothbrok, a Viking warrior and farmer who yearns to explore--and raid--the dist..."


##### Narcos

Comparing the two models when looking at similar results for the show "Narcos", both models return what appear to be good matches.  Most shows deal with drugs, Latin American countries, and/ or crime.  Though the recommendations are different, it's rather difficult to say which model returned better results.  

In [98]:
find_similar_shows_tf(df_longer, 'Narcos', 5, 8)

Unnamed: 0,title,year,imdb_rating,platform,description
0,narcos,2015,8.8/10,netflix,"A chronicled look at the criminal exploits of Colombian drug lord Pablo Escobar, as well as the many other drug kingpins who plagued the country t..."
1,"pablo escobar, the drug lord",2012,8.5/10,netflix,"The exploits of the notorious drug lord, Pablo Escobar."
3,queen of the south,2016,8.0/10,netflix,Teresa flees Mexico after her drug-runner boyfriend is murdered. Settling in Dallas she looks to become the country's reigning drug smuggler and t...
9,la reina del sur,2011,8.0/10,netflix,Teresa Mendoza returns to Mexico after 8 years to fight with Mexican drug dealers.
25,rick and morty,2013,9.2/10,hulu,An animated series that follows the exploits of a super scientist and his not-so-bright grandson.


In [99]:
find_similar_shows_transformers(df_longer, 'Narcos', 5, 8)

Unnamed: 0,title,year,imdb_rating,platform,description
0,narcos,2015,8.8/10,netflix,"A chronicled look at the criminal exploits of Colombian drug lord Pablo Escobar, as well as the many other drug kingpins who plagued the country t..."
1,"pablo escobar, the drug lord",2012,8.5/10,netflix,"The exploits of the notorious drug lord, Pablo Escobar."
9,narcos: mexico,2018,8.4/10,netflix,The rise of the Guadalajara Cartel as an American DEA agent learns the danger of targeting narcos in Mexico.
18,snowfall,2017,8.2/10,hulu,A look at the early days of the crack cocaine epidemic in Los Angeles during the beginning of the 1980s.
22,gangland undercover,2015,8.2/10,hulu,A gritty fact based drama series that tells the astonishing story of a drug dealer who was arrested then given a desperate ultimatum: serve over 2...


##### The Gilmore Girls

In general, for the show "The Gilmore Girls", both models provides results that a pretty good, with some funny exceptions.  First, with the TF model, it recommended the show "The Exorcist", which is hysterical.  However, we can see why it returned that: both shows contain language about a mother and her daughter.  Also, the show "Love" does not exactly seem like a good recommendation for the show "The Gilmore Girls." <br>

The trans model performed much than the TF model in general, though it did recommend the show "Goosebumps", which is nothing like "Gilmore Girls."

In [100]:
find_similar_shows_tf(df_longer, 'gilmore girls',6, 7.5)

Unnamed: 0,title,year,imdb_rating,platform,description
0,gilmore girls,2000,8.1/10,netflix,"A dramedy centering around the relationship between a thirtysomething single mother and her teen daughter living in Stars Hollow, Connecticut."
2,workin' moms,2017,7.6/10,netflix,"Four very different thirtysomething working-mother friends try to balance their jobs, family lives, and love lives in modern-day Toronto, Canada."
6,californication,2007,8.3/10,prime_video,"A writer tries to juggle his career, his relationship with his daughter and his ex-girlfriend, as well as his appetite for beautiful women."
7,love,2016,7.7/10,netflix,Murphy is an American living in Paris who enters a highly sexually and emotionally charged relationship with Electra. Unaware of the effect it wil...
8,the exorcist,2016,8.0/10,hulu,"When a teenage girl is possessed by a mysterious entity, her mother seeks the help of two priests to save her daughter."
10,raising hope,2010,8.0/10,hulu,"Jimmy becomes a single parent to an infant after her mother, with whom he had a one-night stand, ends up on Death Row."


In [101]:
find_similar_shows_transformers(df_longer, 'gilmore girls',6, 7.5)

Unnamed: 0,title,year,imdb_rating,platform,description
0,gilmore girls,2000,8.1/10,netflix,"A dramedy centering around the relationship between a thirtysomething single mother and her teen daughter living in Stars Hollow, Connecticut."
3,kath & kim,2002,8.3/10,netflix,"A mother, her daughter, and their respective ""hunks of spunk"" negotiate life in suburban Australia."
5,better things,2016,7.8/10,hulu,An actress raises her three daughters while juggling the pressures of working in Hollywood and being a single parent.
8,goosebumps,1995,7.7/10,netflix,A teenager teams up with the daughter of young adult horror author R. L. Stine after the writer's imaginary demons are set free on the town of Mad...
13,a love so beautiful,2017,8.1/10,netflix,"Love is as tough as it is sweet for a lovestruck teenager, whose relationship with her next-door neighbor transforms as they grow into adulthood."
14,this is us,2016,8.7/10,hulu,"A heartwarming and emotional story about a unique set of triplets, their struggles and their wonderful parents."


##### Other Similarities

In [102]:
find_similar_shows_tf(df_longer, 'breaking bad',5, 8, 'netflix')

Unnamed: 0,title,year,imdb_rating,platform,description
0,breaking bad,2008,9.4/10,netflix,A chemistry teacher diagnosed with inoperable lung cancer turns to manufacturing and selling methamphetamine with a former student in order to sec...
11,el reemplazante,2012,8.5/10,netflix,"A former financial executive finds redemption and romance when he becomes a math teacher, after he's granted parole following a prison sentence."
29,the umbrella academy,2019,8.0/10,netflix,"A family of former child heroes, now grown apart, must reunite to continue to protect the world."
42,put your head on my shoulder,2019,8.0/10,netflix,"Si Tu Mo's graduation is close and she's unsure of her future plans, she then meets genius Physics student Gu Wei Yi and the two accidentally end ..."
56,shtisel,2013,8.6/10,netflix,"The life of the Shtisel family, a Haredi family in Jerusalem."


In [103]:
find_similar_shows_transformers(df_longer, 'breaking bad',5, 8, 'netflix')

Unnamed: 0,title,year,imdb_rating,platform,description
0,breaking bad,2008,9.4/10,netflix,A chemistry teacher diagnosed with inoperable lung cancer turns to manufacturing and selling methamphetamine with a former student in order to sec...
4,top boy,2011,8.4/10,netflix,Two London drug dealers ply their lucrative trade at a public housing estate in East London.
12,la niña,2016,8.6/10,netflix,"A young woman who was part of the guerrilla force now seeks a better life in the city. She wants to study medicine, to heal instead of hurt, but s..."
17,how to get away with murder,2014,8.1/10,netflix,A group of ambitious law students and their brilliant criminal defense professor become involved in a twisted murder plot that promises to change ...
20,行医,2011,8.2/10,netflix,"Two doctors with opposite approaches to treating patients clash on the job and must learn to trust each other's expertise, allowing love to bloom."


#### Conclusions
There are numerous comparisons we could do between the two models, to see the different results returned by each.

These two recommendation models generate good recommendations overall.  However, the TF-IDF model struggles at times to understand the context of a show's description.  For example, the show "Gilmore Girls" is about a girl and her mom.  The TV show "The Exercist" also deals with a mom and her daughter.  The TF-IDF model honed in on "daughter" and "mother" and thus returned "The Exercist" as a recommendation.  Clearly, this is not a good recommendation.  The transformers model did not return "The Exercist" and seems to much better understand the context of the search show's description and give better recommendations.