<a href="https://colab.research.google.com/github/ale-camer/Data-Science/blob/Finance/Movie_Recommender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this script we build a movie recommender by measuring the similarity of the desciptions of the movies. Therefore we use a standardized dataset that can be found [here](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata?resource=download). We have data from 5,000 movies that can be use for several other toy projects.




## Packages, functions and data

In [None]:
!pip install unidecode --quiet

import pandas as pd
from tqdm import tqdm
from prettytable import PrettyTable

def text_normalizer(data, language:str='english'):

    """
    The objective of this function is to normalize text data. Therefore, it takes as inputs:
        
        - a string that must be specified in the input 'data',
        - and the language, in the input 'language', in which the articles were written, which by default is english.
    """
        
    import re
    import nltk
    import unidecode

    assert type(language) == str, "The 'language' must be a string"
        
    nltk.download('stopwords', quiet=True) # downloading stopwords
    stopword_list = nltk.corpus.stopwords.words(language) 
    urlRegex = re.compile('http\S+')

    def conti_rep_char(str1):
        tchr = str1.group(0)
        if len(tchr) > 1:
          return tchr[0:1]
         
    def check_unique_char(rep, sent_text): # regex to keep only words and numbers
         convert = re.sub(r'[^a-zA-Z0-9\s]',rep,sent_text)
         return convert
    
    lista = []
    for dat in data:
                
        dat = ' '.join([word for word in dat.lower().split() if word not in stopword_list]) # lower capital letters and remove stopwords
        dat = check_unique_char(conti_rep_char, dat) # remove special characters
        dat = ' '.join([word for word in dat.split() if not re.match(urlRegex, word)]) # remove URLs
        dat = ' '.join([unidecode.unidecode(word) for word in dat.split()]) # reemplacing diacritical marks 
        lista.append(dat)

    return lista

(text = norm_text, title = movie, titles = titles)

def similar_themes(text, title, titles):

    """
    The objective of this function is to recomend similar titles from a list of them by measuring the similarity of their texts. Therefore, this function takes three inputs:

    - the text in the 'text' input,
    - the title that we want to get similar data from in the 'title' input,
    - and the list of titles from where we are going to receive the most similar data in the input 'titles'.    
    """
    
    import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    tfidf = TfidfVectorizer(ngram_range=(1, 2), min_df=2) # vectorizer instantiation
    tfidf = tfidf.fit_transform(text) # vectorizing text
    similarity_matrix = pd.DataFrame(cosine_similarity(tfidf)) # measure the similarity of the vectorized text

    indexes = np.where(titles == title)[0][0] # selecting index
    similarities = similarity_matrix.iloc[indexes].values # selecting similarity values 
    top_10 = np.argsort(-similarities)[:10] # taking the top 10 similarity values
    similar_themes = titles[top_10] # taking the 10 most similar titles
    
    return similar_themes, similarities

route = "/content/tmdb_5000_movies.csv"
df = pd.read_csv(route)
df.head()

[?25l[K     |█▍                              | 10 kB 22.4 MB/s eta 0:00:01[K     |██▉                             | 20 kB 8.3 MB/s eta 0:00:01[K     |████▏                           | 30 kB 11.5 MB/s eta 0:00:01[K     |█████▋                          | 40 kB 4.9 MB/s eta 0:00:01[K     |███████                         | 51 kB 5.1 MB/s eta 0:00:01[K     |████████▍                       | 61 kB 6.0 MB/s eta 0:00:01[K     |█████████▊                      | 71 kB 5.9 MB/s eta 0:00:01[K     |███████████▏                    | 81 kB 6.6 MB/s eta 0:00:01[K     |████████████▌                   | 92 kB 6.5 MB/s eta 0:00:01[K     |██████████████                  | 102 kB 5.5 MB/s eta 0:00:01[K     |███████████████▎                | 112 kB 5.5 MB/s eta 0:00:01[K     |████████████████▊               | 122 kB 5.5 MB/s eta 0:00:01[K     |██████████████████              | 133 kB 5.5 MB/s eta 0:00:01[K     |███████████████████▌            | 143 kB 5.5 MB/s eta 0:00:01[K   

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


## Preprocessing

In [None]:
df.fillna("", inplace=True)
df['description'] = df['tagline'].map(str) + ' ' + df['overview'] # putting tagline and overview of each movie in one column call 'description'
for row in tqdm(df.index): # cleaning the genres data
    
    a = df.loc[row,'genres'].replace("[","").replace("]","").replace("{","").replace("}","").replace('",','";"').replace('"',"").split(";")
    if len(a[0]) == 0:
        pass
    else:
        a = pd.DataFrame(a)
        a = a[0].str.split(',',expand=True)
        a = ', '.join(a[1].apply(lambda a: a.split("name: ")[1]))
        df.loc[row,'genres ii'] = a
        
norm_text = text_normalizer(df['description']) # normalizing data
norm_text

100%|██████████| 4803/4803 [00:12<00:00, 391.20it/s]


['enter world pandora 22nd century paraplegic marine dispatched moon pandora unique mission becomes torn following orders protecting alien civilization',
 'end world adventure begins captain barbossa long believed dead come back life headed edge earth turner elizabeth swann nothing quite seems',
 'plan one escapes cryptic message bonds past sends trail uncover sinister organization battles political forces keep secret service alive bond peels back layers deceit reveal terrible truth behind spectre',
 'legend ends following death district attorney harvey dent batman assumes responsibility dents crimes protect late attorneys reputation subsequently hunted gotham city police department eight years later batman encounters mysterious selina kyle villainous bane new terrorist leader overwhelms gothams finest dark knight resurfaces protect city branded enemy',
 'lost world found another john carter warweary former military captain whos inexplicably transported mysterious exotic planet barsoom

## Movie Recommender

In [None]:
titles = df['title'].values
most_popular = df.sort_values('popularity',ascending=False)['title'][:5]
print(f"Titles: \n {titles} \n\n Most popular movies: \n {most_popular}")

Titles: 
 ['Avatar' "Pirates of the Caribbean: At World's End" 'Spectre' ...
 'Signed, Sealed, Delivered' 'Shanghai Calling' 'My Date with Drew'] 

 Most popular movies: 
 546                    Minions
95                Interstellar
788                   Deadpool
94     Guardians of the Galaxy
127         Mad Max: Fury Road
Name: title, dtype: object


In [None]:
for movie in most_popular:
    themes, values = similar_themes(text = norm_text, title = movie, titles = titles) # movie recommender
    dfTemp = df.copy()
    dfTemp['value'] = values
    table = PrettyTable() # table instantiation
    table.title = f"Movie: {movie}" # movie
    table.field_names = ['Title','Similarity Value','Description Length','Genres','Genres Similarity'] # headers
    for i in range(len(themes)):
        table.add_row([
            dfTemp[dfTemp['title'].isin(themes)]['title'].values[i], # movie title
            round(dfTemp[dfTemp['title'].isin(themes)]['value'].values[i]*100,2), # similarity value
            len(dfTemp[dfTemp['title'].isin(themes)]['description'].values[i]), # length of the description
            dfTemp[dfTemp['title'].isin(themes)]['genres ii'].replace(" ","").values[i], # genres
            [j for j in dfTemp[dfTemp['title'] == movie]['genres ii'].values[0].replace(" ","").split(",") if j in dfTemp[dfTemp['title'].isin(themes)]['genres ii'].values[i].replace(" ","").split(",")] # similar genres
            ])
    table.sortby = 'Similarity Value'
    table.reversesort = True
    print(table)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                          Movie: Minions                                                                                         |
+--------------------------------------------------+------------------+--------------------+-----------------------------------------------------+------------------------------------------------+
|                      Title                       | Similarity Value | Description Length |                        Genres                       |               Genres Similarity                |
+--------------------------------------------------+------------------+--------------------+-----------------------------------------------------+------------------------------------------------+
|                   

Here we can see the recommended movies according to the five most popular movies from the dataset. In each table we can see the ten most recommended movies according to their similarity from the the most popular one. 

In the first column we can see the title, in the second column their similarity value, in the third the length of the text with which we measure the similarities, in the fourth column the genres and in the fifth column the genre of each recommended movie that is the same as the most popular movie. 

We can conclude that, in the first place, there isn't to much data to process for each movie and, in the second place, that in most cases the recommended movies have high similar values (up to ten) and their genres are almost the same as the most popular movie.