# Building a movie recommendation system
In this assignment you will be building a movie recommendation system which recommends movies that are similar to the movie you search for. This is knew as the content based filtering approach when building recommendation systems. There are also other methods, but in this assignment we only practice the content based filter method and you will be using what you have learned like k-nn, k-mean to implement the system. We will use the [TMDB 5000 Movie Dataset](https://www.kaggle.com/tmdb/tmdb-movie-metadata) dataset which includes data about 5000 movies for our movie recommendation system.

## The dataset
You will not use the [TMDB 5000 Movie Dataset](https://www.kaggle.com/tmdb/tmdb-movie-metadata) dataset directly instead you use the smaller version that we have prepared for you. The smaller version includes fewer columns than the original, has some pre-processing and contains only one file. Let's read the data using pandas using the cell below.

In [None]:
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/FUNIX Progress/MLP304x_0.1-A_EN/data/tmdb5000.csv")

df.head(5)

Unnamed: 0,title,cast,director,keywords,genres,overview
0,Avatar,"['Sam Worthington', 'Zoe Saldana', 'Sigourney ...",James Cameron,"['culture clash', 'future', 'space war', 'spac...","['Action', 'Adventure', 'Fantasy', 'Science Fi...","In the 22nd century, a paraplegic Marine is di..."
1,Pirates of the Caribbean: At World's End,"['Johnny Depp', 'Orlando Bloom', 'Keira Knight...",Gore Verbinski,"['ocean', 'drug abuse', 'exotic island', 'east...","['Adventure', 'Fantasy', 'Action']","Captain Barbossa, long believed to be dead, ha..."
2,Spectre,"['Daniel Craig', 'Christoph Waltz', 'Léa Seydo...",Sam Mendes,"['spy', 'based on novel', 'secret agent', 'seq...","['Action', 'Adventure', 'Crime']",A cryptic message from Bond’s past sends him o...
3,The Dark Knight Rises,"['Christian Bale', 'Michael Caine', 'Gary Oldm...",Christopher Nolan,"['dc comics', 'crime fighter', 'terrorist', 's...","['Action', 'Crime', 'Drama', 'Thriller']",Following the death of District Attorney Harve...
4,John Carter,"['Taylor Kitsch', 'Lynn Collins', 'Samantha Mo...",Andrew Stanton,"['based on novel', 'mars', 'medallion', 'space...","['Action', 'Adventure', 'Science Fiction']","John Carter is a war-weary, former military ca..."


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     4803 non-null   object
 1   cast      4803 non-null   object
 2   director  4773 non-null   object
 3   keywords  4803 non-null   object
 4   genres    4803 non-null   object
 5   overview  4800 non-null   object
dtypes: object(6)
memory usage: 225.3+ KB


In [None]:
df.isna().sum()

title        0
cast         0
director    30
keywords     0
genres       0
overview     3
dtype: int64

### Filling na values
As you can see our dataset has six columns. First, let's check if any columns that have na values and fill those values.

In [None]:
.isna()

In [None]:
print(df.columns.tolist())
print(df.dtypes)
print("has na columns: ", df.columns[df.isna().any()].tolist())

['title', 'cast', 'director', 'keywords', 'genres', 'overview']
title       object
cast        object
director    object
keywords    object
genres      object
overview    object
dtype: object
has na columns:  ['director', 'overview']


Fill the na values in the **director** and **overview** columns with empty strings.

Object thường là na

In [None]:
df["director"] = df["director"].fillna("")
df["overview"] = df["overview"].fillna("")

df[df["director"] == ""].head(5)

Unnamed: 0,title,cast,director,keywords,genres,overview
3661,Flying By,"['Billy Ray Cyrus', 'Heather Locklear', 'Ahnai...",,[],['Drama'],A real estate developer goes to his 25th high ...
3670,Running Forever,[],,[],['Family'],After being estranged since her mother's death...
3729,Paa,"['Amitabh Bachchan', 'Abhishek Bachchan', 'Vid...",,[],"['Drama', 'Family', 'Foreign']",He suffers from a progeria like syndrome. Ment...
3977,Boynton Beach Club,"['Brenda Vaccaro', 'Dyan Cannon', 'Joseph Bolo...",,['independent film'],"['Comedy', 'Drama', 'Romance']",A handful of men and women of a certain age pi...
4068,Sharkskin,[],,[],[],The Post War II story of Manhattan born Mike E...


In [None]:
df.isna().sum()

title       0
cast        0
director    0
keywords    0
genres      0
overview    0
dtype: int64

### Convert string array to array
The **cast**, **keywords**, **genres** columns are currently string type. We need to convert them into their actual array type. 

In [None]:
print(type(df["cast"][0]))

# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'keywords', 'genres']

for feature in features:
    df[feature] = df[feature].apply(literal_eval)
    
print(type(df["cast"][0]))

<class 'str'>
<class 'list'>


## Recommend movies using overview
Let's use the **overview** column from the dataset to recommend movies. It is pretty reasonable to use the overview plot to recommend movies beacuse people might like movies with similar plot.

### Represent movie overviews as vectors
We use **tf-idf** to transform our movie reviews into vectors then we can use them to recommend similar movies. Let's use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from sklearn to get a matrix of vectors for each movie overview in our dataset. Complete to code below.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

tfidf_matrix = tfidf.fit_transform(df['overview'])

# You shoud get (4803, 20978)
tfidf_matrix.shape

(4803, 20978)

Now, we compute the similarities between each pair of movies then save them for the recommendation later. There are several methods for computing the similarites like consine similarity, euclidean, ... We do not know immediately what is the wright method for this so let's try them out. We use the [cosine_similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html), [euclidean_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html) from sklearn to compute them. Complete the code below to compute the cosine similarity and euclidean distance for each pair of movies. Because euclidean distance is kind of opposite of consine similarity so for convenience, we will multiple euclidan distances with -1 to flip them into similarities. 

In [None]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

cosine_sims = cosine_similarity(tfidf_matrix, tfidf_matrix) 
euclid_distances = euclidean_distances(tfidf_matrix, tfidf_matrix) 

euclid_sims = -euclid_distances

# you should get (4803, 4803)
print(cosine_sims.shape)
print(euclid_sims.shape)

(4803, 4803)
(4803, 4803)


Let's create a look up a movie indexes table from the movie titles to use later.

In [None]:
title2index = pd.Series(df.index, index=df['title']).drop_duplicates()

Complete the **get_recommendations** function below

In [None]:
def get_recommendations(title, sims, title2index, df):
    """
        Get recommendation movies from a movie title
        
        :param title: The query movie title
        :param sims: The similarity matrix for each pair of movie
        :param title2index: Title to index look up table
        :param df: The dataset dataframe
        
        :return: The top 10 recommendation movies
    """
    
    # get the movie index by the title
    idx = title2index[title]

    # get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(sims[idx]))

    # sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda tup: tup[1], reverse = True)
    
    # get the scores of the 10 most similar movies
    # we ignore the first one beacause it is the same movie
    sim_scores = sim_scores[1:11]

    # get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # return the top 10 most similar movies
    return df[:].iloc[movie_indices]

In [None]:
get_recommendations('The Dark Knight', cosine_sims, title2index, df)

Unnamed: 0,title,cast,director,keywords,genres,overview
3,The Dark Knight Rises,"[Christian Bale, Michael Caine, Gary Oldman, A...",Christopher Nolan,"[dc comics, crime fighter, terrorist, secret i...","[Action, Crime, Drama, Thriller]",Following the death of District Attorney Harve...
428,Batman Returns,"[Michael Keaton, Danny DeVito, Michelle Pfeiff...",Tim Burton,"[holiday, corruption, double life, dc comics]","[Action, Fantasy]","Having defeated the Joker, Batman now faces th..."
3854,"Batman: The Dark Knight Returns, Part 2","[Peter Weller, Ariel Winter, David Selby, Mich...",Jay Oliva,"[dc comics, future, joker, robin]","[Action, Animation]",Batman has stopped the reign of terror that Th...
299,Batman Forever,"[Val Kilmer, Tommy Lee Jones, Jim Carrey, Nico...",Joel Schumacher,"[riddle, dc comics, rose, gotham city]","[Action, Crime, Fantasy]",The Dark Knight of Gotham City confronts a das...
1359,Batman,"[Jack Nicholson, Michael Keaton, Kim Basinger,...",Tim Burton,"[double life, dc comics, dual identity, chemical]","[Fantasy, Action]",The Dark Knight of Gotham City begins his war ...
119,Batman Begins,"[Christian Bale, Michael Caine, Liam Neeson, K...",Christopher Nolan,"[himalaya, martial arts, dc comics, crime figh...","[Action, Crime, Drama]","Driven by tragedy, billionaire Bruce Wayne ded..."
1181,JFK,"[Kevin Costner, Tommy Lee Jones, Gary Oldman, ...",Oliver Stone,"[assassination, cia, homophobia, new orleans]","[Drama, Thriller, History]",New Orleans District Attorney Jim Garrison dis...
9,Batman v Superman: Dawn of Justice,"[Ben Affleck, Henry Cavill, Gal Gadot, Amy Adams]",Zack Snyder,"[dc comics, vigilante, superhero, based on com...","[Action, Adventure, Fantasy]",Fearing the actions of a god-like Super Hero l...
2507,Slow Burn,"[Ray Liotta, LL Cool J, Mekhi Phifer, Jolene B...",Wayne Beach,[],"[Mystery, Crime, Drama, Thriller]",A district attorney (Ray Liotta) is involved i...
210,Batman & Robin,"[George Clooney, Chris O'Donnell, Arnold Schwa...",Joel Schumacher,"[double life, dc comics, dual identity, crime ...","[Action, Crime, Fantasy]",Along with crime-fighting partner Robin and ne...


In [None]:
get_recommendations('The Avengers', cosine_sims, title2index, df)

Unnamed: 0,title,cast,director,keywords,genres,overview
7,Avengers: Age of Ultron,"[Robert Downey Jr., Chris Hemsworth, Mark Ruff...",Joss Whedon,"[marvel comic, sequel, superhero, based on com...","[Action, Adventure, Science Fiction]",When Tony Stark tries to jumpstart a dormant p...
3144,Plastic,"[Ed Speleers, Will Poulter, Alfie Allen, Sebas...",Julian Gilbey,[],"[Drama, Action, Comedy, Crime]",Sam &amp; Fordy run a credit card fraud scheme...
1715,Timecop,"[Jean-Claude Van Damme, Mia Sara, Ron Silver, ...",Peter Hyams,"[martial arts, time travel, science fiction, a...","[Thriller, Science Fiction, Action, Crime]",An officer for a security agency that regulate...
4124,This Thing of Ours,"[James Caan, James Caan, Frank Vincent, Vincen...",Danny Provenzano,[heist mafia internet],"[Drama, Action, Thriller]","Using the Internet and global satellites, a gr..."
3311,Thank You for Smoking,"[Aaron Eckhart, Maria Bello, Cameron Bright, A...",Jason Reitman,"[father son relationship, capitalism, based on...","[Comedy, Drama]",The chief spokesperson and lobbyist Nick Naylo...
3033,The Corruptor,"[Mark Wahlberg, Chow Yun-fat, Byron Mann, Kim ...",James Foley,"[new york, life and death, gang war, police]","[Action, Crime, Mystery, Thriller]","Danny is a young cop partnered with Nick, a se..."
588,Wall Street: Money Never Sleeps,"[Michael Douglas, Shia LaBeouf, Josh Brolin, C...",Oliver Stone,[duringcreditsstinger],"[Drama, Crime]",As the global economy teeters on the brink of ...
2136,Team America: World Police,"[Trey Parker, Matt Stone, Kristen Miller, Masa...",Trey Parker,"[paris, france, cairo, capitalism]","[Music, Adventure, Animation, Action]",Team America World Police follows an internati...
1468,The Fountain,"[Hugh Jackman, Rachel Weisz, Ellen Burstyn, Ma...",Darren Aronofsky,"[brain tumor, operation, queen, love of one's ...","[Drama, Adventure, Science Fiction, Romance]","Spanning over one thousand years, and three pa..."
1286,Snowpiercer,"[Chris Evans, Song Kang-ho, Ed Harris, John Hurt]",Bong Joon-ho,"[father son relationship, child labour, brothe...","[Action, Science Fiction, Drama]",In a future where a failed global-warming expe...


In [None]:
get_recommendations('The Dark Knight', euclid_sims, title2index, df)

Unnamed: 0,title,cast,director,keywords,genres,overview
2656,Chiamatemi Francesco - Il Papa della gente,"[Rodrigo de la Serna, Sergio Hernández, Àlex B...",Daniele Luchetti,"[pope, biography]",[Drama],
4140,"To Be Frank, Sinatra at 100",[Tony Oppedisano],Simon Napier-Bell,"[music, actors, legendary perfomer, classic ho...",[Documentary],
4401,The Helix... Loaded,[],,[],"[Action, Comedy, Science Fiction]",
4431,Food Chains,[],Sanjay Rawal,[],[Documentary],
3,The Dark Knight Rises,"[Christian Bale, Michael Caine, Gary Oldman, A...",Christopher Nolan,"[dc comics, crime fighter, terrorist, secret i...","[Action, Crime, Drama, Thriller]",Following the death of District Attorney Harve...
428,Batman Returns,"[Michael Keaton, Danny DeVito, Michelle Pfeiff...",Tim Burton,"[holiday, corruption, double life, dc comics]","[Action, Fantasy]","Having defeated the Joker, Batman now faces th..."
3854,"Batman: The Dark Knight Returns, Part 2","[Peter Weller, Ariel Winter, David Selby, Mich...",Jay Oliva,"[dc comics, future, joker, robin]","[Action, Animation]",Batman has stopped the reign of terror that Th...
299,Batman Forever,"[Val Kilmer, Tommy Lee Jones, Jim Carrey, Nico...",Joel Schumacher,"[riddle, dc comics, rose, gotham city]","[Action, Crime, Fantasy]",The Dark Knight of Gotham City confronts a das...
1359,Batman,"[Jack Nicholson, Michael Keaton, Kim Basinger,...",Tim Burton,"[double life, dc comics, dual identity, chemical]","[Fantasy, Action]",The Dark Knight of Gotham City begins his war ...
119,Batman Begins,"[Christian Bale, Michael Caine, Liam Neeson, K...",Christopher Nolan,"[himalaya, martial arts, dc comics, crime figh...","[Action, Crime, Drama]","Driven by tragedy, billionaire Bruce Wayne ded..."


In [None]:
get_recommendations('The Avengers', euclid_sims, title2index, df)

Unnamed: 0,title,cast,director,keywords,genres,overview
2656,Chiamatemi Francesco - Il Papa della gente,"[Rodrigo de la Serna, Sergio Hernández, Àlex B...",Daniele Luchetti,"[pope, biography]",[Drama],
4140,"To Be Frank, Sinatra at 100",[Tony Oppedisano],Simon Napier-Bell,"[music, actors, legendary perfomer, classic ho...",[Documentary],
4401,The Helix... Loaded,[],,[],"[Action, Comedy, Science Fiction]",
4431,Food Chains,[],Sanjay Rawal,[],[Documentary],
7,Avengers: Age of Ultron,"[Robert Downey Jr., Chris Hemsworth, Mark Ruff...",Joss Whedon,"[marvel comic, sequel, superhero, based on com...","[Action, Adventure, Science Fiction]",When Tony Stark tries to jumpstart a dormant p...
3144,Plastic,"[Ed Speleers, Will Poulter, Alfie Allen, Sebas...",Julian Gilbey,[],"[Drama, Action, Comedy, Crime]",Sam &amp; Fordy run a credit card fraud scheme...
1715,Timecop,"[Jean-Claude Van Damme, Mia Sara, Ron Silver, ...",Peter Hyams,"[martial arts, time travel, science fiction, a...","[Thriller, Science Fiction, Action, Crime]",An officer for a security agency that regulate...
4124,This Thing of Ours,"[James Caan, James Caan, Frank Vincent, Vincen...",Danny Provenzano,[heist mafia internet],"[Drama, Action, Thriller]","Using the Internet and global satellites, a gr..."
3311,Thank You for Smoking,"[Aaron Eckhart, Maria Bello, Cameron Bright, A...",Jason Reitman,"[father son relationship, capitalism, based on...","[Comedy, Drama]",The chief spokesperson and lobbyist Nick Naylo...
3033,The Corruptor,"[Mark Wahlberg, Chow Yun-fat, Byron Mann, Kim ...",James Foley,"[new york, life and death, gang war, police]","[Action, Crime, Mystery, Thriller]","Danny is a young cop partnered with Nick, a se..."


If you implement correctly, for cosine similarity case you should get **The Dark Knight Rises**, **Batman Returns** in top 2 for **The Dark Knight** movie and **Avengers: Age of Ultron**, **Plastic** for the **The Avengers** movie.
<br>
But in the ecludian case, you can see the top 4 recommendation movies are nothing similar to the query movie. There is a problem here, because the **overview** for the top 4 movie is empty so we can ignore the top 4 movies and use from the 5th movies which are much similar to the query movie.

## Recommend movies using multiple features
Previously, we only use the overview plot of the movies to recommend similar movies. Many people might like other movies beacause they are the same genres or from the same director, actors, actresses.

We will use the **cast**, **director**, **keywords**, **genres** features from our dataset to recommend movies. First, let's preprocess the data. Complete the code below to apply the **clean_data** function to the features. Here, we lower case the people name and replace the space character with the underscore character so that people name are preserved.

In [None]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "_")) for i in x]
    else:
        # check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''
        
# apply clean_data function to your features.
features = ['cast', 'director', 'keywords', 'genres']

for feature in features:
    df[feature] = df[feature].apply(clean_data) 

In [None]:
df.head(5)

Unnamed: 0,title,cast,director,keywords,genres,overview
0,Avatar,"[sam_worthington, zoe_saldana, sigourney_weave...",jamescameron,"[culture_clash, future, space_war, space_colony]","[action, adventure, fantasy, science_fiction]","In the 22nd century, a paraplegic Marine is di..."
1,Pirates of the Caribbean: At World's End,"[johnny_depp, orlando_bloom, keira_knightley, ...",goreverbinski,"[ocean, drug_abuse, exotic_island, east_india_...","[adventure, fantasy, action]","Captain Barbossa, long believed to be dead, ha..."
2,Spectre,"[daniel_craig, christoph_waltz, léa_seydoux, r...",sammendes,"[spy, based_on_novel, secret_agent, sequel]","[action, adventure, crime]",A cryptic message from Bond’s past sends him o...
3,The Dark Knight Rises,"[christian_bale, michael_caine, gary_oldman, a...",christophernolan,"[dc_comics, crime_fighter, terrorist, secret_i...","[action, crime, drama, thriller]",Following the death of District Attorney Harve...
4,John Carter,"[taylor_kitsch, lynn_collins, samantha_morton,...",andrewstanton,"[based_on_novel, mars, medallion, space_travel]","[action, adventure, science_fiction]","John Carter is a war-weary, former military ca..."


Let's combine the four features into one feature. Complete the code below.

In [None]:
def combine(x):
    return ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['keywords']) + ' ' + ' '.join(x['genres'])
df['combined'] = df.apply(combine, axis=1)

In [None]:
df.head(5)

Unnamed: 0,title,cast,director,keywords,genres,overview,combined
0,Avatar,"[sam_worthington, zoe_saldana, sigourney_weave...",jamescameron,"[culture_clash, future, space_war, space_colony]","[action, adventure, fantasy, science_fiction]","In the 22nd century, a paraplegic Marine is di...",sam_worthington zoe_saldana sigourney_weaver s...
1,Pirates of the Caribbean: At World's End,"[johnny_depp, orlando_bloom, keira_knightley, ...",goreverbinski,"[ocean, drug_abuse, exotic_island, east_india_...","[adventure, fantasy, action]","Captain Barbossa, long believed to be dead, ha...",johnny_depp orlando_bloom keira_knightley stel...
2,Spectre,"[daniel_craig, christoph_waltz, léa_seydoux, r...",sammendes,"[spy, based_on_novel, secret_agent, sequel]","[action, adventure, crime]",A cryptic message from Bond’s past sends him o...,daniel_craig christoph_waltz léa_seydoux ralph...
3,The Dark Knight Rises,"[christian_bale, michael_caine, gary_oldman, a...",christophernolan,"[dc_comics, crime_fighter, terrorist, secret_i...","[action, crime, drama, thriller]",Following the death of District Attorney Harve...,christian_bale michael_caine gary_oldman anne_...
4,John Carter,"[taylor_kitsch, lynn_collins, samantha_morton,...",andrewstanton,"[based_on_novel, mars, medallion, space_travel]","[action, adventure, science_fiction]","John Carter is a war-weary, former military ca...",taylor_kitsch lynn_collins samantha_morton wil...


### Represent the combined feature as vectors
We do not use **tf-idf** for this feature because we do not want for example a director appears in many movies which does not mean that director is less important so we use the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) instead.
<br>
Complete the code below to get the **count_matrix**

In [None]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df['combined'])

Complete the code below to calculate the cosine similarities, euclidean distances using the **count_matrix**

In [None]:
cosine_sims2 = cosine_similarity(count_matrix, count_matrix)
euclid_distances2 = euclidean_distances(count_matrix, count_matrix)

euclid_sims2 = -euclid_distances2

# you should get (4803, 4803)
print(cosine_sims2.shape)
print(euclid_sims2.shape)

(4803, 4803)
(4803, 4803)


In [None]:
df = df.reset_index()
indices = pd.Series(df.index, index=df['title'])

Let's check out the recommendations

In [None]:
get_recommendations('The Dark Knight', cosine_sims2, title2index, df)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
3,3,The Dark Knight Rises,"[christian_bale, michael_caine, gary_oldman, a...",christophernolan,"[dc_comics, crime_fighter, terrorist, secret_i...","[action, crime, drama, thriller]",Following the death of District Attorney Harve...,christian_bale michael_caine gary_oldman anne_...
119,119,Batman Begins,"[christian_bale, michael_caine, liam_neeson, k...",christophernolan,"[himalaya, martial_arts, dc_comics, crime_figh...","[action, crime, drama]","Driven by tragedy, billionaire Bruce Wayne ded...",christian_bale michael_caine liam_neeson katie...
4638,4638,Amidst the Devil's Wings,[],,[],"[drama, action, crime]","Prequel to ""5th of a Degree.""",drama action crime
1196,1196,The Prestige,"[hugh_jackman, christian_bale, michael_caine, ...",christophernolan,"[competition, secret, obsession, magic]","[drama, mystery, thriller]",A mysterious story of two magicians whose inte...,hugh_jackman christian_bale michael_caine scar...
3332,3332,Harry Brown,"[michael_caine, emily_mortimer, iain_glen, lee...",danielbarber,"[self-defense, widower]","[thriller, crime, drama, action]",An elderly ex-serviceman and widower looks to ...,michael_caine emily_mortimer iain_glen lee_oak...
4099,4099,Harsh Times,"[christian_bale, freddy_rodríguez, eva_longori...",davidayer,"[watching_a_movie, playing_pool, vinegar]","[crime, drama, thriller, action]",Jim Davis is an ex-Army Ranger who finds himse...,christian_bale freddy_rodríguez eva_longoria c...
2398,2398,Hitman,"[timothy_olyphant, dougray_scott, olga_kurylen...",xaviergens,"[assassin, secret_identity, intelligence, sovi...","[action, crime, drama, thriller]","The best-selling videogame, Hitman, roars to l...",timothy_olyphant dougray_scott olga_kurylenko ...
3359,3359,In Too Deep,"[omar_epps, ll_cool_j, nia_long, stanley_tucci]",michaelrymer,[],"[drama, action, thriller, crime]",A fearless cop is taking on a ruthless crimelo...,omar_epps ll_cool_j nia_long stanley_tucci mic...
1503,1503,Takers,"[chris_brown, hayden_christensen, matt_dillon,...",johnluessenhop,[heist],"[action, crime, drama, thriller]","A seasoned team of bank robbers, including Gor...",chris_brown hayden_christensen matt_dillon mic...
1986,1986,Faster,"[dwayne_johnson, billy_bob_thornton, maggie_gr...","georgetillman,jr.",[],"[crime, drama, action, thriller]",Driver (Dwayne Johnson) has spent the last 10 ...,dwayne_johnson billy_bob_thornton maggie_grace...


In [None]:
get_recommendations('The Avengers', cosine_sims2, title2index, df)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
7,7,Avengers: Age of Ultron,"[robert_downey_jr., chris_hemsworth, mark_ruff...",josswhedon,"[marvel_comic, sequel, superhero, based_on_com...","[action, adventure, science_fiction]",When Tony Stark tries to jumpstart a dormant p...,robert_downey_jr. chris_hemsworth mark_ruffalo...
26,26,Captain America: Civil War,"[chris_evans, robert_downey_jr., scarlett_joha...",anthonyrusso,"[civil_war, war, marvel_comic, sequel]","[adventure, action, science_fiction]","Following the events of Age of Ultron, the col...",chris_evans robert_downey_jr. scarlett_johanss...
79,79,Iron Man 2,"[robert_downey_jr., gwyneth_paltrow, don_chead...",jonfavreau,"[malibu, marvel_comic, superhero, based_on_com...","[adventure, action, science_fiction]",With the world now aware of his dual life as t...,robert_downey_jr. gwyneth_paltrow don_cheadle ...
174,174,The Incredible Hulk,"[edward_norton, liv_tyler, tim_roth, william_h...",louisleterrier,"[new_york, rio_de_janeiro, marvel_comic, super...","[science_fiction, action, adventure]",Scientist Bruce Banner scours the planet for a...,edward_norton liv_tyler tim_roth william_hurt ...
85,85,Captain America: The Winter Soldier,"[chris_evans, samuel_l._jackson, scarlett_joha...",anthonyrusso,"[washington_d.c., future, shield, marvel_comic]","[action, adventure, science_fiction]",After the cataclysmic events in New York with ...,chris_evans samuel_l._jackson scarlett_johanss...
68,68,Iron Man,"[robert_downey_jr., terrence_howard, jeff_brid...",jonfavreau,"[middle_east, arms_dealer, malibu, marvel_comic]","[action, science_fiction, adventure]","After being held captive in an Afghan cave, bi...",robert_downey_jr. terrence_howard jeff_bridges...
126,126,Thor: The Dark World,"[chris_hemsworth, natalie_portman, tom_hiddles...",alantaylor,"[marvel_comic, superhero, based_on_comic_book,...","[action, adventure, fantasy]",Thor fights to restore order across the cosmos...,chris_hemsworth natalie_portman tom_hiddleston...
129,129,Thor,"[chris_hemsworth, natalie_portman, tom_hiddles...",kennethbranagh,"[new_mexico, banishment, shield, marvel_comic]","[adventure, fantasy, action]","Against his father Odin's will, The Mighty Tho...",chris_hemsworth natalie_portman tom_hiddleston...
169,169,Captain America: The First Avenger,"[chris_evans, hugo_weaving, tommy_lee_jones, h...",joejohnston,"[new_york, usa, world_war_ii, nazis]","[action, adventure, science_fiction]","Predominantly set during World War II, Steve R...",chris_evans hugo_weaving tommy_lee_jones hayle...
182,182,Ant-Man,"[paul_rudd, michael_douglas, evangeline_lilly,...",peytonreed,"[marvel_comic, superhero, based_on_comic_book,...","[science_fiction, action, adventure]",Armed with the astonishing ability to shrink i...,paul_rudd michael_douglas evangeline_lilly cor...


In [None]:
get_recommendations('The Dark Knight', euclid_sims2, title2index, df)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
3,3,The Dark Knight Rises,"[christian_bale, michael_caine, gary_oldman, a...",christophernolan,"[dc_comics, crime_fighter, terrorist, secret_i...","[action, crime, drama, thriller]",Following the death of District Attorney Harve...,christian_bale michael_caine gary_oldman anne_...
119,119,Batman Begins,"[christian_bale, michael_caine, liam_neeson, k...",christophernolan,"[himalaya, martial_arts, dc_comics, crime_figh...","[action, crime, drama]","Driven by tragedy, billionaire Bruce Wayne ded...",christian_bale michael_caine liam_neeson katie...
4638,4638,Amidst the Devil's Wings,[],,[],"[drama, action, crime]","Prequel to ""5th of a Degree.""",drama action crime
4068,4068,Sharkskin,[],,[],[],The Post War II story of Manhattan born Mike E...,
4118,4118,Hum To Mohabbat Karega,[],,[],[],"Raju, a waiter, is in love with the famous TV ...",
4314,4314,Crowsnest,[],,[],[],"In late summer of 2011, five young friends on ...",
4458,4458,Harrison Montgomery,[],,[],[],Film from Daniel Davila,
4504,4504,Light from the Darkroom,[],,[],[],Light in the Darkroom is the story of two best...,
4553,4553,America Is Still the Place,[],,[],[],1971 post civil rights San Francisco seemed li...,
4562,4562,The Little Ponderosa Zoo,[],,[],[],The Little Ponderosa Zoo is preparing for thei...,


In [None]:
get_recommendations('The Avengers', euclid_sims2, title2index, df)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
7,7,Avengers: Age of Ultron,"[robert_downey_jr., chris_hemsworth, mark_ruff...",josswhedon,"[marvel_comic, sequel, superhero, based_on_com...","[action, adventure, science_fiction]",When Tony Stark tries to jumpstart a dormant p...,robert_downey_jr. chris_hemsworth mark_ruffalo...
4401,4401,The Helix... Loaded,[],,[],"[action, comedy, science_fiction]",,action comedy science_fiction
26,26,Captain America: Civil War,"[chris_evans, robert_downey_jr., scarlett_joha...",anthonyrusso,"[civil_war, war, marvel_comic, sequel]","[adventure, action, science_fiction]","Following the events of Age of Ultron, the col...",chris_evans robert_downey_jr. scarlett_johanss...
79,79,Iron Man 2,"[robert_downey_jr., gwyneth_paltrow, don_chead...",jonfavreau,"[malibu, marvel_comic, superhero, based_on_com...","[adventure, action, science_fiction]",With the world now aware of his dual life as t...,robert_downey_jr. gwyneth_paltrow don_cheadle ...
174,174,The Incredible Hulk,"[edward_norton, liv_tyler, tim_roth, william_h...",louisleterrier,"[new_york, rio_de_janeiro, marvel_comic, super...","[science_fiction, action, adventure]",Scientist Bruce Banner scours the planet for a...,edward_norton liv_tyler tim_roth william_hurt ...
4068,4068,Sharkskin,[],,[],[],The Post War II story of Manhattan born Mike E...,
4118,4118,Hum To Mohabbat Karega,[],,[],[],"Raju, a waiter, is in love with the famous TV ...",
4314,4314,Crowsnest,[],,[],[],"In late summer of 2011, five young friends on ...",
4458,4458,Harrison Montgomery,[],,[],[],Film from Daniel Davila,
4504,4504,Light from the Darkroom,[],,[],[],Light in the Darkroom is the story of two best...,


If you implement correctly, you can see the results are pretty reasonable. We get **The Dark Knight Rises**, **Batman Begins**, **The Prestige** for the **The Dark Night** movie, they are all directed by **Christopher Nolan**. We get **Avengers: Age of Ultron**, **Captain America: Civil War**, **Iron Man 2** for the **The Avengers** movie; they all have the **Robert Downey Jr** actor in the cast who is a very popular actor. 

## Clustering movies using k-mean

We have used k-nn to find similar movies using cosine similarity and ecludian distance in previous sections. In this section let's use the k-mean algorithm to cluster the movies. We will use the [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) class from sklearn.

Complete the **kmeans_tfidf** and **kmeans_count** variable using the **KMeans** class with the following parameters:
<br>
**n_clusters**: **500**
<br>
**random_state**: **96**
<br>
**max_iter**: **1000**
<br>
Remember to use the **fit** function on the **tfidf_matrix** and **count_matrix** variables.

In [None]:
from sklearn.cluster import KMeans

kmeans_tfidf = KMeans(n_clusters=500, random_state=96, max_iter=1000).fit(tfidf_matrix)
kmeans_count = KMeans(n_clusters=500, random_state=96, max_iter=1000).fit(count_matrix)

Read the **get_cluster_movies** and **get_recommendations_kmean** functions below and complete them.

In [None]:
def get_cluster_movies(labels):
    """
        Create movies look up table by label
        :param labels: Array of labels with indexes are movie indexes
        :return: Return a dictionary with cluster labels are keys and items are array movie indexes
    """
    res_dict = {}
    for i, l in enumerate(labels):
        if l not in res_dict:
            res_dict[l] = []
        res_dict[l].append(i)
    return res_dict

def get_recommendations_kmean(title, title2index, cluster_dict, labels):
    """
        Get the movies in the same cluster give a movie title
        
        :param title: The title of the query movie
        :param title2index: Title to index look up table
        :param cluster_dict: A dictionary for looking up movies in the same cluster
        :param labels: Array of labels with indexes are movie indexes
    """
    
    idx = title2index[title]
    
    # Get the movie indices with maximun of 10
    movie_indices = cluster_dict[idx][:10]

    # Return the top 10 most similar movies
    return df[:].iloc[movie_indices]

Let's check the resutls

In [None]:
tfidf_clusters = get_cluster_movies(kmeans_tfidf.labels_)
count_clusters = get_cluster_movies(kmeans_count.labels_)

In [None]:
get_recommendations_kmean('The Dark Knight', title2index, tfidf_clusters, kmeans_tfidf.labels_)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
1596,1596,Sicario,"[emily_blunt, benicio_del_toro, josh_brolin, v...",denisvilleneuve,"[mexico, cia, smoking, texas]","[action, crime, drama, mystery]",A young female FBI agent joins a secret CIA op...,emily_blunt benicio_del_toro josh_brolin victo...
1701,1701,Once Upon a Time in Mexico,"[antonio_banderas, salma_hayek, johnny_depp, e...",robertrodriguez,"[corruption, cia]",[action],"Hitman ""El Mariachi"" becomes involved in inter...",antonio_banderas salma_hayek johnny_depp eva_m...
1973,1973,Double Take,"[orlando_jones, eddie_griffin, garcelle_beauva...",georgegallo,"[mexico, cia, fbi, train]","[adventure, drama, action, comedy]",The governor of a Mexican state is assassinate...,orlando_jones eddie_griffin garcelle_beauvais ...
2893,2893,Trade,"[kevin_kline, cesar_ramos, paulina_gaitán, al...",marcokreuzpaintner,"[usa, sex, brother_sister_relationship, mexico...","[drama, thriller]","A Texas cop (Kevin Kline), whose own daughter ...",kevin_kline cesar_ramos paulina_gaitán alicja...
3477,3477,The Guard,"[brendan_gleeson, don_cheadle, liam_cunningham...",johnmichaelmcdonagh,"[prostitute, blackmail, drug_smuggle, rural_ir...","[action, comedy, thriller, crime]",Two policemen must join forces to take on an i...,brendan_gleeson don_cheadle liam_cunningham ma...
4068,4068,Sharkskin,[],,[],[],The Post War II story of Manhattan born Mike E...,


In [None]:
get_recommendations_kmean('The Avengers', title2index, tfidf_clusters, kmeans_tfidf.labels_)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
349,349,The Secret Life of Walter Mitty,"[ben_stiller, kristen_wiig, patton_oswalt, shi...",benstiller,"[himalaya, photographer, magazine, iceland]","[adventure, comedy, drama, fantasy]",A timid magazine photo manager who lives life ...,ben_stiller kristen_wiig patton_oswalt shirley...
942,942,The Book of Life,"[diego_luna, channing_tatum, zoe_saldana, chri...",jorger.gutierrez,"[love_triangle, afterlife, day_of_the_dead, bu...","[romance, animation, adventure, comedy]","The journey of Manolo, a young man who is torn...",diego_luna channing_tatum zoe_saldana christin...
1736,1736,Our Brand Is Crisis,"[sandra_bullock, anthony_mackie, billy_bob_tho...",davidgordongreen,"[bolivia, woman, political_campaign, south_ame...","[comedy, drama]","A feature film based on the documentary ""Our B...",sandra_bullock anthony_mackie billy_bob_thornt...
2268,2268,Three Kingdoms: Resurrection of the Dragon,"[sammo_hung, vanness_wu, maggie_q, andy_lau]",daniellee,"[warrior_woman, number_in_title]","[action, history, drama]",The aging Zhao embarks on his final and greate...,sammo_hung vanness_wu maggie_q andy_lau daniel...
2979,2979,Bring It On,"[kirsten_dunst, jesse_bradford, eliza_dushku, ...",peytonreed,"[cheerleader, sport, high_school, teenage_girl]",[comedy],The Toro cheerleading squad from Rancho Carne ...,kirsten_dunst jesse_bradford eliza_dushku gabr...
4292,4292,Super Troopers,"[jay_chandrasekhar, steve_lemme, kevin_heffern...",jaychandrasekhar,"[alcohol, radio, police_chief, highway]","[comedy, crime, mystery]","Five bored, occasionally high and always ineff...",jay_chandrasekhar steve_lemme kevin_heffernan ...
4544,4544,The Living Wake,"[jesse_eisenberg, mike_o'connell, jim_gaffigan...",soltryon,[independent_film],[comedy],“The Living Wake” is a dark comedy set in a ti...,jesse_eisenberg mike_o'connell jim_gaffigan an...


In [None]:
get_recommendations_kmean('The Dark Knight', title2index, count_clusters, kmeans_count.labels_)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
72,72,Suicide Squad,"[will_smith, margot_robbie, joel_kinnaman, vio...",davidayer,"[dc_comics, shared_universe, anti_hero, secret...","[action, adventure, crime, fantasy]","From DC Comics comes the Suicide Squad, an ant...",will_smith margot_robbie joel_kinnaman viola_d...


In [None]:
get_recommendations_kmean('The Avengers', title2index, count_clusters, kmeans_count.labels_)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
751,751,Duplicity,"[clive_owen, julia_roberts, paul_giamatti, tom...",tonygilroy,[spy],"[romance, comedy, crime]",Two romantically-engaged corporate spies team ...,clive_owen julia_roberts paul_giamatti tom_wil...


As you can see the result are pretty reasonable. We encourage you to fine tune the k-mean paramters to see different results.