### Import the necessary libraries (pandas, numpy, CountVectorizer and cosine_similarity)

- The `pandas` library is used to read the CSV file and manipulate the data in the movie dataset

- The `numpy` library is used for numerical computations

- The `CountVectorizer` class from the `sklearn.feature_extraction.text` module is used to create a count matrix from the "combined_features" column

- The `cosine_similarity` function from `sklearn.metrics.pairwise` is used to compute the cosine similarity between the movies based on the count matrix

In [9]:
import pandas as pd 
import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Define helper functions to get the movie title from its index and movie index from its title

- `get_title_from_index(index)` is a helper function defined by the user to retrieve the title of a movie given its index in the dataset. It uses the `return df[df.index == index]["title"].values[0]` statement to return the title of the movie.

- `get_index_from_title(title)` is also a helper function defined by the user to retrieve the index of a movie given its title. It uses the return `df[df.title == title]["index"].values[0]` statement to return the index of the movie

In [10]:
def get_index_from_title(title):
    return df[df.title == title]["index"].values[0]

def get_title_from_index(index):
    return df[df.index == index]["title"].values[0]

### Read movie dataset from a CSV file using pandas

The movie dataset is loaded into the dataframe `df` using`pd.read_csv("movie_dataset.csv")` function provided by pandas library.

In [11]:
df = pd.read_csv("movie_dataset.csv")
df.columns

Index(['index', 'budget', 'genres', 'homepage', 'id', 'keywords',
       'original_language', 'original_title', 'overview', 'popularity',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title',
       'vote_average', 'vote_count', 'cast', 'crew', 'director'],
      dtype='object')

### Select features (keywords, cast, genres, director) from the dataset

- The features variable is defined as a list of strings, containing the features to be selected from the movie dataset, specifically 'keywords', 'cast', 'genres', and 'director'

In [12]:
features = ['keywords', 'genres', 'cast', 'director']

for feature in features :
    df[feature] = df[feature].fillna(" ")

### Create a new column in the dataset that combines all selected features

- A for loop iterates through the list of selected features, and for each feature in the list, the code `df[feature] = df[feature].fillna('')` fills any missing values with an empty string.

- Then the user-defined function `combine_features(row)` is used to combine the values of all selected features for each movie into a single string.

- This new combined feature string is added to the dataframe as a new column called "combined_features" using `df["combined_features"] = df.apply(combine_features,axis=1)`

In [13]:
def combine_features(row):
	try:
		return row['keywords'] +" "+row['cast']+" "+row["genres"]+" "+row["director"]
	except:
		print ("Error:", row)	

df["combined_features"] = df.apply(combine_features,axis=1)


### Create a count matrix from the new column using CountVectorizer

- The `CountVectorizer()` class is used to create a count matrix from the "combined_features" column by calling its `fit_transform()` method and passing in `df["combined_features"]` as an argument

In [14]:
cv = CountVectorizer()

count_matrix = cv.fit_transform(df['combined_features'])

array = count_matrix.toarray()

features_names = list(cv.vocabulary_.keys())

features_names[0:15]

['culture',
 'clash',
 'future',
 'space',
 'war',
 'colony',
 'society',
 'sam',
 'worthington',
 'zoe',
 'saldana',
 'sigourney',
 'weaver',
 'stephen',
 'lang']

### Compute the cosine similarity between the movies based on the count 

- The cosine similarity is computed by calling the `cosine_similarity()` function from `sklearn.metrics.pairwise`, passing in the count matrix as an argument and storing the resulting similarity matrix in the variable `cosine_sim`

In [15]:
cos_sim = cosine_similarity(count_matrix)

### Retrieve the index of the movie that the user likes

- movie_user_likes = "Avatar" is defined as the movie that the user likes

- Using the helper function `get_index_from_title(movie_user_likes)` to get the index of this movie.

In [16]:
movie_user_like = "Avatar"

movie_index = get_index_from_title(movie_user_like)

### Find the list of similar movies and sort them in descending order of similarity score:

- Using the movie index it finds the list of similar movies using s`imilar_movies = list(enumerate(cosine_sim[movie_index]))`

- Then it sorts this list in descending order of similarity score using `sorted()` function

In [17]:
similar_movies = list(enumerate(cos_sim[movie_index]))

sorted_similar_movies = sorted(similar_movies, key = lambda x:x[1], reverse = True )

### Print the title of the top 10 most similar movies

- It uses a `for` loop to iterate through the `sorted_similar_movies` list 10 times and retrieves the title of the movie at the current index by calling the `get_title_from_index()` function. 

On each iteration, it does the following:

- Assigns the value of the first element (the index of the movie) of the tuple in the `sorted_similar_movies` list at the current iteration to the variable index using `sorted_similar_movies[i][0]`

- Uses the helper function `get_title_from_index(index)` to get the title of the movie at the current index, assigns the title to the variable `title`.

- Prints the value of the `title` variable

The for loop is using the range(10) so it iterates 10 times, on each iteration it assigns the index of movie to the index variable and then it uses the helper function to get the title of the movie corresponding to that index and print it. So it will print the titles of top 10 most similar movies.

In [18]:
for i in range (10) :
     index = sorted_similar_movies[i][0]
     title = get_title_from_index(index)
     print (title)

Avatar
Guardians of the Galaxy
Aliens
Star Wars: Clone Wars: Volume 1
Star Trek Into Darkness
Star Trek Beyond
Alien
Lockout
Jason X
The Helix... Loaded
