In [19]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

count_matrix gives us a sparse matrix. To make it in human readable form, we need to apply toarrray() method over it. And before printing out this count_matrix, let us first print out the feature list(or, word list), which have been fed to our CountVectorizer() object.

In [21]:
text = ["London Paris London", "Paris Paris London"]

we need to find a way to represent these texts as vectors. The CountVectorizer() class from sklearn.feature_extraction.text library can do this for us. We need to import this library before we can create a new CountVectorizer() object.

In [23]:
cv = CountVectorizer()

In [25]:
count_matrix = cv.fit_transform(text)

In [27]:
print(count_matrix)

  (0, 0)	2
  (0, 1)	1
  (1, 0)	1
  (1, 1)	2


In [29]:
print(count_matrix.toarray())

[[2 1]
 [1 2]]


This indicates that the word ‘london’ occurs 2 times in A and 1 time in B. Similarly, the word ‘paris’ occurs 1 time in A and 2 times in B. Makes sense. Right?

Now, we need to find cosine(or “cos”) similarity between these vectors to find out how similar they are from each other. We can calculate this using cosine_similarity() function from sklearn.metrics.pairwise library.

In [31]:
similarity_scores = cosine_similarity(count_matrix)

In [33]:
print(similarity_scores)

[[1.  0.8]
 [0.8 1. ]]


Each row of the similarity matrix indicates each sentence of our input. So, row 0 = Text A and row 1 = Text B.

The same thing applies for columns. To get a better understanding over this, we can say that the output given above is same as the following

Interpreting this, says that Text A is similar to Text A(itself) by 100%(position [0,0]) and Text A is similar to Text B by 80%(position [0,1]). And by looking at the kind of output it is giving, we can easily say that this is always going to output a symmetric matrix. Because, if Text A is similar to Text B by 80% then, Text B is also going to be similar to Text A by 80%.
Now we know how to find similarity between contents. So, let’s try to apply this knowledge to build a content based movie recommendation engine.

# Building the recommendation engine:

# import all the required libraries

In [43]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [69]:
df = pd.read_csv(r"C:\Users\ASUS\Downloads\movie_dataset.csv")

In [73]:
print (df.columns)

Index(['index', 'budget', 'genres', 'homepage', 'id', 'keywords',
       'original_language', 'original_title', 'overview', 'popularity',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title',
       'vote_average', 'vote_count', 'cast', 'crew', 'director'],
      dtype='object')


In [75]:
features = ['keywords','cast','genres','director']

Our next task is to create a function for combining the values of these columns into a single string.

In [77]:
def combine_features(row):
    return row['keywords']+" "+row['cast']+" "+row['genres']+" "+row['director']

In [79]:
for feature in features:
    df[feature] = df[feature].fillna('') 

df["combined_features"] = df.apply(combine_features,axis=1)

In [89]:
print("Combined Features:", df["combined_features"].head())

Combined Features: 0    culture clash future space war space colony so...
1    ocean drug abuse exotic island east india trad...
2    spy based on novel secret agent sequel mi6 Dan...
3    dc comics crime fighter terrorist secret ident...
4    based on novel mars medallion space travel pri...
Name: combined_features, dtype: object


In [81]:
df.iloc[0].combined_features

'culture clash future space war space colony society Sam Worthington Zoe Saldana Sigourney Weaver Stephen Lang Michelle Rodriguez Action Adventure Fantasy Science Fiction James Cameron'

Now that we have obtained the combined strings, we can now feed these strings to a CountVectorizer() object for getting the count matrix.

In [91]:
cv = CountVectorizer() 
count_matrix = cv.fit_transform(df["combined_features"])

In [93]:
print(count_matrix.shape)

(4803, 14845)


In [59]:
cosine_sim = cosine_similarity(count_matrix)

In [61]:
def get_title_from_index(index):
    return df[df.index == index]["title"].values[0]
def get_index_from_title(title):
    return df[df.title == title]["index"].values[0]

In [63]:
movie_user_likes = "Avatar"
movie_index = get_index_from_title(movie_user_likes)
similar_movies = list(enumerate(cosine_sim[movie_index]))

In [65]:
sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)[1:]

In [67]:
i=0
print("Top 5 similar movies to "+movie_user_likes+" are:\n")
for element in sorted_similar_movies:
    print(get_title_from_index(element[0]))
    i=i+1
    if i>5:
        break

Top 5 similar movies to Avatar are:

Guardians of the Galaxy
Aliens
Star Wars: Clone Wars: Volume 1
Star Trek Into Darkness
Star Trek Beyond
Alien
