# 03-Movie-Query

![](https://images.unsplash.com/photo-1521967906867-14ec9d64bee8?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [JESHOOTS.COM](https://unsplash.com/photos/PpYOQgsZDM4)

In this exercise, you will make a prototype of movie recommender based on a query.

Basically, we want to type a word or a text, and to find the closest movies to this text query. In order to do so, we will use movie title but also movie overview.

Begin by importing usual libraries:

In [1]:
# Run this cell to retrieve challenge data
! mkdir ../data 
! curl https://storage.googleapis.com/schoolofdata-datasets/NLP.Text-Similarity/movies_overviews.csv -o ../data/movies_overviews.csv
! tree ..

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  305k  100  305k    0     0  1236k      0 --:--:-- --:--:-- --:--:-- 1236k
[1;36m..[0m
├── README.md
├── [1;36mdata[0m
│   └── movies_overviews.csv
└── [1;36msrc[0m
    ├── 03-Movie-Query.ipynb
    └── movies_overviews.csv

3 directories, 4 files


In [1]:
# TODO: import needed libraries
import pandas as pd
import numpy as np

Import the data into the file *movies_overview.csv*

In [2]:
# TODO: Load the dataset movies_overview.csv
movie_data = pd.read_csv('../data/movies_overviews.csv')

In [3]:
movie_data

Unnamed: 0,original_title,overview
0,Minions,"Minions Stuart, Kevin and Bob are recruited by..."
1,Wonder Woman,An Amazon princess comes to the world of Man t...
2,Beauty and the Beast,A live-action adaptation of Disney's version o...
3,Baby Driver,After being coerced into working for a crime b...
4,Big Hero 6,The special bond that develops between plus-si...
...,...,...
1008,LOL,"In a world connected by YouTube, iTunes, and F..."
1009,God Bless America,Fed up with the cruelty and stupidity of Ameri...
1010,The Dead Lands,"Hongi, a Maori chieftain’s teenage son, must a..."
1011,Scream 4,"Sidney Prescott, now the author of a self-help..."


In [4]:
movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1013 entries, 0 to 1012
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   original_title  1013 non-null   object
 1   overview        1013 non-null   object
dtypes: object(2)
memory usage: 16.0+ KB


Check what is in this file and how is the data.

In [5]:
# TODO: Check the dataset
movie_data.isna().sum()

original_title    0
overview          0
dtype: int64

In [9]:
f = lambda a,b : a+b

In [10]:
f(2,3)

5

In [11]:
def f(a,b):
    return a+b

In [12]:
f(2,3)

5

In [8]:
movie_data.overview.apply(lambda txt: len(txt.split())).describe()

count    1013.000000
mean       49.333662
std        28.289498
min         9.000000
25%        27.000000
50%        41.000000
75%        66.000000
max       169.000000
Name: overview, dtype: float64

Concatenate the title and overview into another column.

In [13]:
movie_data.columns

Index(['original_title', 'overview'], dtype='object')

In [14]:
# TODO: Concatenate title and overview
movie_data["all_info"] = movie_data["original_title"] + ' ' + movie_data["overview"]

In [15]:
movie_data

Unnamed: 0,original_title,overview,all_info
0,Minions,"Minions Stuart, Kevin and Bob are recruited by...","Minions Minions Stuart, Kevin and Bob are recr..."
1,Wonder Woman,An Amazon princess comes to the world of Man t...,Wonder Woman An Amazon princess comes to the w...
2,Beauty and the Beast,A live-action adaptation of Disney's version o...,Beauty and the Beast A live-action adaptation ...
3,Baby Driver,After being coerced into working for a crime b...,Baby Driver After being coerced into working f...
4,Big Hero 6,The special bond that develops between plus-si...,Big Hero 6 The special bond that develops betw...
...,...,...,...
1008,LOL,"In a world connected by YouTube, iTunes, and F...","LOL In a world connected by YouTube, iTunes, a..."
1009,God Bless America,Fed up with the cruelty and stupidity of Ameri...,God Bless America Fed up with the cruelty and ...
1010,The Dead Lands,"Hongi, a Maori chieftain’s teenage son, must a...","The Dead Lands Hongi, a Maori chieftain’s teen..."
1011,Scream 4,"Sidney Prescott, now the author of a self-help...","Scream 4 Sidney Prescott, now the author of a ..."


Compute both the BOW and the TF-IDF of this new column using scikit learn.

In [16]:
# TODO: compute the BOW and TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

count_vectorizer = CountVectorizer()
bow_repr = count_vectorizer.fit_transform(movie_data.all_info)

tfid_vectorizer = TfidfVectorizer()
tfid_repr = tfid_vectorizer.fit_transform(movie_data.all_info)


What are the dimensions of the TF-IDF and the BOW? Print them out and explain them.

In [17]:
# TODO: print the shapes of TF-IDF and BOW
bow_repr.shape , tfid_repr.shape

((1013, 8909), (1013, 8909))

In [20]:
bow_repr.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Now we want to define a function that takes as input a text query and prints the 10 most similar movies according to the BOW.

To do that, the function should make the following steps:
* Compute the BOW of the input text query
* Compute cosine similarity between the query and the movies
* Print the 10 most similar movies

The function will have the following signature:

`get_BOW_similar(query, movies_df, BOW, BOW_vectorizer)`

In [44]:
# TODO: implement the function get_BOW_similar
from sklearn.metrics.pairwise import cosine_similarity

def get_BOW_similar(query, movies_df, BOW, BOW_vectorizer):
    query_bow = BOW_vectorizer.transform([query])
    similarities = cosine_similarity(query_bow, BOW)
    top_10_indexes = (-1*similarities).argsort()[0][:10]
    print(similarities[0, top_10_indexes])
    return movies_df['original_title'].loc[top_10_indexes]
    

In [47]:
get_BOW_similar('walking in the forest', movie_data, bow_repr, count_vectorizer)

[0.6148789  0.58343849 0.58333333 0.56888012 0.55603844 0.54433105
 0.53066863 0.51965584 0.51832106 0.51639778]


559                             The Normal Heart
203                  Snow White and the Huntsman
29                                The Dark Tower
359                       The Cabin in the Woods
102    The Twilight Saga: Breaking Dawn - Part 2
256                     Smurfs: The Lost Village
755                                Hail, Caesar!
43                 Transformers: The Last Knight
915                           Summer in February
901                                  Any Day Now
Name: original_title, dtype: object

In [48]:
movie_data.iloc[551, 1]

"In the futuristic action thriller Looper, time travel will be invented but it will be illegal and only available on the black market. When the mob wants to get rid of someone, they will send their target 30 years into the past where a looper, a hired gun, like Joe is waiting to mop up. Joe is getting rich and life is good until the day the mob decides to close the loop, sending back Joe's future self for assassination."

Now try to use that function with several queries, check how it works:

In [47]:
# TODO: test yout get_BOW_similar


Now let's do the same function with TF-IDF to see if it works better:

`get_TFIDF_similar(query, movies_df, TFIDF, TFIDF_vectorizer)`

Then test it on some queries.

In [56]:
# TODO: implement the function get_TFIDF_similar
def get_tfidf_similar(query, movies_df, TFIDF, tfidf_vectorizer):
    query_tfidf = tfidf_vectorizer.transform([query])
    similarities = cosine_similarity(query_tfidf, TFIDF)
    top_10_indexes = (-1*similarities).argsort()[0][:10]
    print(similarities[0, top_10_indexes])
    return movies_df['original_title'].loc[top_10_indexes]

In [64]:
get_tfidf_similar('data science', movie_data, tfid_repr, tfid_vectorizer)

[0.17856384 0.1491261  0.0977551  0.08978029 0.07447396 0.
 0.         0.         0.         0.        ]


433                              Kindergarten Cop 2
466    Science Fiction Volume One: The Osiris Child
438                                             Her
576                       The Sorcerer's Apprentice
706                                      Red Lights
0                                           Minions
677                            Hotel Transylvania 2
676                             I Am Not Your Negro
675                                I Give It a Year
674                       To Write Love on Her Arms
Name: original_title, dtype: object

In [66]:
movie_data.iloc[576, 1]

"Balthazar Blake is a master sorcerer in modern-day Manhattan trying to defend the city from his arch-nemesis, Maxim Horvath. Balthazar can't do it alone, so he recruits Dave Stutler, a seemingly average guy who demonstrates hidden potential, as his reluctant protégé. The sorcerer gives his unwilling accomplice a crash course in the art and science of magic, and together, these unlikely partners work to stop the forces of darkness."

In [67]:
from nltk.tokenize import word_tokenize

In [68]:
movie_data.iloc[576, 1].split()

['Balthazar',
 'Blake',
 'is',
 'a',
 'master',
 'sorcerer',
 'in',
 'modern-day',
 'Manhattan',
 'trying',
 'to',
 'defend',
 'the',
 'city',
 'from',
 'his',
 'arch-nemesis,',
 'Maxim',
 'Horvath.',
 'Balthazar',
 "can't",
 'do',
 'it',
 'alone,',
 'so',
 'he',
 'recruits',
 'Dave',
 'Stutler,',
 'a',
 'seemingly',
 'average',
 'guy',
 'who',
 'demonstrates',
 'hidden',
 'potential,',
 'as',
 'his',
 'reluctant',
 'protégé.',
 'The',
 'sorcerer',
 'gives',
 'his',
 'unwilling',
 'accomplice',
 'a',
 'crash',
 'course',
 'in',
 'the',
 'art',
 'and',
 'science',
 'of',
 'magic,',
 'and',
 'together,',
 'these',
 'unlikely',
 'partners',
 'work',
 'to',
 'stop',
 'the',
 'forces',
 'of',
 'darkness.']

In [73]:
'darkness.'.isalpha()

False

In [69]:
word_tokenize(movie_data.iloc[576, 1])

['Balthazar',
 'Blake',
 'is',
 'a',
 'master',
 'sorcerer',
 'in',
 'modern-day',
 'Manhattan',
 'trying',
 'to',
 'defend',
 'the',
 'city',
 'from',
 'his',
 'arch-nemesis',
 ',',
 'Maxim',
 'Horvath',
 '.',
 'Balthazar',
 'ca',
 "n't",
 'do',
 'it',
 'alone',
 ',',
 'so',
 'he',
 'recruits',
 'Dave',
 'Stutler',
 ',',
 'a',
 'seemingly',
 'average',
 'guy',
 'who',
 'demonstrates',
 'hidden',
 'potential',
 ',',
 'as',
 'his',
 'reluctant',
 'protégé',
 '.',
 'The',
 'sorcerer',
 'gives',
 'his',
 'unwilling',
 'accomplice',
 'a',
 'crash',
 'course',
 'in',
 'the',
 'art',
 'and',
 'science',
 'of',
 'magic',
 ',',
 'and',
 'together',
 ',',
 'these',
 'unlikely',
 'partners',
 'work',
 'to',
 'stop',
 'the',
 'forces',
 'of',
 'darkness',
 '.']

What can you do now to improve your function?

If you have time, you can add preprocessing on your query, and before the TF-IDF/BOW, this would probably improve the performances.