# NL Queries PreProcessing
Here, we will combine all the data that we have used as a base to fetch all the movies we are working on with necessary features. We aim to use all of them as filters as well as get them ready to be used in as NL (Natural Language) Queries in our application.

## Importing the Dependencies

In [13]:
import pandas as pd
import time
import json
 
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from bs4 import BeautifulSoup 

In [2]:
data = pd.read_csv("/Users/dhruv/Desktop/Machine_Learning/Projects/Movie_Recommender/Working/Working_Data/Movies_Final.csv") # This is all the data we used for our model

In [3]:
data.head()

Unnamed: 0,id,title,tags,IMDB_ID
0,615656,Meg 2: The Trench,action sci-fi horror jason statham wu jing shu...,tt9224104
1,758323,The Pope's Exorcist,horror mystery thriller russell crowe daniel z...,tt13375076
2,667538,Transformers: Rise of the Beasts,action adventure sci-fi anthony ramos dominiqu...,tt5090568
3,640146,Ant-Man and the Wasp: Quantumania,action adventure sci-fi paul rudd evangeline l...,tt10954600
4,677179,Creed III,drama action michael b. jordan tessa thompson ...,tt11145118


So from here, let us apply the pre processing techniques we did to make our database more precise.

In [4]:
data.columns

Index(['id', 'title', 'tags', 'IMDB_ID'], dtype='object')

In [5]:
data.shape

(26241, 4)

In [6]:
data2 = pd.read_csv("/Users/dhruv/Desktop/Machine_Learning/Projects/Movie_Recommender/Working/Data/Working_Data/Movie_Data.csv")
data2.head()

Unnamed: 0,id,title,genres,original_language,release_year,runtime,keys,Actor_1,Actor_2,Actor_3,Actor_4,Actor_5,Director
0,615656,Meg 2: The Trench,Action Sci-Fi Horror,en,2023,116.0,"['based on novel or book', 'sequel', 'kaiju']",Jason Statham,Wu Jing,Shuya Sophia Cai,Sergio Peris,Mencheta,Ben Wheatley
1,758323,The Pope's Exorcist,Horror Mystery Thriller,en,2023,103.0,"['spain', 'rome italy', 'vatican', 'pope', 'pi...",Russell Crowe,Daniel Zovatto,Alex Essoe,Franco Nero,Peter DeSouza,Julius Avery
2,667538,Transformers: Rise of the Beasts,Action Adventure Sci-Fi,en,2023,127.0,"['peru', 'alien', 'end of the world', 'based o...",Anthony Ramos,Dominique Fishback,Luna Lauren Velez,Dean Scott Vazquez,Tobe Nwigwe,Steven Caple Jr.
3,640146,Ant-Man and the Wasp: Quantumania,Action Adventure Sci-Fi,en,2023,125.0,"['hero', 'ant', 'sequel', 'superhero', 'based ...",Paul Rudd,Evangeline Lilly,Jonathan Majors,Kathryn Newton,Michelle Pfeiffer,Peyton Reed
4,677179,Creed III,Drama Action,en,2023,116.0,"['philadelphia pennsylvania', 'husband wife re...",Michael B. Jordan,Tessa Thompson,Jonathan Majors,Wood Harris,Phylicia Rashād,Michael B. Jordan


Now, we would go ahead and fetch the year from the second dataset and merge it with the first dataset, so as to ensure we only take the latest movies in the past 5 years.

In [7]:
data2 = data2[['id', 'release_year']]

In [8]:
data = pd.merge(data, data2, left_on='id', right_on='id', how='left')
data

Unnamed: 0,id,title,tags,IMDB_ID,release_year
0,615656,Meg 2: The Trench,action sci-fi horror jason statham wu jing shu...,tt9224104,2023
1,758323,The Pope's Exorcist,horror mystery thriller russell crowe daniel z...,tt13375076,2023
2,667538,Transformers: Rise of the Beasts,action adventure sci-fi anthony ramos dominiqu...,tt5090568,2023
3,640146,Ant-Man and the Wasp: Quantumania,action adventure sci-fi paul rudd evangeline l...,tt10954600,2023
4,677179,Creed III,drama action michael b. jordan tessa thompson ...,tt11145118,2023
...,...,...,...,...,...
30530,516266,Love And Politics,drama alex usifo pete edochie ngozi ezeonu unk...,,2003
30531,514859,Remarkable,drama daniel k. daniel nita sheu byack tope te...,,2017
30532,527597,In Sickness And In Health,drama oc ukeje beverly naya meg otanwa rotimi ...,,2018
30533,527320,While You Slept,drama ini edo joseph benjamin venita akpofure ...,,2015


## Fetching only the top 1000 movies within the range of 2017 to 2019
Here, we will only fetch those movies which have been released in the range of 2017 to 2019. This is to ensure that we an ideal amount of movies and are able to train the future models on the same.

In [9]:
data = data.loc[(data['release_year'] <=2019) & (data['release_year'] >= 2017)]
data = data[0:1000]
data

Unnamed: 0,id,title,tags,IMDB_ID,release_year
38,283995,Guardians of the Galaxy Vol. 2,adventure action sci-fi chris pratt zoe saldañ...,tt3896198,2017
60,480530,Creed II,drama michael b. jordan sylvester stallone tes...,tt6343314,2018
102,299536,Avengers: Infinity War,adventure action sci-fi robert downey jr. chri...,tt4154756,2018
115,299534,Avengers: Endgame,adventure sci-fi action robert downey jr. chri...,tt4154796,2019
121,337167,Fifty Shades Freed,drama romance dakota johnson jamie dornan eric...,tt4477536,2018
...,...,...,...,...,...
5283,661289,The Summers of IT: Chapter Two,horror drama fantasy andy bean jessica chastai...,,2019
5284,661289,The Summers of IT: Chapter Two,horror drama fantasy andy bean jessica chastai...,,2019
5286,431093,Everybody Loves Somebody,romance comedy karla souza josé maría yázpik b...,tt5537228,2017
5302,602198,Saving Zoë,drama mystery thriller laura marano vanessa ma...,tt7319822,2019


In [10]:
data.isna().sum()

id              0
title           0
tags            0
IMDB_ID         4
release_year    0
dtype: int64

Since we can see there are some movies which have a null IMDB_ID tag, we will remove them from our database.

In [11]:
data = data.dropna()
data.shape

(996, 5)

In [56]:
data = data.drop_duplicates(subset='id')

In [12]:
data 

Unnamed: 0,id,title,tags,IMDB_ID,release_year
38,283995,Guardians of the Galaxy Vol. 2,adventure action sci-fi chris pratt zoe saldañ...,tt3896198,2017
60,480530,Creed II,drama michael b. jordan sylvester stallone tes...,tt6343314,2018
102,299536,Avengers: Infinity War,adventure action sci-fi robert downey jr. chri...,tt4154756,2018
115,299534,Avengers: Endgame,adventure sci-fi action robert downey jr. chri...,tt4154796,2019
121,337167,Fifty Shades Freed,drama romance dakota johnson jamie dornan eric...,tt4477536,2018
...,...,...,...,...,...
5269,574638,Rolling Thunder Revue: A Bob Dylan Story by Ma...,documentary music bob dylan allen ginsberg pat...,tt9577852,2019
5281,299782,The Other Side of the Wind,drama john huston oja kodar peter bogdanovich ...,tt0069049,2018
5286,431093,Everybody Loves Somebody,romance comedy karla souza josé maría yázpik b...,tt5537228,2017
5302,602198,Saving Zoë,drama mystery thriller laura marano vanessa ma...,tt7319822,2019


# Fetching Keywords from the IMDB Website
Here, our aim is to fetch the top used keywords from the IMDB website for each movie. This will help us in getting the keywords and aid us in performing the recommendation system for the movies, as well as help us in getting the most used keywords for the movies based on which the user can ask the queries based on what type of movie they want to watch.

In [18]:
def load_keywords(imdb_id):
    try:
        keyword_list = []
        driver = webdriver.Safari()
        url = 'https://www.imdb.com/title/{}/keywords/?ref_=tt_stry_kw'.format(imdb_id)
        driver.get(url)

        wait = WebDriverWait(driver, 10)

        for _ in range(2):
            wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'body')))
            driver.find_element(By.CSS_SELECTOR, 'body').send_keys(Keys.PAGE_DOWN)
            time.sleep(1)

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        keywords = soup.find('ul', class_="ipc-metadata-list ipc-metadata-list--dividers-after sc-49ddc26b-0 gXSKic ipc-metadata-list--base")  
        if(keywords == None):
            driver.quit()
            return []
        keywords = keywords.find_all('li')

        for keyword in keywords[:25]:
            keyword_text = keyword.find('div', class_='ipc-metadata-list-summary-item__tc')
            if keyword_text:
                keyword_list.append(keyword_text.text.strip())  # Extract and strip text

        driver.quit()
        return keyword_list

    except (TimeoutException, NoSuchElementException) as e:
        print(f"Error scraping keywords: {e}")
        return []


In [21]:
imdb_ids = data['IMDB_ID'].tolist()

keywords_list = []
for imdb_id in imdb_ids:
    keywords = load_keywords(imdb_id)
    keywords_list.append(keywords)
    
data['keywords'] = keywords_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['keywords'] = keywords_list


In [5]:
data

Unnamed: 0,id,title,tags,IMDB_ID,release_year,keywords
0,283995,Guardians of the Galaxy Vol. 2,adventure action sci-fi chris pratt zoe saldañ...,tt3896198,2017,"['demi god', 'alien creature', 'sarcasm', 'cra..."
1,480530,Creed II,drama michael b. jordan sylvester stallone tes...,tt6343314,2018,"['baby', 'training montage', 'sequel', 'boxing..."
2,299536,Avengers: Infinity War,adventure action sci-fi robert downey jr. chri...,tt4154756,2018,"['superhero', 'ensemble cast', 'marvel cinemat..."
3,299534,Avengers: Endgame,adventure sci-fi action robert downey jr. chri...,tt4154796,2019,"['time travel', 'superhero', 'super villain', ..."
4,337167,Fifty Shades Freed,drama romance dakota johnson jamie dornan eric...,tt4477536,2018,"['sex scene', 'wedding ceremony', 'bondage', '..."
...,...,...,...,...,...,...
991,574638,Rolling Thunder Revue: A Bob Dylan Story by Ma...,documentary music bob dylan allen ginsberg pat...,tt9577852,2019,[]
992,299782,The Other Side of the Wind,drama john huston oja kodar peter bogdanovich ...,tt0069049,2018,"['film business', 'nudity', 'female nudity', '..."
993,431093,Everybody Loves Somebody,romance comedy karla souza josé maría yázpik b...,tt5537228,2017,[]
994,602198,Saving Zoë,drama mystery thriller laura marano vanessa ma...,tt7319822,2019,[]


In [28]:
data["keywords"] = data["keywords"].apply(lambda x: eval(x))

# Fetching Reviews from the IMDB Website
Here, our aim is to fetch the top 25 reviews from the IMDB website for each movie. This will help us in getting the reviews for the movies and then we can use them to train our model for further sentiment analysis.

In [17]:
def load_reviews(imdb_id):
    try:
        review_list = []
        driver = webdriver.Safari()
        url = 'https://www.imdb.com/title/{}/reviews?ref_=tt_ov_rt'.format(imdb_id)
        driver.get(url)

        wait = WebDriverWait(driver, 10) 

        for _ in range(2):
            wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'body')))
            driver.find_element(By.CSS_SELECTOR, 'body').send_keys(Keys.PAGE_DOWN)
            time.sleep(1)

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        reviews = soup.find_all('div', class_='review-container')
        if(reviews == None):
            driver.quit()
            return []

        for review in reviews[:25]: 
            review_text = review.find('div', class_='text')
            if review_text:
                review_list.append(review_text.get_text(strip=True)) 

        driver.quit()
        return review_list

    except (TimeoutException, NoSuchElementException) as e:
        print(f"Error scraping reviews: {e}")
        return [] 

In [19]:
imdb_ids = data['IMDB_ID'].tolist()

reviews_list = []
for imdb_id in imdb_ids:
    reviews = load_reviews(imdb_id)
    reviews_list.append(reviews)
    
data['reviews'] = reviews_list

In [22]:
data

Unnamed: 0,id,title,tags,IMDB_ID,release_year,keywords,reviews
0,283995,Guardians of the Galaxy Vol. 2,adventure action sci-fi chris pratt zoe saldañ...,tt3896198,2017,"['demi god', 'alien creature', 'sarcasm', 'cra...",[Despite being a huge comic book nerd I was no...
1,480530,Creed II,drama michael b. jordan sylvester stallone tes...,tt6343314,2018,"['baby', 'training montage', 'sequel', 'boxing...",[This movie is not as good as the first Creed....
2,299536,Avengers: Infinity War,adventure action sci-fi robert downey jr. chri...,tt4154756,2018,"['superhero', 'ensemble cast', 'marvel cinemat...",[Avengers infinity war is an emotional roller ...
3,299534,Avengers: Endgame,adventure sci-fi action robert downey jr. chri...,tt4154796,2019,"['time travel', 'superhero', 'super villain', ...",[But its a pretty good film. A bit of a mess i...
4,337167,Fifty Shades Freed,drama romance dakota johnson jamie dornan eric...,tt4477536,2018,"['sex scene', 'wedding ceremony', 'bondage', '...",[The first of the three that is actually emoti...
...,...,...,...,...,...,...,...
991,574638,Rolling Thunder Revue: A Bob Dylan Story by Ma...,documentary music bob dylan allen ginsberg pat...,tt9577852,2019,[],"[My ex and I saw Bob Dylan perform in 1984, a ..."
992,299782,The Other Side of the Wind,drama john huston oja kodar peter bogdanovich ...,tt0069049,2018,"['film business', 'nudity', 'female nudity', '...",[Years ago I saw a documentary that included a...
993,431093,Everybody Loves Somebody,romance comedy karla souza josé maría yázpik b...,tt5537228,2017,[],[This is definitely a light comedy worth recom...
994,602198,Saving Zoë,drama mystery thriller laura marano vanessa ma...,tt7319822,2019,[],[This movie takes you on a road to discovery t...


# Integrating Keywords into Tags
Here, since we have received new keywords from the IMDB website, it would be adivislble to extarct those and also integrate them into the tags of the movies. This will help us in getting the most used keywords for the movies and then we can use them to train our model for further keyword based recommendation system.

In [45]:
for i in range(0,len(data["tags"])):
    for j in data["keywords"].iloc[i]:
        data["tags"].iloc[i] + j + " "


In [60]:
data.head()

Unnamed: 0,id,title,tags,IMDB_ID,release_year,keywords,reviews
0,283995,Guardians of the Galaxy Vol. 2,adventure action sci-fi chris pratt zoe saldañ...,tt3896198,2017,"[demi god, alien creature, sarcasm, crash land...",[Despite being a huge comic book nerd I was no...
1,480530,Creed II,drama michael b. jordan sylvester stallone tes...,tt6343314,2018,"[baby, training montage, sequel, boxing match,...",[This movie is not as good as the first Creed....
2,299536,Avengers: Infinity War,adventure action sci-fi robert downey jr. chri...,tt4154756,2018,"[superhero, ensemble cast, marvel cinematic un...",[Avengers infinity war is an emotional roller ...
3,299534,Avengers: Endgame,adventure sci-fi action robert downey jr. chri...,tt4154796,2019,"[time travel, superhero, super villain, cosmic...",[But its a pretty good film. A bit of a mess i...
4,337167,Fifty Shades Freed,drama romance dakota johnson jamie dornan eric...,tt4477536,2018,"[sex scene, wedding ceremony, bondage, car cha...",[The first of the three that is actually emoti...


# Merging other columns with the main dataset
Here, we will merge the other columns with the main dataset and then we will be able to use them as filters in our application, along with creating the database for them, which can be accessed in case of any data based queries from the user.|

In [62]:
data2 = pd.read_csv("/Users/dhruv/Desktop/Machine_Learning/Projects/Movie_Recommender/Working/Data/Working_Data/Movie_Data.csv")
data2.head()

Unnamed: 0,id,title,genres,original_language,release_year,runtime,keys,Actor_1,Actor_2,Actor_3,Actor_4,Actor_5,Director
0,615656,Meg 2: The Trench,Action Sci-Fi Horror,en,2023,116.0,"['based on novel or book', 'sequel', 'kaiju']",Jason Statham,Wu Jing,Shuya Sophia Cai,Sergio Peris,Mencheta,Ben Wheatley
1,758323,The Pope's Exorcist,Horror Mystery Thriller,en,2023,103.0,"['spain', 'rome italy', 'vatican', 'pope', 'pi...",Russell Crowe,Daniel Zovatto,Alex Essoe,Franco Nero,Peter DeSouza,Julius Avery
2,667538,Transformers: Rise of the Beasts,Action Adventure Sci-Fi,en,2023,127.0,"['peru', 'alien', 'end of the world', 'based o...",Anthony Ramos,Dominique Fishback,Luna Lauren Velez,Dean Scott Vazquez,Tobe Nwigwe,Steven Caple Jr.
3,640146,Ant-Man and the Wasp: Quantumania,Action Adventure Sci-Fi,en,2023,125.0,"['hero', 'ant', 'sequel', 'superhero', 'based ...",Paul Rudd,Evangeline Lilly,Jonathan Majors,Kathryn Newton,Michelle Pfeiffer,Peyton Reed
4,677179,Creed III,Drama Action,en,2023,116.0,"['philadelphia pennsylvania', 'husband wife re...",Michael B. Jordan,Tessa Thompson,Jonathan Majors,Wood Harris,Phylicia Rashād,Michael B. Jordan


In [64]:
data2 = data2[['id', 'genres', 'Actor_1', 'Actor_2', 'Actor_3','Actor_4','Actor_5', 'Director']]
data2.head()

Unnamed: 0,id,genres,Actor_1,Actor_2,Actor_3,Actor_4,Actor_5,Director
0,615656,Action Sci-Fi Horror,Jason Statham,Wu Jing,Shuya Sophia Cai,Sergio Peris,Mencheta,Ben Wheatley
1,758323,Horror Mystery Thriller,Russell Crowe,Daniel Zovatto,Alex Essoe,Franco Nero,Peter DeSouza,Julius Avery
2,667538,Action Adventure Sci-Fi,Anthony Ramos,Dominique Fishback,Luna Lauren Velez,Dean Scott Vazquez,Tobe Nwigwe,Steven Caple Jr.
3,640146,Action Adventure Sci-Fi,Paul Rudd,Evangeline Lilly,Jonathan Majors,Kathryn Newton,Michelle Pfeiffer,Peyton Reed
4,677179,Drama Action,Michael B. Jordan,Tessa Thompson,Jonathan Majors,Wood Harris,Phylicia Rashād,Michael B. Jordan


In [65]:
data = pd.merge(data, data2, left_on='id', right_on='id', how='left')
data.head()

Unnamed: 0,id,title,tags,IMDB_ID,release_year,keywords,reviews,genres,Actor_1,Actor_2,Actor_3,Actor_4,Actor_5,Director
0,283995,Guardians of the Galaxy Vol. 2,adventure action sci-fi chris pratt zoe saldañ...,tt3896198,2017,"[demi god, alien creature, sarcasm, crash land...",[Despite being a huge comic book nerd I was no...,Adventure Action Sci-Fi,Chris Pratt,Zoe Saldaña,Dave Bautista,Vin Diesel,Bradley Cooper,James Gunn
1,480530,Creed II,drama michael b. jordan sylvester stallone tes...,tt6343314,2018,"[baby, training montage, sequel, boxing match,...",[This movie is not as good as the first Creed....,Drama,Michael B. Jordan,Sylvester Stallone,Tessa Thompson,Wood Harris,Russell Hornsby,Steven Caple Jr.
2,299536,Avengers: Infinity War,adventure action sci-fi robert downey jr. chri...,tt4154756,2018,"[superhero, ensemble cast, marvel cinematic un...",[Avengers infinity war is an emotional roller ...,Adventure Action Sci-Fi,Robert Downey Jr.,Chris Hemsworth,Mark Ruffalo,Chris Evans,Scarlett Johansson,Anthony RussoJoe Russo
3,299534,Avengers: Endgame,adventure sci-fi action robert downey jr. chri...,tt4154796,2019,"[time travel, superhero, super villain, cosmic...",[But its a pretty good film. A bit of a mess i...,Adventure Sci-Fi Action,Robert Downey Jr.,Chris Evans,Mark Ruffalo,Chris Hemsworth,Scarlett Johansson,Anthony RussoJoe Russo
4,337167,Fifty Shades Freed,drama romance dakota johnson jamie dornan eric...,tt4477536,2018,"[sex scene, wedding ceremony, bondage, car cha...",[The first of the three that is actually emoti...,Drama Romance,Dakota Johnson,Jamie Dornan,Eric Johnson,Luke Grimes,Rita Ora,James Foley


We can probaly restructure the columns in our dataset.

In [66]:
data.columns

Index(['id', 'title', 'tags', 'IMDB_ID', 'release_year', 'keywords', 'reviews',
       'genres', 'Actor_1', 'Actor_2', 'Actor_3', 'Actor_4', 'Actor_5',
       'Director'],
      dtype='object')

In [70]:
data = data.reindex(columns=['id','IMDB_ID', 'title','release_year' ,'genres', 'Actor_1', 'Actor_2', 'Actor_3', 'Actor_4', 'Actor_5', 'Director', 'keywords', 'reviews','tags'])
data.head()

Unnamed: 0,id,IMDB_ID,title,release_year,genres,Actor_1,Actor_2,Actor_3,Actor_4,Actor_5,Director,keywords,reviews,tags
0,283995,tt3896198,Guardians of the Galaxy Vol. 2,2017,Adventure Action Sci-Fi,Chris Pratt,Zoe Saldaña,Dave Bautista,Vin Diesel,Bradley Cooper,James Gunn,"[demi god, alien creature, sarcasm, crash land...",[Despite being a huge comic book nerd I was no...,adventure action sci-fi chris pratt zoe saldañ...
1,480530,tt6343314,Creed II,2018,Drama,Michael B. Jordan,Sylvester Stallone,Tessa Thompson,Wood Harris,Russell Hornsby,Steven Caple Jr.,"[baby, training montage, sequel, boxing match,...",[This movie is not as good as the first Creed....,drama michael b. jordan sylvester stallone tes...
2,299536,tt4154756,Avengers: Infinity War,2018,Adventure Action Sci-Fi,Robert Downey Jr.,Chris Hemsworth,Mark Ruffalo,Chris Evans,Scarlett Johansson,Anthony RussoJoe Russo,"[superhero, ensemble cast, marvel cinematic un...",[Avengers infinity war is an emotional roller ...,adventure action sci-fi robert downey jr. chri...
3,299534,tt4154796,Avengers: Endgame,2019,Adventure Sci-Fi Action,Robert Downey Jr.,Chris Evans,Mark Ruffalo,Chris Hemsworth,Scarlett Johansson,Anthony RussoJoe Russo,"[time travel, superhero, super villain, cosmic...",[But its a pretty good film. A bit of a mess i...,adventure sci-fi action robert downey jr. chri...
4,337167,tt4477536,Fifty Shades Freed,2018,Drama Romance,Dakota Johnson,Jamie Dornan,Eric Johnson,Luke Grimes,Rita Ora,James Foley,"[sex scene, wedding ceremony, bondage, car cha...",[The first of the three that is actually emoti...,drama romance dakota johnson jamie dornan eric...


Now, let us import the final dataset to create the database for the same.

In [8]:
data3 = pd.read_csv("./Working/Data/TMDB_Movies_Dataset/movies.csv")
data3.head()

Unnamed: 0,id,title,genres,original_language,overview,popularity,production_companies,release_date,budget,revenue,runtime,status,tagline,vote_average,vote_count,credits,keywords,poster_path,backdrop_path,recommendations
0,615656,Meg 2: The Trench,Action-Science Fiction-Horror,en,An exploratory dive into the deepest depths of...,8763.998,Apelles Entertainment-Warner Bros. Pictures-di...,2023-08-02,129000000.0,352056482.0,116.0,Released,Back for seconds.,7.079,1365.0,Jason Statham-Wu Jing-Shuya Sophia Cai-Sergio ...,based on novel or book-sequel-kaiju,/4m1Au3YkjqsxF8iwQy0fPYSxE0h.jpg,/qlxy8yo5bcgUw2KAmmojUKp4rHd.jpg,1006462-298618-569094-1061181-346698-1076487-6...
1,758323,The Pope's Exorcist,Horror-Mystery-Thriller,en,Father Gabriele Amorth Chief Exorcist of the V...,5953.227,Screen Gems-2.0 Entertainment-Jesus & Mary-Wor...,2023-04-05,18000000.0,65675816.0,103.0,Released,Inspired by the actual files of Father Gabriel...,7.433,545.0,Russell Crowe-Daniel Zovatto-Alex Essoe-Franco...,spain-rome italy-vatican-pope-pig-possession-c...,/9JBEPLTPSm0d1mbEcLxULjJq9Eh.jpg,/hiHGRbyTcbZoLsYYkO4QiCLYe34.jpg,713704-296271-502356-1076605-1084225-1008005-9...
2,667538,Transformers: Rise of the Beasts,Action-Adventure-Science Fiction,en,When a new threat capable of destroying the en...,5409.104,Skydance-Paramount-di Bonaventura Pictures-Bay...,2023-06-06,200000000.0,407045464.0,127.0,Released,Unite or fall.,7.34,1007.0,Anthony Ramos-Dominique Fishback-Luna Lauren V...,peru-alien-end of the world-based on cartoon-b...,/gPbM0MK8CP8A174rmUwGsADNYKD.jpg,/woJbg7ZqidhpvqFGGMRhWQNoxwa.jpg,496450-569094-298618-385687-877100-598331-4628...
3,640146,Ant-Man and the Wasp: Quantumania,Action-Adventure-Science Fiction,en,Super-Hero partners Scott Lang and Hope van Dy...,4425.387,Marvel Studios-Kevin Feige Productions,2023-02-15,200000000.0,475766228.0,125.0,Released,Witness the beginning of a new dynasty.,6.507,2811.0,Paul Rudd-Evangeline Lilly-Jonathan Majors-Kat...,hero-ant-sequel-superhero-based on comic-famil...,/qnqGbB22YJ7dSs4o6M7exTpNxPz.jpg,/m8JTwHFwX7I7JY5fPe4SjqejWag.jpg,823999-676841-868759-734048-267805-965839-1033...
4,677179,Creed III,Drama-Action,en,After dominating the boxing world Adonis Creed...,3994.342,Metro-Goldwyn-Mayer-Proximity Media-Balboa Pro...,2023-03-01,75000000.0,269000000.0,116.0,Released,You can't run from your past.,7.262,1129.0,Michael B. Jordan-Tessa Thompson-Jonathan Majo...,philadelphia pennsylvania-husband wife relatio...,/cvsXj3I9Q2iyyIo95AecSd1tad7.jpg,/5i6SjyDbDWqyun8klUuCxrlFbyw.jpg,965839-267805-943822-842942-1035806-823999-107...


From this initial dataset, we only need some basic columns which we can use for presenting a movie, which are poster path, backdrop path and vote average.

In [9]:
data3 = data3[['id', 'vote_average','poster_path','backdrop_path']]
data3.head()

Unnamed: 0,id,vote_average,poster_path,backdrop_path
0,615656,7.079,/4m1Au3YkjqsxF8iwQy0fPYSxE0h.jpg,/qlxy8yo5bcgUw2KAmmojUKp4rHd.jpg
1,758323,7.433,/9JBEPLTPSm0d1mbEcLxULjJq9Eh.jpg,/hiHGRbyTcbZoLsYYkO4QiCLYe34.jpg
2,667538,7.34,/gPbM0MK8CP8A174rmUwGsADNYKD.jpg,/woJbg7ZqidhpvqFGGMRhWQNoxwa.jpg
3,640146,6.507,/qnqGbB22YJ7dSs4o6M7exTpNxPz.jpg,/m8JTwHFwX7I7JY5fPe4SjqejWag.jpg
4,677179,7.262,/cvsXj3I9Q2iyyIo95AecSd1tad7.jpg,/5i6SjyDbDWqyun8klUuCxrlFbyw.jpg


In [10]:
data = pd.merge(data, data3, left_on='id', right_on='id', how='left')
data.head()

Unnamed: 0,id,IMDB_ID,title,release_year,genres,Actor_1,Actor_2,Actor_3,Actor_4,Actor_5,Director,keywords,reviews,tags,vote_average,poster_path,backdrop_path
0,283995,tt3896198,Guardians of the Galaxy Vol. 2,2017,Adventure Action Sci-Fi,Chris Pratt,Zoe Saldaña,Dave Bautista,Vin Diesel,Bradley Cooper,James Gunn,"['demi god', 'alien creature', 'sarcasm', 'cra...","[""Despite being a huge comic book nerd I was n...",adventure action sci-fi chris pratt zoe saldañ...,7.623,/y4MBh0EjBlMuOzv9axM4qJlmhzz.jpg,/aJn9XeesqsrSLKcHfHP4u5985hn.jpg
1,480530,tt6343314,Creed II,2018,Drama,Michael B. Jordan,Sylvester Stallone,Tessa Thompson,Wood Harris,Russell Hornsby,Steven Caple Jr.,"['baby', 'training montage', 'sequel', 'boxing...","[""This movie is not as good as the first Creed...",drama michael b. jordan sylvester stallone tes...,6.99,/v3QyboWRoA4O9RbcsqH8tJMe8EB.jpg,/xTYGN1b3XkOtODryXTKgdXLtPMz.jpg
2,299536,tt4154756,Avengers: Infinity War,2018,Adventure Action Sci-Fi,Robert Downey Jr.,Chris Hemsworth,Mark Ruffalo,Chris Evans,Scarlett Johansson,Anthony RussoJoe Russo,"['superhero', 'ensemble cast', 'marvel cinemat...","[""Avengers infinity war is an emotional roller...",adventure action sci-fi robert downey jr. chri...,8.26,/7WsyChQLEftFiDOVTGkv3hFpyyt.jpg,/mDfJG3LC3Dqb67AZ52x3Z0jU0uB.jpg
3,299534,tt4154796,Avengers: Endgame,2019,Adventure Sci-Fi Action,Robert Downey Jr.,Chris Evans,Mark Ruffalo,Chris Hemsworth,Scarlett Johansson,Anthony RussoJoe Russo,"['time travel', 'superhero', 'super villain', ...","[""But its a pretty good film. A bit of a mess ...",adventure sci-fi action robert downey jr. chri...,8.268,/or06FN3Dka5tukK1e9sl16pB3iy.jpg,/7RyHsO4yDXtBv1zUU3mTpHeQ0d5.jpg
4,337167,tt4477536,Fifty Shades Freed,2018,Drama Romance,Dakota Johnson,Jamie Dornan,Eric Johnson,Luke Grimes,Rita Ora,James Foley,"['sex scene', 'wedding ceremony', 'bondage', '...","[""The first of the three that is actually emot...",drama romance dakota johnson jamie dornan eric...,6.699,/9ZedQHPQVveaIYmDSTazhT3y273.jpg,/9ywA15OAiwjSTvg3cBs9B7kOCBF.jpg


In [11]:
data = data.reindex(columns=['id','IMDB_ID', 'title','release_year' ,'genres','vote_average', 'Actor_1', 'Actor_2', 'Actor_3', 'Actor_4', 'Actor_5', 'Director', 'keywords', 'reviews','poster_path','backdrop_path','tags'])
data.head()

Unnamed: 0,id,IMDB_ID,title,release_year,genres,vote_average,Actor_1,Actor_2,Actor_3,Actor_4,Actor_5,Director,keywords,reviews,poster_path,backdrop_path,tags
0,283995,tt3896198,Guardians of the Galaxy Vol. 2,2017,Adventure Action Sci-Fi,7.623,Chris Pratt,Zoe Saldaña,Dave Bautista,Vin Diesel,Bradley Cooper,James Gunn,"['demi god', 'alien creature', 'sarcasm', 'cra...","[""Despite being a huge comic book nerd I was n...",/y4MBh0EjBlMuOzv9axM4qJlmhzz.jpg,/aJn9XeesqsrSLKcHfHP4u5985hn.jpg,adventure action sci-fi chris pratt zoe saldañ...
1,480530,tt6343314,Creed II,2018,Drama,6.99,Michael B. Jordan,Sylvester Stallone,Tessa Thompson,Wood Harris,Russell Hornsby,Steven Caple Jr.,"['baby', 'training montage', 'sequel', 'boxing...","[""This movie is not as good as the first Creed...",/v3QyboWRoA4O9RbcsqH8tJMe8EB.jpg,/xTYGN1b3XkOtODryXTKgdXLtPMz.jpg,drama michael b. jordan sylvester stallone tes...
2,299536,tt4154756,Avengers: Infinity War,2018,Adventure Action Sci-Fi,8.26,Robert Downey Jr.,Chris Hemsworth,Mark Ruffalo,Chris Evans,Scarlett Johansson,Anthony RussoJoe Russo,"['superhero', 'ensemble cast', 'marvel cinemat...","[""Avengers infinity war is an emotional roller...",/7WsyChQLEftFiDOVTGkv3hFpyyt.jpg,/mDfJG3LC3Dqb67AZ52x3Z0jU0uB.jpg,adventure action sci-fi robert downey jr. chri...
3,299534,tt4154796,Avengers: Endgame,2019,Adventure Sci-Fi Action,8.268,Robert Downey Jr.,Chris Evans,Mark Ruffalo,Chris Hemsworth,Scarlett Johansson,Anthony RussoJoe Russo,"['time travel', 'superhero', 'super villain', ...","[""But its a pretty good film. A bit of a mess ...",/or06FN3Dka5tukK1e9sl16pB3iy.jpg,/7RyHsO4yDXtBv1zUU3mTpHeQ0d5.jpg,adventure sci-fi action robert downey jr. chri...
4,337167,tt4477536,Fifty Shades Freed,2018,Drama Romance,6.699,Dakota Johnson,Jamie Dornan,Eric Johnson,Luke Grimes,Rita Ora,James Foley,"['sex scene', 'wedding ceremony', 'bondage', '...","[""The first of the three that is actually emot...",/9ZedQHPQVveaIYmDSTazhT3y273.jpg,/9ywA15OAiwjSTvg3cBs9B7kOCBF.jpg,drama romance dakota johnson jamie dornan eric...


With this, we have our final dataset ready to be used in our application. We will now be splitting this into two portions respecitvely, one for the key based searching query by the user, which we will proceed to store in CHROMADB and the other for the data based query, which we will proceed to store in a regular relational database.

In [12]:
data_key_based = data[['id','title','tags']]
data_query_based = data.drop(columns=['tags'])

In [13]:
data_key_based.head()

Unnamed: 0,id,title,tags
0,283995,Guardians of the Galaxy Vol. 2,adventure action sci-fi chris pratt zoe saldañ...
1,480530,Creed II,drama michael b. jordan sylvester stallone tes...
2,299536,Avengers: Infinity War,adventure action sci-fi robert downey jr. chri...
3,299534,Avengers: Endgame,adventure sci-fi action robert downey jr. chri...
4,337167,Fifty Shades Freed,drama romance dakota johnson jamie dornan eric...


In [14]:
data_query_based.head()

Unnamed: 0,id,IMDB_ID,title,release_year,genres,vote_average,Actor_1,Actor_2,Actor_3,Actor_4,Actor_5,Director,keywords,reviews,poster_path,backdrop_path
0,283995,tt3896198,Guardians of the Galaxy Vol. 2,2017,Adventure Action Sci-Fi,7.623,Chris Pratt,Zoe Saldaña,Dave Bautista,Vin Diesel,Bradley Cooper,James Gunn,"['demi god', 'alien creature', 'sarcasm', 'cra...","[""Despite being a huge comic book nerd I was n...",/y4MBh0EjBlMuOzv9axM4qJlmhzz.jpg,/aJn9XeesqsrSLKcHfHP4u5985hn.jpg
1,480530,tt6343314,Creed II,2018,Drama,6.99,Michael B. Jordan,Sylvester Stallone,Tessa Thompson,Wood Harris,Russell Hornsby,Steven Caple Jr.,"['baby', 'training montage', 'sequel', 'boxing...","[""This movie is not as good as the first Creed...",/v3QyboWRoA4O9RbcsqH8tJMe8EB.jpg,/xTYGN1b3XkOtODryXTKgdXLtPMz.jpg
2,299536,tt4154756,Avengers: Infinity War,2018,Adventure Action Sci-Fi,8.26,Robert Downey Jr.,Chris Hemsworth,Mark Ruffalo,Chris Evans,Scarlett Johansson,Anthony RussoJoe Russo,"['superhero', 'ensemble cast', 'marvel cinemat...","[""Avengers infinity war is an emotional roller...",/7WsyChQLEftFiDOVTGkv3hFpyyt.jpg,/mDfJG3LC3Dqb67AZ52x3Z0jU0uB.jpg
3,299534,tt4154796,Avengers: Endgame,2019,Adventure Sci-Fi Action,8.268,Robert Downey Jr.,Chris Evans,Mark Ruffalo,Chris Hemsworth,Scarlett Johansson,Anthony RussoJoe Russo,"['time travel', 'superhero', 'super villain', ...","[""But its a pretty good film. A bit of a mess ...",/or06FN3Dka5tukK1e9sl16pB3iy.jpg,/7RyHsO4yDXtBv1zUU3mTpHeQ0d5.jpg
4,337167,tt4477536,Fifty Shades Freed,2018,Drama Romance,6.699,Dakota Johnson,Jamie Dornan,Eric Johnson,Luke Grimes,Rita Ora,James Foley,"['sex scene', 'wedding ceremony', 'bondage', '...","[""The first of the three that is actually emot...",/9ZedQHPQVveaIYmDSTazhT3y273.jpg,/9ywA15OAiwjSTvg3cBs9B7kOCBF.jpg


In [22]:
data_key_based.drop_duplicates(subset='id', inplace=True)
data_query_based.drop_duplicates(subset='id', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_key_based.drop_duplicates(subset='id', inplace=True)


In [23]:
data_key_based.to_csv("../Data/Movies_Key_Based.csv", index=False)

In [24]:
data_query_based.to_csv("../Data/Movies_Query_Based.csv", index=False)

In [12]:
data_query_based.head()

Unnamed: 0,id,IMDB_ID,title,release_year,genres,vote_average,Actor_1,Actor_2,Actor_3,Actor_4,Actor_5,Director,keywords,reviews,poster_path,backdrop_path
0,283995,tt3896198,Guardians of the Galaxy Vol. 2,2017,Adventure Action Sci-Fi,7.623,Chris Pratt,Zoe Saldaña,Dave Bautista,Vin Diesel,Bradley Cooper,James Gunn,"['demi god', 'alien creature', 'sarcasm', 'cra...","[""Despite being a huge comic book nerd I was n...",/y4MBh0EjBlMuOzv9axM4qJlmhzz.jpg,/aJn9XeesqsrSLKcHfHP4u5985hn.jpg
1,480530,tt6343314,Creed II,2018,Drama,6.99,Michael B. Jordan,Sylvester Stallone,Tessa Thompson,Wood Harris,Russell Hornsby,Steven Caple Jr.,"['baby', 'training montage', 'sequel', 'boxing...","[""This movie is not as good as the first Creed...",/v3QyboWRoA4O9RbcsqH8tJMe8EB.jpg,/xTYGN1b3XkOtODryXTKgdXLtPMz.jpg
2,299536,tt4154756,Avengers: Infinity War,2018,Adventure Action Sci-Fi,8.26,Robert Downey Jr.,Chris Hemsworth,Mark Ruffalo,Chris Evans,Scarlett Johansson,Anthony RussoJoe Russo,"['superhero', 'ensemble cast', 'marvel cinemat...","[""Avengers infinity war is an emotional roller...",/7WsyChQLEftFiDOVTGkv3hFpyyt.jpg,/mDfJG3LC3Dqb67AZ52x3Z0jU0uB.jpg
3,299534,tt4154796,Avengers: Endgame,2019,Adventure Sci-Fi Action,8.268,Robert Downey Jr.,Chris Evans,Mark Ruffalo,Chris Hemsworth,Scarlett Johansson,Anthony RussoJoe Russo,"['time travel', 'superhero', 'super villain', ...","[""But its a pretty good film. A bit of a mess ...",/or06FN3Dka5tukK1e9sl16pB3iy.jpg,/7RyHsO4yDXtBv1zUU3mTpHeQ0d5.jpg
4,337167,tt4477536,Fifty Shades Freed,2018,Drama Romance,6.699,Dakota Johnson,Jamie Dornan,Eric Johnson,Luke Grimes,Rita Ora,James Foley,"['sex scene', 'wedding ceremony', 'bondage', '...","[""The first of the three that is actually emot...",/9ZedQHPQVveaIYmDSTazhT3y273.jpg,/9ywA15OAiwjSTvg3cBs9B7kOCBF.jpg


# After this, we will work on showing how to integrate text to SQL here and then we will proceed to show how to use the same in our application, with understanding the different types of data based queries which can be given by the user.

# Query Processor
We will aim to understand the type of queries a user might enter on the website and train our model accordingly.

### 1. Understand Query Types:

Identify and understand the types of queries you want your system to support. For instance:
- "Movies directed by Christopher Nolan"
- "Horror films released in 2010"
- "Romantic movies starring Julia Roberts"

### 2. Design a Query Processor:

Given the types of queries you wish to support, design a function that processes these queries. This involves:
- Tokenizing the query
- Identifying keywords or named entities like director names, actor names, genres, years, etc.
- Mapping the identified keywords to relevant columns in your dataset

For a college project:

Start with a Rule-Based Approach: This gives you a working baseline quickly. Implement basic patterns that you anticipate users will employ.
Integrate NER: This will allow you to flexibly identify movie names, director names, etc. without hardcoding anything.
Experiment with Pre-trained Models (if feasible): If you're interested in diving deeper and have the computational resources, consider experimenting with fine-tuning a model for query classification or entity recognition.
Lastly, remember to build iteratively. Get a basic version working, test it, and then incrementally add complexity.

In [17]:
data.head(2)

Unnamed: 0,id,title,genres,original_language,release_year,runtime,keys,Actor_1,Actor_2,Actor_3,Actor_4,Actor_5,Director
0,615656,"[meg, 2, :, the, trench]","[action, sci-fi, horror]",[en],2023,116.0,"[based, on, novel, or, book, sequel, kaiju]","[jason, statham]","[wu, jing]","[shuya, sophia, cai]","[sergio, peris]",[mencheta],"[ben, wheatley]"
1,758323,"[the, pope, 's, exorcist]","[horror, mystery, thriller]",[en],2023,103.0,"[spain, rome, italy, vatican, pope, pig, posse...","[russell, crowe]","[daniel, zovatto]","[alex, essoe]","[franco, nero]","[peter, desouza]","[julius, avery]"


#### Type of Possible Queries
For the sake of a rule based approach, we will be taking queries based on the following filters.

* Director Based Queries
    * Movies directed by Christopher Nolan
    * What films has Christopher Nolan directed?
* Actor Based Movies
    * Films with Leonardo Dicaprio
    * Movies starting Micheal B Jordan
* Genre Based Movies
    * Action Movies from the 2010s
    * Horror Movies
* Year Based Movies
    * Movies from 2015 to 2017
    * Films released in 2023
* Keyword Based Movies
    * Movies about space and science
    * Superhero films
    * Romantic feel good movies

Apart from this, we can also have a cominnation of these type of queries. Let us now use a tokenizer to understand and filter our these queries.

### 3. Query Execution:

Using the processed query, search your dataset to return relevant results. This can be done using:
- Exact matches: For queries like director names, years, etc.
- Fuzzy matching: If you want to handle slight variations or typos in the query
- Vector representations (like TF-IDF, Word2Vec, etc.): For more complex queries

### 4. Testing:

Write test cases for your query system to ensure it's working as expected. Start with basic queries and then move on to more complex ones. Iterate and improve your query processor based on the results.

### 5. User Experience:

Consider giving feedback to the user in case their query does not return results or is ambiguous. You might want to suggest refined queries or correct potential typos.

---

Here's a basic example using exact matching for a director-based query:

```python
def execute_query(query):
    # Tokenize the query
    tokens = word_tokenize(query.lower())
    
    # Check for director-based query
    if 'directed' in tokens and 'by' in tokens:
        director_name = ' '.join(tokens[tokens.index('by')+1:])
        return df1[df1['Director'] == director_name]
    
    # Add other query types
    # ...

    return None

result = execute_query("Movies directed by Christopher Nolan")
print(result)
```

This is a very rudimentary example, but it gives you a starting point. The complexity of the system can grow as you incorporate more query types, handle synonyms (e.g., "films" vs "movies"), or deal with compound queries (e.g., "Romantic movies from the 90s starring Julia Roberts").

The real power comes when you integrate machine learning or deep learning models, but that requires more computational resources and a more extensive setup. For a college project, starting with a rule-based system and progressively adding complexity is a good approach.