# Basic Search Engine from Netflix dataset

# Obtain

In [13]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/dwihdyn/ds-exploration/main/p3/data/netflix-titles.txt')
df.sample(3)


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5302,s5303,Movie,Runaway Bride,Garry Marshall,"Julia Roberts, Richard Gere, Joan Cusack, Hect...",United States,"December 1, 2020",1999,PG,116 min,"Comedies, Romantic Movies",Sparks fly when a newspaper columnist writes a...
6261,s6262,Movie,The D Train,"Jarrad Paul, Andrew Mogel","Jack Black, James Marsden, Kathryn Hahn, Jeffr...","United Kingdom, United States","April 2, 2017",2015,R,101 min,Comedies,A loser in charge of planning his high school ...
5543,s5544,Movie,Shark Night,David R. Ellis,"Sara Paxton, Dustin Milligan, Chris Carmack, K...",United States,"November 2, 2019",2011,PG-13,91 min,Horror Movies,A weekend of beach-house debauchery turns into...


# Scrub

- skipped as data is assumed clean

# Explore

- skipped as our focus here is to create search engine

# Model

- How this search engine works :
    - since we want to match the closest input (query) to the movie description, 
    - calculate IDF (inverse document frequency) = log(N/df) of the movie description AND keyword being typed into the search engine input (query)
        - N : number of documents
        - tf : frequency of the particular term/word/sentence/document
        - log because in real life, we will face millions of N and df, hence we log it to normalise it & make the process less computationally expensive
    - check which description IDF vector is the closest with the input

- we calculate IDF by input the movie description to the 'TfidfVectorizer' model to be trained. and once trained, we put in the query

- will return vector of movie description that is closest with the query input.

- vector : pair of number that show location in the 2D plane

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

# scrub the data from any english stopwords
tfidf = TfidfVectorizer(stop_words='english')

# train TfidfVectorizer using movie description
feature = tfidf.fit_transform(df['description'])

In [15]:
# insert the keyword that we want to search
query = "action"

# transform input into vector that shall be compared which the closest to the movie description
query_feature = tfidf.transform([query])

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

# compare query input to movie descriotion using cosine_similarity method
cosims = cosine_similarity(query_feature, feature).flatten()

# get the vector top5 closest to the query
results = cosims.argsort()[-6 : -1]
print(results)

[ 167 5673 1685 3110 6590]


# iNterpret

In [17]:
# show search result : what movies & its description was it that closest with the query ?

for i in results:
    print(df.iloc[i]['title'])
    print(df.iloc[i]['description'])
    print('------')


A Fairly Odd Summer
In this live-action adventure, the gang heads to Hawaii, where Timmy learns the source of all fairy magic is in dangerous hands.
------
Small Soldiers
When the Commando Elite, a group of toy action figures, are released before they've been tested, they attack the children playing with them.
------
Defiance
In this action-packed drama based on an extraordinary true story, four brothers protect more than 1,000 Jewish refugees during World War II.
------
Jagga Jasoos
An eccentric, self-proclaimed detective goes on an action-packed hunt for his missing father, with an awkward but adventurous journalist by his side.
------
The Liar
At the CARE detective agency, investigators see a lot of action but must always keep their cases secret, even when it affects their personal lives.
------


In [18]:
def search_engine(query, df):
    '''Input search query & movie database, output top5 most relevant movies as per search query towards movie descriptions'''

    # scrub the data from any english stopwords
    tfidf = TfidfVectorizer(stop_words='english')

    # train TfidfVectorizer using movie description
    feature = tfidf.fit_transform(df['description'])

    # insert the keyword that we want to search
    query = query

    # transform input into vector that shall be compared which the closest to the movie description
    query_feature = tfidf.transform([query])

    # compare query input to movie descriotion using cosine_similarity method
    cosims = cosine_similarity(query_feature, feature).flatten()

    # get the vector top5 closest to the query
    results = cosims.argsort()[-6 : -1]
        
    # show search result
    for i in results:
        print(df.iloc[i]['title'])
        print(df.iloc[i]['description'])
        print('------')

In [20]:
search_engine("romance", df)

Last Ferry
Seeking romance and friendship, a young gay lawyer travels to Fire Island in the off-season and is soon on the run after witnessing a murder.
------
Maya Memsaab
A beautiful, wealthy woman’s insatiable appetite for romance leads to tragedy and a police investigation.
------
World Famous Lover
When his frustrated girlfriend decides to leave him, a struggling writer gets down to work on stories of romance he hopes will win her back.
------
Single Ladies Senior
Four best friends and spirited career women navigate the treacherous world of romance – even as it stands in the way of work and friendship.
------
A Mission in an Old Movie
A young man struggles with his overbearing mother while looking for romance and a way to kick-start his show business career.
------


In [22]:
search_engine("horror", df)

Goedam
When night falls on the city, shadows and spirits come alive in this horror anthology series centered on urban legends.
------
My Honor Was Loyalty
Amid the chaos and horror of World War II, a committed German soldier fights a private battle with his own conscience.
------
Fear Files... Har Mod Pe Darr
Possessed lovers, witches, haunted houses and more bring tales of horror to the screen in this anthology series.
------
Haunters: The Art of the Scare
This documentary takes us into the world of those who create horror simulations for willing audiences, and examines the culture they have spawned.
------
Darna Mana Hai
Stranded in a jungle when their car breaks down, six friends pass their time exchanging horror stories, unaware that they may be part of one themselves.
------


In [23]:
search_engine("comedy", df)

Internet Famous
Five viral Internet celebrities travel to a competition that will award one of them their own television series in this ensemble comedy.
------
Santo Cachón
A man whose wife is repeatedly cheating on him turns to his friends for support in this wacky comedy.
------
Middleditch & Schwartz
Comedy duo Thomas Middleditch and Ben Schwartz turn small ideas into epically funny stories in this series of completely improvised comedy specials.
------
Ray Romano: Right Here, Around the Corner
Ray Romano cut his stand-up teeth at the Comedy Cellar in New York. Now, in his first comedy special in 23 years, he returns to where it all began.
------
Monty Python: Before the Flying Circus
Discover how six seemingly ordinary but supremely talented men became Monty Python, sketch comedy's revolutionary comedy troupe.
------


In [25]:
search_engine("gruesome", df)

Hamid
Wanting his missing father to come home, a Kashmiri boy repeatedly attempts to call God for help – until one day, a hardened army officer picks up.
------
Hamburger Hill
The Vietnam War's horrors come brutally to life through the eyes of American soldiers trying to take a heavily fortified hill under Vietcong control.
------
Bloodride
The doomed passengers aboard a spectral bus head toward a gruesome, unknown destination in this deliciously macabre horror anthology series.
------
Don't F**k with Cats: Hunting an Internet Killer
A twisted criminal's gruesome videos drive a group of amateur online sleuths to launch a risky manhunt that pulls them into a dark underworld.
------
Fallet
A Swedish detective and her timid British colleague's attempt to solve a gruesome murder case nets mixed results and miscommunication.
------


1) Action : 3/5 

2) Romance : 5/5 (best)

3) Horror : 4/5

4) Comedy : 2/5 (worst)

5) Dark : 3/5