In [2]:
#Not included in Quiz/Solutions
# execute to import notebook styling for tables and width etc.
from IPython.core.display import HTML

# computational imports
import numpy as np
import pandas as pd
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel

import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize    
nltk.download('averaged_perceptron_tagger')
from sklearn.feature_extraction import text
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet as wn
import string

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/user/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


<font size=18>Week 13 Homework - Recommender Systems 1</font>

# General Multiple Choice Questions
## **Question 1** <font color="magenta">2 points</font>
When would you use dot-product similarity function?

* To calculate the similarity matrix for a Tfidf Vector Matrix 
* To calculate the similarity matrix for a Count Vector Matrix 
* To standardize text to root words 
* To combine columns of text before vectorization

## **Question 2** <font color="magenta">2 points</font>
What is lemmatization?

* Shortening words by removing suffixes and prefixes 
* Standardizing text to their root words 
* Generating a matrix of word counts 
* Chunking text into multi-word phrases 





# Build a Knowledge-Based Recommender

You will be using the data set **tmdb-simplified.csv** to build a simple knowledge-based recommender system. This data set can be found in the data folder in the same folder as this notebook. 

**You will need to use the option encoding = "ISO-8859-1" in the read_csv function in order to open this file.** 

* Read in the file to a variable called "movies" and review the data.
* Apply literal_eval to the genres, keywords, and production_companies columns. (They are already lists, not dictionaries.)
* Filter out movies that have have nothing or zero in the budget.
* Determine how many rows are in this dataframe. 

**Note: This code is ungraded.**

## **Question 3** - How many rows of data are there? <font color="magenta">1 point</font>



In [3]:
movies = pd.read_csv('data/tmdb-simplified.csv', encoding="ISO-8859-1")
for c in ['production_companies', 'genres', 'keywords']:
	movies[c] = movies[c].apply(literal_eval)
movies = movies.loc[movies['budget']!=0]
movies

Unnamed: 0,title,budget,production_companies,genres,keywords,overview,vote_count,vote_average
0,Avatar,237000000,"[Ingenious Film Partners, Twentieth Century Fo...","[action, adventure, fantasy, science fiction]","[culture clash, future, space war]","In the 22nd century, a paraplegic Marine is di...",11800,7.2
1,Pirates of the Caribbean: At World's End,300000000,"[Walt Disney Pictures, Jerry Bruckheimer Films...","[adventure, fantasy, action]","[ocean, drug abuse, exotic island]","Captain Barbossa, long believed to be dead, ha...",4500,6.9
2,Spectre,245000000,"[Columbia Pictures, Danjaq, B24]","[action, adventure, crime]","[spy, based on novel, secret agent]",A cryptic message from Bonds past sends him on...,4466,6.3
3,The Dark Knight Rises,250000000,"[Legendary Pictures, Warner Bros., DC Entertai...","[action, crime, drama, thriller]","[dc comics, crime fighter, terrorist]",Following the death of District Attorney Harve...,9106,7.6
4,John Carter,260000000,[Walt Disney Pictures],"[action, adventure, science fiction]","[based on novel, mars, medallion]","John Carter is a war-weary, former military ca...",2124,6.1
...,...,...,...,...,...,...,...,...
4791,Tin Can Man,13,"[Park Films, Camera Stylo Films]",[horror],[home invasion],Recently dumped by his girlfirend for another ...,1,2.0
4792,Cure,20000,[Daiei Studios],"[crime, horror, mystery, thriller]","[japan, prostitute, hotel]",A wave of gruesome murders is sweeping Tokyo. ...,63,7.4
4796,Primer,7000,[Thinkfilm],"[science fiction, drama, thriller]","[distrust, garage, identity crisis]",Friends/fledgling entrepreneurs invent a devic...,658,6.9
4798,El Mariachi,220000,[Columbia Pictures],"[action, crime, thriller]","[united statesâmexico barrier, legs, arms]",El Mariachi just wants to play his guitar and ...,238,6.6


## **Question 4** - Prep Work & Building a Filter Function (manually graded) <font color="magenta">5 points</font>

Before we build the recommender function that allows for user input, we're going to write a filter function that takes in manual (coded) input and filters our dataframe. Your function should take in parameters for the dataframe, two genres, a production company, and max budget. The filter should identify movies that meet the following criteria:

* Have either genre
* Are NOT made by the production company (the production company is not in the list of production companies)
* Have a budget that is less than or equal to the max budget.

The function should return the filtered dataframe. 

We've given you the function definition. Fill in the code.

Use the examples given in the lesson and Banik's book as a guide. (Do not explode. Use the lesson approach.) 


In [4]:
def filterMovies(df, genre1, genre2, company, budgetmax):
	'''
	Parameters:
	df: The pandas dataframe to filter
	genre1: A possible genre
	genre2: Another possible genre
	company: A production company that can not be in the production company column
	budgetmax: The maximum budget allowed

	Returns: a filtered dataframe
	'''
	filtered = df[((df['genres'].apply(lambda x: genre1 in x)) | 
				   (df['genres'].apply(lambda x: genre2 in x))) &
				 	~(df['production_companies'].apply(lambda x: company in x)) &
				 	(df['budget'] <= budgetmax)]
	
	return(filtered)

<font color='blue'>Hint: If you call your function with the following parameters, you should be left with 27 movies:</font>

    * genres of 'action' and 'adventure' 
    * production company: 'Beijing New Picture Film Co. Ltd.'
    * max budget: 1000000


## **Question 5** Calling Your Filter Function <font color="magenta">2 points</font>
Call your function using the following parameters:

* genres of 'action' and 'crime'
* the production company 'Columbia Pictures'
* max budget of 2 million (2000000). 

Report how many movies are left.






In [5]:
action_crime = filterMovies(movies, 'action', 'crime', 'Colombia Pictures', 2000000)
len(action_crime)

79

## **Question 6** - Fetch the List of Unique Genres (multiple choice) <font color="magenta">2 points</font>
Using the examples from the lesson, generate a *string* of unique genres. *Sort* the genres alphabetically.

What is the 3rd word in the sorted string list?

* fantasy
* **animation**
* comedy
* adventure
* crime




In [6]:
genres = np.sort(movies.apply(lambda x:pd.Series(x['genres'], dtype='object'),axis=1).stack().unique())
genres

array(['action', 'adventure', 'animation', 'comedy', 'crime',
       'documentary', 'drama', 'family', 'fantasy', 'foreign', 'history',
       'horror', 'music', 'mystery', 'romance', 'science fiction',
       'thriller', 'tv movie', 'war', 'western'], dtype=object)

## **Question 7** - Count the Number of Unique Production Companies <font color="magenta">2 points</font>
Using the examples from the lesson, generate a numpy array of production companies and determine the length of that array. How many unique production companies are there?



In [7]:
companies = movies.apply(lambda x:pd.Series(x['production_companies'], dtype='object'),axis=1).stack().unique()
len(companies)

2702

## **Question 8** - Creating the User Input Function (Manually graded) <font color="magenta">5 points</font>
We finally have all the pieces to create a function that returns the top N movies based on the IMDB score and the filter you wrote. We're going to to modify/expand on the build_chart function from the lesson. Once again, we'll give you the function definition in the cell below. We are also giving you the weighted_rating function. Be sure to run that cell.

Your build_chart function should take in:
* the dataframe to filter
* the filter function (you've already written this)
* the rater function (provided below)
* a parameter called "filter_location" which should be either the string 'before' or the string 'after' (filter before or after computing m and C and scoring) 
* a number of movies to return. 

The function should return the top 'n' rows of a dataframe sorted in descending order of the score column. It will return whatever columns you pass in.

In [8]:
#not included in quiz/solutions
#################
# Provided code. Run this cell 
#################

# Function to compute the IMDB weighted rating for each movie
def weighted_rating(x, m, C):
    v = x['vote_count']
    R = x['vote_average']
    # Compute the weighted score
    return (v/(v+m) * R) + (m/(m+v) * C)

In [9]:

def build_chart(df, filter_func, rater, filter_location='after', n=10, *args):
	'''
	Parameters
	df: the dataframe to that will be filtered, scored, sorted (not necessarily in that order)
	filter: the function that's being used to filter (in this case, it would be the filterMovies function)
	rater: the function used to rate or score each movie (in this case, it would be the weighted_rating function)
	filter_location: either the string 'before' or the string 'after.' If 'before' is passed, the filter will be applied before scoring. If after, it will be applied after.
	n: the number of rows to return. Defaults to 10

	Returns
	The top n rows of the sorted dataframe
	'''
	# create initial copy of df
	movies = df.copy()
	
	# calculate m and C now
	C = movies['vote_average'].mean()
	m = movies['vote_count'].quantile(.8)
	
	movies['score'] = movies.apply(rater, args=(m,C), axis=1)
	
	# apply filter if user says so
	if filter_location == 'before':
		if len(args) == 4:
			genre1, genre2, company, maxbudget = args
			movies = filter_func(movies, genre1, genre2, company, maxbudget)
			
		else:
			#Ask for preferred genre 1
			print("Input first preferred genre")
			genre1 = input()

			#Ask for preferred genre 2
			print("Input second preferred genre")
			genre2 = input()

			#Ask for production company to ignore
			print("Input production company to avoid")
			company = input()

			#Ask for max budget
			print("Input maximum budget to consider")
			maxbudget = int(input())

			#Filter based on the condition
			movies = filter_func(movies, genre1, genre2, company, maxbudget)
		
		# re-calculate m and C now, then re-score
		C = movies['vote_average'].mean()
		m = movies['vote_count'].quantile(.8)
		
		movies['score'] = movies.apply(rater, args=(m,C), axis=1)
	
	else:
		if len(args) == 4:
			genre1, genre2, company, maxbudget = args
			movies = filter_func(movies, genre1, genre2, company, maxbudget)
		
		else:
			#Ask for preferred genre 1
			print("Input first preferred genre")
			genre1 = input()

			#Ask for preferred genre 2
			print("Input second preferred genre")
			genre2 = input()

			#Ask for production company to ignore
			print("Input production company to avoid")
			company = input()

			#Ask for max budget
			print("Input maximum budget to consider")
			maxbudget = int(input())

			#Filter based on the condition
			movies = filter_func(movies, genre1, genre2, company, maxbudget)
		
	return(movies.sort_values(by='score', ascending=False).iloc[:n, :])

<font color="blue">Hint: if you run the cell below, the first movie returned should be Monty Python and the Holy Grail</font>

In [56]:
#not included in quiz/solutions
#Use the following: 'action', 'adventure', 'Beijing New Picture Film Co. Ltd.', 1000000, filtered after
build_chart(movies, filterMovies, weighted_rating, 'after', 5)

Unnamed: 0,title,budget,production_companies,genres,keywords,overview,vote_count,vote_average,score
4579,Monty Python and the Holy Grail,400000,"[Python (Monty) Pictures Limited, Michael Whit...","[adventure, comedy, fantasy]","[holy grail, monk, scotland yard]","King Arthur, accompanied by his squire, recrui...",1708,7.8,7.139996
4339,Dr. No,950000,"[United Artists, Eon Productions]","[adventure, action, thriller]","[london england, england, assassination]","In the film that launched the James Bond saga,...",940,6.9,6.517695
4664,Bronson,230000,"[EM Media, Aramid Entertainment Fund, Vertigo ...","[drama, action, crime]","[prison, isolation]",A young man who was sentenced to 7 years in pr...,733,6.9,6.477462
4670,Mad Max,400000,"[Kennedy Miller Productions, Mad Max Films, Cr...","[adventure, action, thriller, science fiction]","[chain, baby, bridge]","In a dystopian future Australia, a vicious bik...",1213,6.6,6.411634
4798,El Mariachi,220000,[Columbia Pictures],"[action, crime, thriller]","[united statesâmexico barrier, legs, arms]",El Mariachi just wants to play his guitar and ...,238,6.6,6.286867


## **Question 9** - Testing Your Function <font color="magenta">2 points</font>

Feel free to modify your build_chart function to allow the inputs to be passed to the function as we did in the lesson.  It makes for easier testing.

Run your build_chart with the following parameters:

* genre 1 = horror
* genre 2 = mystery
* production company = Paramount Pictures
* max budget = 1500000
* n = 7
* filter *before* scoring

What is the final movie in your chart? 
* The Evil Dead
* Night of the Living Dead
* Saw
* Eraserhead
* **Rebecca**


In [10]:
build_chart(movies, filterMovies, weighted_rating, 'before', 7, 'horror', 'mystery', 'Paramount Pictures', 1500000)

Unnamed: 0,title,budget,production_companies,genres,keywords,overview,vote_count,vote_average,score
4291,Saw,1200000,"[Lions Gate Films, Twisted Pictures, Evolution...","[horror, mystery, crime]","[shotgun, based on short film, sadist]",Obsessed with teaching his victims the value o...,2184,7.2,6.961649
2409,Halloween,300000,"[Compass International Pictures, Falcon Intern...","[horror, thriller]","[female nudity, nudity, mask]","In John Carpenter's horror classic, a psychoti...",1035,7.4,6.917077
4595,The Evil Dead,350000,[Renaissance Pictures],[horror],"[falsely accused, beheading, audio tape]",When a group of college students finds a myste...,894,7.3,6.801206
3737,Night of the Living Dead,114000,"[Laurel Group, Off Color Films, Image Ten]",[horror],"[brother sister relationship, cemetery, gun]",A group of people try to survive an attack of ...,580,7.5,6.762342
4724,Eraserhead,10000,"[American Film Institute (AFI), Libra Films]","[drama, fantasy, horror, science fiction]","[baby, mutant, claustrophobia]",Henry Spencer tries to survive his industrial ...,485,7.5,6.688591
4066,[REC],1500000,[Filmax],"[horror, mystery]","[terror, obsession, camcorder]",A television reporter and cameraman follow eme...,934,7.1,6.681961
4281,Rebecca,1288000,[Selznick International Pictures],"[drama, mystery]","[monte carlo, based on novel, age difference]",A self-conscious bride is tormented by the mem...,336,7.7,6.621567


## **Question 10** - Filter After <font color="magenta">2 points</font>

Now use the same parameters, but perform the filter after you apply the scores. 

What is the final movie in your chart?
* The Evil Dead
* Night of the Living Dead
* **Insidious**
* Eraserhead
* Rebecca

In [11]:
build_chart(movies, filterMovies, weighted_rating, 'after', 7, 'horror', 'mystery', 'Paramount Pictures', 1500000)

Unnamed: 0,title,budget,production_companies,genres,keywords,overview,vote_count,vote_average,score
4291,Saw,1200000,"[Lions Gate Films, Twisted Pictures, Evolution...","[horror, mystery, crime]","[shotgun, based on short film, sadist]",Obsessed with teaching his victims the value o...,2184,7.2,6.848528
2409,Halloween,300000,"[Compass International Pictures, Falcon Intern...","[horror, thriller]","[female nudity, nudity, mask]","In John Carpenter's horror classic, a psychoti...",1035,7.4,6.761775
4595,The Evil Dead,350000,[Renaissance Pictures],[horror],"[falsely accused, beheading, audio tape]",When a group of college students finds a myste...,894,7.3,6.677476
3737,Night of the Living Dead,114000,"[Laurel Group, Off Color Films, Image Ten]",[horror],"[brother sister relationship, cemetery, gun]",A group of people try to survive an attack of ...,580,7.5,6.633665
4066,[REC],1500000,[Filmax],"[horror, mystery]","[terror, obsession, camcorder]",A television reporter and cameraman follow eme...,934,7.1,6.602799
4724,Eraserhead,10000,"[American Film Institute (AFI), Libra Films]","[drama, fantasy, horror, science fiction]","[baby, mutant, claustrophobia]",Henry Spencer tries to survive his industrial ...,485,7.5,6.585787
4224,Insidious,1500000,"[Alliance Films, IM Global, Stage 6 Films]","[horror, thriller]","[medium, evil spirit, house warming]",A family discovers that dark spirits have inva...,1737,6.8,6.561787


# Preparing to Build a Content-Based Recommender
In this section of the homework, you will prepare to build a content-based recommender that can flexibly use either CountVectorizer or TfidfVectorizer. We're including  our lemmatization setup code for you. Run the cell below then proceed to part a.

In [12]:
#not included in quiz/solutions
#################################
# This cell does all the set up work - you only need to run this once per notebook
#################################

#Create a helper function to get part of speech
def get_wordnet_pos(word, pretagged = False):
    """Map POS tag to first character lemmatize() accepts"""
    if pretagged:
        tag = word[1].upper() 
    else:
        tag = nltk.pos_tag([word])[0][1][0].upper()
    
    tag_dict = {"J": wn.ADJ,
                "N": wn.NOUN,
                "V": wn.VERB,
                "R": wn.ADV}

    return tag_dict.get(tag, wn.NOUN)

#create a tokenizer that uses lemmatization (word shortening)
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        
        #get the sentences
        sents = sent_tokenize(articles)
        #get the parts of speech for sentence tokens
        sent_pos = [nltk.pos_tag(word_tokenize(s)) for s in sents]
        #flatten the list
        pos = [item for sublist in sent_pos for item in sublist]
        #lemmatize based on POS (otherwise, all words are nouns)
        lems = [self.wnl.lemmatize(t[0], get_wordnet_pos(t, True)) for t in pos if t[0] not in string.punctuation]
        #clean up in-word punctuation
        lems_clean = [''.join(c for c in s if c not in string.punctuation) for s in lems]
        return lems_clean 


    
#lemmatize the stop words
lemmatizer = WordNetLemmatizer()
lemmatized_stop_words = [lemmatizer.lemmatize(w) for w in text.ENGLISH_STOP_WORDS]
#extend the stop words with any other words you want to add, these are bits of contractions
lemmatized_stop_words.extend(['ve','nt','ca','wo','ll'])

## **Question 11** - Create the fetchSimilarityMatrix Function (Manually Graded) <font color="magenta">5 points</font>
We know that we have two kinds of vectorization we can do, and each requires a slightly different similarity matrix. Let's create a wrapper function that has the following parameters:

* df: the dataframe holding our data
* soupCol: the string name of the column holding our soup (this should already be ready to go - you shouldn't be creating your soup inside this function)
* vectorizer: an initialized vectorizer. This will either be a TfidfVectorizer or a CountVectorizer
* vectorType: a string representing either Tfidf or Count to indicate which type of vectorizer we are using

Inside your function, you'll:
* make sure your soup has no NaN (fill with empty strings)
* fit_transform your soup into a number matrix
* if the vector type is 'Tfidf', use the linear_kernel() function to generate a similarity matrix
* if the vector type is 'Count', use the cosine_similarity() function to generate a similarity matrix
* return the sparse similarity matrix



In [15]:

def fetchSimilarityMatrix(df, soupCol, vectorizer, vectorType='Tfidf'):
	'''
	Parameters
	df: the dataframe containing a soup column to tranform
	soupCol: The string title of the soup column
	vectorizer: an initialized vectorizer, with all pre-processing you desire
	vectorType: 'Tfidf' or 'Count' - representing the type of vectorizer you used.

	Returns
	Sparse Similarity Matrix
	'''
	
	df[soupCol] = df[soupCol].fillna('')
	
	vec_matrix = vectorizer.fit_transform(df[soupCol])
	
	if vectorType == 'Tfidf':
		sim_matrix = linear_kernel(vec_matrix, vec_matrix)

	elif vectorType == 'Count':
		sim_matrix = cosine_similarity(vec_matrix, vec_matrix)
	
	else:
		print('vectorType must be either "Count" or "Tfidf"')
		return
	
	return(sim_matrix)


<font color="blue">Hint: Running the code below should return 0.2</font>

In [22]:
#hint code: not included in quiz/solutions
# Read in some ted talk data
ted = pd.read_csv('data/ted-simplified.csv')

#Define a TF-IDF Vectorizer Object. Use the LemmaTokenizer defined above, convert to lowercase, and remove stopwords, and only use the top 100 features.
tfidf = TfidfVectorizer(tokenizer=LemmaTokenizer(), lowercase=True, stop_words=lemmatized_stop_words, max_features = 100)

sim = fetchSimilarityMatrix(ted, 'description', tfidf, 'Tfidf')
round(sim[1,0], 2)

0.2

## **Question 12** - Test Your fetchSimilarityMatrix Function <font color="magenta">2 points</font>
Using the ted data we read in for you above, initialize a CountVectorizer that uses 'english' stop words, lowercase, and all the features. Call the fetchSimilarityMatrix function, using the column 'topics' for your soup.

What is the value [0,2] position in your matrix (rounded to 2 digits)?


In [72]:
count = CountVectorizer(lowercase=True, stop_words='english')
sim = fetchSimilarityMatrix(ted, 'topics', count, 'Count')
round(sim[0,2], 2)

0.2

## **Question 13** - Preparing the Movies Metadata Soup (Manually Graded) <font color="magenta">5 points</font>

For this problem we'll be using the same data set **tmdb-simplified.csv** to build a meta-data based recommender by creating a "soup" based on: 

- all genres
- all keywords
- all production companies

You will need to sanitize the production companies and the keywords. Review the self-assessment solution for code to sanitize.

Make sure that you concatenate the columns in the order listed (genres, then keywords, then production companies).

Do not reload the data, just use the datframe you created and filtered in Question 3.


In [23]:
def sanitize(x):
	if isinstance(x, list):
		#Strip spaces and convert to lowercase
		return [str.lower(i.replace(" ", "")) for i in x]
	else:
		#Check if director exists. If not, return empty string
		if isinstance(x, str):
			return str.lower(x.replace(" ", ""))
		else:
			return ''

def create_soup(x):
    return ' '.join(x['genres']) + ' ' + ' '.join(x['keywords']) + ' ' + ' '.join(x['production_companies'])

movies['keywords'] = movies['keywords'].apply(sanitize)
movies['production_companies'] = movies['production_companies'].apply(sanitize)

movies['soup'] = movies.apply(create_soup, axis=1)

## **Question 14** What is the soup for Spider-Man 3? <font color="magenta">2 points</font>

There are lots of different ways to extract text from a Pandas dataframe. You can use whatever way you choose, just make sure that you're able to see the complete text. Spider-Man 3 should be the 6th row in your dataframe (so with zero-based indexes, that would be [5]. We recommend that you confirm that you're reviewing the correct row. Once you're sure you're looking at the correct row, select which of the following is the correct soup for Spider-Man 3.

* **fantasy action adventure dualidentity amnesia sandstorm columbiapictures lauraziskinproductions marvelenterprises**
* fantasy action adventure dualidentityamnesiasandstorm columbiapictures lauraziskinproductions marvelenterprises 
* fantasy action adventure dual identity amnesia sandstorm Columbia Pictures Laura Ziskin Productions Marvel Enterprises 
* fantasy action adventure d u a l i d e n t i t y a m n e s i a s a n d s t o r m columbiapictures lauraziskinproductions marvelenterprises 

In [51]:
print(movies[movies['title'] == 'Spider-Man 3'].soup.values)

['fantasy action adventure dualidentity amnesia sandstorm columbiapictures lauraziskinproductions marvelenterprises']


## **Question 15** Create Your Movie Similarity Matrix (Manually Graded) <font color="magenta">2 points</font>
Instantiate a CountVectorizer instance, converting to lowercase and removing 'english' stop words and a maximum of 1000 features. Using this instance and your fetchSimilarityMatrix function, fetch the appropriate similarity matrix for the movie df's "soup" column.



In [52]:
count = CountVectorizer(lowercase=True, stop_words='english', max_features=1000)
sim = fetchSimilarityMatrix(movies, 'soup', count, 'Count')

## **Question 16** Determine Similarity between two movies <font color="magenta">1 points</font>
There are many ways to use the matrix to determine the similarity between any two movies. In the cell below, we determine the similarity between 'Spider-Man 3' and 'The Dark Knight Rises' rounded to 2 digits.

<font color="blue">Hint: it should be 0.11</font>

Based on this sample code, determine the similarity between 'Primer' and 'Avatar', rounded to 2 digits.



In [53]:
#hint code
simdf = pd.DataFrame(sim, index=movies['title'], columns=movies['title'])
round(simdf['Spider-Man 3']['The Dark Knight Rises'], 2)

0.11

In [54]:
round(simdf['Primer']['Avatar'], 2)

0.28

## **Question 17** Generating Recommendations from the MetaData Soup <font color="magenta">2 points</font>
Finally! We have all our pieces and we can run our meta-data based content recommender. Use the pieces that you've done so far and the content_recommender function from the lesson (copied for you below) to determine the top 5 movies related to the "title" (that's your seed column) of "Spider-Man 3" - based on the similarity matrix you've already generated above.

What is the top movie?

* The Amazing Spider Man
* The Moneky King 2
* **Spider-Man 2**
* The Broadway Melody
* Krull


In [63]:
#not included in quiz/solutions
def content_recommender(df, seed, seedCol, sim_matrix,  topN=5): 
	#get the indices based off the seedCol
	indices = pd.Series(df.index, index=df[seedCol]).drop_duplicates()

	# Obtain the index of the item that matches our seed
	idx = indices[seed]

	# Get the pairwsie similarity scores of all items and convert to tuples
	sim_scores = list(enumerate(sim_matrix[idx]))

	#delete the item that was passed in
	del sim_scores[idx]

	# Sort the items based on the similarity scores
	sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

	# Get the scores of the top-n most similar items.
	sim_scores = sim_scores[:topN]

	# Get the item indices
	movie_indices = [i[0] for i in sim_scores]

	# Return the topN most similar items
	return df.iloc[movie_indices]

In [64]:
content_recommender(movies, 'Spider-Man 3', 'title', sim)

Unnamed: 0,title,budget,production_companies,genres,keywords,overview,vote_count,vote_average,soup
30,Spider-Man 2,200000000,"[columbiapictures, lauraziskinproductions, mar...","[action, adventure, fantasy]","[dualidentity, loveofone'slife, pizzaboy]",Peter Parker is going through a major identity...,4321,6.7,action adventure fantasy dualidentity loveofon...
20,The Amazing Spider-Man,215000000,"[columbiapictures, lauraziskinproductions, mar...","[action, adventure, fantasy]","[lossoffather, vigilante, serum]",Peter Parker is an outcast high schooler aband...,6586,6.5,action adventure fantasy lossoffather vigilant...
38,The Amazing Spider-Man 2,200000000,"[columbiapictures, marvelenterprises, aviaradp...","[action, adventure, fantasy]","[obsession, marvelcomic, sequel]","For Peter Parker, life is busy. Between taking...",4179,6.5,action adventure fantasy obsession marvelcomic...
786,The Monkey King 2,68490000,[filmkopictures],"[action, adventure, fantasy]",[monkeyking],Taking place 500 years after the Havoc in Heav...,24,6.0,action adventure fantasy monkeyking filmkopict...
159,Spider-Man,139000000,"[columbiapictures, marvelenterprises]","[fantasy, action]","[lossoflover, spider, thanksgiving]",After being bitten by a genetically altered sp...,5265,6.8,fantasy action lossoflover spider thanksgiving...


## **Question 18** - Using Just the Overview <font color="magenta">2 points</font>
Instead of using the soup, generate a similarity matrix using the 'overview' column. Since this is freeform text, use the Tfidf vectorizer. Pre-process the text by lemmatizing the words, using the lemmatized_stop_words. Again, limit to 1000 features.  Generate the top 5 recommendations for 'Spider-Man 3' again.

<font color="blue">Hint: You should only need a few lines of code here...</font>

What is the top movie?

* The Amazing Spider Man
* The Monkey King 2
* Spider-Man 2
* **The Broadway Melody**
* Krull


In [65]:
tfidf = TfidfVectorizer(tokenizer=LemmaTokenizer(), lowercase=True, stop_words=lemmatized_stop_words, max_features = 1000)
sim = fetchSimilarityMatrix(movies, 'overview', tfidf, 'Tfidf')
content_recommender(movies, 'Spider-Man 3', 'title', sim)

Unnamed: 0,title,budget,production_companies,genres,keywords,overview,vote_count,vote_average,soup
4594,The Broadway Melody,379000,[metro-goldwyn-mayer(mgm)],"[drama, music, romance]","[musical, singer, pre-code]","Harriet and Queenie Mahoney, a vaudeville act,...",19,5.0,drama music romance musical singer pre-code me...
38,The Amazing Spider-Man 2,200000000,"[columbiapictures, marvelenterprises, aviaradp...","[action, adventure, fantasy]","[obsession, marvelcomic, sequel]","For Peter Parker, life is busy. Between taking...",4179,6.5,action adventure fantasy obsession marvelcomic...
1198,Escape from Planet Earth,40000000,"[rainmakerentertainment, mainframeentertainmen...","[animation, comedy, adventure, family, science...","[spaceship, alien, rescue]",Astronaut Scorch Supernova finds himself caugh...,332,5.7,animation comedy adventure family science fict...
2985,The Color of Money,13800000,"[silverscreenpartners, touchstonepictures]",[drama],"[bar, billard, talent]","Former pool hustler ""Fast Eddie"" Felson decide...",291,6.7,drama bar billard talent silverscreenpartners ...
1155,Spy Kids 3-D: Game Over,38000000,[dimensionfilms],"[action, adventure, comedy, family, science fi...","[videogame, intelligence, liberation]",Carmen's caught in a virtual reality game desi...,511,4.7,action adventure comedy family science fiction...


## **Question 19** - Using N-Grams of the Overview <font color="magenta">2 points</font>
Generate a similarity matrix using just 3 word phrases (n-grams) of the 'overview' column. Since this is freeform text, use the Tfidf vectorizer. Pre-process the text by lemmatizing the words, using the lemmatized_stop_words. Again, limit to 1000 features.  Generate the top 5 recommendations for 'Spider-Man 3' again.

<font color="blue">Hint: You should only need a few lines of code here...</font>

What is the top movie?

* The Amazing Spider Man
* Pirates of the Caribbean: At World's End
* John Carter
* Spider-Man
* **Avatar**

In [66]:
tfidf = TfidfVectorizer(tokenizer=LemmaTokenizer(), lowercase=True, stop_words=lemmatized_stop_words, ngram_range=(3,3), max_features = 1000)
sim = fetchSimilarityMatrix(movies, 'overview', tfidf, 'Tfidf')
content_recommender(movies, 'Spider-Man 3', 'title', sim)

Unnamed: 0,title,budget,production_companies,genres,keywords,overview,vote_count,vote_average,soup
0,Avatar,237000000,"[ingeniousfilmpartners, twentiethcenturyfoxfil...","[action, adventure, fantasy, science fiction]","[cultureclash, future, spacewar]","In the 22nd century, a paraplegic Marine is di...",11800,7.2,action adventure fantasy science fiction cultu...
1,Pirates of the Caribbean: At World's End,300000000,"[waltdisneypictures, jerrybruckheimerfilms, se...","[adventure, fantasy, action]","[ocean, drugabuse, exoticisland]","Captain Barbossa, long believed to be dead, ha...",4500,6.9,adventure fantasy action ocean drugabuse exoti...
2,Spectre,245000000,"[columbiapictures, danjaq, b24]","[action, adventure, crime]","[spy, basedonnovel, secretagent]",A cryptic message from Bonds past sends him on...,4466,6.3,action adventure crime spy basedonnovel secret...
3,The Dark Knight Rises,250000000,"[legendarypictures, warnerbros., dcentertainment]","[action, crime, drama, thriller]","[dccomics, crimefighter, terrorist]",Following the death of District Attorney Harve...,9106,7.6,action crime drama thriller dccomics crimefigh...
4,John Carter,260000000,[waltdisneypictures],"[action, adventure, science fiction]","[basedonnovel, mars, medallion]","John Carter is a war-weary, former military ca...",2124,6.1,action adventure science fiction basedonnovel ...


## **Question 20** Soup + Overview <font color="magenta">2 points</font>

Now add the overview to your soup. Since we do not want the genres and keywords down-weighted for describing multiple movies, use a CountVectorizer with lemmatization and the lemmatized_stop_words. Once again, limit your features to 1000. (We're limiting features here just to speed up processing time.) Again find recommendations for 'Spider-Man 3.'

What is the top movie?

* **Spider-Man**
* The Amazing Spider-Man 2
* Avatar
* Spider-Man
* Krull

In [85]:
movies['soup+overview'] = movies['soup'] + ' ' + movies['overview']
count = CountVectorizer(tokenizer=LemmaTokenizer(), lowercase=True, stop_words=lemmatized_stop_words, max_features=1000)
sim = fetchSimilarityMatrix(movies, 'soup+overview', count, 'Count')
content_recommender(movies, 'Spider-Man 3', 'title', sim)

Unnamed: 0,title,budget,production_companies,genres,keywords,overview,vote_count,vote_average,soup,soup+overview
159,Spider-Man,139000000,"[columbiapictures, marvelenterprises]","[fantasy, action]","[lossoflover, spider, thanksgiving]",After being bitten by a genetically altered sp...,5265,6.8,fantasy action lossoflover spider thanksgiving...,fantasy action lossoflover spider thanksgiving...
38,The Amazing Spider-Man 2,200000000,"[columbiapictures, marvelenterprises, aviaradp...","[action, adventure, fantasy]","[obsession, marvelcomic, sequel]","For Peter Parker, life is busy. Between taking...",4179,6.5,action adventure fantasy obsession marvelcomic...,action adventure fantasy obsession marvelcomic...
1438,Krull,27000000,"[columbiapicturescorporation, barclaysmercanti...","[fantasy, action, adventure]","[kingdom, lightsaber, cultfavorite]",A prince and a fellowship of companions set ou...,129,5.8,fantasy action adventure kingdom lightsaber cu...,fantasy action adventure kingdom lightsaber cu...
1721,30 Minutes or Less,28000000,[columbiapictures],"[action, adventure, comedy]","[pizzadelivery, adventure, pizzaboy]",Two fledgling criminals kidnap a pizza deliver...,531,5.6,action adventure comedy pizzadelivery adventur...,action adventure comedy pizzadelivery adventur...
1198,Escape from Planet Earth,40000000,"[rainmakerentertainment, mainframeentertainmen...","[animation, comedy, adventure, family, science...","[spaceship, alien, rescue]",Astronaut Scorch Supernova finds himself caugh...,332,5.7,animation comedy adventure family science fict...,animation comedy adventure family science fict...
