In [1]:
# execute to import notebook styling for tables and width etc.
from IPython.core.display import HTML
import urllib.request
response = urllib.request.urlopen('https://raw.githubusercontent.com/DataScienceUWL/DS775v2/master/ds755.css')
HTML(response.read().decode("utf-8"));

<font size=18>Lesson 10 Homework</font>

# **HW10.1** - Build a Simple Recommender for Low-Budget Films

You will be using the data set **tmdb_5000_movies.csv** to build three simple recommender systems, computing your score three different ways. The point to this problem is to demonstrate that when you apply filters and when you calculate m and C matters, though there is no one right way to do it. 

For each piece of this problem, you'll start with all of the movies.

For each piece of this problem, you will apply the following filters, but at different times. In your prep work, we recommend that you create a function that takes in a dataframe, applies the following filters, and returns a dataframe. This is not required, but it will make your code easier.
- movies with runtime of 200 minutes or less
- movies with a budget between (and exclusive of) \\$0 million and \\$3 million
- movies by any production company except Universal Pictures or Warner Bros.

Note that most movies are by more than one production company.

This data set can be found in the data folder in the same folder as this notebook. 

**You will need to use the option encoding = "ISO-8859-1" in the read_csv function in order to open this file.** Use the examples given in the lesson and Banik's book as a guide. (Do not explode. Use the lesson approach.)

For each of the resulting dataframes, output the top ten movies in order, based on the weighted rating. Display only the movie title, vote_count, vote_average, and weighted rating.



## Prep Work

* Do your imports.
* Create your weighted_rating function.
* Optionally create an apply_filters function.
* Read in the tmdb_5000_movies.csv. The only columns you will need are title, runtime, budget, production_companies, vote_count, and vote_average. (<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html" target="_new">Pandas has a handy parameter called "usecols" that you can use to read in just the needed columns.</a>)
* Prep the production_companies column for filtering (similar to how we handed genres in the lesson - reminder: do not explode).
* Print the total number of movies in the dataframe.
* <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html" target="_new">Make 3 copies of this dataframe</a> - movies1, movies2, movies3.

<font color = "blue"> *** 4 points -  answer in cell below *** (don't delete this cell) </font>


In [2]:
# Use this cell to prep
# packages
import pandas as pd
import numpy as np
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# returns IMDB's weighted rating
def weighted_rating(x, m, C):
    v = x['vote_count']
    R = x['vote_average']
    # Compute the weighted score
    return (v/(v+m) * R) + (m/(m+v) * C)

# applies filters
def apply_filters(df):
    df = df[(df['runtime'] <= 200) & (df['budget'] >= 0) & (df['budget'] <= 3000000)]
    df = df[df['production_companies'].apply(lambda x: "Universal Pictures" not in x)]
    df = df[df['production_companies'].apply(lambda x: "Warner Bros." not in x)]
    return df

# import data
df = pd.read_csv('data/tmdb_5000_movies.csv', usecols=['title', 'runtime', 'budget', 'production_companies', 'vote_count', 'vote_average'])

# format column
df['production_companies'] = df['production_companies'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

# Save three copies of the dataframe
movies1, movies2, movies3 = df, df, df

# Print the number of movies in the dataframe
print(f"There are {df.shape[0]} movies in the dataframe.\n")


There are 4803 movies in the dataframe.



## **HW10.1a** - Score, Then Filter
In this approach, you will score the entire movies1 dataset, then you will filter the results.

### Step-by-Step
Using movies1...

- Compute m as the movies that have a vote_count >= the 80th percentile. 
- Compute C as the mean of the vote average of all movies.
- Calculate the weighted_rating score for each movie, adding a weighted_rating column to the dataframe.
* Apply the following **filters**:
    - movies with runtime of 200 minutes or less
    - movies with a budget between (and exclusive of) \\$0 million and \\$3 million
    - movies by any production company except Universal Pictures or Warner Bros.

- Print m and C.
- Print the total number of movies left at the end of this problem.
- Display your top 10 results.

<font color = "blue"> *** 3 points -  answer in cell below *** (don't delete this cell) </font>

In [3]:
#fetch m from the whole dataset
m = movies1['vote_count'].quantile(.8)

#fetch C from the whole dataset
C = movies1['vote_average'].mean()

#compute weighted score
movies1['score1'] = movies1.apply(weighted_rating, args=(m,C), axis=1)

#apply filters
movies1 = apply_filters(movies1)

print(f"m = {m}\nC = {C}\n")
print(f"There are {movies1.shape[0]} movies in the dataframe after applying the criteria.\n")

# sort and display
movies1 = movies1.sort_values('score1', ascending=False)
movies1 = movies1[['title','vote_count','vote_average','score1']]
print(movies1.head(10))

m = 957.6000000000004
C = 6.092171559442011

There are 1449 movies in the dataframe after applying the criteria.

                                                  title  vote_count  \
4300                                     Reservoir Dogs        3697   
4602                                       12 Angry Men        2078   
4302                     The Good, the Bad and the Ugly        2311   
4337                                        Taxi Driver        2535   
3940                                             Oldboy        1945   
2862                                         About Time        2067   
4173  Dr. Strangelove or: How I Learned to Stop Worr...        1442   
4579                    Monty Python and the Holy Grail        1708   
4238                                       Modern Times         856   
4333                                              Rocky        1791   

      vote_average    score1  
4300           8.0  7.607499  
4602           8.2  7.535072  
4302       

## **HW10.1b** - Filter, Then Score
In this approach, we'll first filter, and then score the movies.

### Step-By-Step
Starting with movies2...

- Apply the following **filters**:
    - movies with runtime of 200 minutes or less
    - movies with a budget between (and exclusive of) \\$0 million and \\$3 million
    - movies by any production company except Universal Pictures or Warner Bros.
- Compute m as the movies that have a vote_count >= the 80th percentile, out of the remaining movies. 
- Compute C as the mean of the vote_average of the remaining movies.

- Calculate the weighted rating score.
- Print m and C.
- Print the total number of movies left.
- Display your top 10 results.

<font color = "blue"> *** 3 points -  answer in cell below *** (don't delete this cell) </font>

In [4]:
#apply filters
movies2 = apply_filters(movies2)

#fetch m 
m = movies2['vote_count'].quantile(.8)

#fetch C 
C = movies2['vote_average'].mean()

#compute weighted score
movies2['score2'] = movies2.apply(weighted_rating, args=(m,C), axis=1)

print(f"m = {m}\nC = {C}\n")

print(f"There are {movies2.shape[0]} movies in the dataframe after applying the criteria.\n")

# sort and display
movies2 = movies2.sort_values('score2', ascending=False)
movies2 = movies2[['title','vote_count','vote_average','score2']]
print(movies2.head(10))


m = 167.4000000000001
C = 5.7636991028295395

There are 1449 movies in the dataframe after applying the criteria.

                                                  title  vote_count  \
4602                                       12 Angry Men        2078   
4302                     The Good, the Bad and the Ugly        2311   
4300                                     Reservoir Dogs        3697   
4337                                        Taxi Driver        2535   
3940                                             Oldboy        1945   
4173  Dr. Strangelove or: How I Learned to Stop Worr...        1442   
4238                                       Modern Times         856   
2862                                         About Time        2067   
4579                    Monty Python and the Holy Grail        1708   
3984                                   Some Like It Hot         808   

      vote_average    score2  
4602           8.2  8.018368  
4302           8.1  7.942198  
4300      

## **HW10.1c** - Popular Movies Only
In this approach, you're going to first filter to the top 20% of all movies, then filter to our critera, then compute m and C.

### Step-by-Step
Starting with movies3...

* Filter movies3 to the top 20% of vote_count.
* Apply the following **filters**:
    - movies with runtime of 200 minutes or less
    - movies with a budget between (and exclusive of) \\$0 million and \\$3 million
    - movies by any production company except Universal Pictures or Warner Bros.
* Compute m as the movies that have a vote_count >= the 80th percentile, out of the remaining movies.
* Compute C as the mean of the vote_average out of the remaining movies.
* Calculate the weighted rating score.
* Print m and C.
* Print the total number of remaining movies.
* Display your top 10 results.

<font color = "blue"> *** 3 points -  answer in cell below *** (don't delete this cell) </font>


In [5]:
m = movies3['vote_count'].quantile(.8)
movies3 = movies3[movies3['vote_count'] >= m]

#apply filters
movies3 = apply_filters(movies3)

m = movies3['vote_count'].quantile(.8)
C = movies3['vote_average'].mean()

#compute weighted score
movies3['score3'] = movies3.apply(weighted_rating, args=(m,C), axis=1)

print(f"m = {m}\nC = {C}\n")

print(f"There are {movies3.shape[0]} movies in the dataframe after applying the criteria.\n")

# sort and display
movies3 = movies3.sort_values('score3', ascending=False)
movies3 = movies3[['title','vote_count','vote_average','score3']]
print(movies3.head(10))


m = 1914.0
C = 7.169444444444447

There are 36 movies in the dataframe after applying the criteria.

                                                  title  vote_count  \
4300                                     Reservoir Dogs        3697   
4602                                       12 Angry Men        2078   
4302                     The Good, the Bad and the Ugly        2311   
4337                                        Taxi Driver        2535   
3940                                             Oldboy        1945   
4173  Dr. Strangelove or: How I Learned to Stop Worr...        1442   
2862                                         About Time        2067   
4579                    Monty Python and the Holy Grail        1708   
4081                                The Lives of Others         958   
4017                                     Before Sunrise         959   

      vote_average    score3  
4300           8.0  7.716684  
4602           8.2  7.705891  
4302           8.1  7.67

## **HW10.1d** - Which version of the score would you use? Why?

Talk briefly about which approach you would take if you were recommending low-budget movies. 

<font color = "blue"> *** 2 point -  answer in cell below *** (don't delete this cell) </font>

<font color = "green">
I feel like the second approach makes the most sense for this use case. The first approach scores the movies against the entire population and then filters down to the target.  The third approach filters to the movies with the most votes first and then scores them. I assume that low budget films will tend to have fewer viewers on the whole. The second approach narrows the population down to the appropriate criteria before applying the ranking, and seems to be appropriate for this case.
</font>

# **HW10.2** - Build a Knowledge-Based Recommender

Use the data set **tmdb_5000_movies.csv** to build a knowledge-based recommender system that solicits the following information listed below and then ranks the movies according to the IMDB weighted rating formula. Use all available movies to begin with (*i.e.* don't restrict it to just the top 20%, for example) Print the top 5 highest rated movies for this recommendation. 

Ask the user to enter answers to the following questions:

- Enter a preferred genre. (Print a list of genres for the user to choose from before asking for their inputs.)
- Enter another preferred genre.
- Enter a minimum runtime (in minutes).
- Enter a maximum runtime (in minutes).
    
Create the recommender to select movies with either genre entered. Be inclusive of the minimum (>=) and maximum (<=) runtimes.

Calculate m (80th percentile) and C (mean of the vote_aveage) from your filtered dataset.

Test your recommender by calling the function and having it give recommendations for genres "crime" and "drama" between of 50 and 120 minutes long and display only the movie title and the requested characteristics as well as the vote count, vote average, and IMDB weighted rating with your recommendations.

Use the examples given in the lesson as a guide. (Do not explode.)

Provide your code and a demonstration of the recommender below.  We may run your code as well.

<font color = "blue"> *** 15 points -  answer in cell below *** (don't delete this cell) </font>

In [6]:
def build_chart(gen_df, percentile=0.8):
    
    #Ask for preferred genres
    print("Input preferred genre")
    genre = input()
    
    #Ask for preferred genres
    print("Input another preferred genre")
    genre2 = input()
    
    #Ask for upper limit of duration
    print("Input longest duration")
    high_time = int(input())
    
    #Ask for lower limit of duration
    print("Input lowest duration")
    low_time = int(input())
    
    #Define a new movies variable to store the preferred movies. Copy the contents of gen_df to movies
    movies = gen_df.copy()
    
    #Filter based on the condition
    movies = movies[(movies['genres'].apply(lambda x: genre in x) | movies['genres'].apply(lambda x: genre2 in x)) & #updated filtering based on a list.
                    (movies['runtime'] >= low_time) & 
                    (movies['runtime'] <= high_time)]
                        
    #Compute the values of C and m for the movies
    C = movies['vote_average'].mean()
    m = movies['vote_count'].quantile(percentile)
                    
    #Calculate score using the IMDB formula
    movies['score'] = movies.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) 
                                       + (m/(m+x['vote_count']) * C)
                                       ,axis=1)
                    
    # return only requested columns
    movies = movies[['title', 'genres', 'runtime', 'vote_count', 'vote_average', 'score']]
                    
    #Sort movies in descending order of their scores
    movies = movies.sort_values('score', ascending=False)
                    
    return movies
                    

df = pd.read_csv('data/tmdb_5000_movies.csv', usecols=['title', 'runtime', 'genres', 'vote_count', 'vote_average'])
df['genres'] = df['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

out_movies = build_chart(df, .8)
print(out_movies.head(5))

Input preferred genre


 Science Fiction


Input another preferred genre


 Fantasy


Input longest duration


 120


Input lowest duration


 50


                                      title  \
2285                     Back to the Future   
3158                                  Alien   
2152  Eternal Sunshine of the Spotless Mind   
1725                           Blade Runner   
1987                   Howl's Moving Castle   

                                            genres  runtime  vote_count  \
2285  [Adventure, Comedy, Science Fiction, Family]    116.0        6079   
3158   [Horror, Action, Thriller, Science Fiction]    117.0        4470   
2152             [Science Fiction, Drama, Romance]    108.0        3652   
1725            [Science Fiction, Drama, Thriller]    117.0        3509   
1987               [Fantasy, Animation, Adventure]    119.0        1991   

      vote_average     score  
2285           8.0  7.570388  
3158           7.9  7.381918  
2152           7.9  7.300754  
1725           7.9  7.283880  
1987           8.2  7.188955  


# **HW10.3** - Build a Content-Based Recommender

Use the data set **tmdb_5000_movies.csv** to build a meta-data based recommender by creating a "soup" based on the 

- all genres
- top three keywords (Hint: Review page 55 of the book.)
- top three production companies (Hint: Review page 55 of the book.)
- overview

Generate a similarity matrix using cosine similarity with the **CountVectorizer**. 

Be sure to delete the passed-in movie from the row of similarity scores before sorting the score tuples. (See the lesson, not the book.)

**HW10.3a** - Put your code in the cell below, then answer the follow up questions.  Use the examples given in the lesson and Banik's book as a guide.

<font color = "blue"> *** 10 points -  answer in cell below *** (don't delete this cell) </font>

In [7]:
df = pd.read_csv('data/tmdb_5000_movies.csv')
df['overview'] = df['overview'].fillna('')
df['genres'] = df['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df['keywords'] = df['keywords'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df['production_companies'] = df['production_companies'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

def content_recommender(df, title, cosine_sim, indices, topN=2): 
    # Obtain the index of the movie that matches the title
    idx = indices[title]
    # Get the pairwsie similarity scores of all movies with that movie and convert to tuples
    sim_scores = list(enumerate(cosine_sim[idx]))
    #delete the movie that was passed in
    del sim_scores[idx]
    
    # Sort the movies based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the top-n most similar movies.
    sim_scores = sim_scores[:topN]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    df = df[['title', 'genres']]
    
    # Return the top 10 most similar movies
    return df.iloc[movie_indices]

# Returns the list top 3 elements or entire list; whichever is more.
def generate_list(x):
    if isinstance(x, list):
        names = [ele for ele in x]
        
        if len(names) > 3:
            names = names[:3]
        return names
    
    return []

df['keywords'] = df['keywords'].apply(generate_list)
df['production_companies'] = df['production_companies'].apply(generate_list)

#Function that creates a soup out of the desired metadata
def create_soup(x):
    return ' '.join(x['genres']) + " " + ' '.join(x['keywords']) + " " + ' '.join(x['production_companies']) + " " + x['overview']

df['soup'] = df.apply(create_soup, axis=1)

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df['soup'])

#create the reverse mapping
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

#Compute the cosine similarity score 
cosine_sim = cosine_similarity(count_matrix, count_matrix)



**HW10.3b** -  After you have constructed it, print the "soup" for the first entry (the 0 entry).

<font color = "blue"> *** 4 points -  answer in cell below *** (don't delete this cell) </font>

In [8]:
print(df['soup'][0])

Action Adventure Fantasy Science Fiction culture clash future space war Ingenious Film Partners Twentieth Century Fox Film Corporation Dune Entertainment In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.


**HW103.** List the top 10 recommended movies that go with the movie Halloween. Display only the movie titles and genres of your recommendations.

<font color = "blue"> *** 6 points -  answer in cell below *** (don't delete this cell) </font>

In [9]:
content_recommender(df, 'Halloween', cosine_sim, indices, 10)

Unnamed: 0,title,genres
3746,The Boy Next Door,[Thriller]
4415,Snow White: A Deadly Summer,"[Horror, Thriller]"
4134,Lesbian Vampire Killers,"[Horror, Comedy]"
4190,May,"[Drama, Horror, Thriller, Romance]"
3175,Black Christmas,"[Horror, Thriller]"
3050,Body Double,"[Crime, Mystery, Horror, Thriller]"
2551,Halloween II,[Horror]
3920,Phantasm II,"[Action, Horror, Science Fiction, Thriller]"
3383,Losin' It,[Comedy]
3586,"As Above, So Below","[Horror, Thriller]"
