<h2>Assignment 2 - Content Based Recommender</h2>
<p>David Flanagan INFO-T780<br>May 9th, 2019</p>
<p>In this assignment I created a content based recommender system using a dataset I downloaded from IMDB.  The data site is available for download <a href="https://datasets.imdbws.com/">here</a>.</p>
<p>The data set contains 1000 movies and while the data set has many features I narrowed them down to the following:
    <ol>
        <li>Genre</li>
        <li>IMDB Rating</li>
        <li>Year Released</li>
        <li>Runtime</li>
        <li>Number of people who rated the movie</li>
        <li>Revenue in millions of dollars</li>
        <li>Metascore from across multiple ratings sites</li>
    </ol>
</p>

In [1]:
# Imports
import clean_data as cd
import movie_recommender as mr
import numpy as np
import ipywidgets as widgets
EXTRACTED_DATA_PATH = 'IMDB-Movie-Data.csv'
ZIPPED_DATA_PATH = 'data/imdb_data.zip'

<h3>Data Processing</h3>
<p>This cell unzips the data file and process it extensively.<br>
The genres are turned into a vector of boolean values where the value is set to 1 (True) if the movie is listed under that genre or false otherwise.  All of the other parameters are processed as ordinal data.  They are converted to discrete values based on range.</p>
<p>
The available genres are as follows: Action, Western, Music, History, Thriller, Drama, Horror, Musical, Animation, War, Fantasy, Comedy, Adventure, Mystery Sci-Fi, Sport, Family, Biography, Crime, and Romance</p>
<p>The features are stored in the data matrix X with each row representing a movie.  The shape of the data is 1000x26. The cooresponding labels are stored in the column vector Y.</p>

 

In [2]:
#Clean the data
cd.unzipData(ZIPPED_DATA_PATH)
X, Y = cd.loadData(EXTRACTED_DATA_PATH)

<h3>Creation of Recommender System</h3>
<p>This cell actually creates the recommender system and sets up the weight vectors.  The weight vector was set via "knowledge elicitation" from a self declared expert
in what people want to watch.</p>
<p>All of the genres are given a weight of 0.7 since the expert considered genre an important feature when selecting a movie.  Likewise the IMDB rating as well as the metascore were given weights of 0.9 and 0.8 respectively since these features are considered important.  All other features were given a value between 0.1 and 0.3 as they are considered less important but should still have an impact.</p>
<p>This section also sets up the local similarity functions.  Two different location functions are used one to compare the genres and the other to compare the ordinal data.  The genre comparison function returns a 1 if and only if the input vector has that genre selected and the movie vector is of that genre.  In otherwords it is essentally a logical AND gate.  Think of it as adding to the similarity if the input vector and the movie both have the genre listed, but no increase in similarity is added just because both the movie and the input vector both do not have a genre listed.</p>
<p>The local similarity function for the ordinal data is given as follows:</p>

$$ordinal_similarity = 1-\frac{abs(X^{(i)}-x_i)}{range(X^{(i)})}$$

<p>Where $X^{(i)}$ is the collumn of feature $i$ and $x_i$ is the $i$th feature of the input vector.</p>

In [3]:
#Setup the difference functions, weights and recommender system.
weights = np.array([0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 
                    0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 
                    0.7, 0.7, 0.1, 0.2, 0.9, 0.1, 0.2, 0.8])

difference_fxs = []
for i in range(X.shape[1]):
    if i >= 20:
        difference_fxs.append(mr.absDifference)
    else:
        difference_fxs.append(mr.genreCompare)

recommender = mr.MovieRecommenderSystem(X,Y, weights, difference_fxs)

<h3>Input Vector</h3>
<p>Use these widgets to select the input vector to run kNN against.  We use radio buttons for most of the features.  However, the genre is a multiple select box so that we can select as many genres as we want to include in the recommendation query.</p>

In [4]:
{
    "tags": [
        "hidecode",
    ]
}
ratingWidget = widgets.RadioButtons(description='IMDB Rating:', options=cd.ratingDict.keys())
genreWidget = widgets.SelectMultiple(description='Genres:', options=cd.genreDict.keys())
lengthWidget = widgets.RadioButtons(description='Movie Length:', options=cd.lengthDict.keys())
releaseDateWidget = widgets.RadioButtons(description="Release Date:", options=cd.releaseDateDict.keys())
ratingCountWidget = widgets.RadioButtons(description='Number of Ratings:', options=cd.ratingCountDict.keys())
revenueWidget = widgets.RadioButtons(description='Revenue:', options=cd.revenueDict.keys())
metascoreWidget = widgets.RadioButtons(description='MetaScore:', options=cd.metaScoreDict.keys())
display(widgets.HBox([genreWidget, releaseDateWidget, lengthWidget]),
        widgets.HBox([ratingWidget, ratingCountWidget, revenueWidget]), metascoreWidget)


HBox(children=(SelectMultiple(description='Genres:', options=('Musical', 'Biography', 'Mystery', 'Action', 'Fa…

HBox(children=(RadioButtons(description='IMDB Rating:', options=('Very High Rating (>9)', 'High Rating (8-9)',…

RadioButtons(description='MetaScore:', options=('Very High (>90)', 'High (70-90)', 'Average (50-70)', 'Low (30…

<h3>Example Output</h3>

In [13]:
input_vec = cd.selectionToInputVector(genreWidget.value, 
                                      releaseDateWidget.value, 
                                      lengthWidget.value, 
                                      ratingWidget.value, 
                                      ratingCountWidget.value, 
                                      revenueWidget.value, 
                                      metascoreWidget.value)

recommendations = recommender.getRecommendations(input_vec)
print(recommendations)

['The Book of Life', 'Home', 'Rio', 'Trolls', 'Brave', 'Sing', 'Storks', 'Cars 2', 'Beowulf', 'Monsters University']
None


<h3>Cleanup</h3>
<p>This section just cleans up the intermediate extracted data file.</p>

In [6]:
#%% Cleanup files
cd.cleanupFile(EXTRACTED_DATA_PATH)