# Movie Genre Predictor
In this Jupyter Notebook, I will build a $k$-nearest neighbors classifier that will determine the genre of a movie based on its synopsis. The [Wikipedia Movie Plots data set](https://www.kaggle.com/jrobischon/wikipedia-movie-plots) is used in order to generate the training set for the classifier, and I will build my own test set from [IMDB](https://imdb.com). The classifier will work by looking at the frequencies of words in the synposes of movies from each genre.

1. [The Question](#question)
2. [Data Preprocessing](#preprocessing)
3. [Grouping Plots & Word Recurrence](#grouping)

In [1]:
import numpy as np
import pandas as pd
import string

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import warnings
warnings.simplefilter('ignore', FutureWarning)

# read table
movies = pd.read_csv('movie_plots.csv')

movies.loc[0:4]

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


<div id="question"></div>
## The Question 
The first part of developing a data-driven project is to decide what question you want to answer. The question needs to be specific, and it needs to be something you can develop a step-by-step approach for. With this notebook, I am going to use the `movies` DataFrame to answer the following question:
> Can we predict the genre of a movie based on its plot?

It will take a few steps to answer this question. The over-arching workflow will look something like this:
1. Data preprocessing
2. Group movies by genre and look for recurring words in plots
3. Write a $k$-nearest neighbor classifier
4. Test the classifier and determine its accuracy

<div id="preprocessing"></div>
## Data Preprocessing
Currently, the `movies` DataFrame contains details of movies (year, title, country of origin, director, cast, genre, Wiki link, and plot). In order to get meaningful results, the data need to be "cleaned;" that is, we need to remove values that we can't work with. The data set, which contains 34,885 entries, has lots of rows that contain `NaN` values ("not a number"), which means that we can't work with them. The `pandas` library allows you to remove the rows that contain `NaN` values:

In [2]:
movies = movies.dropna().reset_index(drop=True)
movies.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1903,Alice in Wonderland,American,Cecil Hepworth,May Clark,unknown,https://en.wikipedia.org/wiki/Alice_in_Wonderl...,"Alice follows a large white rabbit down a ""Rab..."
1,1907,Daniel Boone,American,Wallace McCutcheon and Ediwin S. Porter,"William Craven, Florence Lawrence",biographical,https://en.wikipedia.org/wiki/Daniel_Boone_(19...,Boone's daughter befriends an Indian maiden as...
2,1907,How Brown Saw the Baseball Game,American,Unknown,Unknown,comedy,https://en.wikipedia.org/wiki/How_Brown_Saw_th...,Before heading out to a baseball game at a nea...
3,1907,Laughing Gas,American,Edwin Stanton Porter,"Bertha Regustus, Edward Boulden",comedy,https://en.wikipedia.org/wiki/Laughing_Gas_(fi...,The plot is that of a black woman going to the...
4,1908,The Adventures of Dollie,American,D. W. Griffith,"Arthur V. Johnson, Linda Arvidson",drama,https://en.wikipedia.org/wiki/The_Adventures_o...,On a beautiful summer day a father and mother ...


In the above cell, we also reset the row indices to make slicing the table easier. We now have 33,464 rows, so we lost about 1,400 entries. The next thing to look at is what kinds of values we have in the DataFrame. Since we're focusing on the genre and plot, we can remove columns with irrelevant information:

In [3]:
movies = movies[['Genre', 'Plot']]
movies.head()

Unnamed: 0,Genre,Plot
0,unknown,"Alice follows a large white rabbit down a ""Rab..."
1,biographical,Boone's daughter befriends an Indian maiden as...
2,comedy,Before heading out to a baseball game at a nea...
3,comedy,The plot is that of a black woman going to the...
4,drama,On a beautiful summer day a father and mother ...


#### Data Exploration and Accptable Genres
Now we need to see what kinds of data we have in the set. We begin by showing what the values in the `Genre` column of the DataFrame are:

In [11]:
movies['Genre'].unique()

array(['unknown', 'biographical', 'comedy', ...,
       'animation, produced by glukoza production',
       'adventure, romance, fantasy film', 'ero'], dtype=object)

It looks like there's lots of different values in the column (2,210 to be exact). But for this question, we're really only interested in the basic genres of movies: action, adventure, comedy, drama, fantasy, historical, horror, romance, science fiction, and thriller.

To this end, we will define a function that will read through the `Genre` entry in each row and categorize them as one of the above. This process has a few steps:
1. Define a function that will determine if any of the above words are present in the entry
2. Filter `movies` for such rows
3. Change the entry for `Genre` to one of the above for each row left if entry is not one of the above

In [5]:
# Step 1:
def contains_acceptable_genre(entry):
    acceptable_genres = ['action', 'adventure', 'comedy', 'drama', 'fantasy', 'historical', 'horror', 
                        'romance', 'science fiction', 'thriller']
    for genre in acceptable_genres:
        if genre in entry['Genre']:
            return True
    return False

# Step 2:
filtered_movies = movies[movies.apply(contains_acceptable_genre, axis=1)].reset_index(drop=True)
filtered_movies.head()

Unnamed: 0,Genre,Plot
0,comedy,Before heading out to a baseball game at a nea...
1,comedy,The plot is that of a black woman going to the...
2,drama,On a beautiful summer day a father and mother ...
3,drama,A thug accosts a girl as she leaves her workpl...
4,comedy,A young couple decides to elope after being ca...


In order to accomplish Step 3, we need to develop a few helper functions. Mainly, we need to comb each `Genre` entry in `filtered_movies` do the following:
1. Check if it is an exact match for something in `acceptable_genres`
2. If it is not, then use a heuristic to determine which genre it is a part of
3. Apply this function to each row of the DataFrame

For Step 3.1, we define the function `superfluous_text` which returns a Boolean value corresponding to whether or not there are extra characters beyond an entry in `accaptable_genres`.

In [6]:
# Step 3.1:
def superfluous_text(entry):
    """
    Determines if the string entry is in acceptable_genres
    """
    acceptable_genres = ['action', 'adventure', 'comedy', 'drama', 'fantasy', 'historical', 'horror', 
                        'romance', 'science fiction', 'thriller']
    for genre in acceptable_genres:
        if genre == entry:
            return False
    return True

For Step 3.2, we take entries for which `superfluous_text` returns `True` and use a heuristic to determine which is the most accurate genre they fall under. The heuristic that we use is that the last word is likely the most general genre (i.e. the noun without any modifiers). In this way, `'historical drama'` would become `'drama'` or `'action comedy'` would become `'comedy'`.

The function `determine_genre` looks for words starting from the last word and going to the first until it finds an entry in `acceptable_genres`. If no such word exists, then an empty string, `''`, is inserted instead.

In [7]:
# Step 3.2:
def determine_genre(entry):
    """
    Takes a string entry and returns the word closest to the end that is in acceptable_genres
    """
    acceptable_genres = ['action', 'adventure', 'comedy', 'drama', 'fantasy', 'historical', 'horror', 
                        'romance', 'science fiction', 'thriller']

    if not superfluous_text(entry):
        return entry
    for c in entry:
        if c not in string.ascii_lowercase:
            genre = entry.replace(c, '')
    try:
        genre = genre.split(' ')
    except UnboundLocalError:
        genre = entry.split(' ')

    i = -1
    new_genre = ''
    while new_genre not in acceptable_genres:
        try:
            new_genre = genre[i]
        except IndexError:
            new_genre = ''
            return new_genre
        if genre not in acceptable_genres:
            i -= 1

    return new_genre

Finally, for Step 3.3, the function `change_genres` takes a DataFrame as its argument and goes through the `Genre` column to replace the original `Genre` with the result of `determine_genre`. Then a new DataFrame is generated by copying the one passed as the argument and dropping the original `Genre` column and replacing it with a new `Genre` column. Then it filters out rows where the empty string `''` was inserted instead of a genre.

In [8]:
# Step 3.3:
def change_genres(df):
    new_genres = np.array([])
    for entry in df['Genre']:
        new_genre = determine_genre(entry)
        new_genres = np.append(new_genres, new_genre)
    
    new_df = df.drop(columns=['Genre']).assign(Genre=new_genres)
    new_df = new_df[new_df['Genre'] != ''].reset_index(drop=True)
    
    return new_df

filtered_genres = change_genres(filtered_movies)
filtered_genres.head()

Unnamed: 0,Plot,Genre
0,Before heading out to a baseball game at a nea...,comedy
1,The plot is that of a black woman going to the...,comedy
2,On a beautiful summer day a father and mother ...,drama
3,A thug accosts a girl as she leaves her workpl...,drama
4,A young couple decides to elope after being ca...,comedy


#### Cleaning Plot Strings
Now that we have sorted the genres of each movie in the DataFrame, we look now to format the `Plot` column so that it can be more easily analyzed. In the cell below, the function `clean_string` is defined, which removes all characters that are not letters from a string and makes all letters lowercase. Then the function `clean_plots` is defined, which takes a DataFrame as its parameter and goes through each `Plot` entry and cleans the string there, returning a new DataFrame with cleaned plot strings.

In [9]:
def clean_string(s):
    for c in s:
        if c not in string.ascii_letters + ' ':
            s = s.replace(c, '')
        elif c in string.ascii_uppercase:
            i = string.ascii_uppercase.index(c)
            s = s.replace(c, string.ascii_lowercase[i])
    return s

def clean_plots(df):
    plots = list(df['Plot'])
    
    cleaned_plots = []
    for plot in plots:
        cleaned_plot = clean_string(plot)
        cleaned_plots += [cleaned_plot]
        
    new_df = df.drop(['Plot'], axis=1).assign(Plot=cleaned_plots)
    return new_df

movies_with_cleaned_plots = clean_plots(filtered_genres)
movies_with_cleaned_plots.head()

Unnamed: 0,Genre,Plot
0,comedy,before heading out to a baseball game at a nea...
1,comedy,the plot is that of a black woman going to the...
2,drama,on a beautiful summer day a father and mother ...
3,drama,a thug accosts a girl as she leaves her workpl...
4,comedy,a young couple decides to elope after being ca...


<div id="grouping"></div>
## Grouping Plots & Word Recurrence

In [13]:
plot_lists = movies_with_cleaned_plots.groupby('Genre').agg(list)
plot_lists.head()

Unnamed: 0_level_0,Plot
Genre,Unnamed: 1_level_1
action,[in world war i american pilots mal andrews ch...
adventure,[a white girl florence lawrence rejects a prop...
comedy,[before heading out to a baseball game at a ne...
drama,[on a beautiful summer day a father and mother...
fantasy,[the daughter of king neptune takes on human f...


In [41]:
words_by_genre = {}
for genre in plot_lists.index.values:
    plots = plot_lists.loc[genre,].values[0]
    
    word_counts = {}
    for plot in plots:
        plot = plot.split(' ')
        for word in plot:
            try:
                word_counts[word] += 1
            except KeyError:
                word_counts[word] = 1
    words_by_genre[genre] = word_counts