# Movie Genre Classification Using NLP

## by Andrew Alarcon

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import statistics as stats
import imdb

### Reading in the Data

In [43]:
data = pd.read_csv('data/movies.csv')

In [44]:
data.head()

Unnamed: 0,1,Oscar et la dame rose (2009),drama,"Listening in to a conversation between his doctor and parents, 10-year-old Oscar learns what nobody has the courage to tell him. He only has a few weeks to live. Furious, he refuses to speak to anyone except straight-talking Rose, the lady in pink he meets on the hospital stairs. As Christmas approaches, Rose uses her fantastical experiences as a professional wrestler, her imagination, wit and charm to allow Oscar to live life and love to the full, in the company of his friends Pop Corn, Einstein, Bacon and childhood sweetheart Peggy Blue."
0,2,Cupid (1997),thriller,A brother and sister with a past incestuous r...
1,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fie...
2,4,The Secret Sin (1915),drama,To help their unemployed father make ends mee...
3,5,The Unrecovered (2007),drama,The film's title refers not only to the un-re...
4,6,Quality Control (2011),documentary,Quality Control consists of a series of 16mm ...


### Data Cleaning

Looking at the first 5 rows, we see that the first row is being treated as the column names. Let's fix this by setting it as the first row and then correctly naming the columns.

In [45]:
#using vstack to stack the column names of data and set it to the
#first row of dataframe
data = pd.DataFrame(np.vstack([data.columns, data]))

In [46]:
data.head()

Unnamed: 0,0,1,2,3
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his do...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous r...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fie...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends mee...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-re...


Let's appropriately name the columns.

In [47]:
data.rename(columns = {1:'Title', 2:'Genre', 3: 'Description'}, inplace = True)


In [48]:
data.head()

Unnamed: 0,0,Title,Genre,Description
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his do...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous r...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fie...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends mee...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-re...


Now, we will create a Year column based of the years in the parenthesis in the Title column. Let's drop the index column first.

In [49]:
data = data.drop(columns=[0], axis=1)
data.head()

Unnamed: 0,Title,Genre,Description
0,Oscar et la dame rose (2009),drama,Listening in to a conversation between his do...
1,Cupid (1997),thriller,A brother and sister with a past incestuous r...
2,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fie...
3,The Secret Sin (1915),drama,To help their unemployed father make ends mee...
4,The Unrecovered (2007),drama,The film's title refers not only to the un-re...


In [50]:
#Using regex to correctly gather the year information in Title
#source: https://stackoverflow.com/questions/13807207/regex-find-a-number-between-parentheses

#myre = re.compile(".*\(([0-9]+)\).*")
#myre = re.compile("/^(19|20)[\d]{2,2}$/")
myre = re.compile("\([^\d]*(\d+)[^\d]*\)")
data = data.assign(Year=data['Title'].str.extract(myre))


In [51]:
data.head()

Unnamed: 0,Title,Genre,Description,Year
0,Oscar et la dame rose (2009),drama,Listening in to a conversation between his do...,2009
1,Cupid (1997),thriller,A brother and sister with a past incestuous r...,1997
2,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fie...,1980
3,The Secret Sin (1915),drama,To help their unemployed father make ends mee...,1915
4,The Unrecovered (2007),drama,The film's title refers not only to the un-re...,2007


In [135]:
data.shape

(54214, 4)

### Checking the amount of NAN values in each columns

In [52]:
data.isna().sum() / len(data)

Title          0.000000
Genre          0.000000
Description    0.000000
Year           0.049674
dtype: float64

In [53]:
data['Year'].isna().sum()

2693

2693 missing values in the Year column. This is about 5% of the dataset. Let's try to find the years for these movies via the IMDB API.

In [None]:
data_null = data[data['Year'].isnull()].reset_index(drop=True)

In [86]:
data_null.isna().sum()

Title             0
Genre             0
Description       0
Year           2693
dtype: int64

In [50]:
moviesDB = imdb.IMDb()

#### Here we are finding the movies that are still in development

In [78]:
# Searching for a movie title
#moviesDB = imdb.IMDb()

for i in range(len(data_null)):
#for i in range(0, 3):
    movie_title = data_null['Title'][i]
    imdb_movies = moviesDB.search_movie(movie_title)

    print(f'Searching for {movie_title}:\n')

    development = '(in development)'

    #searching through imbd's database for current movie
    for movie in imdb_movies:
        title = movie['title']
        #if year contains '(in development)' go to next movie
        if development in title:
            print(f'{title} is still in development')
            data_null['Year'][i] = 'In Development'
            break
        #year = movie['year']
        #print(f'{title}')
        #print(type(title))

Searching for  The Sandman (????/I) :

The Sandman  (in development) is still in development
Searching for  Stealing Stradivarius (????) :

Stealing Stradivarius (in development) is still in development
Searching for  The Wish Kin (????) :

The Wish Kin (in development) is still in development
Searching for  Killing Grace (????) :

Killing Grace (in development) is still in development
Searching for  Crooked Tree (????) :

Crooked Teeth (in development) (TV Mini Series) is still in development
Searching for  Two Women (????) :

Searching for  Operation Bannana Split (????) :

Operation Bannana Split (in development) is still in development
Searching for  The Cellar Door 2: Preymates (????) :

The Cellar Door 2: Preymates (in development) is still in development
Searching for  The Gang (????/II) :

The Gang  (in development) is still in development
Searching for  Ember: The Sapphire Empire (????) :

Ember: The Sapphire Empire (in development) is still in development
Searching for  The M

In [82]:
data_null.Year.isna().sum()

1108

In [52]:
data_null.Year.isna().sum()/len(data_null)

0.41143705904196065

The number of null values were decreased by half! That leaves around 4% of the dataset empty.
We now have two options:

1) Since there are many movies that overlap with the same name and have been released on different years, we could just ignore these values for the data visualization and analysis part of this notebook. The year won't impact our machine learning model later on, so it is safe to keep these  rows because they still contain valuable information in the Description and Genre columns.

2) We could further play around with the IMDB API to try to find the years for these movies. However there is an issue. Some movies in the IMDB database simply do not even have an argument for year, and an error is thrown when trying to search for a year value.

In [3]:
data_null = pd.read_csv('../PartialClean.csv')

In [6]:
#Dropping all the nan rows.
data_null = data_null.dropna()

In [7]:
data_null.isna().sum()

Unnamed: 0     0
Title          0
Genre          0
Description    0
Year           0
dtype: int64

## Data Visualization

Check for multicollinearity. 

Plot years and genres.