# Film Genre Classification by Script Dialog Analysis

The dataset compiled by Cristian Danescu-Niculescu-Mizil, assistant professor at the department of information science of the Cornell University, offers structured data reflecting the dialog corpus of 617 movies, additional meta information about them, such as title, genres, IMDB votes and rating, as well as details from every character intervening in the movie: name, gender, character’s importance according to the position in the credits. This dataset can be found here:  http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

This data allows us to build a language model by processing the dialogs contained in the corpus, we hope to discern patterns that will allow us, for example, to predict the details of an unknown movie character given his or her lines in the script. Other possible applications would be classifying a new movie by genres by inputting the conversations. These are two possibilities derived from learning from the available corpus, if time allows it, we can even augment the existing data structure to infer more insights, i.e. relationship between characters, etc..

### Import third party libraries

In [29]:
%matplotlib inline
import re
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

import pandas as pd


### Transform original txt files into CSV

The dataset offered by Cristian Cornell are 4 different txt files: `movie_titles_metadata.txt`, `movie_lines.txt`, `movie_conversations.txt` and `movie_characters_metadata.txt`. To make it easier to work with the data, we will transform them into CSV files.

In [79]:
def csv_convert(filepath, header):
    with open(filepath, 'r', errors='ignore') as f:
        lines = f.readlines()
        csv = open(filepath.replace(".txt", ".csv"),"w+")
        csv.write("%s\n" %(",".join(header)))
        
        for line in lines:
            lineArray = line.split(" +++$+++ ")
            
            if "LineIDs" in header:
                lineArray[3] = '"' + lineArray[3].replace("\n", "") + '"\n'
            
            if "Line" in header:
                lineArray[4] = '"' + lineArray[4].replace('"', "'").replace("\n", "") + '"\n'

            if "Movie Title" in header:
                lineArray[3] = '"' + lineArray[3] + '"'
                
            if "Title" in header:
                lineArray[1] = '"' + lineArray[1] + '"'
                
            if "Genres" in header:
                lineArray[5] = '"' + lineArray[5].replace("\n", "") + '"\n'
                
            line = ",".join(lineArray)
            
            csv.write(line)
        
        csv.close()
    f.close()

header = ["ID", "Name", "Movie ID", "Movie Title", "Gender", "Relevance"]
csv_convert("data/train/movie_characters_metadata.txt", header)

header = ["Char1 ID", "Char2 ID", "Movie ID", "LineIDs"]
csv_convert("data/train/movie_conversations.txt", header)

header = ["ID", "Characted ID", "Movie ID", "Character Name", "Line"]
csv_convert("data/train/movie_lines.txt", header)

header = ["ID", "Title", "Release Year", "Rating", "Votes", "Genres"]
csv_convert("data/train/movie_titles_metadata.txt", header)


### Data Analysis

#### Characters Metadata

In [71]:
characters_meta = pd.read_csv('data/train/movie_characters_metadata.csv', encoding='latin-1')
characters_meta.shape

(8339, 6)

In [72]:
characters_meta.head()

Unnamed: 0,ID,Name,Movie ID,Movie Title,Gender,Relevance
0,u0,BIANCA,m0,10 things i hate about you,f,4
1,u1,BRUCE,m0,10 things i hate about you,?,?
2,u2,CAMERON,m0,10 things i hate about you,m,3
3,u3,CHASTITY,m0,10 things i hate about you,?,?
4,u4,JOEY,m0,10 things i hate about you,m,6


#### Movie Conversations

In [73]:
movie_conversations = pd.read_csv('data/train/movie_characters_metadata.csv', encoding='latin-1')
movie_conversations.shape

(8339, 6)

In [74]:
movie_conversations.head()

Unnamed: 0,ID,Name,Movie ID,Movie Title,Gender,Relevance
0,u0,BIANCA,m0,10 things i hate about you,f,4
1,u1,BRUCE,m0,10 things i hate about you,?,?
2,u2,CAMERON,m0,10 things i hate about you,m,3
3,u3,CHASTITY,m0,10 things i hate about you,?,?
4,u4,JOEY,m0,10 things i hate about you,m,6


#### Movie Lines

In [80]:
movie_lines = pd.read_csv('data/train/movie_lines.csv', encoding='latin-1')
movie_lines.shape

(281764, 5)

In [81]:
movie_lines.head()

Unnamed: 0,ID,Characted ID,Movie ID,Character Name,Line
0,L1045,u0,m0,BIANCA,They do not!
1,L1044,u2,m0,CAMERON,They do to!
2,L985,u0,m0,BIANCA,I hope so.
3,L984,u2,m0,CAMERON,She okay?
4,L925,u0,m0,BIANCA,Let's go.


#### Movie Metadata

In [85]:
movie_meta = pd.read_csv('data/train/movie_titles_metadata.csv', encoding='latin-1')
movie_meta.shape

(566, 6)

In [86]:
movie_meta.head()

Unnamed: 0,ID,Title,Release Year,Rating,Votes,Genres
0,m0,10 things i hate about you,1999,6.9,62847,"['comedy', 'romance']"
1,m1,1492: conquest of paradise,1992,6.2,10421,"['adventure', 'biography', 'drama', 'history']"
2,m2,15 minutes,2001,6.1,25854,"['action', 'crime', 'drama', 'thriller']"
3,m3,2001: a space odyssey,1968,8.4,163227,"['adventure', 'mystery', 'sci-fi']"
4,m4,48 hrs.,1982,6.9,22289,"['action', 'comedy', 'crime', 'drama', 'thrill..."


In [107]:
movie_genres = {}
for index, row in movie_meta.iterrows():
    current_genres = eval(row["Genres"])
    for i in range(len(current_genres)):
        print (i)
        if current_genres[i] in movie_genres:
            movie_genres[current_genres[i]] += 1
        else:
            print (current_genres[i])
            movie_genres[current_genres[i]] = 1

print (movie_genres)
            
plt.figure(figsize=(8,5))
# ax = sns.barplot(x.index, x.values)
# plt.title("Multiple categories per comment")
# plt.ylabel('# of Occurrences', fontsize=12)
# plt.xlabel('# of categories', fontsize=12)

0
1
0
1
2
3
0
1
2
3
0
1
2
0
1
2
3
4
0
1
2
3
4
0
1
2
0
1
2
0
1
2
0
1
0
1
2
0
1
2
0
1
2
0
1
0
1
2
3
0
1
2
0
1
2
0
1
0
0
1
0
1
0
1
0
1
2
3
0
1
2
0
1
0
1
2
3
4
0
1
0
1
2
0
1
2
0
1
2
3
0
1
2
0
1
2
3
0
0
1
2
3
0
1
2
0
1
2
0
1
2
0
1
2
3
0
1
0
1
2
3
0
1
2
3
4
0
1
2
3
0
1
2
0
1
0
1
0
1
0
1
0
1
2
0
1
2
3
4
5
0
1
0
1
2
0
1
0
1
0
1
2
0
1
2
0
1
2
0
1
2
3
0
1
2
3
4
0
1
2
3
4
0
1
2
0
1
0
1
2
3
0
1
0
1
2
0
0
1
2
3
0
1
0
1
2
3
0
1
2
3
0
0
1
2
0
1
2
0
1
0
1
2
0
1
2
3
4
5
6
7
8
0
1
2
3
4
0
1
2
0
1
2
0
1
0
1
2
0
1
2
0
1
0
1
0
0
1
0
1
2
0
1
2
0
0
1
2
3
0
1
0
1
2
0
1
0
0
1
0
1
2
0
1
2
0
1
0
1
2
3
0
1
2
3
4
5
0
1
0
1
2
3
4
0
1
2
0
1
2
0
1
0
1
2
3
4
0
1
2
0
1
2
0
1
0
0
1
0
1
2
0
1
0
1
0
1
2
3
0
1
2
0
1
2
3
0
1
0
0
1
2
0
1
0
1
2
0
1
2
3
4
5
6
0
1
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
0
1
2
0
1
2
3
0
1
0
1
2
0
1
2
0
1
2
3
4
0
1
2
0
1
0
1
0
1
2
3
0
0
0
1
0
1
0
1
2
0
1
2
0
1
0
0
1
2
0
1
2
0
0
1
0
1
0
1
2
3
4
0
1
2
3
0
1
2
3
0
1
0
0
0
1
2
0
1
2
3
4
0
1
2
0
0
1
0
1
2
0
1
2
0
1
2
3
0
1
0
1
2
0
1
2
3

<Figure size 576x360 with 0 Axes>

<Figure size 576x360 with 0 Axes>

### Define existing movie genres:

In [4]:
movie_genres = ["comedy", "romance", "adventure", "biography", "drama", "history",
"action", "crime", "thriller", "mystery", "sci-fi", "fantasy", "horror", "music", "western",
"war", "adult", "musical", "animation", "sport", "family", "short", "film-noir", "documentary"]

In [None]:
  for i in movies:
    current_genres = movies[i]["genre"]

    for genre in current_genres:

      if genre in genres:
        genres[genre] += 1
      else:
        genres[genre] = 1