There are a bunch of interesting analysis done with the data available online. Below are some of them to help you with ideation of your final project.

1. [University of Washington News: New tool quantifies power imbalance between female and male characters in Hollywood movie scripts](https://www.washington.edu/news/2017/11/13/new-tool-quantifies-power-imbalance-between-female-and-male-characters-in-hollywood-movie-scripts/)
2. [Cornell University Paper: Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs](https://www.cs.cornell.edu/~cristian/Chameleons_in_imagined_conversations.html), and the corresponding [slide presentation](https://www.cs.cornell.edu/~cristian/Chameleons_in_imagined_conversations_files/chameleons_talk.pdf)
3. [Nature News: How movies mirror our mimicry](https://www.nature.com/news/2011/110624/full/news.2011.378.html)

In [2]:
import pandas as pd
import numpy as np
import pickle

In [45]:
df_dialog = pd.read_table(
    './movie-dialog-corpus/movie_lines.txt',
    delimiter='\t',
    error_bad_lines=False,
    warn_bad_lines=False,
    header=None,
    names=[
        'lineID',
        'characterID',
        'movieID',
        'character_name',
        'dialog'
    ]
)
# filter out invalid data
df_dialog = df_dialog[(df_dialog.character_name.notnull()) & (df_dialog.dialog.notnull())]
#print(df_dialog.head(5))
df_dialog_movie1 = df_dialog[df_dialog['characterID'] == 'u8599']
df_dialog_movie1

Unnamed: 0,lineID,characterID,movieID,character_name,dialog
289498,L624053,u8599,m583,KLAA,


In [5]:
# Let's check the overall statistics of the data
df_dialog.describe()

Unnamed: 0,lineID,characterID,movieID,character_name,dialog
count,304352,304352,304352,304352,304352
unique,304352,9031,617,5356,265579
top,L490131,u4525,m289,JACK,What?
freq,1,537,1530,3032,1680


There are a total of `304352` lines of dialoges spoken by `9031` unique characters across `617` different movies. In terms of character name, `JACK` seems to be very popular, but without further analysis, it's hard for us draw any conclusion on if it's due to very few movies that the character `JACK` really have spoken a lot, or that `JACK` is just the favorite choice of name shared among many different movies.

`What?` is the most frequently used dialog. You might wonder `what?`, and I guess that reaction to some degree provided the intuition on why that's the case.

In [53]:
df_character = pd.read_table(
    './movie-dialog-corpus/movie_characters_metadata.txt',
    delimiter='\t',
    header=None,
    names=[
        'characterID',
        'character_name',
        'movieID',
        'movie_title',
        'gender',
        'position_in_credits'
    ]
)
k = df_character[df_character['characterID'] == 'u9032'].iloc[0,2]
#print(k)
df_movie = df_dialog[df_dialog['movieID'] == 'm589']
df_character_script = df_movie[df_movie['characterID'] == 'u0']
df_movie

Unnamed: 0,lineID,characterID,movieID,character_name,dialog
292291,L634094,u8682,m589,BLIND MAN,You know I won't be seeing you.
292292,L634093,u8683,m589,BOBBY,You <u>are</u> crazy you know. Be seeing you o...
292293,L634092,u8682,m589,BLIND MAN,Things ain't always the way they seem. You got...
292294,L634091,u8683,m589,BOBBY,Time's up. Any last words of wisdom?
292295,L634087,u8683,m589,BOBBY,You got a lotta philosophy in you old timer bu...
292296,L634086,u8682,m589,BLIND MAN,Well don't say I didn't warn you when things g...
292297,L634085,u8683,m589,BOBBY,No and I don't plan on sticking around either.
292298,L634084,u8682,m589,BLIND MAN,Nothing makes the Great Spirit laugh harder th...
292299,L634083,u8683,m589,BOBBY,Not this twig friend. I got plans.
292300,L634082,u8682,m589,BLIND MAN,More or less.


In [14]:
# Let's first see the overall stats
df_character.describe()

Unnamed: 0,characterID,character_name,movieID,movie_title,gender,position_in_credits
count,9035,9033,9035,9035,9035,9035
unique,9035,5354,617,617,3,58
top,u7306,MAN,m289,casino,?,?
freq,1,44,44,44,6020,6339


In [13]:
# Let's see what the gender stats looks like
df_character.gender.value_counts()

?    6020
M    2049
F     966
Name: gender, dtype: int64

Unfortunately, near $2/3$ a lot of the gender information is not available in the dataset (`6020` out of `9035`). However, we still have $1/3$ of the data and that's still `~3000` characters so it's definitely enough for any analysis that takes the prospective of gender.

In [55]:
with open('./movie-dialog-corpus/movie_title.pkl', 'rb') as f:
    df_title = pickle.load(f)

df_title.drop(['genres_x', 'movie_title'], axis='columns', inplace=True)
df_title.rename({'genres_y' : 'genres', 'title' : 'movie_title'}, axis='columns', inplace=True)

# add release month
def parse_month(x):
    if type(x) is str:
        return np.int8(x[5:7])
    return np.nan
df_title['release_month'] = df_title.release_date.apply(parse_month)
#sdf = df_title.sort_values('budget')
#sdf.tail()
df_title[df_title['movieID'] == 'm589']

Unnamed: 0,movieID,movie_year,IMDB_rating,IMDB_votes,budget,genres,imdb_id,original_language,original_title,overview,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,movie_title,production_companies_name,release_month
589,m589,1997,6.7,25388,0.0,"[Action, Comedy, Thriller, Crime, Family]",tt0098536,en,Turner & Hooch,Scott Turner has 3 days left in the local poli...,...,[US],1989-07-28,71079915.0,100.0,[en],Released,The Oddest Couple Ever Unleashed!,Turner & Hooch,"[Silver Screen Partners III, Touchstone Pictures]",7.0


In [25]:
# Overall stats
df_title.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 617 entries, 0 to 616
Data columns (total 21 columns):
movieID                      617 non-null object
movie_year                   617 non-null int64
IMDB_rating                  617 non-null float64
IMDB_votes                   617 non-null int64
budget                       616 non-null float64
genres                       616 non-null object
imdb_id                      616 non-null object
original_language            616 non-null object
original_title               616 non-null object
overview                     616 non-null object
popularity                   616 non-null float64
production_countries         616 non-null object
release_date                 616 non-null object
revenue                      616 non-null float64
runtime                      614 non-null float64
spoken_languages             616 non-null object
status                       616 non-null object
tagline                      616 non-null object
movie_titl

There are `617` total movies and we have full data for `616` of them (only one missing). There are a lot of different features describing a movie that you can analyze.