# Predicting Age and Gender from Movie Dialogue

### Max Harlynking and Arnav Luthra

### Domain

The project domain is comprised of movie scripts and the characters/actors in them, which have their genders and ages classified. Since the main goal of the project is to predict gender and age, the task(T) is predicting the gender and age of a character based on their dialogue. The experience(E) would be movie scripts along with the the genders and ages of each character. The performance measure(P) would be the accuracy of gender and age prediction, after creating a rule based on our dataset.

### Datasets

We based a lot of our project on an existing project done on movie dialogue analysis posted here: http://polygraph.cool/films/

The code discussed below is hosted here, forked from the original analysis: https://github.com/harlynkingm/scripts

Although the creator had posted some of his code online, a majority of the code had been ommitted due to copyright restrictions involving sharing scripts. What was included was javascript code(GetMovieObjects.js) that, when run with node.js, scraped movie scripts from a variety of webpages. Also included was a character list csv that included all the characters from the their respective movies and the age and gender of most of these characters.

GetMovieObjects.js works by going through a list of movie objects and scraping the scripts for each respective object. Our first step was to get a subset of the all the movie objects that were represented in the character list csv. We used a python script(inserted below) to get the script ids of first 300 movies in the character list csv and then converted this list of ids into a list of movie objects which we then used to scrape the movie scripts.

The following code sample shows how we retrieved the script id's for the films we used:

In [14]:
import csv
ids = set()
count = 0
with open("./scripts/character_list5.csv") as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        if count < 10: # we used 300 for our project
            ids.add(row["script_id"])
        count = len(ids) 
    print(str(list(ids)))


['642', '625', '623', '280', '650', '633', '630', '640', '647', '648']


Once the script id's were retrieved, the original download code had to by modified to navigate to a script's URL, then download that script in a given format using the cheerio python library.

Once downloaded we ran into our first major roadblock, parsing the scripts. We wanted to isolate each character and their respective lines of diologue but this proved to be incredibly difficult as there was no consistency in the script formats. We tried to look for patterns in the line breaks and character names but all we could find was that character names were generally in all caps. Unfortunately, so were other aspects of the script. After a lot of frustration and failed attempts we realized that in the original html files for the movie scripts, character names were always bolded (meaning that they were surrounded by a &#60;b&#62; tag). From that realization we changed our scraping code to download the scripts with the original html tags intact and then use python string parsing to isolate the characters and dialogue.

The following sample is a piece of our parsing script run on an example film (Fight Club).

In [13]:
import re

f = open('./scripts/test-script.txt', 'r')
s = f.read()
s = s.replace('</b>', '')
s = re.sub('&\w+;', '', s, 0)
s = re.sub('\((\w|\s)+\)', '', s, 0)
l = s.split('<b>')
for block in range(1, len(l)):
    if (not l[block].strip('\n').strip(' ').isupper()):
        splitted = l[block].split('\n')
        if (splitted[0].isupper() and len(splitted) > 1 and splitted[1] != ''):
            character = splitted[0].strip()
            dialogue = ''
            for speech in splitted[1:]:
                if speech != '' and dialogue != '':
                    dialogue += ' ' + speech.strip().replace(',', '')
                elif speech == '':
                    break
                else:
                    dialogue += speech.strip().replace(',', '')
            print(character, dialogue)

JACK (V.O.) People were always asking me did I know Tyler Durden.
TYLER One minute.  This is the beginning.  We're at ground zero.  Maybe you should say a few words to mark the occasion.
JACK ... i... ann....iinn.. ff....nnyin...
JACK (V.O.) With a gun barrel between your teeth you only speak in vowels.
JACK I can't think of anything.
JACK (V.O.) With my tongue I can feel the rifling in the barrel.  For a second I totally forgot about Tyler's whole controlled demolition thing and I wondered how clean this gun is.
TYLER It's getting exciting now.
JACK (V.O.) That old saying how you always hurt the one you love well it works both way.
JACK (V.O.) We have front row seats for this Theater of Mass Destruction.  The Demolitions Committee of Project Mayhem wrapped the foundation columns of ten buildings with blasting gelatin.  In two minutes primary charges will blow base charges and those buildings will be reduced to smoldering rubble.  I know this because Tyler knows this.
TYLER Look what w

From there we cross referenced the extracted characters with the characters in the character list csv and metadata csv. This code has not been included in this notebook but can be seen in the scan_for_lines2.py file. We then generated a new csv file with each line of diologue, the character associated with that dialogue and the character's age and gender. We then used this complete csv for our analysis.

The following is a sample from this completed csv:

In [1]:
import csv
count = 0
with open("./scripts/allDialogue.csv") as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        if count < 10: 
            print(row)
            count += 1
        else:
            break

['script_id', 'title', 'year', 'gross', 'name', 'age', 'gender', 'dialogue']
['1001', 'Heist', '2001', '', 'bella', 'NULL', 'm', '...you tell me.']
['1001', 'Heist', '2001', '', 'joe moore', '71', 'm', 'Gold.']
['1001', 'Heist', '2001', '', 'bella', 'NULL', 'm', 'Some people say Love.']
['1001', 'Heist', '2001', '', 'bella', 'NULL', 'm', '...easy to get the gold hard to get it home.']
['1001', 'Heist', '2001', '', 'joe moore', '71', 'm', '...waal so it takes a little bit of thought...']
['1001', 'Heist', '2001', '', 'florid man', 'NULL', 'm', '...Five Cappuccino...']
['1001', 'Heist', '2001', '', 'florid man', 'NULL', 'm', '...you have a nice day.']
['1001', 'Heist', '2001', '', 'bella', 'NULL', 'm', 'Showtime Circus Time.']
['1001', 'Heist', '2001', '', 'bella', 'NULL', 'm', '...I see five.']


### Methods

We took our dialogue csv and fed it into LightSide to analyze the data. Due to the unwiedly size of our dataset we were limited in the methods we used to analyze our data. We used basic linear correlation to determine which words were more correlated to one gender over the other. We then used Naive Bayes to create a model for predicting gender and trained it using 10-fold cross validation. For age prediction we limited the ages to 18 year buckets so that the classification would be more accurate, rather than analyzing every year individually.

The Naive Bayes models trained on this dataset (with over 160,000 rows) took over half an hour to complete using cross validation. We gave up on attempting to classify data using Decision Trees or Support Vector Machines after an hour and a half of processing with no results returned. We believe a more powerful computer would be necessary for this analysis.

### Analysis and Results

#### Gender

The first thing that stood out to us as we analyzed the data was that men had siginificantly more dialogue than women. The total number of lines for men was 121,262 while women only had 42,151 lines. After running a feature extraction on the data we found the words with the highest correlation to men included stop words, masculine words, words about fighting, religious words, and all of the curse words. The words with the highest correlation to women included words about the self (me, I), relationship words (husband, honey), loving words, and the word 'please.' We thought it was interesting to note the potential sexism in these words- and the correlation with typically male/female concepts appearing in the dataset.

We then used naive bayes to fit a predictive model. This model had a 74% accuracy, however, we feel this may have been somewhat skewed by the difference in volume between male and female dialogue. In fact, the model was the most accurate when predicting male dialogue. When we provided 10 famous speeches as test data (where 5 of the speeches were male and 5 were female), the model predicted that all the speeches were male, getting only 50% correct. We were disappointed in these results but understood how they could be obtained given a dataset so heavily skewed towards men.

#### Age

When looking at the number of lines belonging to each 18 year age group, we found that a majority of the lines were spoken by characters of ages 18-36 and characters of ages 35-54. Once again, we used correlation to find the most prominent features and found that amongst the 0-18 year old characters family titles like <b>daddy, dad, mom, mommy, auntie,</b> and <b>uncle</b> were the most highly correlated. Amongst the 18-36 year old bracket we found that slang terms like <b>dude, yo, cool,</b> and <b>shit</b> were highly correlated along with <b>huh, really,</b> and an assortment of first names. The 36-54 and 54-72 age brackets didn't have many explanatory features. It was interesting to note that, for some reason, the 72+ bracket had a lot of star trek terms (such as <b>vulcan, starfleet, warp, enterprise, and klingon</b>). These correlations were significantly weaker than those used when analyzing gender, with the maximum correlation equalling about .05.

Unfortunately naive bayes did not work too well for predicting age. The model that was fit only had a 42% accuracy and got the most confused on stop words. The model predicted the 18-36 group best, but this was likely due to the sheer volume of dialogue in the group. It also tended to confuse the 36-54 and 54-72 groups with each other.

### Conclusions

What stood out the most as we were analyzing our data was how gender stereotypes were very much present in the movie scripts we analyzed. While men were correlated with speech about violence and general hatefulness, women were often correlated with love and relationship words. Even though women may discuss war or violence in a film, there are more cases of women talking about love and relationships, which is a stereotype that separates the genders in an interesting way.

It was also hard to ignore the massive difference in volume between male and female dialogue. Because of this, we wondered how much this disparity affected our results. In an ideal dataset of 50% male and female dialogue, we wondered whether the correlations we noticed would appear as often. As stated previously, our generated small test set was proposed to have all male dialogue by the model although half of the speeches were given by women. It is possible that the model was 'playing it safe' by choosing the result that would most likely occur given the data, or that the speeches given happened to all use a majority of masculine-leaning words. It would be interesting to supply a large database of speech data as a test set to test this.

Age tended to be much harder to predict as there only seemed to be significant differences between large gaps in age. While the model predicted the difference between very young and very old people accurately, the ages in the middle had little to no difference (particularly in the 36-72 year range). It is possible that scripts simply do not write characters within these ages differently, as the words with the highest correlation in these categories had little-to-no meaning and very low correlation values (less than .05). Young and old characters did have differences, which makes sense given that these characters often have slightly similar interactions on screen.

While we have explored this data, there is a lot that could be done to dive deeper into it. It would be very interesting to test other models on the data using a better processor, or to use feature selection in attempt to draw out more interesting correlations from the data (without stop words, for instance). We believe that we have only scratched the surface of the value of this dataset and would love to pass it along to future classes or projects who would have an interest in it.