## "A Bossy Sort of Voice"
#### A study of sexism in the Harry Potter series using Natural Language Processing

My eldest son is almost 6 and loves the Harry Potter series of books.  But, as I read them for the first time with him as a bedtime story, I noticed something hadn't expected.  

The sexism.

The sexism critique of the Harry Potter novels is not a new one - many people have (written)[https://www.bustle.com/articles/136244-the-5-least-feminist-moments-in-harry-potter] excellent articles about Ron's treatment of Hermoine, the portrayl of other female characters as cold or incompetent or promiscuous.  But there were two types of analyses I didn't see: a quantitative or linguistics based one, or something that looked at how the author herself portrays female characters in a biased light in how they speak

My hope in doing this is that people will use some of these tools to look more critically at the language of literature we love and the popular press for signs of gender, racial, and other bias.

#### Getting started: setting up hypotheses and requirements

In this project, I will test three hypotheses:

1. Female characters are referred to by the narrator with sexist words throughout the series by the narrator, while males are not described with that same language.

2. The narrator will use more sexist words when describing the female characters than other characters will use when talking about them.

3. Sexism as defined above will decline in both dialog and narration as the series progresses.


Taking this approach requires tools to do the following, which I'll tackle in this workbook:

1. Process text from all 7 'Harry Potter' books.

2. Look at the text at the word and sentence level.

3. Distinguish parts of speech, like nouns, verbs, and adjectives.

4. Seperate dialog from narration.

5. Be able to tell female characters from male characters

6. Have examples of sexist words and/or phrases to create a classifier. 

7. Reduce sexist words in the text to their root forms (e.g. 'shriller' should be equivalent to 'shrill').

8. Summarize the data to test our hypotheses.


#### Step 1: Process text from all 7 'Harry Potter' books.

What I mean by "process", is to get the text from the series into a format that can be read into a computer program for analysis.  For this project, I'll be using the [Python programming language](https://www.python.org/), and a few libraries (basically groupings of code that completes specialized common processes), most notably the [Natural Language Processing Toolkit (NLTK)](http://www.nltk.org/).

To do this, I am going to read 7 files in the .txt format, each containing text of one of the books, using Python's built-in `open` function and `read` method.  You can get the files for these and other books from [this site](https://archive.org/stream/pdfy-ZhGUmtnn6LEtA7jL/Harry%20Potter%20and%20the%20Philosopher%27s%20Stone%2C%20by%20J.K.%20Rowling_djvu.txt) -- note that you can use existing txt format files, or copy text, paste it in Notepad and save as a .txt file.

In this code, the files are named 'hp' then the book number - i.e. `hp1.txt` - and stored in a folder called `corpus`. To run the code as it is, you'll need to recreate this schema or alter the code to fit the path you create.

Let's write a function to read in our text.  

In [1]:
def read_file(num):
    text = ''
    with open('corpus/hp'+ str(num) + '.txt', 'rt') as file_in:
        for line in file_in:
            text = text + line
    return text

book_content = read_file(6)
print(book_content[3000:4002])

For a brief moment he allowed himself the impossible hope that nobody would answer him. However, a voice responded at once, a crisp, decisive voice that sounded as though it were reading a prepared statement. It was coming -- as the Prime Minister had known at the first cough -- from the froglike little man wearing a long silver wig who was depicted in a small, dirty oil painting in the far corner of the room.
"To the Prime Minister of Muggles. Urgent we meet. Kindly respond immediately. Sincerely, Fudge."
The man in the painting looked inquiringly at the Prime Minister.
"Er," said the Prime Minister, "listen... It's not a very good time for me... I'm waiting for a telephone call, you see... from the President of--"
"That can be rearranged," said the portrait at once. The Prime Minister's heart sank. He had been afraid of that.
"But I really was rather hoping to speak--"
"We shall arrange for the President to forget to call. He will telephone tomorrow night instead," said the little ma

Running this would return a single long string of text, and it's now usable in our program.  Notice that we pass in the number of the book we want to open with the `num` variable.

#### Step 2: Seperate dialog from narrative
In the text above, the opening and closing quotes look the same: `"`.  However, when we tokenize the text, NLTK makes opening and closing quotes look different, so we can see where dialog begins and ends.  Let's give it a try!

In [2]:
from nltk import word_tokenize

def tokenize_text(book_content):
    tokenized = word_tokenize(book_content)
    return tokenized

tokenized = tokenize_text(book_content)
print (tokenized[604:822])

['For', 'a', 'brief', 'moment', 'he', 'allowed', 'himself', 'the', 'impossible', 'hope', 'that', 'nobody', 'would', 'answer', 'him', '.', 'However', ',', 'a', 'voice', 'responded', 'at', 'once', ',', 'a', 'crisp', ',', 'decisive', 'voice', 'that', 'sounded', 'as', 'though', 'it', 'were', 'reading', 'a', 'prepared', 'statement', '.', 'It', 'was', 'coming', '--', 'as', 'the', 'Prime', 'Minister', 'had', 'known', 'at', 'the', 'first', 'cough', '--', 'from', 'the', 'froglike', 'little', 'man', 'wearing', 'a', 'long', 'silver', 'wig', 'who', 'was', 'depicted', 'in', 'a', 'small', ',', 'dirty', 'oil', 'painting', 'in', 'the', 'far', 'corner', 'of', 'the', 'room', '.', '``', 'To', 'the', 'Prime', 'Minister', 'of', 'Muggles', '.', 'Urgent', 'we', 'meet', '.', 'Kindly', 'respond', 'immediately', '.', 'Sincerely', ',', 'Fudge', '.', "''", 'The', 'man', 'in', 'the', 'painting', 'looked', 'inquiringly', 'at', 'the', 'Prime', 'Minister', '.', '``', 'Er', ',', "''", 'said', 'the', 'Prime', 'Minister

As you can see in the text above, our opening and closing quotes look like this: `'``' and "''"`.  This is a helpful tool in judging where dialog begins and ends across sentences.  Next, we will set up some rules around this and label parts of text.

#### Step 3: Label dialog and narration
In this step, we will label these two types of text while keeping them in order in case we need context.  To do this, we'll keep the list format to preserve order and create a tuple for each piece of dialogue.  So, for example:

`"''", 'That', 'can', 'be', 'rearranged', ',', "''", 'said', 'the', 'portrait', 'at', 'once', '.'`

Would become:

`('d', ["''", 'That', 'can', 'be', 'rearranged', ',', "''",]), ('n', ['said', 'the', 'portrait', 'at', 'once', '.'])`

What we'll do to achieve this is:
1. Create a new list called `parsed`.
1. Loop through the text in `tokenized` variable (printed above).  When we hit an open quote character, stop, grab everything up to that point and make it a list in a tuple where the first value is `n` for "narration", and the second value is a list containing all of those words (e.g. `('n', ['For', 'a', 'brief', 'moment', 'he', 'allowed', 'himself', 'the', 'impossible', 'hope', 'that', 'nobody', 'would', 'answer', 'him', '.', 'However', ',', 'a', 'voice', 'responded', 'at', 'once', ',', 'a', 'crisp', ',', 'decisive', 'voice', 'that', 'sounded', 'as', 'though', 'it', 'were', 'reading', 'a', 'prepared', 'statement', '.', 'It', 'was', 'coming', '--', 'as', 'the', 'Prime', 'Minister', 'had', 'known', 'at', 'the', 'first', 'cough', '--', 'from', 'the', 'froglike', 'little', 'man', 'wearing', 'a', 'long', 'silver', 'wig', 'who', 'was', 'depicted', 'in', 'a', 'small', ',', 'dirty', 'oil', 'painting', 'in', 'the', 'far', 'corner', 'of', 'the', 'room', '.',']`.  
2. Append this tuple to `text_parsed`.
3. Use the point where we found the open quote as a placeholder, then look ahead until we find a close quote.
4. Take the whole slice, from open quote to close quote.
4. Drop the slice into a tuple where the first value is `d` for "dialog" and the second value is a list containing all of the words and the quotes (e.g. `('d', '``', 'To', 'the', 'Prime', 'Minister', 'of', 'Muggles', '.', 'Urgent', 'we', 'meet', '.', 'Kindly', 'respond', 'immediately', '.', 'Sincerely', ',', 'Fudge', '.', "''")`)

The function to do this will be called `parse_text`. 

In [22]:
def parse_text(t):
    open_q = '``'
    close_q = "''"
    found_c = False # this will be used to break the while loop below
    # current will hold words until an open quote is found
    current = []
    # parsed is the list we'll eventually return, and where the ('n', ['sentence']) or ('d', ['sentence']) tuples
    # will be appended
    parsed = [] 
    length = len(t)
    i = 0

    while i < length:
        word = t[i]
        
        if word != open_q:
            current.append(word)

        elif word == open_q:
            parsed.append(('n', current))
            
            current = []
            current.append(word)
            
            while found_c == False and i < length-1:
                i += 1
                if t[i] != close_q:
                    current.append(t[i])
                else:
                    current.append(t[i])
                    parsed.append(('d', current))
                    current = []
                    found_c = True
        
        found_c = False
        i += 1
        
    return parsed
        

In [24]:
parsed = parse_text(tokenized)
print(parsed[:300])



#### Step 2: Look at the text at the word and sentence level

While this function brings the text into the program, we need to be able to look at it at the *sentence* level - this will help us identify dialog vs. narration - and at the *word* level to flag when sexist terms are used.

This next function will break our big long string of text from `read_file` into a list of sentences, then break that into a list of words and punctuation with an NLTK functions called `sent_tokenize` and `word_tokenize`.

First, we will import the parts of NLTK we need, then write the function, called `split_text`.

In [None]:
from nltk import sent_tokenize, word_tokenize

def split_text(textfile):
    # using the sent_tokenize function will break our text into a list of strings
    # the function splits on white space and punctuation
    s_tokens = sent_tokenize(textfile)
    
    # turn each sentence into a list of word tokens with a list comprehension
    tokenized = [word_tokenize(s) for s in s_tokens]
    return tokenized

# def tokenize(textfile)

The `split_text` function returns a list that includes the entire text of a book, which has a list inside it for every sentence, made up of each individual word or punctuation mark.  For the first few sentences of *The Philosopher's Stone* the output looks like this: 
```
[['CHAPTER', 'ONE', 'THE', 'BOY', 'WHO', 'LIVED', 'Mr.', 'and', 'Mrs.', 'Dursley', ',', 'of', 'number', 'four', ',', 'Privet', 'Drive', ',', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', ',', 'thank', 'you', 'very', 'much', '.'], ['They', 'were', 'the', 'last', 'people', 'you', "'d", 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', ',', 'because', 'they', 'just', 'did', "n't", 'hold', 'with', 'such', 'nonsense', '.'], ['Mr.', 'Dursley', 'was', 'the', 'director', 'of', 'a', 'firm', 'called', 'Grunnings', ',', 'which', 'made', 'drills', '.'], ['He', 'was', 'a', 'big', ',', 'beefy', 'man', 'with', 'hardly', 'any', 'neck', ',', 'although', 'he', 'did', 'have', 'a', 'very', 'large', 'mustache', '.'], ['Mrs.', 'Dursley', 'was', 'thin', 'and', 'blonde', 'and', 'had', 'nearly', 'twice', 'the', 'usual', 'amount', 'of', 'neck', ',', 'which', 'came', 'in', 'very', 'useful', 'as', 'she', 'spent', 'so', 'much', 'of', 'her', 'time', 'craning', 'over', 'garden', 'fences', ',', 'spying', 'on', 'the', 'neighbors', '.']]
```

#### Step 3: Distinguish parts of speech, like nouns, verbs, and adjectives
Now that the text is broken into words, we can use NLTK's classification function to categorize words by the part of speech they represent.

The `pos_tag` function of NLTK that we'll use to do this takes in a word as a string and returns a tuple of the word and a code for how it was classified.  So "Harry" in the sentence "Harry is Petunia's nephew." would come back as `('Harry', 'NNP')`.  

We will import the `pos_tag` function from NLTK and tag the words in a new function called `tagged_text`.

In [None]:
from nltk import pos_tag

def tagged_text(tokenized):
    tagged = [pos_tag(word) for word in tokenized]
    return tagged

If we call `tagged_text` on that same passage from *The Philosopher's Stone*, this is the output for the first few sentences - the same passage that we saw in Step 2.
~~~~ 
[[('CHAPTER', 'NN'), ('ONE', 'CD'), ('THE', 'NNP'), ('BOY', 'NNP'), ('WHO', 'NNP'), ('LIVED', 'NNP'), ('Mr.', 'NNP'), ('and', 'CC'), ('Mrs.', 'NNP'), ('Dursley', 'NNP'), (',', ','), ('of', 'IN'), ('number', 'NN'), ('four', 'CD'), (',', ','), ('Privet', 'NNP'), ('Drive', 'NNP'), (',', ','), ('were', 'VBD'), ('proud', 'JJ'), ('to', 'TO'), ('say', 'VB'), ('that', 'IN'), ('they', 'PRP'), ('were', 'VBD'), ('perfectly', 'RB'), ('normal', 'JJ'), (',', ','), ('thank', 'NN'), ('you', 'PRP'), ('very', 'RB'), ('much', 'RB'), ('.', '.')], [('They', 'PRP'), ('were', 'VBD'), ('the', 'DT'), ('last', 'JJ'), ('people', 'NNS'), ('you', 'PRP'), ("'d", 'MD'), ('expect', 'VB'), ('to', 'TO'), ('be', 'VB'), ('involved', 'VBN'), ('in', 'IN'), ('anything', 'NN'), ('strange', 'JJ'), ('or', 'CC'), ('mysterious', 'JJ'), (',', ','), ('because', 'IN'), ('they', 'PRP'), ('just', 'RB'), ('did', 'VBD'), ("n't", 'RB'), ('hold', 'VB'), ('with', 'IN'), ('such', 'JJ'), ('nonsense', 'NN'), ('.', '.')], [('Mr.', 'NNP'), ('Dursley', 'NNP'), ('was', 'VBD'), ('the', 'DT'), ('director', 'NN'), ('of', 'IN'), ('a', 'DT'), ('firm', 'NN'), ('called', 'VBN'), ('Grunnings', 'NNP'), (',', ','), ('which', 'WDT'), ('made', 'VBD'), ('drills', 'NNS'), ('.', '.')], [('He', 'PRP'), ('was', 'VBD'), ('a', 'DT'), ('big', 'JJ'), (',', ','), ('beefy', 'JJ'), ('man', 'NN'), ('with', 'IN'), ('hardly', 'RB'), ('any', 'DT'), ('neck', 'NN'), (',', ','), ('although', 'IN'), ('he', 'PRP'), ('did', 'VBD'), ('have', 'VB'), ('a', 'DT'), ('very', 'RB'), ('large', 'JJ'), ('mustache', 'NN'), ('.', '.')], [('Mrs.', 'NNP'), ('Dursley', 'NNP'), ('was', 'VBD'), ('thin', 'JJ'), ('and', 'CC'), ('blonde', 'NN'), ('and', 'CC'), ('had', 'VBD'), ('nearly', 'RB'), ('twice', 'RB'), ('the', 'DT'), ('usual', 'JJ'), ('amount', 'NN'), ('of', 'IN'), ('neck', 'NN'), (',', ','), ('which', 'WDT'), ('came', 'VBD'), ('in', 'IN'), ('very', 'RB'), ('useful', 'JJ'), ('as', 'IN'), ('she', 'PRP'), ('spent', 'VBD'), ('so', 'RB'), ('much', 'JJ'), ('of', 'IN'), ('her', 'PRP'), ('time', 'NN'), ('craning', 'NN'), ('over', 'IN'), ('garden', 'NN'), ('fences', 'NNS'), (',', ','), ('spying', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('neighbors', 'NNS'), ('.', '.')]]
~~~~ 

Each of the tags - the second values in the tuple for the words - represent a part of speech.  For example, 'NNP' means the word has been tagged by NLTK as a proper noun, 'VB' means the word has been tagged as a verb.  You can find a full list [here](https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/), but in the next steps we'll be focused primarily on adjectives, verbs and nouns.

#### Step 4: Seperate dialog from narration.

Recall that we want to look at how the narrator describes women seperately from the dialog - how can we do this?

Fortunately, in English text and in these books, dialog is always enclosed in quotation marks.  After running the functions we've created on the text, dialog generally looks something like these snippets from the end of *The Philosopher's Stone*.

~~~~
[('``', '``'), ('Thanks', 'NNS'), (',', ','), ("''", "''"), ('said', 'VBD'), ('Harry', 'NNP'), (',', ','), ('``', '``'), ('I', 'PRP'), ("'ll", 'MD'), ('need', 'VB'), ('something', 'NN'), ('to', 'TO'), ('look', 'VB'), ('forward', 'RB'), ('to', 'TO'), ('.', '.'), ("''", "''")]

[('``', '``'), ('There', 'EX'), ('he', 'PRP'), ('is', 'VBZ'), (',', ','), ('Mom', 'NNP'), (',', ','), ('there', 'EX'), ('he', 'PRP'), ('is', 'VBZ'), (',', ','), ('look', 'NN'), ('!', '.'), ("''", "''")]
~~~~

Also, notice that in my text, the beginning quotes look slanted, like this `('``', '``')` and ending quotes look straight, like this `("''", "''")`.  We can use this information to distinguish the beginning and end of a piece of dialog.

Next, we'll create a function called `seperate_narration_dialogue` that will take our big list of lists of tagged words in and pull all of the dialog out, so we are left with a list of narration that we can analyze.  

In [None]:
textfile = read_file(6)
tokenized = split_text(textfile)


In [None]:
print(tokenized[:25])

Parsing out the dialog from narration is tricky.  There are five cases to consider.

###### Case 1: The entire sentence is a quote
Example: `"But I really was rather hoping to speak--"`
Solution: Check that the first character is an open quote and the last character is a closed quote.  If this is true, the entire sentence is counted as dialog.

###### Case 2: The sentence contains a quote and some narration
Example: `"Hello?" he said, trying to sound braver than he felt.`
Solution: If there are equal numbers of open and close qu

In [None]:
def seperate_narration_dialog(tagged):
    narration_only = []
    dialog_only = []
    i = 0
    open_quotes = ('``', '``')
    alt_open_quotes = ("''", "''")
    close_quotes = ("''", "''")
    print("length", len(tagged))
    # iterate through the list of sentences 
    while i < 40:
        sent = tagged[i]
        
        # make lists of where we see open and close quotes in the sentence
        open_quote_indices = [o for o, x in enumerate(sent) if x == open_quotes]  #[0, 10]
        close_quote_indices = [c for c, y in enumerate(sent) if y == close_quotes]
        
        # case 1: There is dialog that covers more than one sentence
        if len(open_quote_indices) > len(close_quote_indices):
            search_closed = False
            j = 1
            quote = []
            quote.extend(sent)
            
            while search_closed == False:
                current = tagged[i+j]
               
                if close_quotes in current:
                    narration = current
                    quote.extend(tagged[i + j])
                    search_closed = True

                else:
                    quote.extend(tagged[i + j])
                    j += 1

            i = i + j + 1
            dialog_only.append(quote)
            print('q1', quote)
            
        # case 2: There is dialog that is captured in one sentence
        if len(open_quote_indices) == len(close_quote_indices) and len(open_quote_indices) > 1:
            quotes = [sent[open_quote_indices[i]:close_quote_indices[i]] for i in range(len(sent)+1)]
            print("q", quotes)
            print("*****")
            dialog_only.append(quotes)
            narration = [w for w in sent if w not in quotes]
            narration_only.append(narration)

        # case 2: there is no dialog in the sentence
        else:
            narration_only.append(sent)
            i += 1

        
    return (narration_only, dialog_only)

When we run `seperate_narration_dialog`, we'll get output a list of positions of the opening quotes in a sentence.  

If there's just one value, like `[0]`, there is just one opening quote, like in this sentence:
~~~~
``Ah...Prime Minister," said Cornelius Fudge, striding forward with his hand outstretched.
~~~~

However, if we get two values, like `[0, 22]`, it means there are two opening quotes in the sentence, like this one:
~~~~
``But," said the Prime Minister breathlessly, watching his teacup chewing on the corner of his next speech, ``but why -- why has nobody told me --?"
~~~~

And so on.

Then we want to match these open quotes to closed quotes, seeing if the numbers are equal.

[('Fudge', 'NNP'), ('took', 'VBD'), ('a', 'DT'), ('great', 'JJ'), (',', ','), ('deep', 'JJ'), ('breath', 'NN'), ('and', 'CC'), ('said', 'VBD'), (',', ','), ('``', '``'), ('Prime', 'NNP'), ('Minister', 'NNP'), (',', ','), ('I', 'PRP'), ('am', 'VBP'), ('very', 'RB'), ('sorry', 'JJ'), ('to', 'TO'), ('have', 'VB'), ('to', 'TO'), ('tell', 'VB'), ('you', 'PRP'), ('that', 'IN'), ('he', 'PRP'), ("'s", 'VBZ'), ('back', 'RB'), ('.', '.')] [('He-Who-Must-Not-Be-Named', 'NNP'), ('is', 'VBZ'), ('back', 'RB'), ('.', '.'), ("''", "''")] [10] []

In [None]:
narration_only, dialog_only = seperate_narration_dialog(tagged)
print("dialog", dialog_only)

#### Sexist language
Since sexism in language is sometimes very subtle, a challenge for this project is identifying what words or combinations of words and their context should be considered biased.

After some searching, I found a couple of excellent lists.  Because the Harry Potter series is written in English by a British writer, I focused on sources from the UK and countries in the Commonwealth.  Using this blog post by a [New Zealand blogger](http://sacraparental.com/2016/05/14/everyday-misogyny-122-subtly-sexist-words-women/) I had a first set of words and some excellent categories to begin with.  I found a number of [other](http://time.com/4268325/history-calling-women-shrill/) excellent articles about sexism in language, which I used to add to the `sexist_words` Python dictionary below.

In [None]:
sexist_words ={ 
    'assertiveness': ['bossy', 'abrasive', 'ball-buster', 'aggressive', 'shrill', 'bolshy', 'intense', 'stroppy', 'forward', 'mannish', 'strident', 'know-it-all'],
    'behavior' : ['cackle', 'shriek', 'caterwaul', 'yowl', 'screech','gossip', 'dramatic', 'catty', 'bitch', 'nag', 'cold', 'icy', 'shrew', 'humorless', 'man-hater', 'banshee', 'fishwife', 'lippy', 'ditzy', 'diva', 'prima donna', 'feisty', 'ladylike', 'bubbly', 'vivaious', 'flirt', 'sass', 'chatty', 'demure', 'modest', 'emotional', 'hysterical', 'hormonal', 'menstrual', 'flaky', 'moody', 'over-sensitive'],
    'sexuality': ['slut', 'trollop', 'frigid', 'easy', 'tease', 'loose', 'man-eater', 'prude', 'curvy', 'cheap', 'frumpy', 'faded', 'mousey', 'clotheshorse', 'cow', 'hag'],
    'relationship': ['spinster', 'barren', 'housewife', 'houseproud', 'soccer mom', 'mistress', 'kept woman'],
    'praise': ['caring', 'compassionate', 'hard-working', 'conscientious', 'dependable', 'diligent', 'dedicated', 'tactful', 'interpersonal', 'warm', 'helpful'],
}

#### Step 3: Parsing the text
