## "A Bossy Sort of Voice"
#### A study of sexism in the Harry Potter series using Natural Language Processing

My eldest son is almost 6 and loves the Harry Potter series of books.  But, as I read them for the first time with him as a bedtime story, I noticed something hadn't expected.  

The sexism.

The sexism critique of the Harry Potter novels is not a new one - many people have (written)[https://www.bustle.com/articles/136244-the-5-least-feminist-moments-in-harry-potter] excellent articles about Ron's treatment of Hermoine, the portrayl of other female characters as cold or incompetent or promiscuous.  But there were two types of analyses I didn't see: a quantitative or linguistics based one, or something that looked at how the author herself portrays female characters in a biased light in how they speak

My hope in doing this is that people will use some of these tools to look more critically at the language of literature we love and the popular press for signs of gender, racial, and other bias.

#### Getting started: setting up hypotheses and requirements

In this project, I will test three hypotheses:

1. Female characters are referred to by the narrator with sexist words throughout the series by the narrator, while males are not described with that same language.

2. The narrator will use more sexist words when describing the female characters than other characters will use when talking about them.

3. Sexism as defined above will decline in both dialog and narration as the series progresses.


Taking this approach requires tools to do the following, which I'll tackle in this workbook:

1. Process text from all 7 'Harry Potter' books.

2. Seperate dialog from narration.

3. Label dialog and narration.

4. Distinguish parts of speech, like nouns, verbs, and adjectives.
000

2. Look at the text at the word and sentence level.

3. Distinguish parts of speech, like nouns, verbs, and adjectives.

4. 

5. Be able to tell female characters from male characters

6. Have examples of sexist words and/or phrases to create a classifier. 

7. Reduce sexist words in the text to their root forms (e.g. 'shriller' should be equivalent to 'shrill').

8. Summarize the data to test our hypotheses.


#### Step 1: Process text from all 7 'Harry Potter' books.

What I mean by "process", is to get the text from the series into a format that can be read into a computer program for analysis.  For this project, I'll be using the [Python programming language](https://www.python.org/), and a few libraries (basically groupings of code that completes specialized common processes), most notably the [Natural Language Processing Toolkit (NLTK)](http://www.nltk.org/).

To do this, I am going to read 7 files in the .txt format, each containing text of one of the books, using Python's built-in `open` function and `read` method.  You can get the files for these and other books from [this site](https://archive.org/stream/pdfy-ZhGUmtnn6LEtA7jL/Harry%20Potter%20and%20the%20Philosopher%27s%20Stone%2C%20by%20J.K.%20Rowling_djvu.txt) -- note that you can use existing txt format files, or copy text, paste it in Notepad and save as a .txt file.

In this code, the files are named 'hp' then the book number - i.e. `hp1.txt` - and stored in a folder called `corpus`. To run the code as it is, you'll need to recreate this schema or alter the code to fit the path you create.

Let's write a function to read in our text.  

In [68]:
def read_file(num):
    text = ''
    with open('corpus/hp'+ str(num) + '.txt', 'rt') as file_in:
        for line in file_in:
            text = text + line
    return text

book_content = read_file(1)
print(book_content[3000:4002])

s mind. As he drove toward town he thought of nothing except a large order of drills he was hoping to get that day.

But on the edge of town, drills were driven out of his mind by something else. As he sat in the usual morning traffic jam, he couldn't help noticing that there seemed to be a lot of strangely dressed people about. People in cloaks. Mr. Dursley couldn't bear people who dressed in funny clothes -- the getups you saw on young people! He supposed this was some stupid new fashion. He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by. They were whispering excitedly together. Mr. Dursley was enraged to see that a couple of them weren't young at all; why, that man had to be older than he was, and wearing an emerald-green cloak! The nerve of him! But then it struck Mr. Dursley that this was probably some silly stunt -- these people were obviously collecting for something...

yes, that would be it. The traffic moved on

Running this would return a single long string of text, and it's now usable in our program.  Notice that we pass in the number of the book we want to open with the `num` variable.

#### Step 2: Seperate dialog from narrative
In the text above, the opening and closing quotes look the same: `"`.  However, when we tokenize the text, NLTK makes opening and closing quotes look different, so we can see where dialog begins and ends.  Let's give it a try!

In [69]:
from nltk import word_tokenize

def tokenize_text(book_content):
    tokenized = word_tokenize(book_content)
    return tokenized

tokenized = tokenize_text(book_content)
print (tokenized[604:822])

['read', 'maps', 'or', 'signs', '.', 'Mr.', 'Dursley', 'gave', 'himself', 'a', 'little', 'shake', 'and', 'put', 'the', 'cat', 'out', 'of', 'his', 'mind', '.', 'As', 'he', 'drove', 'toward', 'town', 'he', 'thought', 'of', 'nothing', 'except', 'a', 'large', 'order', 'of', 'drills', 'he', 'was', 'hoping', 'to', 'get', 'that', 'day', '.', 'But', 'on', 'the', 'edge', 'of', 'town', ',', 'drills', 'were', 'driven', 'out', 'of', 'his', 'mind', 'by', 'something', 'else', '.', 'As', 'he', 'sat', 'in', 'the', 'usual', 'morning', 'traffic', 'jam', ',', 'he', 'could', "n't", 'help', 'noticing', 'that', 'there', 'seemed', 'to', 'be', 'a', 'lot', 'of', 'strangely', 'dressed', 'people', 'about', '.', 'People', 'in', 'cloaks', '.', 'Mr.', 'Dursley', 'could', "n't", 'bear', 'people', 'who', 'dressed', 'in', 'funny', 'clothes', '--', 'the', 'getups', 'you', 'saw', 'on', 'young', 'people', '!', 'He', 'supposed', 'this', 'was', 'some', 'stupid', 'new', 'fashion', '.', 'He', 'drummed', 'his', 'fingers', 'on

As you can see in the text above, our opening and closing quotes look like this: `'``' and "''"`.  This is a helpful tool in judging where dialog begins and ends across sentences.  Next, we will set up some rules around this and label parts of text.

#### Step 3: Label dialog and narration
In this step, we will label these two types of text while keeping them in order in case we need context.  To do this, we'll keep the list format to preserve order and create a tuple for each piece of dialogue.  So, for example:

`"''", 'That', 'can', 'be', 'rearranged', ',', "''", 'said', 'the', 'portrait', 'at', 'once', '.'`

Would become:

`('d', ["''", 'That', 'can', 'be', 'rearranged', ',', "''",]), ('n', ['said', 'the', 'portrait', 'at', 'once', '.'])`

What we'll do to achieve this is:
1. Create a new list called `parsed`.
1. Loop through the text in `tokenized` variable (printed above).  When we hit an open quote character, stop, grab everything up to that point and make it a list in a tuple where the first value is `n` for "narration", and the second value is a list containing all of those words (e.g. `('n', ['For', 'a', 'brief', 'moment', 'he', 'allowed', 'himself', 'the', 'impossible', 'hope', 'that', 'nobody', 'would', 'answer', 'him', '.', 'However', ',', 'a', 'voice', 'responded', 'at', 'once', ',', 'a', 'crisp', ',', 'decisive', 'voice', 'that', 'sounded', 'as', 'though', 'it', 'were', 'reading', 'a', 'prepared', 'statement', '.', 'It', 'was', 'coming', '--', 'as', 'the', 'Prime', 'Minister', 'had', 'known', 'at', 'the', 'first', 'cough', '--', 'from', 'the', 'froglike', 'little', 'man', 'wearing', 'a', 'long', 'silver', 'wig', 'who', 'was', 'depicted', 'in', 'a', 'small', ',', 'dirty', 'oil', 'painting', 'in', 'the', 'far', 'corner', 'of', 'the', 'room', '.',']`.  
2. Append this tuple to `text_parsed`.
3. Use the point where we found the open quote as a placeholder, then look ahead until we find a close quote.
4. Take the whole slice, from open quote to close quote.
4. Drop the slice into a tuple where the first value is `d` for "dialog" and the second value is a list containing all of the words and the quotes (e.g. `('d', '``', 'To', 'the', 'Prime', 'Minister', 'of', 'Muggles', '.', 'Urgent', 'we', 'meet', '.', 'Kindly', 'respond', 'immediately', '.', 'Sincerely', ',', 'Fudge', '.', "''")`)

The function to do this will be called `parse_text`. 

In [70]:
def parse_text(t):
    open_q = '``'
    close_q = "''"
    found_c = False # this will be used to break the while loop below
    # current will hold words until an open quote is found
    current = []
    # parsed is the list we'll eventually return, and where the ('n', ['sentence']) or ('d', ['sentence']) tuples
    # will be appended
    parsed = [] 
    length = len(t)
    i = 0

    while i < length:
        word = t[i]
        
        if word != open_q and word != close_q:
            current.append(word)

        elif word == open_q or word == close_q:
            parsed.append(('n', current))
            
            current = []
            current.append(word)
            
            while found_c == False and i < length-1:
                i += 1
                if t[i] != close_q:
                    current.append(t[i])
                else:
                    current.append(t[i])
                    parsed.append(('d', current))
                    current = []
                    found_c = True
        
        found_c = False
        i += 1
        
    return parsed
        

In [71]:
parsed = parse_text(tokenized)

# the text below is the same as above, now categorized with an 'n' (narration) or 'd' (dialog)
print(parsed[:13])

[('n', ['CHAPTER', 'ONE', 'THE', 'BOY', 'WHO', 'LIVED', 'Mr.', 'and', 'Mrs.', 'Dursley', ',', 'of', 'number', 'four', ',', 'Privet', 'Drive', ',', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', ',', 'thank', 'you', 'very', 'much', '.', 'They', 'were', 'the', 'last', 'people', 'you', "'d", 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', ',', 'because', 'they', 'just', 'did', "n't", 'hold', 'with', 'such', 'nonsense', '.', 'Mr.', 'Dursley', 'was', 'the', 'director', 'of', 'a', 'firm', 'called', 'Grunnings', ',', 'which', 'made', 'drills', '.', 'He', 'was', 'a', 'big', ',', 'beefy', 'man', 'with', 'hardly', 'any', 'neck', ',', 'although', 'he', 'did', 'have', 'a', 'very', 'large', 'mustache', '.', 'Mrs.', 'Dursley', 'was', 'thin', 'and', 'blonde', 'and', 'had', 'nearly', 'twice', 'the', 'usual', 'amount', 'of', 'neck', ',', 'which', 'came', 'in', 'very', 'useful', 'as', 'she', 'spent', 'so', 'much', 'of', 'her', 'time', 'cra

To recap: now we have a big list of tuples that represents an entire book, with the text still in its original order. 

Each item in the list is a tuple.  

The first value is the tuple an `'n'` for narration or `'d'` for dialog.  

The second item in the tuple is a list of tuples.  Each of these tuples include a word or punctuation mark and the appropriate tag for the part of speech it represents.


Now that we have our text organized the way we want it, we can run some analysis on the narration to see how Rowling portrays female characters.   There are a couple of challenges with this.

First, we need to isolate narrative passages that refer to female characters, either by name or pronoun.

The next challenge is determining how do we know something is sexist?  There are a couple of ways to approach this.  The first is to determine what rules we will use to determine something is sexist, then apply those rules to the passages we isolate.  The other would be to create an algorithm, feed it a labeled training set of sexist phrases and train it to recognize others.

For the purposes of this project, I'm going to use the first approach: writing a set of rules and testing each narrative passage against it.  This is more transparent and in my opinion less subject to bias than finding and labeling passages of text, and possibly less error prone.


#### Step 4: Isolating the right narrative passages
Let's create a list of all of the narrative that describes a female character.  The most basic step is creating a list of all of the narrative passages.  

To do this, we're going to loop through our full text variable, `tagged`, and pull out only the parts marked with `n` in the `isolate narrative` function below.

In [72]:
isolate_narrative = [p for p in parsed if p[0] == 'n']
#sample
print(isolate_narrative[2:3])

[('n', ['Mr.', 'Dursley', 'stopped', 'dead', '.', 'Fear', 'flooded', 'him', '.', 'He', 'looked', 'back', 'at', 'the', 'whisperers', 'as', 'if', 'he', 'wanted', 'to', 'say', 'something', 'to', 'them', ',', 'but', 'thought', 'better', 'of', 'it', '.', 'He', 'dashed', 'back', 'across', 'the', 'road', ',', 'hurried', 'up', 'to', 'his', 'office', ',', 'snapped', 'at', 'his', 'secretary', 'not', 'to', 'disturb', 'him', ',', 'seized', 'his', 'telephone', ',', 'and', 'had', 'almost', 'finished', 'dialing', 'his', 'home', 'number', 'when', 'he', 'changed', 'his', 'mind', '.', 'He', 'put', 'the', 'receiver', 'back', 'down', 'and', 'stroked', 'his', 'mustache', ',', 'thinking', '...', 'no', ',', 'he', 'was', 'being', 'stupid', '.', 'Potter', 'was', "n't", 'such', 'an', 'unusual', 'name', '.', 'He', 'was', 'sure', 'there', 'were', 'lots', 'of', 'people', 'called', 'Potter', 'who', 'had', 'a', 'son', 'called', 'Harry', '.', 'Come', 'to', 'think', 'of', 'it', ',', 'he', 'was', "n't", 'even', 'sure',

In [73]:
def split_sent(t):
    for s in t:
        k = (list(s) for k,s in gb(s, lambda item: item=='.'))
        for i in k:
            if len(i) > 1:
                print(i)

def protagonist(n):
    p_list = ['harry', 'potter', 'ron', 'ronald', 'hermione', 'granger']
    protagonist_narrative = {
        'harry': [],
        'ron':  [],
        'hermione':  [],
    }
    for i in n:
        for w in i[1]:
            word = w.lower()
            
            if word in p_list:
                if word == p_list[0] or word == p_list[1]:
                    protagonist_narrative['harry'].append(i[1])
                if word == p_list[2] or word == p_list[3]:
                    protagonist_narrative['ron'].append(i[1])
                if word == p_list[4] or word == p_list[5]:
                    protagonist_narrative['hermione'].append(i[1])
                    
    return protagonist_narrative
        
protagonist_dict = protagonist(isolate_narrative)

In [99]:
print(protagonist_dict['hermione'])



##### Step 5: Distinguish parts of speech, like nouns, verbs, and adjectives
Now that we've categorized text as narration or dialog, we can use NLTK's classification function to categorize words by the part of speech they represent.

The `pos_tag` function of NLTK that we'll use to do this takes in a word as a string and returns a tuple of the word and a code for how it was classified.  For example, 'NNP' means the word has been tagged by NLTK as a proper noun, 'VB' means the word has been tagged as a verb.  So "Harry" in the sentence "Harry is Petunia's nephew." would come back as `('Harry', 'NNP')`.  

You can find a full list [here](https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/), but in the next steps we'll be focused primarily on adjectives, verbs and nouns.

We will import the `pos_tag` function from NLTK and tag the words in a new function called `tagged_text`.

In [180]:
from nltk import pos_tag, sent_tokenize
from itertools import groupby as gb
punct = string.punctuation

def split_per(t):
    for s in t:
        k = (list(s) for k,s in gb(s, lambda item: item=='.'))
        for i in k:
            if len(i) > 1:
                print(i)

print(split_per(protagonist_dict['hermione']))

['said', 'Hermione']
['said', 'Hermione']
['And', 'he', 'was', 'off', ',', 'explaining', 'all', 'about', 'the', 'four', 'balls', 'and', 'the', 'positions', 'of', 'the', 'seven', 'players', ',', 'describing', 'famous', 'games', 'he', "'d", 'been', 'to', 'with', 'his', 'brothers', 'and', 'the', 'broomstick', 'he', "'d", 'like', 'to', 'get', 'if', 'he', 'had', 'the', 'money']
['He', 'was', 'just', 'taking', 'Harry', 'through', 'the', 'finer', 'points', 'of', 'the', 'game', 'when', 'the', 'compartment', 'door', 'slid', 'open', 'yet', 'again', ',', 'but', 'it', 'was', "n't", 'Neville', 'the', 'toadless', 'boy', ',', 'or', 'Hermione', 'Granger', 'this', 'time']
['Three', 'boys', 'entered', ',', 'and', 'Harry', 'recognized', 'the', 'middle', 'one', 'at', 'once', ':', 'it', 'was', 'the', 'pale', 'boy', 'from', 'Madam', 'Malkin', "'s", 'robe', 'shop']
['He', 'was', 'looking', 'at', 'Harry', 'with', 'a', 'lot', 'more', 'interest', 'than', 'he', "'d", 'shown', 'back', 'in', 'Diagon', 'Alley']
['A

['It', 'was', 'really', 'lucky', 'that', 'Harry', 'now', 'had', 'Hermlone', 'as', 'a', 'friend']
['He', 'did', "n't", 'know', 'how', 'he', "'d", 'have', 'gotten', 'through', 'all', 'his', 'homework', 'without', 'her', ',', 'what', 'with', 'all', 'the', 'last-minute', 'Quidditch', 'practice', 'Wood', 'was', 'making', 'them', 'do']
['She', 'had', 'also', 'tent', 'him', 'Quidditch', 'Through', 'the', 'Ages', ',', 'which', 'turned', 'out', 'to', 'be', 'a', 'very', 'interesting', 'read']
['Harry', 'learned', 'that', 'there', 'were', 'seven', 'hundred', 'ways', 'of', 'committing', 'a', 'Quidditch', 'foul', 'and', 'that', 'all', 'of', 'them', 'had', 'happened', 'during', 'a', 'World', 'Cup', 'match', 'in', '1473', ';', 'that', 'Seekers', 'were', 'usually', 'the', 'smallest', 'and', 'fastest', 'players', ',', 'and', 'that', 'most', 'serious', 'Quidditch', 'accidents', 'seemed', 'to', 'happen', 'to', 'them', ';', 'that', 'although', 'people', 'rarely', 'died', 'playing', 'Quidditch', ',', 'refe

In [159]:
print(tt)



In [101]:
# we'll pass the function the parsed variable in which the text is classified as narration or dialog
# and then print the same text as above with the tagging

tagged = tag_text(protagonist_dict['hermione'])
print(tagged)



In [148]:

for i in tt:
    print(i)
    for t in range(len(i)):
        if i[t] == ('Hermione', 'NNP'):
            print(i)
#             if i[t-1][1] == 'VBN':
#                 print (i[t-1])
#             if i[t+1][1] == 'VBD':
#                 print (i[t+1])
#         elif i[t][1] == 'JJ':
#             print (i)


('said', 'VBD')
('Hermione', 'NNP')
('.said', 'NNP')
('Hermione', 'NNP')
('.And', 'VBD')
('he', 'PRP')
('was', 'VBD')
('off', 'RB')
(',', ',')
('explaining', 'VBG')
('all', 'DT')
('about', 'IN')
('the', 'DT')
('four', 'CD')
('balls', 'NNS')
('and', 'CC')
('the', 'DT')
('positions', 'NNS')
('of', 'IN')
('the', 'DT')
('seven', 'CD')
('players', 'NNS')
(',', ',')
('describing', 'VBG')
('famous', 'JJ')
('games', 'NNS')
('he', 'PRP')
("'d", 'MD')
('been', 'VBN')
('to', 'TO')
('with', 'IN')
('his', 'PRP$')
('brothers', 'NNS')
('and', 'CC')
('the', 'DT')
('broomstick', 'NN')
('he', 'PRP')
("'d", 'MD')
('like', 'VB')
('to', 'TO')
('get', 'VB')
('if', 'IN')
('he', 'PRP')
('had', 'VBD')
('the', 'DT')
('money', 'NN')
('.', '.')
('He', 'PRP')
('was', 'VBD')
('just', 'RB')
('taking', 'VBG')
('Harry', 'NNP')
('through', 'IN')
('the', 'DT')
('finer', 'NN')
('points', 'NNS')
('of', 'IN')
('the', 'DT')
('game', 'NN')
('when', 'WRB')
('the', 'DT')
('compartment', 'NN')
('door', 'NN')
('slid', 'VBD')
('o

('to', 'TO')
('end', 'VB')
('with', 'IN')
('him', 'PRP')
('narrowly', 'RB')
('escaping', 'VBG')
('Muggles', 'NNP')
('in', 'IN')
('helicopters', 'NNS')
('.', '.')
('He', 'PRP')
('was', 'VBD')
("n't", 'RB')
('the', 'DT')
('only', 'JJ')
('one', 'NN')
(',', ',')
('though', 'IN')
(':', ':')
('the', 'DT')
('way', 'NN')
('Seamus', 'NNP')
('Finnigan', 'NNP')
('told', 'VBD')
('it', 'PRP')
(',', ',')
('he', 'PRP')
("'d", 'MD')
('spent', 'VB')
('most', 'JJS')
('of', 'IN')
('his', 'PRP$')
('childhood', 'NN')
('zooming', 'VBG')
('around', 'IN')
('the', 'DT')
('countryside', 'NN')
('on', 'IN')
('his', 'PRP$')
('broomstick', 'NN')
('.', '.')
('Even', 'RB')
('Ron', 'NNP')
('would', 'MD')
('tell', 'VB')
('anyone', 'NN')
('who', 'WP')
("'d", 'VBD')
('listen', 'VBN')
('about', 'IN')
('the', 'DT')
('time', 'NN')
('he', 'PRP')
("'d", 'MD')
('almost', 'RB')
('hit', 'VB')
('a', 'DT')
('hang', 'NN')
('glider', 'NN')
('on', 'IN')
('Charlie', 'NNP')
("'s", 'POS')
('old', 'JJ')
('broom', 'NN')
('.', '.')
('Every

('It', 'PRP')
('contains', 'VBZ')
('your', 'PRP$')
('new', 'JJ')
('Nimbus', 'NNP')
('Two', 'CD')
('Thousand', 'NNP')
(',', ',')
('but', 'CC')
('I', 'PRP')
('do', 'VBP')
("n't", 'RB')
('want', 'VB')
('everybody', 'NN')
('knowing', 'VBG')
('you', 'PRP')
("'ve", 'VBP')
('got', 'VBN')
('a', 'DT')
('broomstick', 'NN')
('or', 'CC')
('they', 'PRP')
("'ll", 'MD')
('all', 'DT')
('want', 'VB')
('one', 'CD')
('.', '.')
('Oliver', 'CC')
('Wood', 'NNP')
('will', 'MD')
('meet', 'VB')
('you', 'PRP')
('tonight', 'VBN')
('on', 'IN')
('the', 'DT')
('Quidditch', 'NNP')
('field', 'NN')
('at', 'IN')
('seven', 'CD')
("o'clock", 'NNS')
('for', 'IN')
('your', 'PRP$')
('first', 'JJ')
('training', 'NN')
('session', 'NN')
('.', '.')
('Professor', 'NNP')
('McGonagall', 'NNP')
('Harry', 'NNP')
('had', 'VBD')
('difficulty', 'NN')
('hiding', 'VBG')
('his', 'PRP$')
('glee', 'NN')
('as', 'IN')
('he', 'PRP')
('handed', 'VBD')
('the', 'DT')
('note', 'NN')
('to', 'TO')
('Ron', 'NNP')
('to', 'TO')
('read', 'VB')
('.came',

('rip', 'VB')
('him', 'PRP')
('off', 'RP')
('or', 'CC')
('catch', 'VB')
('him', 'PRP')
('a', 'DT')
('terrible', 'JJ')
('blow', 'NN')
('with', 'IN')
('the', 'DT')
('club', 'NN')
('.', '.')
('Hermione', 'NNP')
('had', 'VBD')
('sunk', 'VBN')
('to', 'TO')
('the', 'DT')
('floor', 'NN')
('in', 'IN')
('fright', 'NN')
(';', ':')
('Ron', 'NNP')
('pulled', 'VBD')
('out', 'RP')
('his', 'PRP$')
('own', 'JJ')
('wand', 'NN')
('--', ':')
('not', 'RB')
('knowing', 'VBG')
('what', 'WP')
('he', 'PRP')
('was', 'VBD')
('going', 'VBG')
('to', 'TO')
('do', 'VB')
('he', 'PRP')
('heard', 'VB')
('himself', 'PRP')
('cry', 'VB')
('the', 'DT')
('first', 'JJ')
('spell', 'NN')
('that', 'WDT')
('came', 'VBD')
('into', 'IN')
('his', 'PRP$')
('head', 'NN')
(':The', 'NN')
('club', 'NN')
('flew', 'VBD')
('suddenly', 'RB')
('out', 'IN')
('of', 'IN')
('the', 'DT')
('troll', 'NN')
("'s", 'POS')
('hand', 'NN')
(',', ',')
('rose', 'VBD')
('high', 'JJ')
(',', ',')
('high', 'JJ')
('up', 'RB')
('into', 'IN')
('the', 'DT')
('air

('a', 'DT')
('Quidditch', 'NNP')
('foul', 'NN')
('and', 'CC')
('that', 'IN')
('all', 'DT')
('of', 'IN')
('them', 'PRP')
('had', 'VBD')
('happened', 'VBN')
('during', 'IN')
('a', 'DT')
('World', 'NNP')
('Cup', 'NNP')
('match', 'NN')
('in', 'IN')
('1473', 'CD')
(';', ':')
('that', 'IN')
('Seekers', 'NNPS')
('were', 'VBD')
('usually', 'RB')
('the', 'DT')
('smallest', 'JJS')
('and', 'CC')
('fastest', 'JJS')
('players', 'NNS')
(',', ',')
('and', 'CC')
('that', 'IN')
('most', 'JJS')
('serious', 'JJ')
('Quidditch', 'NN')
('accidents', 'NNS')
('seemed', 'VBD')
('to', 'TO')
('happen', 'VB')
('to', 'TO')
('them', 'PRP')
(';', ':')
('that', 'IN')
('although', 'IN')
('people', 'NNS')
('rarely', 'RB')
('died', 'VBD')
('playing', 'VBG')
('Quidditch', 'NNP')
(',', ',')
('referees', 'NNS')
('had', 'VBD')
('been', 'VBN')
('known', 'VBN')
('to', 'TO')
('vanish', 'VB')
('and', 'CC')
('turn', 'VB')
('up', 'RP')
('months', 'NNS')
('later', 'RB')
('in', 'IN')
('the', 'DT')
('Sahara', 'NNP')
('Desert', 'NNP'

('rows', 'NNS')
('.', '.')
('Hermione', 'NNP')
('took', 'VBD')
('out', 'RP')
('a', 'DT')
('list', 'NN')
('of', 'IN')
('subjects', 'NNS')
('and', 'CC')
('titles', 'NNS')
('she', 'PRP')
('had', 'VBD')
('decided', 'VBN')
('to', 'TO')
('search', 'VB')
('while', 'IN')
('Ron', 'NNP')
('strode', 'VBD')
('off', 'RP')
('down', 'RP')
('a', 'DT')
('row', 'NN')
('of', 'IN')
('books', 'NNS')
('and', 'CC')
('started', 'VBD')
('pulling', 'VBG')
('them', 'PRP')
('off', 'RP')
('the', 'DT')
('shelves', 'NNS')
('at', 'IN')
('random', 'NN')
('.', '.')
('Harry', 'NNP')
('wandered', 'VBD')
('over', 'IN')
('to', 'TO')
('the', 'DT')
('Restricted', 'NNP')
('Section', 'NN')
('.', '.')
('He', 'PRP')
('had', 'VBD')
('been', 'VBN')
('wondering', 'VBG')
('for', 'IN')
('a', 'DT')
('while', 'NN')
('if', 'IN')
('Flamel', 'NNP')
('was', 'VBD')
("n't", 'RB')
('somewhere', 'RB')
('in', 'IN')
('there', 'RB')
('.', '.')
('Unfortunately', 'RB')
(',', ',')
('you', 'PRP')
('needed', 'VBD')
('a', 'DT')
('specially', 'RB')
('si

('Harry', 'NNP')
('.said', 'NNP')
('Hermione', 'NNP')
('.', '.')
('As', 'IN')
('the', 'DT')
('match', 'NN')
('drew', 'VBD')
('nearer', 'RB')
(',', ',')
('however', 'RB')
(',', ',')
('Harry', 'NNP')
('became', 'VBD')
('more', 'RBR')
('and', 'CC')
('more', 'RBR')
('nervous', 'JJ')
(',', ',')
('whatever', 'WDT')
('he', 'PRP')
('told', 'VBD')
('Ron', 'NNP')
('and', 'CC')
('Hermione', 'NNP')
('.', '.')
('The', 'DT')
('rest', 'NN')
('of', 'IN')
('the', 'DT')
('team', 'NN')
('was', 'VBD')
("n't", 'RB')
('too', 'RB')
('calm', 'JJ')
(',', ',')
('either', 'RB')
('.', '.')
('The', 'DT')
('idea', 'NN')
('of', 'IN')
('overtaking', 'VBG')
('Slytherin', 'NNP')
('in', 'IN')
('the', 'DT')
('house', 'NN')
('championship', 'NN')
('was', 'VBD')
('wonderful', 'JJ')
(',', ',')
('no', 'DT')
('one', 'NN')
('had', 'VBD')
('done', 'VBN')
('it', 'PRP')
('for', 'IN')
('seven', 'CD')
('years', 'NNS')
(',', ',')
('but', 'CC')
('would', 'MD')
('they', 'PRP')
('be', 'VB')
('allowed', 'VBN')
('to', 'TO')
(',', ',')
('

('were', 'VBD')
("n't", 'RB')
('nearly', 'RB')
('as', 'RB')
('much', 'JJ')
('fun', 'NN')
('as', 'IN')
('the', 'DT')
('Christmas', 'NNP')
('ones', 'NNS')
('.', '.')
('It', 'PRP')
('was', 'VBD')
('hard', 'JJ')
('to', 'TO')
('relax', 'VB')
('with', 'IN')
('Hermione', 'NNP')
('next', 'JJ')
('to', 'TO')
('you', 'PRP')
('reciting', 'VBG')
('the', 'DT')
('twelve', 'NN')
('uses', 'VBZ')
('of', 'IN')
('dragon', 'NN')
("'s", 'POS')
('blood', 'NN')
('or', 'CC')
('practicing', 'VBG')
('wand', 'NN')
('movements', 'NNS')
('.', '.')
('Moaning', 'VBG')
('and', 'CC')
('yawning', 'NN')
(',', ',')
('Harry', 'NNP')
('and', 'CC')
('Ron', 'NNP')
('spent', 'VBD')
('most', 'JJS')
('of', 'IN')
('their', 'PRP$')
('free', 'JJ')
('time', 'NN')
('in', 'IN')
('the', 'DT')
('library', 'NN')
('with', 'IN')
('her', 'PRP')
(',', ',')
('trying', 'VBG')
('to', 'TO')
('get', 'VB')
('through', 'IN')
('all', 'DT')
('their', 'PRP$')
('extra', 'JJ')
('work', 'NN')
('.said', 'NNP')
('Hermione', 'NNP')
('thoughtfully', 'RB')
('

('were', 'VBD')
('a', 'DT')
('bit', 'NN')
('late', 'JJ')
('arriving', 'NN')
('at', 'IN')
('Hagrid', 'NNP')
("'s", 'POS')
('hut', 'NN')
('because', 'IN')
('they', 'PRP')
("'d", 'MD')
('had', 'VBN')
('to', 'TO')
('wait', 'VB')
('for', 'IN')
('Peeves', 'NNS')
('to', 'TO')
('get', 'VB')
('out', 'IN')
('of', 'IN')
('their', 'PRP$')
('way', 'NN')
('in', 'IN')
('the', 'DT')
('entrance', 'NN')
('hall', 'NN')
(',', ',')
('where', 'WRB')
('he', 'PRP')
("'d", 'MD')
('been', 'VBN')
('playing', 'VBG')
('tennis', 'NN')
('against', 'IN')
('the', 'DT')
('wall', 'NN')
('.', '.')
('Hagrid', 'NNP')
('had', 'VBD')
('Norbert', 'NNP')
('packed', 'VBN')
('and', 'CC')
('ready', 'JJ')
('in', 'IN')
('a', 'DT')
('large', 'JJ')
('crate', 'NN')
('.Hagrid', 'JJ')
('sobbed', 'NN')
(',', ',')
('as', 'IN')
('Harry', 'NNP')
('and', 'CC')
('Hermione', 'NNP')
('covered', 'VBD')
('the', 'DT')
('crate', 'NN')
('with', 'IN')
('the', 'DT')
('invisibility', 'NN')
('cloak', 'NN')
('and', 'CC')
('stepped', 'VBD')
('underneath',

('rebellions', 'NNS')
('...', ':')
('.', '.')
('Then', 'RB')
(',', ',')
('about', 'IN')
('a', 'DT')
('week', 'NN')
('before', 'IN')
('the', 'DT')
('exams', 'NNS')
('were', 'VBD')
('due', 'JJ')
('to', 'TO')
('start', 'VB')
(',', ',')
('Harry', 'NNP')
("'s", 'POS')
('new', 'JJ')
('resolution', 'NN')
('not', 'RB')
('to', 'TO')
('interfere', 'VB')
('in', 'IN')
('anything', 'NN')
('that', 'WDT')
('did', 'VBD')
("n't", 'RB')
('concern', 'NN')
('him', 'PRP')
('was', 'VBD')
('put', 'VBN')
('to', 'TO')
('an', 'DT')
('unexpected', 'JJ')
('test', 'NN')
('.', '.')
('Walking', 'VBG')
('back', 'RB')
('from', 'IN')
('the', 'DT')
('library', 'NN')
('on', 'IN')
('his', 'PRP$')
('own', 'JJ')
('one', 'CD')
('afternoon', 'NN')
(',', ',')
('he', 'PRP')
('heard', 'VBD')
('somebody', 'NN')
('whimpering', 'VBG')
('from', 'IN')
('a', 'DT')
('classroom', 'NN')
('up', 'RP')
('ahead', 'RB')
('.', '.')
('As', 'IN')
('he', 'PRP')
('drew', 'VBD')
('closer', 'RB')
(',', ',')
('he', 'PRP')
('heard', 'VBD')
('Quirrell'

('followed', 'VBD')
('him', 'PRP')
('out', 'IN')
('of', 'IN')
('the', 'DT')
('clearing', 'NN')
(',', ',')
('staring', 'VBG')
('over', 'RP')
('their', 'PRP$')
('shoulders', 'NNS')
('at', 'IN')
('Ronan', 'NNP')
('and', 'CC')
('Bane', 'NNP')
('until', 'IN')
('the', 'DT')
('trees', 'NNS')
('blocked', 'VBD')
('their', 'PRP$')
('view', 'NN')
('.asked', 'VBD')
('Hermione', 'NNP')
('.They', 'NNP')
('walked', 'VBD')
('on', 'IN')
('through', 'IN')
('the', 'DT')
('dense', 'NN')
(',', ',')
('dark', 'JJ')
('trees', 'NNS')
('.', '.')
('Harry', 'NNP')
('kept', 'VBD')
('looking', 'VBG')
('nervously', 'RB')
('over', 'IN')
('his', 'PRP$')
('shoulder', 'NN')
('.', '.')
('He', 'PRP')
('had', 'VBD')
('the', 'DT')
('nasty', 'JJ')
('feeling', 'NN')
('they', 'PRP')
('were', 'VBD')
('being', 'VBG')
('watched', 'VBN')
('.', '.')
('He', 'PRP')
('was', 'VBD')
('very', 'RB')
('glad', 'JJ')
('they', 'PRP')
('had', 'VBD')
('Hagrid', 'NNP')
('and', 'CC')
('his', 'PRP$')
('crossbow', 'NN')
('with', 'IN')
('them', 'PRP

('glittering', 'VBG')
('--', ':')
('glittering', 'VBG')
('?Ron', 'NN')
('dived', 'VBD')
(',', ',')
('Hermione', 'NNP')
('rocketed', 'VBD')
('upward', 'RB')
(',', ',')
('the', 'DT')
('key', 'NN')
('dodged', 'VBD')
('them', 'PRP')
('both', 'DT')
(',', ',')
('and', 'CC')
('Harry', 'NNP')
('streaked', 'VBD')
('after', 'IN')
('it', 'PRP')
(';', ':')
('it', 'PRP')
('sped', 'VBD')
('toward', 'IN')
('the', 'DT')
('wall', 'NN')
(',', ',')
('Harry', 'NNP')
('leaned', 'VBD')
('forward', 'RB')
('and', 'CC')
('with', 'IN')
('a', 'DT')
('nasty', 'JJ')
(',', ',')
('crunching', 'VBG')
('noise', 'NN')
(',', ',')
('pinned', 'VBD')
('it', 'PRP')
('against', 'IN')
('the', 'DT')
('stone', 'NN')
('with', 'IN')
('one', 'CD')
('hand', 'NN')
('.', '.')
('Ron', 'NNP')
('and', 'CC')
('Hermione', 'NNP')
("'s", 'POS')
('cheers', 'NNS')
('echoed', 'VBP')
('around', 'IN')
('the', 'DT')
('high', 'JJ')
('chamber', 'NN')
('.', '.')
('They', 'PRP')
('landed', 'VBD')
('quickly', 'RB')
(',', ',')
('and', 'CC')
('Harry', '

('out', 'RP')
('to', 'TO')
('all', 'DT')
('students', 'NNS')
(',', ',')
('them', 'PRP')
('not', 'RB')
('to', 'TO')
('use', 'VB')
('magic', 'NN')
('over', 'IN')
('the', 'DT')
('holidays', 'NNS')
('(said', 'VBD')
('Harry', 'NNP')
('.', '.')
('He', 'PRP')
(',', ',')
('Ron', 'NNP')
(',', ',')
('and', 'CC')
('Hermione', 'NNP')
('passed', 'VBD')
('through', 'IN')
('the', 'DT')
('gateway', 'NN')
('together', 'RB')
('.He', 'NNP')
('walked', 'VBD')
('away', 'RB')
('.', '.')
('Harry', 'NNP')
('hung', 'VBD')
('back', 'RP')
('for', 'IN')
('a', 'DT')
('last', 'JJ')
('word', 'NN')
('with', 'IN')
('Ron', 'NNP')
('and', 'CC')
('Hermione', 'NNP')
('.said', 'NNP')
('Hermione', 'NNP')
(',', ',')
('looking', 'VBG')
('uncertainly', 'RB')
('after', 'IN')
('Uncle', 'NNP')
('Vernon', 'NNP')
(',', ',')
('shocked', 'VBD')
('that', 'IN')
('anyone', 'NN')
('could', 'MD')
('be', 'VB')
('so', 'RB')
('unpleasant', 'JJ')
('.', '.')


#### Step 5: Isolating sexist language
This part is tricky - sexism can be subtle or detectable only with context.  To keep this analysis straightforward, I assembled a list of sexist works to search for in the narrative.

Because the Harry Potter series is written in English by a British writer, I focused on sources from the UK and countries in the Commonwealth.  Using this blog post by a [New Zealand blogger](http://sacraparental.com/2016/05/14/everyday-misogyny-122-subtly-sexist-words-women/) I had a first set of words and some excellent categories to begin with.  I found a number of [other](http://time.com/4268325/history-calling-women-shrill/) excellent articles about sexism in language, which I used to add to the `sexist_words` Python dictionary below, grouping them by type.

In [None]:
sexist_words = { 
    'assertiveness': ['bossy', 'abrasive', 'ball-bust', 'aggressive', 'shrill', 'bolshy', 'intense', 'stroppy', 'forward', 'mannish', 'strident', 'know-it-all'],
    'behavior' : ['cackle', 'shriek', 'caterwaul', 'yowl', 'screech','gossiping', 'dramatic', 'catty', 'bitchy', 'nagging', 'coldly', 'icy', 'shrew', 'humorless', 'man-hater', 'banshee', 'fishwife', 'lippy', 'ditzy', 'diva', 'prima donna', 'feisty', 'ladylike', 'bubbly', 'vivaious', 'flirt', 'sass', 'chatty', 'demure', 'modest', 'emotional', 'hysterical', 'hormonal', 'menstrual', 'flaky', 'moody', 'over-sensitive'],
    'sexuality': ['slut', 'trollop', 'frigid', 'easy', 'tease', 'loose', 'man-eater', 'prude', 'curvy', 'cheap', 'frump', 'fad', 'mouse', 'mousy', 'clotheshorse', 'cow', 'hag'],
    'relationship': ['spinster', 'barren', 'housewife', 'houseproud', 'soccer mom', 'mistress', 'kept woman'],
#     'praise': ['care', 'compassion', 'hard-working', 'conscientious', 'dependable', 'diligent', 'dedicated', 'tactful', 'interpersonal', 'warm', 'helpful'],
}

# making this into a list for easier analysis
sexist_words_list = [v[j] for k, v in sexist_words.items() for j in range(len(v))]

Now that we know what words we are looking for, we want to make sure we get all versions of the words; if we only compare them to this list, we'd pick up on `cackle` but not `cackled`.   

To to this, we'll use a stemming algorithm that's part of NLTK.  Let's look at some examples of how this works.  What we want to happen is to have all forms of a word have the same stem.

In [None]:
from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()

# this works really well for some words
print("Bossiness stem:", st.stem('bossiness'))
print("Bossily stem:", st.stem('bossily'))
print("Bossy stem", st.stem('bossy'))

print("Cackled stem:", st.stem('cackled'))
print("Cackling stem:", st.stem('cackling'))
print("Cackle stem:", st.stem('cackle'))

# but not for others
print("Shreiking stem:", st.stem('shrieking'))
print("Shrieked stem:", st.stem('shrieked'))
print("Shriek stem:", st.stem('shriek'))


As we can see above, this doesn't *always* work, so for extra insurance, we'll have a rule that if the root word is in a word, it counts. 

So, even though the stemmer isn't linking `shrieking` to `shreik`, looking for the letters in `shreik` would.

In [None]:
def sexist_words(t):
    sexist_words_isolated = []
    for i in t: # (n, [()])
        for j in i[1]:
            for k in j:
                word = k[0]
                if k[1] != 'NNP' or 'RP':
                    if st.stem(word) in sexist_words_list:
                        sexist_words_isolated.append(word)
                    else:
                        for x in sexist_words_list:
                                if x in word and word != 'headmistress':
                                    sexist_words_isolated.append(word)
    
    return sexist_words_isolated

print(sexist_words(isolate_narrative))