This notebook explores dependency parsing by identifying the actions and objects that are characteristically associated with male and female characters.

In [1]:
import spacy, math
from collections import Counter
import operator

In [2]:
nlp = spacy.load('en_core_web_sm')

We'll run seven novels by Jane Austen through spacy (this will take a few minutes).

In [3]:
filenames=["../data/fiction/emma.txt", "../data/fiction/lady_susan.txt", "../data/fiction/mansfield_park.txt", "../data/fiction/northanger_abbey.txt", "../data/fiction/persuasion.txt", "../data/fiction/pride.txt", "../data/fiction/sense_and_sensibility.txt"]
all_tokens=[]
for filename in filenames:
    print(filename)
    data=open(filename, encoding="utf-8").read()
    tokens=nlp(data)
    all_tokens.extend(tokens)

../data/fiction/emma.txt
../data/fiction/lady_susan.txt
../data/fiction/mansfield_park.txt
../data/fiction/northanger_abbey.txt
../data/fiction/persuasion.txt
../data/fiction/pride.txt
../data/fiction/sense_and_sensibility.txt


In [4]:
print (len(all_tokens))

972838


In [5]:
def test(maleCounter, femaleCounter, display=25):
    
    """ Function that takes two Counter objects as inputs and prints out a ranked list of terms
    more characteristic of the first counter than the second.  Here we'll use log-odds
    with an uninformative prior (from Monroe et al 2008, "Fightin Words", eqn. 22) as our metric.
    
    """
    
    vocab=dict(maleCounter) 
    vocab.update(dict(femaleCounter))
    maleSum=sum(maleCounter.values())
    femaleSum=sum(femaleCounter.values())

    ranks={}
    alpha=0.01
    alphaV=len(vocab)*alpha
        
    for word in vocab:
        
        log_odds_ratio=math.log( (maleCounter[word] + alpha) / (maleSum+alphaV-maleCounter[word]-alpha) ) - math.log( (femaleCounter[word] + alpha) / (femaleSum+alphaV-femaleCounter[word]-alpha) )
        variance=1./(maleCounter[word] + alpha) + 1./(femaleCounter[word] + alpha)
        
        ranks[word]=log_odds_ratio/math.sqrt(variance)

    sorted_x = sorted(ranks.items(), key=operator.itemgetter(1), reverse=True)
    
    print("Most male:")
    for k,v in sorted_x[:display]:
        print("%.3f\t%s" % (v,k))
    
    print("\nMost female:")
    for k,v in reversed(sorted_x[-display:]):
        print("%.3f\t%s" % (v,k))

Spacy uses the [ClearNLP dependency labels](https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md), which are very close to the Stanford typed dependencies.  See the [Stanford dependencies manual](http://people.ischool.berkeley.edu/~dbamman/DependencyManual.pdf) for more information about each tag.  Parse information is contained in the spacy token object; see the following for which attributes encode the token text, idx (position in sentence), part of speech, and dependency relation.  The syntactic head for a token is another token given in `token.head` (where all of those same token attributes are accessible).

In [6]:
testDoc=nlp("He started his car.")
for token in testDoc:
    print("%s\t%s\t%s\t%s\t%s\t%s\t%s" % (token.text, token.idx, token.tag_, token.dep_, token.head.text, token.head.idx, token.head.tag_))


He	0	PRP	nsubj	started	3	VBD
started	3	VBD	ROOT	started	3	VBD
his	11	PRP$	poss	car	15	NN
car	15	NN	dobj	started	3	VBD
.	18	.	punct	started	3	VBD


Q1: Find the verbs that men are more characteristically the *subject* of than women.  Feel free to only consider subjects that are "he" and "she" pronouns.  This function should return two Counter objects (`maleCounter` and `femaleCounter`) which counts the number of times a given verb has "he" (`maleCounter`) and "she" (`femaleCounter`) as its syntactic subject.

In [34]:
def count_subjects():
    maleCounter=Counter()
    femaleCounter=Counter()
    verb_list = ['VB', 'VBG', 'VBD', 'VBN', 'VBZ', 'VBP']
    for token in all_tokens:
        if token.head.tag_ in verb_list and token.dep_ == 'nsubj':
            if token.text.lower() == 'he':
                maleCounter[token.head.text.lower()] += 1
            elif token.text.lower() == 'she':
                femaleCounter[token.head.text.lower()] += 1
    return maleCounter, femaleCounter

In [35]:
male, female=count_subjects()
test(male, female, display=10)

Most male:
6.774	is
5.730	replied
5.098	come
4.767	told
4.553	came
4.371	said
4.106	seemed
3.658	left
3.561	comes
3.451	done

Most female:
-7.816	felt
-5.423	saw
-4.342	knew
-4.253	heard
-3.742	found
-3.266	was
-3.075	had
-3.003	thought
-2.878	cried
-2.738	believed


Q2: Find the verbs that men are more characteristically the *object* of than women.  Feel free to only consider objects that are "him" and "her" pronouns.  This function should return two Counter objects (`maleCounter` and `femaleCounter`) which counts the number of times a given verb has "he" (`maleCounter`) and "she" (`femaleCounter`) as its syntactic direct object. 

In [36]:
def count_objects():
    maleCounter=Counter()
    femaleCounter=Counter()
    verb_list = ['VB', 'VBG', 'VBD', 'VBN', 'VBZ', 'VBP']
    for token in all_tokens:
        if token.head.tag_ in verb_list and token.dep_ == 'dobj':
            if token.text.lower() == 'him':
                maleCounter[token.head.text.lower()] += 1
            elif token.text.lower() == 'her':
                femaleCounter[token.head.text.lower()] += 1
    return maleCounter, femaleCounter

In [37]:
male, female=count_objects()
test(male, female, display=10)

Most male:
3.890	seen
3.546	like
3.532	seeing
3.092	meet
2.740	see
2.462	liked
2.408	know
2.313	call
2.249	sent
1.994	get

Most female:
-3.818	left
-2.591	struck
-2.402	convinced
-2.340	attended
-2.070	made
-1.958	prevented
-1.826	obliged
-1.803	joined
-1.803	amuse
-1.803	fetch


Q3: Find the objects that are *possessed* more frequently by men than women.  Feel free to only consider possessors that are "his" and "her" pronouns.   This function should return two Counter objects (`maleCounter` and `femaleCounter`) which counts the number of times a given term is possessed by "he" (`maleCounter`) and "she" (`femaleCounter`).

In [38]:
testDoc2=nlp("He left his keys at home.")
for token in testDoc2:
    print("%s\t%s\t%s\t%s\t%s\t%s\t%s" % (token.text, token.idx, token.tag_, token.dep_, token.head.text, token.head.idx, token.head.tag_))


He	0	PRP	nsubj	left	3	VBD
left	3	VBD	ROOT	left	3	VBD
his	8	PRP$	poss	keys	12	NNS
keys	12	NNS	dobj	left	3	VBD
at	17	IN	prep	left	3	VBD
home	20	NN	pobj	at	17	IN
.	24	.	punct	left	3	VBD


In [42]:
def count_possessions():
    maleCounter=Counter()
    femaleCounter=Counter()
    for token in all_tokens:
        if token.dep_ == 'poss' and token.head.tag_ in ['NN', 'NNS']:
            if token.text.lower() == 'his':
                maleCounter[token.head.text.lower()] += 1
            elif token.text.lower() == 'her':
                femaleCounter[token.head.text.lower()] += 1
    return maleCounter, femaleCounter

In [43]:
male, female=count_possessions()
test(male, female, display=10)

Most male:
4.689	sisters
4.453	attentions
4.277	house
4.235	name
4.223	return
3.815	son
3.713	attachment
3.510	horse
3.427	character
3.427	behaviour

Most female:
-7.209	mother
-6.370	sister
-4.645	eyes
-4.423	aunt
-3.987	uncle
-3.642	spirits
-3.542	heart
-3.369	room
-3.217	mind
-3.096	feelings


Q4: Find the actions that are men do *to women* more frequently than women do *to men*.  Feel free to only consider subjects and objects that are "she"/"he"/"her"/"him" pronouns.   This function should return two Counter objects (`maleCounter` and `femaleCounter`) which counts the number of times a given verb has "he" as the subject and "her" as the object (`maleCounter`) and "she" as the subject and "him" as the object (`femaleCounter`).

In [49]:
testDoc2=nlp("She sent them home.")
for token in testDoc2:
    print("%s\t%s\t%s\t%s\t%s\t%s\t%s" % (token.text, token.idx, token.tag_, token.dep_, token.head.text, token.head.idx, token.head.tag_))


She	0	PRP	nsubj	sent	4	VBD
sent	4	VBD	ROOT	sent	4	VBD
them	9	PRP	dobj	sent	4	VBD
home	14	RB	advmod	sent	4	VBD
.	18	.	punct	sent	4	VBD


In [52]:
def count_SVO_tuples():
    maleCounter=Counter()
    femaleCounter=Counter()
    verb_list = ['VB', 'VBG', 'VBD', 'VBN', 'VBZ', 'VBP']
    for i in range(len(all_tokens)):
        if all_tokens[i].text.lower() == 'he' and all_tokens[i].dep_ == 'nsubj':
            while all_tokens[i].head.tag_ in verb_list and all_tokens[i].dep_ != 'punc':
                i += 1
                if all_tokens[i].text.lower() == 'her' and all_tokens[i].dep_ == 'dobj':
                    maleCounter[all_tokens[i].head.text.lower()] += 1
        elif all_tokens[i].text.lower() == 'she' and all_tokens[i].dep_ == 'nsubj':
            while all_tokens[i].head.tag_ in verb_list and all_tokens[i].dep_ != 'punc':
                i += 1
                if all_tokens[i].text.lower() == 'him' and all_tokens[i].dep_ == 'dobj':
                    femaleCounter[all_tokens[i].head.text.lower()] += 1
            i += 1
    # your code here
                
    return maleCounter, femaleCounter

In [53]:
male, female=count_SVO_tuples()
test(male, female, display=10)

Most male:
1.960	loved
1.822	told
1.605	left
1.368	joined
1.070	hear
1.070	asked
0.846	assured
0.674	sought
0.613	handed
0.613	assure

Most female:
-2.403	seen
-1.998	have
-1.458	saw
-0.846	wished
-0.846	found
-0.659	liked
-0.611	accept
-0.611	accepted
-0.611	like
-0.611	refused


Comment: across the four questions, we do see differences that suggest gendering (e.g. a man possessing a house v/s a woman possessing a room).