This notebook explores dependency parsing by identifying the actions and objects that are characteristically associated with male and female characters.

In [1]:
import spacy, math
from collections import Counter
import operator

In [2]:
nlp = spacy.load('en')

"""
workaround if you are getting an error loading the sapcy 'en' module:
"""
# nlp = spacy.load('en_core_web_sm')

We'll run seven novels by Jane Austen through spacy (this will take a few minutes).

In [3]:
filenames=["../data/fiction/emma.txt", "../data/fiction/lady_susan.txt", "../data/fiction/mansfield_park.txt", "../data/fiction/northanger_abbey.txt", "../data/fiction/persuasion.txt", "../data/fiction/pride.txt", "../data/fiction/sense_and_sensibility.txt"]
all_tokens=[]
for filename in filenames:
    print(filename)
    data=open(filename, encoding="utf-8").read()
    tokens=nlp(data)
    all_tokens.extend(tokens)

../data/fiction/emma.txt
../data/fiction/lady_susan.txt
../data/fiction/mansfield_park.txt
../data/fiction/northanger_abbey.txt
../data/fiction/persuasion.txt
../data/fiction/pride.txt
../data/fiction/sense_and_sensibility.txt


In [4]:
print (len(all_tokens))

972534


In [5]:
def test(maleCounter, femaleCounter, display=25):
    
    """ Function that takes two Counter objects as inputs and prints out a ranked list of terms
    more characteristic of the first counter than the second.  Here we'll use log-odds
    with an uninformative prior (from Monroe et al 2008, "Fightin Words", eqn. 22) as our metric.
    
    """
    
    vocab=dict(maleCounter) 
    vocab.update(dict(femaleCounter))
    maleSum=sum(maleCounter.values())
    femaleSum=sum(maleCounter.values())

    ranks={}
    alpha=0.01
    alphaV=len(vocab)*alpha
        
    for word in vocab:
        
        log_odds_ratio=math.log( (maleCounter[word] + alpha) / (maleSum+alphaV-maleCounter[word]-alpha) ) - math.log( (femaleCounter[word] + alpha) / (femaleSum+alphaV-femaleCounter[word]-alpha) )
        variance=1./(maleCounter[word] + alpha) + 1./(femaleCounter[word] + alpha)
        
        ranks[word]=log_odds_ratio/math.sqrt(variance)

    sorted_x = sorted(ranks.items(), key=operator.itemgetter(1), reverse=True)
    
    print("Most male:")
    for k,v in sorted_x[:display]:
        print("%.3f\t%s" % (v,k))
    
    print("\nMost female:")
    for k,v in reversed(sorted_x[-display:]):
        print("%.3f\t%s" % (v,k))

Spacy uses the [ClearNLP dependency labels](https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md), which are very close to the Stanford typed dependencies.  See the [Stanford dependencies manual](https://nlp.stanford.edu/software/dependencies_manual.pdf) for more information about each tag.  Parse information is contained in the spacy token object; see the following for which attributes encode the token text, idx (position in sentence), part of speech, and dependency relation.  The syntactic head for a token is another token given in `token.head` (where all of those same token attributes are accessible).

In [6]:
testDoc=nlp("He started his car.")
for token in testDoc:
    print("%s\t%s\t%s\t%s\t%s\t%s\t%s" % (token.text, token.idx, token.tag_, token.dep_, token.head.text, token.head.idx, token.head.tag_))


He	0	PRP	nsubj	started	3	VBD
started	3	VBD	ROOT	started	3	VBD
his	11	PRP$	poss	car	15	NN
car	15	NN	dobj	started	3	VBD
.	18	.	punct	started	3	VBD


Q1: Find the verbs that men are more characteristically the *subject* of than women.  Feel free to only consider subjects that are "he" and "she" pronouns.  This function should return two Counter objects (`maleCounter` and `femaleCounter`) which counts the number of times a given verb has "he" (`maleCounter`) and "she" (`femaleCounter`) as its syntactic subject.

In [7]:
def count_subjects():
    maleCounter=Counter()
    femaleCounter=Counter()

    for token in all_tokens:
        if token.dep_ == "nsubj":
            if token.text.lower() == "he":
                maleCounter[token.head.lemma_]+=1
            elif token.text.lower() == "she":
                femaleCounter[token.head.lemma_]+=1
    
    return maleCounter, femaleCounter

In [8]:
male, female=count_subjects()
test(male, female, display=10)

Most male:
6.039	come
3.662	reply
2.616	seem
2.454	agree
2.194	leave
2.194	tell
2.172	talk
2.046	marry
1.962	afford
1.754	ask

Most female:
-10.281	feel
-8.489	be
-8.477	see
-7.001	have
-6.344	hear
-5.539	think
-4.909	find
-4.386	could
-4.377	know
-4.311	cry


Q2: Find the verbs that men are more characteristically the *object* of than women.  Feel free to only consider objects that are "him" and "here" pronouns.  This function should return two Counter objects (`maleCounter` and `femaleCounter`) which counts the number of times a given verb has "he" (`maleCounter`) and "she" (`femaleCounter`) as its syntactic direct object. 

In [9]:
def count_objects():
    maleCounter=Counter()
    femaleCounter=Counter()

    for token in all_tokens:
        if token.dep_ == "dobj":
            if token.text.lower() == "him":
                maleCounter[token.head.lemma_]+=1
            elif token.text.lower() == "her":
                femaleCounter[token.head.lemma_]+=1
    
    return maleCounter, femaleCounter

In [10]:
male, female=count_objects()
test(male, female, display=10)

Most male:
3.880	like
3.057	_
2.888	see
2.090	suspect
1.985	send
1.835	introduce
1.656	wish
1.652	dislike
1.553	recommend
1.525	know

Most female:
-4.110	leave
-3.310	tell
-3.249	attend
-2.890	strike
-2.782	please
-2.671	prevent
-2.636	ask
-2.603	oblige
-2.407	address
-2.353	engage


Q3: Find the objects that are *possessed* more frequently by men than women.  Feel free to only consider possessors that are "his" and "her" pronouns.   This function should return two Counter objects (`maleCounter` and `femaleCounter`) which counts the number of times a given term is possessed by "he" (`maleCounter`) and "she" (`femaleCounter`).

In [11]:
def count_possessions():
    maleCounter=Counter()
    femaleCounter=Counter()

    for token in all_tokens:
        if token.dep_ == "poss":
            if token.text.lower() == "his":
                maleCounter[token.head.lemma_]+=1
            elif token.text.lower() == "her":
                femaleCounter[token.head.lemma_]+=1
    
    return maleCounter, femaleCounter

In [12]:
male, female=count_possessions()
test(male, female, display=10)

Most male:
3.718	be
3.457	horse
3.253	return
3.110	profession
3.079	come
2.937	address
2.931	being
2.925	house
2.669	name
2.664	pride

Most female:
-10.292	mother
-8.257	sister
-6.776	aunt
-6.370	eye
-5.939	heart
-5.907	friend
-5.740	uncle
-5.708	brother
-5.646	spirit
-5.459	father


Q4: Find the actions that are men do *to women* more frequently than women do *to men*.  Feel free to only consider subjects and objects that are "she"/"he"/"her"/"him" pronouns.   This function should return two Counter objects (`maleCounter` and `femaleCounter`) which counts the number of times a given verb has "he" as the subject and "her" as the object (`maleCounter`) and "she" as the subject and "him" as the object (`femaleCounter`).

In [13]:
def count_SVO_tuples():
    maleCounter=Counter()
    femaleCounter=Counter()

    dobj_verbs={}
    nsubj_verbs={}

    for token in all_tokens:
        if token.dep_ == "dobj":
            if token.text.lower() == "him" or token.text.lower() == "her":
                head=token.head
                if head not in dobj_verbs:
                    dobj_verbs[head]=[]
                dobj_verbs[head].append(token)
                
                
        if token.dep_ == "nsubj":
            if token.text.lower() == "he" or token.text.lower() == "she":
                head=token.head
                if head not in nsubj_verbs:
                    nsubj_verbs[head]=[]
                nsubj_verbs[head].append(token)

    for head_verb in dobj_verbs:
        if head_verb in nsubj_verbs:
            
            for subjectToken in nsubj_verbs[head_verb]:
                for objectToken in dobj_verbs[head_verb]:
                    if subjectToken.text.lower() == "he" and objectToken.text.lower() == "her":
                        maleCounter[head_verb.lemma_]+=1
                    elif subjectToken.text.lower() == "she" and objectToken.text.lower() == "him":
                        femaleCounter[head_verb.lemma_]+=1

                
    return maleCounter, femaleCounter

In [14]:
male, female=count_SVO_tuples()
test(male, female, display=10)

Most male:
2.595	tell
1.996	leave
1.487	give
1.393	assure
1.370	ask
1.252	join
0.959	forget
0.959	address
0.619	love
0.601	hear

Most female:
-3.164	see
-1.683	have
-1.252	entreat
-1.002	thank
-0.959	understand
-0.715	like
-0.713	know
-0.672	accept
-0.624	refuse
-0.571	receive
