[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/11.nlp/HW10_SyntacticRelations.ipynb)

# HW10: Exploring gender in books

This notebook explores dependency parsing by identifying the actions and objects that are characteristically associated with characters as a function of their referential gender ("he"/"she").

In [None]:
import math
import operator

from collections import Counter

import spacy
from tqdm import tqdm

In [None]:
nlp = spacy.load('en_core_web_sm')

## Load data

We'll run seven novels by Jane Austen through spaCy (this will take a few minutes).

In [None]:
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/emma.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/lady_susan.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/mansfield_park.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/northanger_abbey.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/persuasion.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/pride.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/refs/heads/main/data/fiction/sense_and_sensibility.txt

In [None]:
files = ["emma.txt", "lady_susan.txt", "mansfield_park.txt", "northanger_abbey.txt", "persuasion.txt", "pride.txt", "sense_and_sensibility.txt"]

def read_all_files(filenames):
    all_tokens = []

    for filename in tqdm(filenames):
        data = open(filename, encoding="utf-8").read()
        tokens = nlp(data)
        all_tokens.extend(tokens)
    return all_tokens

all_tokens = read_all_files(files)

In [None]:
print(len(all_tokens))

## Setting up log odds

In [None]:
def logodds(counter1, counter2, display=25):
    """
    Function that takes two Counter objects as inputs and prints out a ranked list of terms
    more characteristic of the first counter than the second.  Here we'll use log-odds
    with an uninformative prior (from Monroe et al 2008, "Fightin Words", eqn. 22) as our metric.

    "Category 1" corresponds to the category of the counter1 object (the first argument)
    "Category 2" corresponds to the category of the counter2 object (the second argument)
    """
    vocab=dict(counter1)
    vocab.update(dict(counter2))
    count1_sum=sum(counter1.values())
    count2_sum=sum(counter2.values())

    ranks={}
    alpha=0.01
    alphaV=len(vocab)*alpha

    for word in vocab:

        log_odds_ratio=math.log( (counter1[word] + alpha) / (count1_sum+alphaV-counter1[word]-alpha) ) - math.log( (counter2[word] + alpha) / (count2_sum+alphaV-counter2[word]-alpha) )
        variance=1./(counter1[word] + alpha) + 1./(counter2[word] + alpha)

        ranks[word]=log_odds_ratio/math.sqrt(variance)

    sorted_x = sorted(ranks.items(), key=operator.itemgetter(1), reverse=True)

    print("Most category 1:")
    for k,v in sorted_x[:display]:
        print("%.3f\t%s" % (v,k))

    print("\nMost category 2:")
    for k,v in reversed(sorted_x[-display:]):
        print("%.3f\t%s" % (v,k))

## Dependency parsing with SpaCy

SpaCy uses the [ClearNLP dependency labels](https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md), which are very close to the Stanford typed dependencies.  See the [Stanford dependencies manual](http://people.ischool.berkeley.edu/~dbamman/DependencyManual.pdf) for more information about each tag.  Parse information is contained in the spacy token object; see the following for which attributes encode the token text, idx (position in sentence), part of speech, and dependency relation.  The syntactic head for a token is another token given in `token.head` (where all of those same token attributes are accessible).

In [None]:
test_doc = nlp("He started his car.")
for token in test_doc:
    print("\t".join(str(x) for x in [token.text, token.idx, token.tag_, token.dep_, token.head.text, token.head.idx, token.head.tag_]))


**Q1**. Find the verbs that men are more characteristically the *subject* of than women.  Feel free to only consider subjects that are "he" and "she" pronouns.  This function should return two Counter objects (`he_counter` and `she_counter`) which counts the number of times a given verb has "he" (`he_counter`) and "she" (`she_counter`) as its syntactic subject.

In [None]:
def count_subjects(tokens):
    he_counter = Counter()
    she_counter = Counter()

    # your code here

    return he_counter, she_counter

In [None]:
he_counts, she_counts = count_subjects(all_tokens)
logodds(he_counts, she_counts, display=10)

**Q2**. Find the verbs that men are more characteristically the *object* of than women.  Feel free to only consider objects that are "him" and "her" pronouns.  This function should return two Counter objects (`he_counter` and `she_counter`) which counts the number of times a given verb has "him" (`he_counter`) and "her" (`she_counter`) as its syntactic direct object.

In [None]:
def count_objects(tokens):
    he_counter=Counter()
    she_counter=Counter()

    # your code here
            
    return he_counter, she_counter

In [None]:
he_counts, she_counts = count_objects(all_tokens)
logodds(he_counts, she_counts, display=10)

**Q3**. Find the objects that are *possessed* more frequently by men than women. Feel free to only consider possessors that are "his" and "her" pronouns.   This function should return two Counter objects (`he_counter` and `she_counter`) which counts the number of times a given term is possessed by "he" (`he_counter`) and "she" (`she_counter`).

In [None]:
def count_possessions(tokens):
    he_counter=Counter()
    she_counter=Counter()

    # your code here


    return he_counter, she_counter

In [None]:
he_counts, she_counts = count_possessions(all_tokens)
logodds(he_counts, she_counts, display=10)

**Q4**. Find the actions that are men do *to women* more frequently than women do *to men*.  Feel free to only consider subjects and objects that are "she"/"he"/"her"/"him" pronouns.   This function should return two Counter objects (`he_counter` and `she_counter`) which counts the number of times a given verb has "he" as the subject and "her" as the object (`he_counter`) and "she" as the subject and "him" as the object (`she_counter`).

In [None]:
def count_SVO_tuples(tokens):
    he_counter=Counter()
    she_counter=Counter()

    # your code here

    return he_counter, she_counter

In [None]:
he_counts, she_counts = count_SVO_tuples(all_tokens)
logodds(he_counts, she_counts, display=10)

**Q5**. **In a few sentences,** reflect on the analysis you did above. What claims can you make about the data? What are some limitations?