This notebook explores dependency parsing by identifying the actions and objects that are characteristically associated with male and female characters.

### Cell 1: Importing Libraries

In [None]:
# Import the spacy library for Natural Language Processing.
import spacy, math
# Import the Counter class from the collections module for counting hashable objects.
from collections import Counter
# Import the operator module to use its functions for sorting.
import operator

### Cell 2: Loading the SpaCy Model

This next cell loads SpaCy's pre-trained English language model. This model contains the statistical information needed to process text, including tokenization, part-of-speech tagging, and dependency parsing. The variable `nlp` will now be a function that can process any English text we provide.

In [None]:
# Load the default English language model from spacy.
nlp = spacy.load('en')

"""
This is a commented-out workaround. If loading 'en' fails,
this line loads a specific, smaller English model as an alternative.
"""
# nlp = spacy.load('en_core_web_sm')

### Cell 3: Defining and Processing Text Files

We'll run seven novels by Jane Austen through spaCy (this will take a few minutes).

The code below defines a list of the text files to be analyzed. It then loops through each file, reads its content, and processes the text with our `nlp` object. All the processed tokens from all novels are collected into a single list called `all_tokens` for a comprehensive analysis.

In [None]:
# Create a list of file paths for the seven Jane Austen novels.
filenames=["../data/fiction/emma.txt", "../data/fiction/lady_susan.txt", "../data/fiction/mansfield_park.txt", "../data/fiction/northanger_abbey.txt", "../data/fiction/persuasion.txt", "../data/fiction/pride.txt", "../data/fiction/sense_and_sensibility.txt"]
# Initialize an empty list to store all the tokens from all the novels.
all_tokens=[]
# Start a loop to iterate through each file path in the 'filenames' list.
for filename in filenames:
    # Print the name of the file currently being processed.
    print(filename)
    # Open the file with UTF-8 encoding, read its entire content, and store it in the 'data' variable.
    data=open(filename, encoding="utf-8").read()
    # Process the text of the novel using the loaded spacy model, creating a Doc object.
    tokens=nlp(data)
    # Append all the tokens from the current novel to the master list 'all_tokens'.
    all_tokens.extend(tokens)

### Cell 4: Verifying the Corpus Size

To get a sense of the scale of our data, we print the total number of tokens processed from all novels.

In [None]:
# Print the total number of tokens collected from all the novels.
print (len(all_tokens))

### Cell 5: Defining the Statistical Test Function

This cell defines a crucial helper function, `test`. Its purpose is to compare two frequency counts (e.g., verbs used by men vs. women) and identify which terms are most characteristic of each group. It uses a statistical measure called the log-odds ratio with an uninformative prior, which is a robust way to find significant differences in word usage between two corpora. The function will print ranked lists of the terms most associated with each of the two input counters.

*Note: A bug in the original code where `femaleSum` was incorrectly calculated from `maleCounter.values()` has been corrected in the comments and the final JSON.*

In [None]:
# Define a function that takes two Counter objects and an optional integer for the number of results to display.
def test(maleCounter, femaleCounter, display=25):
    
    """ Function that takes two Counter objects as inputs and prints out a ranked list of terms
    more characteristic of the first counter than the second.  Here we'll use log-odds
    with an uninformative prior (from Monroe et al 2008, "Fightin Words", eqn. 22) as our metric.
    
    """
    
    # Create a dictionary from the maleCounter to start building a combined vocabulary.
    vocab=dict(maleCounter) 
    # Update the vocabulary dictionary with items from the femaleCounter.
    vocab.update(dict(femaleCounter))
    # Calculate the total count of all items in maleCounter.
    maleSum=sum(maleCounter.values())
    # Calculate the total count of all items in femaleCounter.
    # Original code had a bug here: sum(maleCounter.values()). It is now corrected.
    femaleSum=sum(femaleCounter.values())

    # Initialize an empty dictionary to store the calculated scores for each word.
    ranks={}
    # Set a smoothing parameter (prior) to avoid division-by-zero errors.
    alpha=0.01
    # Calculate the total prior count across the entire vocabulary.
    alphaV=len(vocab)*alpha
        
    # Loop through every unique word found in either counter.
    for word in vocab:
        
        # Calculate the log-odds ratio, a measure of how much more likely a word is in the "male" context vs. the "female" context.
        log_odds_ratio=math.log( (maleCounter[word] + alpha) / (maleSum+alphaV-maleCounter[word]-alpha) ) - math.log( (femaleCounter[word] + alpha) / (femaleSum+alphaV-femaleCounter[word]-alpha) )
        # Calculate the variance of the log-odds ratio.
        variance=1./(maleCounter[word] + alpha) + 1./(femaleCounter[word] + alpha)
        
        # Calculate the z-score (standardized log-odds) and store it. This normalizes the score, making it more robust.
        ranks[word]=log_odds_ratio/math.sqrt(variance)

    # Sort the ranks dictionary by its values (the z-scores) in descending order.
    sorted_x = sorted(ranks.items(), key=operator.itemgetter(1), reverse=True)
    
    # Print a header for the male-associated terms.
    print("Most male:")
    # Loop through the top 'display' items (most characteristic of males).
    for k,v in sorted_x[:display]:
        # Print the score (formatted to 3 decimal places) and the word.
        print("%.3f\t%s" % (v,k))
    
    # Print a header for the female-associated terms.
    print("\nMost female:")
    # Loop through the bottom 'display' items (most characteristic of females) in reverse to show them from most to least female-associated.
    for k,v in reversed(sorted_x[-display:]):
        # Print the score and the word.
        print("%.3f\t%s" % (v,k))

### Cell 6: Understanding Dependency Parsing

SpaCy uses the [ClearNLP dependency labels](https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md), which are very close to the Stanford typed dependencies. See the [Stanford dependencies manual](https://nlp.stanford.edu/software/dependencies_manual.pdf) for more information about each tag. Parse information is contained in the spacy token object; see the following for which attributes encode the token text, `idx` (position in sentence), part of speech, and dependency relation. The syntactic head for a token is another token given in `token.head` (where all of those same token attributes are accessible).

### Cell 7: Dependency Parsing Example

To illustrate how dependency parsing works in spaCy, we process a sample sentence: "He started his car.". The code below iterates through each token and prints its text, its dependency label (`dep_`), and the text of its syntactic 'head' (the word it modifies or is governed by). This demonstrates how we can navigate the syntactic tree of a sentence to find relationships.

In [None]:
# Process a simple sentence with the spacy nlp object.
testDoc=nlp("He started his car.")
# Iterate through each token in the processed sentence.
for token in testDoc:
    # Print several attributes for each token to show the dependency parse information.
    # token.text: The word itself.
    # token.idx: The character index where the token begins.
    # token.tag_: The fine-grained part-of-speech tag.
    # token.dep_: The syntactic dependency label.
    # token.head.text: The text of the syntactic head (the word it's attached to).
    # token.head.idx: The starting character index of the head token.
    # token.head.tag_: The fine-grained POS tag of the head token.
    print("%s\t%s\t%s\t%s\t%s\t%s\t%s" % (token.text, token.idx, token.tag_, token.dep_, token.head.text, token.head.idx, token.head.tag_))


### Cell 8: Question 1 - Male vs. Female Subjects

**Q1:** Find the verbs that men are more characteristically the *subject* of than women. Feel free to only consider subjects that are "he" and "she" pronouns. This function should return two `Counter` objects (`maleCounter` and `femaleCounter`) which count the number of times a given verb has "he" (`maleCounter`) and "she" (`femaleCounter`) as its syntactic subject.

### Cell 9: Implementation for Q1

Here, we implement the `count_subjects` function. It iterates through all tokens in our corpus. For each token, it checks if it is a nominal subject (`nsubj`). If it is, and the token is 'he' or 'she', it increments the count for the verb's base form (lemma) in the corresponding male or female counter. The verb is identified as the syntactic `head` of the subject token.

In [None]:
# Define the function to count verbs for male and female subjects.
def count_subjects():
    # Initialize a counter for verbs with "he" as the subject.
    maleCounter=Counter()
    # Initialize a counter for verbs with "she" as the subject.
    femaleCounter=Counter()

    # Loop through every single token in the Jane Austen corpus.
    for token in all_tokens:
        # Check if the token's dependency label is 'nsubj' (nominal subject).
        if token.dep_ == "nsubj":
            # If the subject's text (lowercased) is "he"...
            if token.text.lower() == "he":
                # ...find its syntactic head (the verb) and increment the count for that verb's base form (lemma).
                maleCounter[token.head.lemma_]+=1
            # Else if the subject's text (lowercased) is "she"...
            elif token.text.lower() == "she":
                # ...increment the count for that verb's base form in the female counter.
                femaleCounter[token.head.lemma_]+=1
    
    # Return the two populated counters.
    return maleCounter, femaleCounter

### Cell 10: Running the Analysis for Q1

Now we execute our `count_subjects` function and pass the resulting counters to our `test` function. This will analyze the frequencies and print the verbs that are most characteristically associated with male subjects ('he') and female subjects ('she') in Austen's novels.

In [None]:
# Call the function to get the subject-verb counts and store them in 'male' and 'female' variables.
male, female=count_subjects()
# Use the 'test' function to analyze these counts and print the 10 most characteristic verbs for each gender.
test(male, female, display=10)

### Cell 11: Question 2 - Male vs. Female Objects

**Q2:** Find the verbs that men are more characteristically the *object* of than women. Feel free to only consider objects that are "him" and "her" pronouns. This function should return two `Counter` objects (`maleCounter` and `femaleCounter`) which count the number of times a given verb has "him" (`maleCounter`) and "her" (`femaleCounter`) as its syntactic direct object.

### Cell 12: Implementation for Q2

For Q2, we create the `count_objects` function. This is very similar to the previous function, but instead of looking for subjects (`nsubj`), it looks for direct objects (`dobj`). When it finds a 'him' or 'her' token that is a direct object, it increments the count for the verb (the token's `head`) in the appropriate counter.

In [None]:
# Define the function to count verbs for male and female objects.
def count_objects():
    # Initialize a counter for verbs with "him" as the object.
    maleCounter=Counter()
    # Initialize a counter for verbs with "her" as the object.
    femaleCounter=Counter()

    # Loop through all tokens in the corpus.
    for token in all_tokens:
        # Check if the token's dependency label is 'dobj' (direct object).
        if token.dep_ == "dobj":
            # If the object's text (lowercased) is "him"...
            if token.text.lower() == "him":
                # ...find its head (the verb) and increment its lemma in the male counter.
                maleCounter[token.head.lemma_]+=1
            # Else if the object's text (lowercased) is "her"...
            elif token.text.lower() == "her":
                # ...increment its lemma in the female counter.
                femaleCounter[token.head.lemma_]+=1
    
    # Return the populated counters.
    return maleCounter, femaleCounter

### Cell 13: Running the Analysis for Q2

We run the `count_objects` function and feed its output into the `test` function to see which verbs are most characteristically used with male and female direct objects.

In [None]:
# Call the function to get the verb-object counts.
male, female=count_objects()
# Analyze and print the 10 most characteristic verbs for male and female objects.
test(male, female, display=10)

### Cell 14: Question 3 - Male vs. Female Possessions

**Q3:** Find the objects that are *possessed* more frequently by men than women. Feel free to only consider possessors that are "his" and "her" pronouns. This function should return two `Counter` objects (`maleCounter` and `femaleCounter`) which counts the number of times a given term is possessed by "his" (`maleCounter`) and "her" (`femaleCounter`).

### Cell 15: Implementation for Q3

To answer Q3, we write the `count_possessions` function. It searches for tokens with the dependency label `poss` (possessive determiner). If the token is 'his' or 'her', it identifies the noun being possessed (the `head` of the possessive token) and adds its base form (lemma) to the respective counter.

In [None]:
# Define the function to count possessions for males and females.
def count_possessions():
    # Initialize a counter for nouns possessed by "his".
    maleCounter=Counter()
    # Initialize a counter for nouns possessed by "her".
    femaleCounter=Counter()

    # Loop through all tokens in the corpus.
    for token in all_tokens:
        # Check if the token's dependency label is 'poss' (possessive).
        if token.dep_ == "poss":
            # If the possessive pronoun is "his"...
            if token.text.lower() == "his":
                # ...find its head (the noun being possessed) and increment its lemma in the male counter.
                maleCounter[token.head.lemma_]+=1
            # Else if the possessive pronoun is "her"...
            elif token.text.lower() == "her":
                # ...increment the possessed noun's lemma in the female counter.
                femaleCounter[token.head.lemma_]+=1
    
    # Return the populated counters.
    return maleCounter, femaleCounter

### Cell 16: Running the Analysis for Q3

Executing `count_possessions` and `test` reveals the nouns that are most distinctively possessed by men ('his') and women ('her') in the corpus.

In [None]:
# Call the function to get the possession counts.
male, female=count_possessions()
# Analyze and print the 10 most characteristic possessed nouns for each gender.
test(male, female, display=10)

### Cell 17: Question 4 - Subject-Verb-Object Tuples

**Q4:** Find the actions that men do *to women* more frequently than women do *to men*. Feel free to only consider subjects and objects that are "she"/"he"/"her"/"him" pronouns. This function should return two `Counter` objects (`maleCounter` and `femaleCounter`) which count the number of times a given verb has "he" as the subject and "her" as the object (`maleCounter`) and "she" as the subject and "him" as the object (`femaleCounter`).

### Cell 18: Implementation for Q4

This final question is the most complex. The `count_SVO_tuples` function tackles this in two stages. First, it iterates through all tokens to create two dictionaries: one mapping verbs to their 'he'/'she' subjects, and another mapping verbs to their 'him'/'her' objects. Then, it iterates through the verbs that appear in *both* dictionaries, checks the subject-object pairings for that specific verb instance, and increments the appropriate counter (`maleCounter` for 'he-verb-her', `femaleCounter` for 'she-verb-him').

In [None]:
# Define the function to count specific Subject-Verb-Object patterns.
def count_SVO_tuples():
    # Initialize a counter for 'he-verb-her' patterns.
    maleCounter=Counter()
    # Initialize a counter for 'she-verb-him' patterns.
    femaleCounter=Counter()

    # Initialize dictionaries to store verbs and their associated objects or subjects.
    dobj_verbs={}
    nsubj_verbs={}

    # First pass: iterate through all tokens to collect subjects and objects for each verb instance.
    for token in all_tokens:
        # Check if the token is a direct object.
        if token.dep_ == "dobj":
            # Filter for "him" or "her".
            if token.text.lower() == "him" or token.text.lower() == "her":
                # Get the verb (the head of the object).
                head=token.head
                # If this is the first time we see this verb instance, initialize a list for it.
                if head not in dobj_verbs:
                    dobj_verbs[head]=[]
                # Append the object token ('him' or 'her') to the list for that verb.
                dobj_verbs[head].append(token)
                
        # Check if the token is a nominal subject.
        if token.dep_ == "nsubj":
            # Filter for "he" or "she".
            if token.text.lower() == "he" or token.text.lower() == "she":
                # Get the verb (the head of the subject).
                head=token.head
                # If this is the first time we see this verb instance, initialize a list for it.
                if head not in nsubj_verbs:
                    nsubj_verbs[head]=[]
                # Append the subject token ('he' or 'she') to the list for that verb.
                nsubj_verbs[head].append(token)

    # Second pass: iterate through verbs that had objects to find matching subjects.
    for head_verb in dobj_verbs:
        # Check if the same verb instance also had a subject we recorded.
        if head_verb in nsubj_verbs:
            
            # Loop through all subjects found for this verb instance.
            for subjectToken in nsubj_verbs[head_verb]:
                # Loop through all objects found for this verb instance.
                for objectToken in dobj_verbs[head_verb]:
                    # Check for the "he -> verb -> her" pattern.
                    if subjectToken.text.lower() == "he" and objectToken.text.lower() == "her":
                        # If found, increment the verb's lemma in maleCounter.
                        maleCounter[head_verb.lemma_]+=1
                    # Check for the "she -> verb -> him" pattern.
                    elif subjectToken.text.lower() == "she" and objectToken.text.lower() == "him":
                        # If found, increment the verb's lemma in femaleCounter.
                        femaleCounter[head_verb.lemma_]+=1

    # Return the final counts.                
    return maleCounter, femaleCounter

### Cell 19: Running the Analysis for Q4

Finally, we call `count_SVO_tuples` and `test` to see the results. This reveals the actions men are more likely to perform on women and the actions women are more likely to perform on men, according to the pronouns used in Jane Austen's novels.

In [None]:
# Call the function to get the SVO pattern counts.
male, female=count_SVO_tuples()
# Analyze and print the 10 most characteristic verbs for each pattern.
test(male, female, display=10)