This notebook explores the task of extracting social networks from text: for a given set of people mentioned in a text, can we extract a social network that connects them?  In this social network, people are the nodes, and the edges between them are weighted by the strength of their connection.  How you define what "connection" means here is up to you (within reason).

This notebook requires networkx; install with:

```sh
pip install networkx==2.2
```

In [None]:
# Import the spacy library for Natural Language Processing (NLP)
import spacy
# Import defaultdict from collections, a dictionary that provides a default value for a non-existent key
from collections import defaultdict
# Import networkx for creating and manipulating complex networks (graphs)
import networkx as nx
# Import matplotlib.pyplot for creating static, animated, and interactive visualizations
import matplotlib.pyplot as plt

### Aim of the Code
This cell imports all the necessary Python libraries for the task.
* **spaCy**: Used for advanced NLP tasks, specifically for Named Entity Recognition (NER) to identify characters in the text.
* **defaultdict**: A convenient dictionary subclass to handle accumulating items (like character mentions) without needing to check if a key exists first.
* **networkx**: The core library for creating the social network graph, adding nodes (characters), and edges (relationships).
* **matplotlib**: Used to visualize and plot the final network graph.

In [None]:
# Load the pre-trained English language model from spaCy
# We disable the 'parser' component because it's not needed for our task (Named Entity Recognition)
# and disabling it makes the process faster.
nlp = spacy.load('en', disable=['parser'])

### Aim of the Code
This line initializes the spaCy NLP pipeline. It loads the standard English model (`en`), which is trained to perform various linguistic tasks. By disabling the `parser`, we optimize the pipeline to run faster since we only need the Named Entity Recognition (NER) component for this project.

In [None]:
# Define a function named 'process' that takes a filename as input
def process(filename):
    # Open the file with the specified filename in read mode, using utf-8 encoding for broad character support
    with open(filename, encoding="utf-8") as file:
        # Read the entire content of the file into the 'data' variable
        data=file.read()
        # Process the text data with the spaCy nlp object and return the resulting doc object
        return nlp(data)

### Aim of the Code
This cell defines a helper function `process` that simplifies file handling and NLP processing. It takes a file path, reads the text content, and then runs it through the spaCy pipeline created in the previous cell. The output is a spaCy `Doc` object, which is a rich container with all the linguistic annotations.

Q1. Pick an English-language book you know from [Project Gutenberg](https://www.gutenberg.org) and save it in the `data/` directory.  Read it in here.  What two characters have a strong relationship in this book?

In [None]:
# By default, we're using Austen's Pride and Prejudice from the specified file path
# Call the 'process' function to read the text and create a spaCy Doc object
doc=process("../data/pride.txt")

### Aim of the Code
This code executes the `process` function on the selected book, "Pride and Prejudice." The text is loaded and processed, and the resulting `Doc` object, which contains the book's text and all of its linguistic features, is stored in the `doc` variable. This `doc` will be the primary source of data for all subsequent analysis.

Your main task here will be to create a social network of people mentioned in text.  You will need to implement the following two functions: `get_nodes`, which returns a list of characters in a text along with a weight for them (e.g., their frequency of mention in the text) and `get_edges`, which returns a list of positive weights between those character nodes (if two characters do not have a tie between them, their weight is 0 and you don't have to include them in the edge list).

The interesting question here is how you measure whether a social tie exists between two characters in a text, and how you go about placing a weight on that edge that measures the strength of the tie. Two characters that have a strong tie should have a high weight.  Consider the different ways that we might measure social interaction in text -- the frequency with which two characters are mentioned together, how often they mention each other in dialogue, how "friendly" their interaction seems, etc.

For two previous approaches to this, see Elson et al. (2010) "[Extracting Social Networks from Literary Texts](http://www1.cs.columbia.edu/~delson/pubs/ACL2010-ElsonDamesMcKeown.pdf)" and Stiller et al. 2003, "[The Small World of Shakespeare's Plays](http://www.staff.ncl.ac.uk/daniel.nettle/shakespeare.pdf)".

Q2: Before implementing the two functions, explain how you are defining a social tie, and how you are measuring it in text.

In [None]:
# Define a function to extract mentions of people from a spaCy Doc
def get_people_mentions(doc, min_count=10):
    """ Extract all of the PERSON mentions in a spacy-processed document.
    Returns a dict mapping each unique person name to a list of spacy entity mentions
    Each spacy entity has the following attributes:
    
    * text
    * start position in document (character)
    * end position in document (character)
    * label (NER category)
    
    https://spacy.io/usage/linguistic-features#named-entities
    """
    # Create a defaultdict where each key will default to an empty list
    people=defaultdict(list)
    # Iterate through all the named entities found by spaCy in the document
    for entity in doc.ents:
        # Check if the entity's label is 'PERSON'
        if entity.label_ == "PERSON":
            # If it is a person, append the entity object to the list associated with their name
            # .lstrip() and .rstrip() remove any leading/trailing whitespace from the name
            people[entity.text.lstrip().rstrip()].append(entity)
    
    # Return the dictionary of people and their mentions
    return people

### Aim of the Code
This function, `get_people_mentions`, is responsible for the first step of data extraction: identifying all characters. It iterates through the entities recognized by spaCy in the text. If an entity is labeled as a `PERSON`, it's stored in a dictionary. The dictionary keys are the character names, and the values are lists of every instance (mention) where that character's name appeared.

Q3: Implement `get_nodes` below.

In [None]:
# Define a function to create network nodes from the extracted people mentions
def get_nodes(people, min_count=10):
    """ Creates network nodes from a dict of people mentions
    Input: a dict of people mentions returned from get_people_mentions()
    Output: a dict mapping the entity name to a positive numerical value of their importance 
    (the size of the node in a network graph)
    
    e.g., {"Tom": 5, "Huck": 1}
    
    """
    # Create a defaultdict where each key will default to 0.0
    nodes=defaultdict(float)
    # Initialize a variable to keep track of the total number of mentions
    total=0.
    # Iterate through each person (key) in the people dictionary
    for person in people:
        # Check if the person was mentioned at least 'min_count' times to filter out minor characters
        if len(people[person]) >= min_count:
            # The node's initial weight is the number of times the person is mentioned
            nodes[person]=len(people[person])
            # Add this count to the total
            total+=len(people[person])
    
    # Iterate through the filtered list of people (nodes)
    for person in nodes:
        # Normalize the node's weight by dividing by the total mentions
        # This gives a relative importance score for each character
        nodes[person]/=total
    
    # Return the dictionary of nodes and their normalized weights
    return nodes

### Aim of the Code
The `get_nodes` function processes the dictionary of character mentions to determine each character's importance in the story. A character's importance (which will determine the size of their node in the graph) is defined by how frequently they are mentioned. The function first counts the mentions for each character, filtering out those who don't meet a `min_count` threshold. It then **normalizes** these counts by dividing by the total number of mentions, so each node's size represents its share of the overall "character attention" in the book.

Q4. Implement `get_edges` below.

In [None]:
# Define a function to create the connections (edges) between the nodes (people)
def get_edges(people, doc):
    """ Creates network edges from a dict of people mentions and the full spacy-processed document
    Input: a dict of people mentions returned from get_people_mentions() and document returned from process()
    Output: a dict mapping a person all of the other people they are connected to, along with the weight of
    that connection.
    
    e.g., {"Tom: {"Huck": 2, "Sally": 1}, "Huck": {"Sally": 1}}
    
    Keep in mind that doc gives you access to *all* of the tokens in the book.
    
    """
    
    # Here we'll define the strength of a tie to be one of proximity of mention (two characters who
    # are frequently mentioned together will have a high edge weight).
    # Define an inner helper function to count co-occurrences of two people
    def get_counts(person1, person2):

        # Define the proximity window: two mentions are "together" if they appear within 500 characters of each other
        window=500 # characters
        # Initialize a counter for co-occurrences
        count=0
        # Get the list of all mentions for the first person
        per1_mentions=people[person1]
        # Get the list of all mentions for the second person
        per2_mentions=people[person2]

        # Loop through each mention of the first person
        for p1 in per1_mentions:
            # For each mention of the first person, loop through each mention of the second person
            for p2 in per2_mentions:
                # Check if the character start positions of the two mentions are within the defined window
                if abs(p1.start-p2.start) < window:
                    # If they are, increment the co-occurrence counter
                    count+=1

        # Return the total count of co-occurrences
        return count

    # Initialize a dictionary to store the edges and their weights
    edges={}
    # Initialize a counter for the total number of edges, used for normalization
    total_edges=0.

    # Iterate through every person who will be a node in our network
    for person1 in people:
        # If this person isn't already a key in our edges dictionary, add them with an empty dictionary as the value
        if person1 not in edges:
            edges[person1]={}
        # Iterate through every person again to create pairs
        for person2 in people:
            # Ensure we are not comparing a person to themselves
            if person1 != person2:
                # Calculate the number of times these two people are mentioned in close proximity
                count=get_counts(person1, person2)
                # To create an edge, the co-occurrence count must be at least 5
                # and the edge must not have been added already (to avoid duplicates like Tom-Huck and Huck-Tom)
                if count >= 5 and (person1 not in edges or person2 not in edges[person1]):
                    # Add the connection and its weight (the count) to the edges dictionary
                    edges[person1][person2]=count
                    # Increment the total number of edges found
                    total_edges+=1

    # Loop through the first person in each established edge
    for person1 in edges:
        # Loop through the second person connected to the first
        for person2 in edges[person1]:
            # Normalize the edge weight by dividing by the total number of edges
            edges[person1][person2]/=total_edges
    
    # Return the final dictionary of connections and their normalized strengths
    return edges

### Aim of the Code
This is the most complex function. Its goal is to define and quantify the relationships between characters. It operates on the principle of **co-occurrence**: if two characters are mentioned close to each other frequently, their relationship is considered strong.

1.  **Inner `get_counts` Function**: This helper function takes two character names and counts how many times any mention of the first character appears within a 500-character "window" of any mention of the second.
2.  **Main Logic**:
    * It iterates through every possible pair of characters.
    * For each pair, it calls `get_counts` to measure their proximity.
    * If the co-occurrence count is above a threshold (5), it creates an "edge" (a connection) between them, with the count as the initial weight. It avoids creating duplicate connections (e.g., if it adds Darcy-Elizabeth, it won't later add Elizabeth-Darcy).
    * Finally, similar to `get_nodes`, it **normalizes** all the edge weights by dividing by the total number of connections. This results in a relative strength for each relationship.

First, let's map *mentions* of character to the unique individuals they refer to.  We'll talk about better ways of doing this when we get to coreference resolution in a couple of weeks, but for now let's make a simplification and just say that every mention with exactly the same form refers to the same individual (so all mentions of "Elizabeth" refer to the character ELIZABETH, all mentions of "Mr. Darcy" refer to MR. DARCY, etc.)

In [None]:
# Call the get_people_mentions function with the processed document to extract all character mentions
people=get_people_mentions(doc)

### Aim of the Code
This line executes the `get_people_mentions` function. It scans the entire processed text of "Pride and Prejudice" (`doc`) and produces the `people` dictionary, which maps character names to a list of all their mentions in the book. This is the raw data from which the network will be built.

In [None]:
# Call the get_nodes function with the 'people' dictionary to calculate the importance of each character
nodes=get_nodes(people)

### Aim of the Code
This line takes the `people` dictionary and passes it to the `get_nodes` function. It calculates the normalized frequency for each major character, creating the `nodes` dictionary which maps character names to their importance score. This data will be used to set the size of the circles in the final graph.

In [None]:
# Call the get_edges function to calculate the relationship strengths between all characters
edges=get_edges(people, doc)

### Aim of the Code
This line executes the `get_edges` function. It uses the `people` dictionary and the full `doc` to analyze the proximity of character mentions. The result is the `edges` dictionary, a nested structure that defines the connections and their normalized weights between every pair of characters.

In [None]:
# Define a function to draw the network graph using the nodes and edges data
def create_graph(nodes, edges):

    """ Plot a set of weighted nodes and weighted edges on a network graph """
    
    # A parameter to control the spacing of the graph layout; larger numbers spread nodes out more
    force_directed_expansion=2
    
    # Dimensions for the output plot, making it a large, readable image
    figure_height=20
    figure_width=20
    
    # Create an empty graph object from the networkx library
    G = nx.Graph()
    # Iterate through the nodes dictionary
    for person in nodes:
        # Add each person as a node, with their importance score stored as a 'nodesize' attribute
        G.add_node(person, nodesize=nodes[person])
    # Iterate through the first person in each edge connection
    for person1 in edges:
        # Iterate through the second person connected to the first
        for person2 in edges[person1]:
            # Make sure both people in the edge exist as nodes in our graph
            if person1 in nodes and person2 in nodes:
                # Add a weighted edge between the two people
                G.add_weighted_edges_from([(person1, person2, edges[person1][person2]) ])

    # Define a dictionary of options for drawing the graph
    options = {
    'edgecolors':"black",  # Color of the border of the nodes
    'linewidths':1,         # Width of the border of the nodes
    'with_labels': True,    # Draw the names of the characters on the nodes
    'font_weight': 'regular', # Font style for the labels
    }
    
    # Get a list of all edges in the graph
    g_edges = G.edges()

    # Create a list of node sizes for plotting. The 'nodesize' attribute is scaled up for better visibility.
    sizes = [G.node[node]['nodesize']*100000 for node in G]
    # Create a list of edge widths for plotting. The 'weight' attribute is scaled up for better visibility.
    weights = [G[u][v]['weight']*10 for u,v in g_edges]

    # Set up the plot figure and axes with the specified dimensions
    fig, ax = plt.subplots(1, 1, figsize=(figure_height, figure_width));

    # Draw the network graph using networkx's drawing function
    # pos=nx.spring_layout(...): This is a physics-based layout algorithm that positions nodes to minimize edge crossings.
    # node_size=sizes: Sets the size of each node from our calculated list.
    # width=weights: Sets the width of each edge from our calculated list.
    # **options: Applies the other styling options defined earlier.
    nx.draw_networkx(G, pos=nx.spring_layout(G, k=force_directed_expansion, iterations=100), node_size=sizes, width=weights, **options)

### Aim of the Code
This function, `create_graph`, brings everything together for visualization. It uses the `networkx` library to:
1.  Create a graph object.
2.  Add each character as a **node**, using the `nodes` dictionary to set the size of the node.
3.  Add the relationships as **edges**, using the `edges` dictionary to set the thickness (weight) of the connecting lines.
4.  Use `matplotlib` to draw the graph. It employs a `spring_layout` algorithm, which is a physics simulation that arranges the nodes in a visually intuitive way, with strongly connected nodes pulling closer together. The final output is a clear visual representation of the book's social network.

In [None]:
# Call the create_graph function with the final nodes and edges data to generate and display the plot
create_graph(nodes, edges)

### Aim of the Code
This is the final step. It calls the `create_graph` function, passing in the calculated `nodes` (character importance) and `edges` (relationship strengths). This executes the drawing process and displays the social network graph of "Pride and Prejudice."