# BookNLP Character Network — Make a Network for Character in Any Book

By [Melanie Walsh](https://melaniewalsh.org/), based on a notebook by [David Bamman](https://people.ischool.berkeley.edu/~dbamman/)

[BookNLP](https://github.com/booknlp/booknlp) is a natural language processing tool developed by David Bamman. This tool can computationally identify characters, people, places, quotations, events, and  a lot more for book-length documents in English (support for other languages coming soon!). Most NLP tools do not work well on book-length documents, which makes BookNLP extremely useful.

This Colab notebook, which is [based on a notebook](https://github.com/booknlp/booknlp/blob/main/examples/Read%20character%20file.ipynb) created by David Bamman, demonstrates how to create character network data for a book with BookNLP. The default book for this notebook is Virginia Woolf's *Mrs. Dalloway* (1925). However, you can substitue another URL in the "Pick Your Book" section to try BookNLP out on another book. BookNLP will take a few minutes to process a text, depending on how long it is.

# 🚨 Before You Begin 🚨

First, you need to sign into a Google account to use this notebook.

Second, BookNLP will work best if you switch to using a GPU, or Graphical Processing Unit, for this notebook. To use a GPU in Google Colab, go to the menu at the top of the screen and select:

`Runtime > Change runtime type > Hardware accelerator > GPU (Then slick "Save")`

To run all the code in this notebook, you can select:

`Runtime > Run all`

If you want to save your own changes to this notebook, you'll need to save a copy.

# Pick Your Book

The default book for this notebook is Virginia Woolf's *Mrs. Dalloway*: https://gutenberg.net.au/ebooks02/0200991.txt. But you can find a .txt URL from [Project Gutenberg](https://www.gutenberg.org/ebooks/search/?sort_order=downloads) or GitHub or anywhere else and plug it in below:

In [39]:
!wget "https://gutenberg.net.au/ebooks02/0200991.txt" -O my_book.txt

--2021-12-10 04:10:53--  https://gutenberg.net.au/ebooks02/0200991.txt
Resolving gutenberg.net.au (gutenberg.net.au)... 43.229.63.241
Connecting to gutenberg.net.au (gutenberg.net.au)|43.229.63.241|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 370280 (362K) [text/plain]
Saving to: ‘my_book.txt’


2021-12-10 04:10:55 (442 KB/s) - ‘my_book.txt’ saved [370280/370280]



Then click `Runtime > Run All`

# Install and Import Packages

In [40]:
!pip install booknlp
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 9.2 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [41]:
from booknlp.booknlp import BookNLP
import json
from collections import Counter
from pathlib import Path
import pandas as pd
pd.options.display.max_rows = 200
pd.options.display.max_colwidth = 100

# Set Up BookNLP

In [42]:
model_params = {
		"pipeline":"entity,quote,supersense,event,coref", 
		"model":"big", 
	}

booknlp= BookNLP("en", model_params)

{'pipeline': 'entity,quote,supersense,event,coref', 'model': 'big'}
--- startup: 19.740 seconds ---


# Run BookNLP

Before we apply BookNLP, let's check to make sure our text file has the right character encoding.

If your text file has a character encoding that is not UTF-8 or ISO-8859-1, then you will need to uncomment the last two lines in the code cell below and manually enter the character encoding of the file before transforming it into a UTF-8 file.

In [43]:
# Check to see if text file opens with UTF-8 encoding
try:
    open("renamed_Endgame_Script.txt", encoding='utf-8').read()
except UnicodeDecodeError:
    try:
      # Check to see if file opens with ISO-8859-1 encoding and, if so, rewrite the file as UTF-8
      text = open("my_book.txt", encoding='ISO-8859-1').read()
      open('my_book.txt', mode='w', encoding='utf-8').write(text)
    except:
      print("Character encoding error: You need to uncomment the lines above and specify the character encoding for this text file")
 
# Open the file with the current encoding
#text = open("my_book.txt", encoding='Your Character Encoding Here').read()
# Rewrite the file as a UTF-8 file
#open('my_book.txt', mode='w', encoding='utf-8'.write(text)

Apply BookNLP

In [44]:
inputFile = "renamed_Endgame_Script.txt"
outputDir = "AvengersFilms_dir/"
idd="my_film"

booknlp.process(inputFile, outputDir, idd)

--- spacy: 9.644 seconds ---
--- entities: 150.221 seconds ---
--- quotes: 0.080 seconds ---
--- attribution: 14.234 seconds ---
--- name coref: 0.412 seconds ---
--- coref: 200.872 seconds ---
--- TOTAL (excl. startup): 376.079 seconds ---, 33116 words


# Get Character Data

Load character data

In [45]:
character_data = json.load(open("AvengersFilms_dir/my_film.book"))

Make a counter

In [46]:
def get_counter_from_dependency_list(dependency_list):
    
    counter = Counter()

    for token in dependency_list:
        term = token["w"]
        tokenGlobalIndex=token["i"]
        counter[term] += 1
    return counter

Loop through character data and pull out information, then transform it into a DataFrame

In [47]:
character_data["characters"]

[{'agent': [{'i': 382, 'w': 'do'},
   {'i': 691, 'w': 'like'},
   {'i': 934, 'w': 'know'},
   {'i': 948, 'w': 'know'},
   {'i': 962, 'w': 'hope'},
   {'i': 1017, 'w': 'say'},
   {'i': 1020, 'w': 'feeling'},
   {'i': 1153, 'w': 'know'},
   {'i': 1155, 'w': 'said'},
   {'i': 1164, 'w': 'hoping'},
   {'i': 1164, 'w': 'hoping'},
   {'i': 1193, 'w': 'mean'},
   {'i': 1215, 'w': 'lie'},
   {'i': 1224, 'w': 'drift'},
   {'i': 1229, 'w': 'think'},
   {'i': 1522, 'w': 'lost'},
   {'i': 1813, 'w': 'thought'},
   {'i': 1874, 'w': 'fight'},
   {'i': 1939, 'w': 'saw'},
   {'i': 1948, 'w': 'had'},
   {'i': 1962, 'w': 'dreaming'},
   {'i': 1971, 'w': 'gon'},
   {'i': 1988, 'w': 'needed'},
   {'i': 2014, 'w': 'need'},
   {'i': 2037, 'w': 'need'},
   {'i': 2043, 'w': 'believe'},
   {'i': 2045, 'w': 'remember'},
   {'i': 2132, 'w': 'said'},
   {'i': 2312, 'w': 'got'},
   {'i': 2320, 'w': 'got'},
   {'i': 2320, 'w': 'got'},
   {'i': 2504, 'w': 'bring'},
   {'i': 2511, 'w': 'come'},
   {'i': 2610, 'w': 'k

In [48]:
for character in character_data["characters"]:
    
    agentList = character["agent"]
    patientList = character["patient"]
    possList = character["poss"]
    modList = character["mod"]
    character_id = character["id"]
    count = character["count"]
    referential_gender_distribution = referential_gender_prediction="unknown"
    if character["g"] is not None and character["g"] != "unknown":
        referential_gender_distribution=character["g"]["inference"]
        referential_gender=character["g"]["argmax"]
    mentions=character["mentions"]
    proper_mentions=mentions["proper"]
    max_proper_mention=""


In [49]:
df_list = []
for character in character_data["characters"]:
    
    agentList = character["agent"]
    patientList = character["patient"]
    possList = character["poss"]
    modList = character["mod"]
    character_id = character["id"]
    count = character["count"]
    referential_gender_distribution = referential_gender_prediction="unknown"

    if character["g"] is not None and character["g"] != "unknown":
        referential_gender_distribution=character["g"]["inference"]
        referential_gender=character["g"]["argmax"]

    mentions=character["mentions"]
    proper_mentions=mentions["proper"]
    max_proper_mention=""

    # just print out information about named characters
    if len(mentions["proper"]) > 0:
        max_proper_mention=mentions["proper"][0]["n"]
        
        df_list.append( {'Name':max_proper_mention , 'Character ID': character_id,
                         'Mentions': count,
                       'Gender': referential_gender,
                       'Possessives': get_counter_from_dependency_list(possList).most_common(10),
                       'Agent': get_counter_from_dependency_list(agentList).most_common(10),
                       'Patient': get_counter_from_dependency_list(patientList).most_common(10),
                       'Modifiers': get_counter_from_dependency_list(modList).most_common(10)}
        )
df = pd.DataFrame(df_list)
df['Character ID'] = df['Character ID'].astype(str)
df

Unnamed: 0,Name,Character ID,Mentions,Gender,Possessives,Agent,Patient,Modifiers
0,TONY STARK,85,519,he/him/his,"[(hand, 5), (reactor, 4), (Reactor, 3), (side, 3), (body, 3), (helmet, 2), (face, 2), (eyes, 2),...","[(know, 7), (did, 6), (sitting, 5), (falls, 4), (got, 4), (repeating, 4), (leans, 3), (looking, ...","[(repeating, 4), (Thank, 2), (hugs, 2), (puts, 1), (taps, 1), (carries, 1), (lies, 1), (forced, ...","[(able, 2), (sick, 1), (blood, 1), (small, 1), (fine, 1), (being, 1), (unresponsive, 1)]"
1,STEVE ROGERS,79,354,he/him/his,"[(shield, 7), (arm, 5), (hand, 4), (feet, 4), (communicator, 2), (beard, 1), (face, 1), (territo...","[(looks, 9), (know, 6), (share, 4), (walks, 4), (holding, 3), (look, 3), (throws, 3), (sighs, 2)...","[(see, 3), (sees, 2), (facing, 1), (joins, 1), (stop, 1), (lost, 1), (find, 1), (tell, 1), (hit,...","[(ROGERS, 1), (about, 1), (scared, 1), (wrong, 1), (ass, 1), (options, 1), (sure, 1), (new, 1)]"
2,BRUCE,119,283,he/him/his,"[(body, 3), (hands, 2), (future, 2), (hand, 2), (face, 2), (form, 2), (Mom, 1), (Ship, 1), (chan...","[(looks, 4), (say, 3), (walks, 3), (know, 3), (pushes, 2), (lost, 2), (presses, 2), (think, 2), ...","[(see, 4), (Thank, 2), (pushes, 1), (blamed, 1), (treating, 1), (shakes, 1), (liked, 1), (ignore...","[(right, 2), (okay, 2), (grateful, 1), (kind, 1)]"
3,THOR,96,282,he/him/his,"[(hand, 4), (head, 3), (face, 2), (headphones, 2), (mother, 2), (beard, 2), (Love, 2), (steps, 1...","[(walks, 5), (know, 5), (do, 4), (grabs, 3), (uses, 3), (gon, 3), (think, 3), (tries, 3), (holds...","[(walks, 3), (Thank, 2), (throwing, 2), (socking, 2), (takes, 2), (stop, 1), (brings, 1), (scare...","[(asleep, 2), (awake, 1), (dead, 1), (drunk, 1)]"
4,SCOTT LANG,108,260,he/him/his,"[(hometown, 2), (daughter, 2), (lunch, 2), (body, 1), (suit, 1), (gauntlets, 1), (screen, 1), (c...","[(sees, 4), (pushes, 3), (looks, 3), (pulls, 3), (drops, 3), (walks, 2), (arrives, 2), (studied,...","[(know, 2), (bring, 2), (flinging, 1), (shows, 1), (seen, 1), (showing, 1), (face, 1), (Thank, 1...","[(okay, 1), (open, 1), (surprised, 1), (person, 1), (right, 1)]"
5,THANOS,73,218,he/him/his,"[(sword, 6), (army, 6), (fingers, 3), (arm, 3), (head, 3), (helmet, 3), (plan, 2), (armour, 2), ...","[(looks, 3), (knows, 3), (does, 3), (puts, 3), (sees, 3), (wiped, 2), (walks, 2), (gets, 2), (si...","[(knocking, 3), (got, 2), (punches, 2), (crush, 2), (hunting, 1), (fought, 1), (fight, 1), (kill...","[(unbeatable, 1), (death, 1)]"
6,CLINT BARTON,74,195,he/him/his,"[(daughter, 3), (hand, 3), (wife, 2), (family, 2), (head, 2), (fingers, 1), (brothers, 1), (elbo...","[(looks, 5), (starts, 3), (sees, 3), (runs, 3), (finds, 3), (gives, 3), (look, 2), (picks, 2), (...","[(sees, 2), (corrects, 1), (done, 1), (doing, 1), (find, 1), (crying, 1), (look, 1), (leads, 1),...","[(alone, 1)]"
7,Nebula,84,169,she/her,"[(hand, 4), (head, 4), (self, 3), (hands, 2), (gun, 2), (help, 1), (throat, 1), (arm, 1), (Knife...","[(walks, 7), (shakes, 2), (walk, 2), (looks, 2), (fighting, 2), (throws, 2), (takes, 2), (points...","[(sees, 2), (love, 1), (thank, 1), (treated, 1), (facing, 1), (Helping, 1), (Bring, 1), (left, 1...","[(welcome, 1), (useless, 1)]"
8,NATASHA ROMANOFF,95,115,he/him/his,"[(laundry, 1), (Daddy, 1), (hand, 1)]","[(know, 4), (turns, 2), (paying, 1), (tell, 1), (seek, 1), (fear, 1), (lose, 1), (love, 1), (sta...","[(Panicking, 1), (crying, 1), (leads, 1), (knocks, 1), (Damn, 1)]","[(asleep, 1), (pain, 1)]"
9,RHODEY,97,90,he/him/his,"[(fingers, 1), (suit, 1), (colleague, 1), (friend, 1), (head, 1)]","[(flies, 2), (gets, 2), (comes, 1), (know, 1), (raise, 1), (aims, 1), (walks, 1), (makes, 1), (h...","[(counting, 1)]",[]


# View Character Data

Let's view the most frequently mentioned characters as well as their referential gender, actions for the which they are the agent and patient, objects they possess, and modifiers.

In [50]:
df

Unnamed: 0,Name,Character ID,Mentions,Gender,Possessives,Agent,Patient,Modifiers
0,TONY STARK,85,519,he/him/his,"[(hand, 5), (reactor, 4), (Reactor, 3), (side, 3), (body, 3), (helmet, 2), (face, 2), (eyes, 2),...","[(know, 7), (did, 6), (sitting, 5), (falls, 4), (got, 4), (repeating, 4), (leans, 3), (looking, ...","[(repeating, 4), (Thank, 2), (hugs, 2), (puts, 1), (taps, 1), (carries, 1), (lies, 1), (forced, ...","[(able, 2), (sick, 1), (blood, 1), (small, 1), (fine, 1), (being, 1), (unresponsive, 1)]"
1,STEVE ROGERS,79,354,he/him/his,"[(shield, 7), (arm, 5), (hand, 4), (feet, 4), (communicator, 2), (beard, 1), (face, 1), (territo...","[(looks, 9), (know, 6), (share, 4), (walks, 4), (holding, 3), (look, 3), (throws, 3), (sighs, 2)...","[(see, 3), (sees, 2), (facing, 1), (joins, 1), (stop, 1), (lost, 1), (find, 1), (tell, 1), (hit,...","[(ROGERS, 1), (about, 1), (scared, 1), (wrong, 1), (ass, 1), (options, 1), (sure, 1), (new, 1)]"
2,BRUCE,119,283,he/him/his,"[(body, 3), (hands, 2), (future, 2), (hand, 2), (face, 2), (form, 2), (Mom, 1), (Ship, 1), (chan...","[(looks, 4), (say, 3), (walks, 3), (know, 3), (pushes, 2), (lost, 2), (presses, 2), (think, 2), ...","[(see, 4), (Thank, 2), (pushes, 1), (blamed, 1), (treating, 1), (shakes, 1), (liked, 1), (ignore...","[(right, 2), (okay, 2), (grateful, 1), (kind, 1)]"
3,THOR,96,282,he/him/his,"[(hand, 4), (head, 3), (face, 2), (headphones, 2), (mother, 2), (beard, 2), (Love, 2), (steps, 1...","[(walks, 5), (know, 5), (do, 4), (grabs, 3), (uses, 3), (gon, 3), (think, 3), (tries, 3), (holds...","[(walks, 3), (Thank, 2), (throwing, 2), (socking, 2), (takes, 2), (stop, 1), (brings, 1), (scare...","[(asleep, 2), (awake, 1), (dead, 1), (drunk, 1)]"
4,SCOTT LANG,108,260,he/him/his,"[(hometown, 2), (daughter, 2), (lunch, 2), (body, 1), (suit, 1), (gauntlets, 1), (screen, 1), (c...","[(sees, 4), (pushes, 3), (looks, 3), (pulls, 3), (drops, 3), (walks, 2), (arrives, 2), (studied,...","[(know, 2), (bring, 2), (flinging, 1), (shows, 1), (seen, 1), (showing, 1), (face, 1), (Thank, 1...","[(okay, 1), (open, 1), (surprised, 1), (person, 1), (right, 1)]"
5,THANOS,73,218,he/him/his,"[(sword, 6), (army, 6), (fingers, 3), (arm, 3), (head, 3), (helmet, 3), (plan, 2), (armour, 2), ...","[(looks, 3), (knows, 3), (does, 3), (puts, 3), (sees, 3), (wiped, 2), (walks, 2), (gets, 2), (si...","[(knocking, 3), (got, 2), (punches, 2), (crush, 2), (hunting, 1), (fought, 1), (fight, 1), (kill...","[(unbeatable, 1), (death, 1)]"
6,CLINT BARTON,74,195,he/him/his,"[(daughter, 3), (hand, 3), (wife, 2), (family, 2), (head, 2), (fingers, 1), (brothers, 1), (elbo...","[(looks, 5), (starts, 3), (sees, 3), (runs, 3), (finds, 3), (gives, 3), (look, 2), (picks, 2), (...","[(sees, 2), (corrects, 1), (done, 1), (doing, 1), (find, 1), (crying, 1), (look, 1), (leads, 1),...","[(alone, 1)]"
7,Nebula,84,169,she/her,"[(hand, 4), (head, 4), (self, 3), (hands, 2), (gun, 2), (help, 1), (throat, 1), (arm, 1), (Knife...","[(walks, 7), (shakes, 2), (walk, 2), (looks, 2), (fighting, 2), (throws, 2), (takes, 2), (points...","[(sees, 2), (love, 1), (thank, 1), (treated, 1), (facing, 1), (Helping, 1), (Bring, 1), (left, 1...","[(welcome, 1), (useless, 1)]"
8,NATASHA ROMANOFF,95,115,he/him/his,"[(laundry, 1), (Daddy, 1), (hand, 1)]","[(know, 4), (turns, 2), (paying, 1), (tell, 1), (seek, 1), (fear, 1), (lose, 1), (love, 1), (sta...","[(Panicking, 1), (crying, 1), (leads, 1), (knocks, 1), (Damn, 1)]","[(asleep, 1), (pain, 1)]"
9,RHODEY,97,90,he/him/his,"[(fingers, 1), (suit, 1), (colleague, 1), (friend, 1), (head, 1)]","[(flies, 2), (gets, 2), (comes, 1), (know, 1), (raise, 1), (aims, 1), (walks, 1), (makes, 1), (h...","[(counting, 1)]",[]


# Get Named Entities

Read in named entities and view all named entities

In [51]:
# Read in named entities
entity_df = pd.read_csv("AvengersFilms_dir/my_film.entities", delimiter='\t')
# Merge character data and entity data on Character ID
entity_df['COREF'] = entity_df['COREF'].astype(str)
entity_df = pd.merge(df[['Character ID', 'Name']], entity_df, left_on = 'Character ID', right_on= 'COREF')
entity_df[:100]

Unnamed: 0,Character ID,Name,COREF,start_token,end_token,prop,cat,text
0,85,TONY STARK,85,599,599,PROP,PER,Tony
1,85,TONY STARK,85,626,626,PROP,PER,Tony
2,85,TONY STARK,85,629,630,PROP,PER,TONY STARK
3,85,TONY STARK,85,632,632,PRON,PER,You
4,85,TONY STARK,85,643,643,PRON,PER,you
5,85,TONY STARK,85,656,656,PROP,PER,Tony
6,85,TONY STARK,85,674,674,PROP,PER,Tony
7,85,TONY STARK,85,703,703,PROP,PER,Tony
8,85,TONY STARK,85,705,706,PROP,PER,TONY STARK
9,85,TONY STARK,85,722,722,PROP,PER,Tony


# Create Character Proximity Network Data

We want to make a network where characters have a connection, or edge, if they appear near each other in the text (within 100 token of one another). To make this measurement, we can use the "starting_tokens" where the characters first appear and substract them.

First, we will make all the names and starting tokens into lists. Then we will zip them together.

In [52]:
names = entity_df['Name'].tolist()
start_tokens = entity_df['start_token'].tolist()

Then we will use `itertools` to make a combinations of all the characters and starting tokens.

In [53]:
import itertools

edge_list = []
threshold_distance = 15

# Make all possible combinations of characters and start tokens
for person, another_person in itertools.combinations(zip(names, start_tokens), 2):
    
    # Measure the distance between tokens
    distance = abs(person[1] - another_person[1])
    # If distance is smaller than 100
    if distance < threshold_distance:
        # and it's not the same person
        if person[0] != another_person[0]:
            # add the edge to the list
            edge_list.append((person[0], another_person[0]))

character_df = pd.DataFrame(Counter(edge_list).most_common(), columns=['character_pair', 'edge_weight'])
character_df['character1']=character_df['character_pair'].str[0]
character_df['character2']=character_df['character_pair'].str[1]


In [54]:
character_df[:50]

Unnamed: 0,character_pair,edge_weight,character1,character2
0,"(TONY STARK, STEVE ROGERS)",209,TONY STARK,STEVE ROGERS
1,"(TONY STARK, SCOTT LANG)",109,TONY STARK,SCOTT LANG
2,"(CLINT BARTON, NATASHA ROMANOFF)",93,CLINT BARTON,NATASHA ROMANOFF
3,"(BRUCE, SCOTT LANG)",87,BRUCE,SCOTT LANG
4,"(TONY STARK, Pepper)",77,TONY STARK,Pepper
5,"(STEVE ROGERS, SCOTT LANG)",75,STEVE ROGERS,SCOTT LANG
6,"(STEVE ROGERS, BRUCE)",74,STEVE ROGERS,BRUCE
7,"(STEVE ROGERS, THANOS)",66,STEVE ROGERS,THANOS
8,"(TONY STARK, THANOS)",63,TONY STARK,THANOS
9,"(BRUCE, THOR)",63,BRUCE,THOR


Write to CSV (where you can then download this data)

In [55]:
character_df.to_csv('EG-Character-Edge-List.csv', index=False)