# How to create semantic speech networks

This notebook demos the process of creating semantic speech networks in an interactive notebook using netts.

### Contents:  
1. [Installing Netts](#1.-Installing-Netts)  
2. [Constructing a semantic speech network](#2.-Constructing-a-semantic-speech-network)  
   1. [Command Line Interface](#2.1-Command-Line-Interface)
   2. [Python Interface](#2.2-Python-Interface)
3. [Constructing several networks](#3.-Constructing-several-networks)
4. [How does netts work?](#4.How-does-netts-work?)

Caroline Nettekoven & Ryan Daniels\
February 2023, University of Cambridge

Find more information about netts on [my website](https://www.caroline-nettekoven.com/post/netts/), in the [netts documentation](https://alan-turing-institute.github.io/netts/) or in our [preprint](https://doi.org/10.1101/2022.02.25.22271517) (in press at [Schizophrenia Bulletin](https://academic.oup.com/schizophreniabulletin)). You can also read more about netts on the [Accelerate Science Cambridge Blog](https://acceleratescience.github.io/blog).

## 1. Installing Netts

We will begin by quickly walking you through installing netts. Ideally, you create a Python environment to install netts into, in your project folder ([This guide](https://docs.python.org/3/library/venv.html) will show you how). But you can also install netts without a Python environment, if you don't want to work with Python environments.

Set up a virtual environment of your choice, here we use pyenv and python 3.9 <span style="color:green"> we can replace this with anything else that you want, it's just what I've been using </span>:

```bash
python3.9 -m venv .venv
source .venv/bin/activate
```


To install the latest official release of netts from PyPI, open up a [terminal](https://www.computerhope.com/jargon/t/terminal.htm) and from the command line prompt run:

```bash
pip install netts
```


### Install Additional Dependencies

Netts may require the Java Runtime Environment. Instructions for downloading and installing for your operating system can be found [here](https://docs.oracle.com/goldengate/1212/gg-winux/GDRAD/java.htm#BGBFHBEA).

Netts requires additional dependencies including [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) and [OpenIE](https://github.com/dair-iitd/OpenIE-standalone). You can install them either directly from the netts CLI or in Python.

To install using the CLI:

```bash
netts install
```

To install in a notebook:

In [1]:
import netts

settings = netts.get_settings()
print(f"Installing dependencies to {settings.netts_dir}")
netts.install_dependencies()

  from .autonotebook import tqdm as notebook_tqdm


Installing dependencies to /Users/callithrix/netts


## 2. Constructing a semantic speech network

Netts takes speech transcripts and converts them into a semantic graph. Imagine we have the following short transcript in a file called `transcript_1.txt` in the folder `transcripts/`:

> I see a man and he is wearing a jacket. He is standing in the dark against a light post. On the picture there seems to be like a park and... Or trees but in those trees there are little balls of light reflections as well. I cannot see the... Anything else because it’s very dark. But the man on the picture seems to wear a hat and he seems to have a hoodie on as well. The picture is very mysterious, which I like about it, but for me I would like to understand more about the picture.

We can create a semantic graph from the transcript using either the command line interface (CLI) of python package.

Let's first print the content of `transcript_1.txt`:



In [3]:
%%bash
cat transcripts/transcript_1.txt


I see a man and he is wearing a jacket. He is standing in the dark against a light post. On the picture there seems to be like a park and... Or trees but in those trees there are little balls of light reflections as well. I cannot see the... Anything else because it’s very dark. But the man on the picture seems to wear a hat and he seems to have a hoodie on as well. The picture is very mysterious, which I like about it, but for me I would like to understand more about the picture.

### 2.1 Command Line Interface
We can process a single transcript with the Command Line Interface (CLI) like this


```bash
netts run transcript.txt outputs
```

#### Inputs

We can break this down into the following components:

| CLI Command | transcript.txt     | outputs                  |
| ----------- | ------------------ | ------------------------ |
| netts run   | Path to transcript | path of output directory |

1. `transcript.txt` can be replaced with the full path to any `.txt` file.
2. `outputs` can be replaced with the path to any directory. If the directory does not exist yet netts will create it.

We're going to run this inside a Jupyter notebook cell now:

In [7]:
%%bash
netts run transcripts/transcript_1.txt output_folder

[03/22/23 19:47:41] INFO     For logging information, please check              
                             /Users/callithrix/Documents/Projects/Cambridge_Nett
                             s/code/netts_demo/netts_log.log                    
Starting CoreNLP Server...
Processing Transcript(s)...


Netts will let you know if it has found output files (png or pickle files) for the transcripts you are trying to process. In that case, netts will give a warning and stop processing any transcripts that have already been processed. If you would like to generate these files again with netts, move your old files out of the folder or rename then. Then re-run netts on your input transcript.

#### Outputs

Once netts processes the transcript the output directory will contain two files:

```text
outputs/
    transcript.pickle
    transcript.png
```

The file prefix is taken from the input file (in this case <code><ins style="color: green; text-decoration-color: green;">transcript</ins>.txt</code>)

Netts will also produce a log file. Any serious issues will be printed out to the console, but minor pieces of information will end up in a log file located in your output directory.

### 2.2 Python Interface

If you dont want to use the netts command line interface (CLI) or want more control over netts you can use the netts python package directly. Here we'll run through the example transcript in Python.

The transcript `transcript_1.txt` is in the directory `transcripts/`. We will load open it with Python and process it with netts.

In [11]:
import matplotlib.pyplot as plt
import netts

with netts.OpenIEClient() as openie_client, netts.CoreNLPClient(
    properties={"annotators": "tokenize,ssplit,pos,lemma,parse,depparse,coref,openie", "be_quiet": "true"},
) as corenlp_client:

    with open("transcripts/transcript_1.txt", encoding="utf-8") as f:
        transcript = f.read()

    network = netts.SpeechGraph(transcript)

    network.process(
        openie_client=openie_client,
        corenlp_client=corenlp_client,
        preprocess_config=settings.netts_config.preprocess,
    )

INFO:stanza:Writing properties to tmp file: corenlp_server-4aeaa1168751465d.props


INFO:stanza:Starting server with command: java -Xmx5G -cp /Users/callithrix/netts/stanza_corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 8089 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet False -serverProperties corenlp_server-4aeaa1168751465d.props -preload -outputFormat serialized
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - Server default properties:
			(Note: unspecified annotator properties are English defaults)
			annotators = tokenize,ssplit,pos,lemma,parse,depparse,coref,openie
			be_quiet = true
			inputFormat = text
			outputFormat = serialized
			prettyPrint = false
			threads = 5
[main] INFO CoreNLP - Threads: 5
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words-dists

I see a man and he is wearing a jacket . He is standing in the dark against a light post . On the picture there seems to be like a park and ... Or trees but in those trees there are little balls of light reflections as well . I can not see the ... Anything else because it is very dark . But the man on the picture seems to wear a hat and he seems to have a hoodie on as well . The picture is very mysterious , which I like about it , but for me I would like to understand more about the picture .


[pool-1-thread-3] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [0.8 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.4 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.4 sec].
[pool-1-thread-3] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
[pool-1-thread-3] INFO edu.stanford.nlp.time.TimeExpressionExtractorImpl - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sut

INFO:netts:  he | is wearing | a jacket


INFO:netts:  i | see | a man


INFO:netts:  he | is standing | in the dark


INFO:netts:  there | seems | on the picture


INFO:netts:  it | is | very dark


INFO:netts:  i | can not see | the ... anything else


INFO:netts:  he | to have | a hoodie on


INFO:netts:  the man on the picture | to wear | a hat


INFO:netts:  i | would like to understand | more about the picture


INFO:netts:  i | like | about it


INFO:netts:  the picture | is | very mysterious


INFO:netts:++++ Created 11 edges (ollie) ++++


INFO:netts:  i | see | man


INFO:netts:  he | is wearing | jacket


INFO:netts:  he | is standing in | dark


INFO:netts:  he | is standing against | light post


INFO:netts:  he | is standing against | post


INFO:netts:  man | wear | hat


INFO:netts:  picture | is | very mysterious


INFO:netts:  picture | is | mysterious


INFO:netts:++++ Created 8 edges (stanza).  ++++


INFO:netts:++++ Ollie detected 11 edges, so no stanza edges were added.  ++++


INFO:netts:++++ Obtained word types ++++


INFO:netts:++++ Created 0 adj edges ++++


INFO:netts:  balls | of | reflections


INFO:netts:  man | on | picture


INFO:netts:  more | about | picture


INFO:netts:++++ Created 3 preposition edges ++++


INFO:netts: he | in | dark


INFO:netts: he | against | post


INFO:netts: i | about | it


INFO:netts: i | for | me


INFO:netts:++++ Created 4 oblique edges ++++


INFO:netts:++++ Added 4 oblique edges. Total edges: 11 ++++


INFO:netts:{'the man on the picture': [(4, 'he'), (4, 'the man on the picture')], 'a man': [(1, 'he'), (0, 'a man'), (0, 'he')], 'the picture': [(5, 'it'), (2, 'the picture')], 'i': [(0, 'i'), (3, 'i'), (5, 'me'), (5, 'i'), (5, 'i')]}


INFO:netts:++++ Obtained 4 node synonyms ++++


INFO:netts:++++ Split node name synonyms on prepositions. ++++


INFO:netts:the man 		|	 on the picture  -- Adding ('man', 'picture', {'relation': 'on', 'extractor': 'preposition', 'sentence': 4})


INFO:netts:++++ Split nodes on prepositions. ++++


INFO:netts:Replaced 'he' with 'a man' in ('he', 'a jacket')


INFO:netts:Replaced 'i' with 'i' in ('i', 'a man')


INFO:netts:Replaced 'i' with 'i' in ('i', 'a man')


INFO:netts:Replaced 'a man' with 'a man' in ('i', 'a man')


INFO:netts:Replaced 'he' with 'a man' in ('he', 'in the dark')


INFO:netts:Replaced 'on the picture' with 'the picture' in ('there', 'on the picture')


INFO:netts:Replaced 'i' with 'i' in ('i', 'the ... anything else')


INFO:netts:Replaced 'i' with 'i' in ('i', 'the ... anything else')


INFO:netts:Replaced 'he' with 'the man' in ('he', 'a hoodie on')


INFO:netts:Replaced 'the man' with 'the man' in ('the man', 'a hat')


INFO:netts:Replaced 'i' with 'i' in ('i', 'more about the picture')


INFO:netts:Replaced 'i' with 'i' in ('i', 'more about the picture')


INFO:netts:Replaced 'i' with 'i' in ('i', 'more about the picture')


INFO:netts:Replaced 'i' with 'i' in ('i', 'about it')


INFO:netts:Replaced 'i' with 'i' in ('i', 'about it')


INFO:netts:Replaced 'i' with 'i' in ('i', 'about it')


INFO:netts:Replaced 'about it' with 'the picture' in ('i', 'about it')


INFO:netts:Replaced 'man' with 'the man' in ('man', 'picture')


INFO:netts:++++ Merged nodes that are referenced several times. ++++


INFO:netts:Replaced 'man' with 'the man' in ('man', 'picture')


INFO:netts:++++ Merged nodes that are referenced several times. ++++


INFO:netts:++++ Merged nodes that are referenced several times. ++++


INFO:netts:Replaced 'he' with 'a man' in ('he', 'dark')


INFO:netts:Replaced 'he' with 'a man' in ('he', 'post')


INFO:netts:Replaced 'i' with 'i' in ('i', 'it')


INFO:netts:Replaced 'i' with 'i' in ('i', 'it')


INFO:netts:Replaced 'i' with 'i' in ('i', 'it')


INFO:netts:Replaced 'it' with 'the picture' in ('i', 'it')


INFO:netts:Replaced 'i' with 'i' in ('i', 'me')


INFO:netts:Replaced 'i' with 'i' in ('i', 'me')


INFO:netts:Replaced 'i' with 'i' in ('i', 'me')


INFO:netts:Replaced 'me' with 'i' in ('i', 'me')


INFO:netts:++++ Merged nodes that are referenced several times. ++++


INFO:netts:++++ Added adjective edges: True ++++


INFO:netts:++++ Added all preposition edges: True ++++


INFO:netts:++++ Obtained unconnected nodes ++++


INFO:netts:++++ Cleaned nodes. ++++


INFO:netts:@@@@{0}


INFO:netts:++++ Cleaned parallel edges from duplicates. ++++
[Thread-0] INFO CoreNLP - CoreNLP Server is shutting down.


As you can see, netts produces a lot of output. At the command line, you will be able to choose if you want all the output printed, or if you would like netts to run quietly. By default, netts runs quietly and only prints the full output to the log file. If you would like to see the full netts output at the command line, for example to check that every steps runs correctly, you can use the command line option `--verbose`. To do that, you run: `netts --verbose transcript_1.txt output_folder`.

We can save the constructed semantic speech network as a [pickle](https://docs.python.org/3/library/pickle.html) file for later analysis. Use the netts function `pickle_graph` for this.

In [12]:
# Save the graph object as a pickle file
with open("output_folder/transcript.pickle", "wb") as output_f:
    netts.pickle_graph(network, output_f)


Pickle files are a way to save Python objects. You can later load the saved picke fle back into Python using the `networkx` function `read_gpickle`:

In [13]:
import pickle
with open("output_folder/transcript.pickle", "rb") as graph_file:
    network = pickle.load(graph_file)

## 3. Constructing several networks


If you have a folder of transcripts you can process the entire folder with the CLI. For example, if you have a folder called `transcripts/`:

```text
transcripts/
    input_folder/
        transcript_2.txt
        transcript_3.txt
        transcript_4.txt
        transcript_5.txt
```

You can process all of them by submitting the folder to netts with the Command Line:


In [3]:
%%bash
netts run transcripts/input_folder output_folder

[03/22/23 19:19:08] INFO     For logging information, please check              
                             /Users/callithrix/Documents/Projects/Cambridge_Nett
                             s/code/netts_demo/netts_log.log                    
Starting CoreNLP Server...
Processing Transcript(s)...


## 4. How does netts work?
Let's quickly walk through how netts works. This is more extensively covered in our [preprint](https://doi.org/10.1101/2022.02.25.22271517) and paper (in press at [Schizophrenia Bulletin](https://academic.oup.com/schizophreniabulletin)), but briefly reviewing the processing steps will help us understand the output. If you would like to directly move on to the analysis, feel free to skip this section and continue with the Jupyter notebook on [analysing netts networks](#3.-Plotting-a-semantic-speech-network).

![Netts pipeline.](img/Pipeline_figure_reduced.png)


## Preprocessing

Netts first expands the most common English contractions (e.g. expanding <em>I'm</em> to <em>I am</em>).
It then removes interjections (<em>Mh</em>, <em>Uhm</em>).
Netts also removes any transcription notes (e.g. timestamps, <em>[inaudible]</em>) that were inserted by the transcriber.
The user can pass a file of transcription notes that should be removed from the transcripts before processing.
See [Configuration](https://alan-turing-institute.github.io/netts/configuration/) for a step-by-step guide on passing custom transcription notes to netts for removal.
Netts does not remove stop words or punctuation to stay as close to the original speech as possible.

Netts then uses [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) to perform sentence splitting, tokenization, part of speech tagging, lemmatization, dependency parsing and co-referencing on the transcript.
Netts uses the default language model implemented in CoreNLP.

We describe these Natural Language Processing steps briefly in the following.
The transcript is first split into sentences (sentence splitting).
It is then further split into meaningful entities, usually words (tokenization).
Each word is assigned a part of speech label.
The part of speech label indicates whether the word is a verb, noun, or another part of speech (part of speech tagging).
Each word is also assigned their dictionary form or lemma (lemmatization).
Next, the grammatical relationship between words is identified (dependency parsing).
Finally, any occurrences where two or more expressions in the transcript refer to the same entity are identified (co-referencing).
For example where a noun <em>man</em> and a pronoun <em>he</em> refer to the same person.

## Finding nodes and edges

Netts submits each sentence to [OpenIE5](https://github.com/dair-iitd/OpenIE-standalone) for relation extraction.
Openie5 extracts semantic relationships between entities from the sentence.
For example, performing relation extraction on the sentence <em>I see a man</em> identifies the relation <em>see</em> between the entities <em>I</em> and <em>a man</em>.
From these extracted relations, netts creates an initial list of the edges that will be present in the semantic speech network.
In the edge list, the entities are the nodes and the relations are the edge labels.

Next, netts uses the part of speech tags and dependency structure to extract edges defined by adjectives or prepositions:
For instance, <em>a man on the picture</em> contains a preposition edge where the entity <em>a man</em> and <em>the picture</em> are linked by an edge labelled <em>on</em>.
An example of an adjective edge would be <em>dark background</em>.
Here, <em>dark</em> and <em>background</em> are linked by an implicit <em>is</em>.
These adjective edges and preposition edges are added to the edge list.
During the next processing steps this edge list is further refined.

## Refining nodes and edges

After creating the edge list, netts uses the co-referencing information to merge nodes that refer to the same entity.
This is to take into account cases different words refer to the same entity.
For example in the case where the pronoun <em>he</em> is used to refer to <em>a man</em> or in the case where the synonym <em>the guy</em> is used to refer to <em>a man</em>.
Every entity mentioned in the text should be represented by a unique node in the semantic speech network.
Therefore, nodes referring to the same entity are merged by replacing the node label in the edge list with the most representative node label (first mention of the entity that is a noun).
In the example above, <em>he</em> and <em>the guy</em> would be replaced by <em>a man</em>.
Node labels are then cleaned of superfluous words such as determiners.
For example, <em>a man</em> would turn into <em>man</em>.

## Constructing network

In the final step, netts constructs a semantic speech network from the edge list using [networkx](https://networkx.org/).
The network is then plotted and saves the output.
The output consists of the networkx object, the network image and the log messages from netts.
The resulting network (a [MultiDiGraph](https://networkx.org/documentation/stable/reference/classes/multidigraph.html)) is directed and unweighted, and can have parallel edges and self-loops.
Parallel edges are two or more edges that link the same two nodes in the same direction.
A self-loop is an edge that links a node with itself.


Now that we understand how netts creates semantic speech networks, let's look at how we can analyze them. Head on over to the Jupyter notebook on [analysing netts networks](#3.-Plotting-a-semantic-speech-network).