# `relatio` for analysis of narratives in the survey data
**Runtime $\sim$ 10min**

----------------------------

This code is based on a short demo of the package `relatio` publicly available from the authors of the relatio method. I use the the main wrapper functions to quickly obtain narrative statements from the corpus of my survey responses. For further details, please refer to the paper: ["Text Semantics Capture Political and Economic Narratives"](https://arxiv.org/abs/2108.01720) 

Notes on inputs and outputs:
It takes as input a text corpus and outputs a list of narrative statements. The pipeline is unsupervised: the user does not need to specify narratives beforehand. Narrative statements are defined as tuples of semantic roles with a (agent, verb, patient, attribute) structure. 

----------------------------

In [27]:
#import data
import pandas as pd

survey_data_filtered_file = '../datasets/spur_survey_response_filtered_df1.txt'
survey_df = pd.read_csv(survey_data_filtered_file, sep='\t')

subset_col = 'Q12.6_corrected'
df = survey_df[['ResponseId', subset_col]]
df = df.rename(columns={'ResponseId': 'id', subset_col : 'doc'})
df.head()

Unnamed: 0,id,doc
0,R_10PI5FKTTlId8Ec,I'm not yo I'm not talking to you
1,R_1gduid1fizdQ4d8,It is the way it is
2,R_23WCgmyuPAy3b1G,Because it would be more fair to people that d...
3,R_1pLEMXp5iJbankG,I choose proposal 2 because you offered me bet...
4,R_2OYEXKWutkzQOmS,Because I felt like it wouldn't be good enough...


In [2]:
#import stop words from ntlk (less extensive list):
import nltk
from nltk.corpus import stopwords
nltk_stops = stopwords.words('english')
nltk_stops.extend(['u'])
print(nltk_stops)
nltk_stops = set(nltk_stops)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

## Step 1: Split into sentences

----------------------------

For any new corpus, the first thing you will want to do is to split the corpus into sentences.

We do this on the first 100 tweets. 

The output is two lists: one with an index for the document and one with the resulting split sentences.

----------------------------


In [3]:
from relatio.utils import split_into_sentences

split_sentences = split_into_sentences(
    df, progress_bar=True
)

for i in range(5):
    print('Document id: %s' %split_sentences[0][i])
    print('Sentence: %s \n' %split_sentences[1][i])

Splitting into sentences...


100%|██████████████████████████████████████| 4886/4886 [00:09<00:00, 497.26it/s]

Document id: R_10PI5FKTTlId8Ec
Sentence: I'm not yo 

Document id: R_10PI5FKTTlId8Ec
Sentence: I'm not talking to you 

Document id: R_1gduid1fizdQ4d8
Sentence: It is the way it is 

Document id: R_23WCgmyuPAy3b1G
Sentence: Because it would be more fair to people that don't have much 

Document id: R_1pLEMXp5iJbankG
Sentence: I choose proposal 2 because you offered me better things than proposal 1. 






## Step 2: Annotate semantic roles

----------------------------

Once the corpus is split into sentences. You can feed it to the semantic role labeler.

The output is a list of json objects which contain the semantic role annotations for each sentence in the corpus.

----------------------------


In [4]:
# Note that SRL is time-consuming, in particular on CPUs.
# To speed up the annotation, you can also use GPUs via the "cuda_device" argument of the "run_srl()" function. 

from relatio.wrappers import run_srl

srl_res = run_srl(
    path="https://storage.googleapis.com/allennlp-public-models/openie-model.2020.03.26.tar.gz", # pre-trained model
    sentences=split_sentences[1],
    progress_bar=True,
)

2022-02-18 18:18:34,489 - INFO - allennlp.common.plugins - Plugin allennlp_models available
2022-02-18 18:18:34,791 - INFO - cached_path - cache of https://storage.googleapis.com/allennlp-public-models/openie-model.2020.03.26.tar.gz is up-to-date
2022-02-18 18:18:34,792 - INFO - allennlp.models.archival - loading archive file https://storage.googleapis.com/allennlp-public-models/openie-model.2020.03.26.tar.gz from cache at /Users/emilyrobitschek/.allennlp/cache/60314a853eb0aaa774d176d878c62469d49872feb4f2bfd071a75c77f6d76707.1b91cc27e347f2df04ce771a304bee2b70a2c487626b67e277d44c593b868c25
2022-02-18 18:18:34,793 - INFO - allennlp.models.archival - extracting archive file /Users/emilyrobitschek/.allennlp/cache/60314a853eb0aaa774d176d878c62469d49872feb4f2bfd071a75c77f6d76707.1b91cc27e347f2df04ce771a304bee2b70a2c487626b67e277d44c593b868c25 to temp dir /var/folders/lf/h87x9m414hq53x2kh59fl0xw0000gp/T/tmpidwvcimz
2022-02-18 18:18:35,315 - INFO - allennlp.common.params - dataset_reader.type 

2022-02-18 18:18:35,763 - INFO - allennlp.nn.initializers -    encoder._module.layer_2.cell.state_linearity.weight
2022-02-18 18:18:35,764 - INFO - allennlp.nn.initializers -    encoder._module.layer_3.cell.input_linearity.bias
2022-02-18 18:18:35,765 - INFO - allennlp.nn.initializers -    encoder._module.layer_3.cell.input_linearity.weight
2022-02-18 18:18:35,766 - INFO - allennlp.nn.initializers -    encoder._module.layer_3.cell.state_linearity.bias
2022-02-18 18:18:35,766 - INFO - allennlp.nn.initializers -    encoder._module.layer_3.cell.state_linearity.weight
2022-02-18 18:18:35,767 - INFO - allennlp.nn.initializers -    encoder._module.layer_4.cell.input_linearity.bias
2022-02-18 18:18:35,768 - INFO - allennlp.nn.initializers -    encoder._module.layer_4.cell.input_linearity.weight
2022-02-18 18:18:35,769 - INFO - allennlp.nn.initializers -    encoder._module.layer_4.cell.state_linearity.bias
2022-02-18 18:18:35,769 - INFO - allennlp.nn.initializers -    encoder._module.layer_4.c

Running SRL...


100%|███████████████████████████████████████████| 18/18 [01:42<00:00,  5.71s/it]


In [5]:
# An example of SRL output
srl_res[6]

{'verbs': [{'verb': 'prefer',
   'description': '[ARG0: I] [V: prefer] [ARG1: the lower density idea , but also mixed use with restaurants , cafes , etc] .',
   'tags': ['B-ARG0',
    'B-V',
    'B-ARG1',
    'I-ARG1',
    'I-ARG1',
    'I-ARG1',
    'I-ARG1',
    'I-ARG1',
    'I-ARG1',
    'I-ARG1',
    'I-ARG1',
    'I-ARG1',
    'I-ARG1',
    'I-ARG1',
    'I-ARG1',
    'I-ARG1',
    'I-ARG1',
    'O']}],
 'words': ['I',
  'prefer',
  'the',
  'lower',
  'density',
  'idea',
  ',',
  'but',
  'also',
  'mixed',
  'use',
  'with',
  'restaurants',
  ',',
  'cafes',
  ',',
  'etc',
  '.']}

## Step 3: Build the narrative model

----------------------------

We are now ready to build a narrative model.

The function `build_narrative_model` takes as input the split sentences the SRL annotations for the corpus. It builds a model of low-dimensional narrative statements which may then be used to obtain narrative statements  "out-of-sample".

The function has sensible defaults for most arguments, but the user should at least specify:
- the number of latent unnamed entities to recover (`n_clusters`)
- the embeddings to be used (see here for further details) 

We specify 100 unnamed entities to uncover. The embeddings used are pre-trained glove embeddings.

To speed up the model's training, we also focus on the top 100 most frequent named entities (the default is to mine all named entities).

To improve interpretability, we remove common uninformative words in the corpus (the "stopwords"), as well as one-letter words.

----------------------------

In [6]:
# NB: This step usually takes several minutes to run. You might want to grab a coffee.
 
from relatio.wrappers import build_narrative_model

narrative_model = build_narrative_model(
    srl_res=srl_res,
    sentences=split_sentences[1],
    embeddings_type="gensim_keyed_vectors",
    embeddings_path="glove-wiki-gigaword-300",
    n_clusters=[[50]], 
    stop_words = nltk_stops,
    progress_bar=True,
)

Processing SRL...


100%|████████████████████████████████████| 5896/5896 [00:00<00:00, 34536.50it/s]


Cleaning SRL...


100%|███████████████████████████████████| 13569/13569 [00:08<00:00, 1516.82it/s]


Computing role frequencies...


100%|████████████████████████████████| 13569/13569 [00:00<00:00, 1228840.33it/s]


Mining named entities...


100%|██████████████████████████████████████| 5896/5896 [00:27<00:00, 216.81it/s]


Mapping named entities...


100%|███████████████████████████████████| 13569/13569 [00:01<00:00, 9259.03it/s]


Loading embeddings model...


2022-02-18 18:21:10,412 - INFO - gensim.models.utils_any2vec - loading projection weights from /Users/emilyrobitschek/gensim-data/glove-wiki-gigaword-300/glove-wiki-gigaword-300.gz
2022-02-18 18:22:04,154 - INFO - gensim.models.utils_any2vec - loaded (400000, 300) matrix from /Users/emilyrobitschek/gensim-data/glove-wiki-gigaword-300/glove-wiki-gigaword-300.gz
2022-02-18 18:22:15,979 - INFO - gensim.models.keyedvectors - precomputing L2-norms of word weight vectors


In [7]:
# The narrative model is simply a dictionary containing the narrative model's specifics.
print(narrative_model.keys())
print(nltk_stops)

dict_keys(['roles_considered', 'roles_with_entities', 'roles_with_embeddings', 'dimension_reduce_verbs', 'clean_text_options', 'verb_counts', 'entities', 'top_n_entities', 'embeddings_model', 'cluster_model', 'cluster_labels_most_similar', 'cluster_labels_most_freq'])
{'aren', 'or', "she's", 'them', 'because', 'while', 'couldn', 'ma', 'didn', 'below', 'during', "you'd", 'u', 'you', 'll', 'he', 'it', 'hadn', 'from', 'own', 'an', 'until', 'mightn', 'did', 'before', "isn't", 'yourselves', 'with', 'we', 'have', 'out', 'i', 'which', "won't", 'y', 'doesn', 'doing', 'haven', 'my', 'are', 'both', 'after', 'more', 'to', 'only', 'through', 'hasn', 'yourself', 'so', 'they', 'if', 'then', 'how', 'most', 'myself', 'hers', 'your', 'here', 'such', 'whom', 'again', 'when', "wasn't", "wouldn't", 'o', 's', "aren't", "that'll", 'some', "don't", 'further', 'into', 'shouldn', 'ourselves', 'over', 'been', 'does', 'itself', 'too', 'wasn', 'up', 'under', "haven't", 'what', 'each', 'why', 'once', 'needn', 'its

In [8]:
# Most common named entities
narrative_model['entities'].most_common()[:20]

[('london', 41),
 ('chicago', 34),
 ('los angeles', 18),
 ('la', 14),
 ('nyc', 12),
 ('new york', 10),
 ('proposal', 7),
 ('uk', 5),
 ('manhattan', 5),
 ('brooklyn', 5),
 ('usa', 4),
 ('america', 4),
 ('california', 4),
 ('covid', 4),
 ('new york city', 4),
 ('democrat', 3),
 ('city chicago', 3),
 ('lewisham', 3),
 ('rent control', 3),
 ('goodmayes', 3)]

In [9]:
# The unnamed entities uncovered in the corpus 
# (automatically labeled by the most frequent phrase in the cluster)
narrative_model['cluster_labels_most_freq']

[[{5: 'renter',
   20: 'well',
   8: 'people',
   21: 'need',
   11: 'bar',
   33: 'decision',
   13: 'one',
   6: 'best',
   16: 'survey',
   48: 'double population',
   23: 'density',
   47: 'densification',
   26: 'important',
   1: 'factor',
   9: 'money',
   42: 'good',
   0: 'government',
   25: 'sense',
   46: 'apartment',
   28: 'mixed use',
   41: 'increase',
   32: 'problem',
   31: 'business',
   45: 'traffic',
   7: 'opinion',
   19: 'unfair',
   35: 'landlord',
   24: 'answer',
   12: 'project',
   10: 'affordable housing',
   43: 'interested',
   22: 'neighbourhood',
   15: 'high',
   17: 'small business',
   2: 'crowd',
   29: 'affect',
   30: 'carbon neutral',
   14: 'fair',
   39: 'together',
   38: 'poor',
   40: 'service',
   49: 'rental',
   44: 'area',
   4: 'benefit',
   3: 'gentrification',
   37: 'environment',
   36: 'plan',
   34: 'government interference project',
   18: 'good economy',
   27: 'favorable'}]]

----------------------------

In practice, `build_narrative_model` is a flexible wrapper function which gives a lot of control to the user. 

Let's break the options down into four categories:

1. Basic utilities
    - the semantic roles you're interested in (`roles_considered`)
    - where you would like to save the model (`output_path`)
    - whether you cant to track the function's progress (`progress_bar`)
    

2. Text preprocessing
    - basic text preprocessing steps 
    (`remove_punctuation`, `remove_digits`, remove `stop_words`, `stem` or `lemmatize` words, etc.)
    - would you like to replace verbs by their most common synonyms/antonyms? (`dimension_reduce_verbs`)


3. Named entities  
    - which semantic roles have named entities? (`roles_with_entities`)
    - how many named entities would you like to keep? (`top_n_entities`)

Technical details: under the hood, we work with SpaCy named entity recognizer to identify named entities. 
We consider tags related to places, organizations, people and events.


4. Unnamed entities (e.g., tax, government, dog, cat, etc.)
    - which semantic roles have unnamed entities? (`roles_with_embeddings`)
    - how many latent unnamed entities are there in the corpus? (`n_clusters`)

Technical details: under the hood, we embed semantic phrases without named entities and cluster them with 
K-Means. 

----------------------------

## Step 4: Get narrative statements based on the narrative model.

----------------------------

Once the narrative model is built, we can use it to extract narrative statements from any corpus (provided that the 
corpus is split into sentences and annotated for semantic roles). 

We call the function `get_narratives` for this purpose.

----------------------------

In [10]:
from relatio.wrappers import get_narratives

final_statements = get_narratives(
    srl_res=srl_res,
    doc_index=split_sentences[0],  # doc names
    narrative_model=narrative_model,
    n_clusters=[0],  
    progress_bar=True,
)

Processing SRL...


100%|████████████████████████████████████| 5896/5896 [00:00<00:00, 34043.14it/s]


Cleaning SRL...


100%|███████████████████████████████████| 13569/13569 [00:08<00:00, 1543.12it/s]


Processing raw arguments...


100%|█████████████████████████████████| 13569/13569 [00:00<00:00, 713484.41it/s]


Cleaning verbs...


100%|██████████████████████████████████| 13569/13569 [00:00<00:00, 20711.78it/s]


Mapping named entities...


100%|███████████████████████████████████| 13569/13569 [00:01<00:00, 9279.86it/s]


Assigning clusters to roles...


100%|███████████████████████████████████| 13569/13569 [00:02<00:00, 5531.22it/s]


In [11]:
# The resulting pandas dataframe

print(final_statements.columns)

final_statements.head()

Index(['doc', 'sentence', 'statement', 'ARG0_highdim', 'ARG0_lowdim',
       'B-V_highdim', 'B-V_lowdim', 'B-ARGM-NEG_highdim', 'B-ARGM-NEG_lowdim',
       'B-ARGM-MOD_highdim', 'ARG1_highdim', 'ARG1_lowdim', 'ARG2_highdim',
       'ARG2_lowdim'],
      dtype='object')


Unnamed: 0,doc,sentence,statement,ARG0_highdim,ARG0_lowdim,B-V_highdim,B-V_lowdim,B-ARGM-NEG_highdim,B-ARGM-NEG_lowdim,B-ARGM-MOD_highdim,ARG1_highdim,ARG1_lowdim,ARG2_highdim,ARG2_lowdim
0,R_10PI5FKTTlId8Ec,0,0,,,,,True,True,,,,yo,renter
1,R_10PI5FKTTlId8Ec,0,1,,,yo,yo,True,True,,,,,
2,R_10PI5FKTTlId8Ec,1,2,,,,,,,,,,,
3,R_10PI5FKTTlId8Ec,1,3,,,talk,talk,True,True,,,,,
4,R_1gduid1fizdQ4d8,2,4,,,,,,,,,,way,well


### Trying out different clustering scenarios

The choice of the number of clusters is corpus and application specific. Specifying a small number of latent entities leads to a large dimension reduction and may decrease entity coherence, whilst specifying a large number of latent entities may produce cluster redundancy. 

To help users try different clustering scenarios, the wrapper functions `build_narrative_model` and `get_narratives` allow users to experiment various clustering scenarios, which we detail below.

----------------------------

In `build_narrative_model`, the arguments `roles_with_embeddings` and `n_clusters` are specified as lists of lists arguments. This implies that you can cluster semantic roles separately (or together) and with different numbers of clusters. 

For example, a user could specify:
- `roles_with_embeddings = [["ARG0"],["ARG1"]]`
- `n_clusters = [[10,20],[10]]`
    
He/she would then cluster "ARG0" and "ARG1" separately and not consider "ARG2" for dimension reduction. For "ARG0", the clustering scenarios are a model with 10 and a model with 20 clusters. For "ARG1", the clustering scenario is a model with 10 clusters.


----------------------------

To extract narrative statements based on a clustering scenario, the `get_narratives` function also has the 
`n_clusters` argument. 

`n_clusters` is a list for the clustering scenarios. For instance, in our previous example, 
semantic roles "ARG0" and "ARG1" are clustered separately, so `n_clusters` expects two indices: one for the 
clustering scenario to pick for "ARG0" and one for the clustering scenario to pick for "ARG1".

For example, a user could specify:
- `n_clusters = [0,0]`

He/she would then extract narrative statements based on 10 clusters for "ARG0" and 10 clusters for "ARG1".

----------------------------

## Step 5: Model validation and basic analysis

----------------------------

The resulting `final_statments` object is a pandas dataframe which lists narrative statements found in documents and 
sentences of the corpus.

It is straight-forward to manually inspect the quality of the resulting entities and narrative statements.

----------------------------

In [12]:
# Entity coherence
# Print most frequent phrases per entity

# Pool ARG0, ARG1 and ARG2 together

df1 = final_statements[['ARG0_lowdim', 'ARG0_highdim']]
df1.rename(columns={'ARG0_lowdim': 'ARG', 'ARG0_highdim': 'ARG-RAW'}, inplace=True)

df2 = final_statements[['ARG1_lowdim', 'ARG1_highdim']]
df2.rename(columns={'ARG1_lowdim': 'ARG', 'ARG1_highdim': 'ARG-RAW'}, inplace=True)

df3 = final_statements[['ARG2_lowdim', 'ARG2_highdim']]
df3.rename(columns={'ARG2_lowdim': 'ARG', 'ARG2_highdim': 'ARG-RAW'}, inplace=True)

df = df1.append(df2).reset_index(drop = True)
df = df.append(df3).reset_index(drop = True)

# Count semantic phrases

df = df.groupby(['ARG', 'ARG-RAW']).size().reset_index()
df.columns = ['ARG', 'ARG-RAW', 'count']

# Drop empty semantic phrases

df = df[df['ARG'] != ''] 

# Rearrange the data

df = df.groupby(['ARG']).apply(lambda x: x.sort_values(["count"], ascending = False))
df = df.reset_index(drop= True)
df = df.groupby(['ARG']).head(10)

df['ARG-RAW'] = df['ARG-RAW'] + ' - ' + df['count'].astype(str)
df['cluster_elements'] = df.groupby(['ARG'])['ARG-RAW'].transform(lambda x: ' | '.join(x))

df = df.drop_duplicates(subset=['ARG'])

df['cluster_elements'] = [', '.join(set(i.split(','))) for i in list(df['cluster_elements'])]

print('Entities to inspect:', len(df))

df = df[['ARG', 'cluster_elements']]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


Entities to inspect: 144


In [13]:
for l in df.values.tolist():
    print('entity: \n %s \n' % [0])
    print('most frequent phrases: \n %s \n' % l[1])

entity: 
 [0] 

most frequent phrases: 
 affect - 3 | impact - 3 | effect - 3 | result - 2 | change occur - 2 | substantial country - 2 | part decision effect live - 2 | long term effect - 1 | long last negative effect neighbourhood - 1 | longer affordable may know change occur much go - 1 

entity: 
 [0] 

most frequent phrases: 
 affordable housing - 84 | affordable - 33 | option - 31 | expensive - 14 | suitable - 10 | safe - 7 | easy - 7 | comfortable - 6 | convenient - 4 | favor affordable housing - 4 

entity: 
 [0] 

most frequent phrases: 
 always certain amount affordable house new development - 2 | always - 1 | future climate change always mind - 1 | high enough always increase - 1 | international investor would almost always vote - 1 | late always overcrowd - 1 | like project attract new type people always good meet new people nt rich - 1 | project attract new type people always good meet new people nt rich - 1 | public input anything always developer - 1 | wider issue always

In [14]:
# Low-dimensional vs. high-dimensional narrative statements

# Replace negated verbs by "not-verb"

import numpy as np

final_statements['B-V_lowdim_with_neg'] = np.where(final_statements['B-ARGM-NEG_lowdim'] == True, 
                                          'not-' + final_statements['B-V_lowdim'], 
                                          final_statements['B-V_lowdim'])

final_statements['B-V_highdim_with_neg'] = np.where(final_statements['B-ARGM-NEG_highdim'] == True, 
                                           'not-' + final_statements['B-V_lowdim'], 
                                           final_statements['B-V_highdim'])

# Concatenate high-dimensional narratives (with text preprocessing but no clustering)

final_statements['narrative_highdim'] = (final_statements['ARG0_highdim'] + ' ' + 
                                         final_statements['B-V_highdim_with_neg'] + ' ' +  
                                         final_statements['ARG1_highdim'])

# Concatenate low-dimensional narratives (with clustering)

final_statements['narrative_lowdim'] = (final_statements['ARG0_lowdim'] + ' ' + 
                                        final_statements['B-V_highdim_with_neg'] + ' ' + 
                                        final_statements['ARG1_lowdim'])

# Focus on narratives with a ARG0-VERB-ARG1 structure (i.e. "complete narratives")

indexNames = final_statements[(final_statements['ARG0_lowdim'] == '')|
                             (final_statements['ARG1_lowdim'] == '')|
                             (final_statements['B-V_lowdim_with_neg'] == '')].index

complete_narratives = final_statements.drop(indexNames)

complete_narratives

Unnamed: 0,doc,sentence,statement,ARG0_highdim,ARG0_lowdim,B-V_highdim,B-V_lowdim,B-ARGM-NEG_highdim,B-ARGM-NEG_lowdim,B-ARGM-MOD_highdim,ARG1_highdim,ARG1_lowdim,ARG2_highdim,ARG2_lowdim,B-V_lowdim_with_neg,B-V_highdim_with_neg,narrative_highdim,narrative_lowdim
9,R_23WCgmyuPAy3b1G,3,9,people,people,,,True,True,,much,well,,,not-,not-,people not- much,people not- well
68,R_1dBsP0wiodbtJHq,29,68,increase rent building place,rent,bring,get,,,would,crime,problem,,,get,bring,increase rent building place bring crime,rent bring problem
81,R_2sceuzLAPUduXvN,34,81,community,community,,,True,True,,enough job security,important,,,not-,not-,community not- enough job security,community not- important
86,R_1LzKpSE1Jg4u1s0,37,86,anything,good,improves,improve,,,,city,city,,,improve,improves,anything improves city,good improves city
87,R_1LzKpSE1Jg4u1s0,37,87,anything,good,help,help,,,,people,people,,,help,help,anything help people,good help people
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13462,R_1mJFHLSfH66ViuI,5845,13462,poorer people,poor,need,need,,,,help,need,,,need,need,poorer people need help,poor need need
13478,R_3nJjrdnMCAMyBZq,5857,13478,proposal,proposal,make,make,,,,sense others,well,,,make,make,proposal make sense others,proposal make well
13500,R_3D7tv6LuiTl0KU3,5866,13500,large investor,small business,care,care,True,True,,affordable rent apartment,rent,,,not-care,not-care,large investor not-care affordable rent apartment,small business not-care rent
13530,R_2v8EIlISE7HEoiF,5879,13530,community resident,community,involve,need,,,,decision community,community,,,need,involve,community resident involve decision community,community involve community


In [15]:
test = complete_narratives.merge(survey_df, how='left', left_on='doc', right_on='ResponseId')
test[test['Q7.5'].isin(['Always reject', 'reject'])]['narrative_lowdim'].value_counts()[:20]

apartment not-pay interested       1
rent help people                   1
factor sustain neighborhood        1
community need service             1
good rent unfair                   1
plan offer survey                  1
project not-ignore neighborhood    1
people mean traffic                1
neighborhood not-decrease rent     1
affordable housing lower high      1
environment focus government       1
rent lower high                    1
people must need                   1
rent need double population        1
government regard apartment        1
unfair pay high                    1
rent not-control important         1
fair help proposal                 1
poor keep best                     1
project handle gentrification      1
Name: narrative_lowdim, dtype: int64

In [16]:
# Top high-dimensional complete narrative statements

complete_narratives['narrative_highdim'].value_counts().head(10)

proposal benefit community                                                   3
people afford housing                                                        3
landlord increase rent                                                       2
would never take one without rent control get increase low income housing    2
proposal show rent get increase well resident value opinion count            2
project increase population density                                          2
proposal help community                                                      2
landlord raise rent                                                          2
people lose job                                                              2
community need input project                                                 1
Name: narrative_highdim, dtype: int64

In [17]:
# Top low-dimensional complete narrative statements

complete_narratives['narrative_highdim'].value_counts().head(30)

proposal benefit community                                                                                    3
people afford housing                                                                                         3
landlord increase rent                                                                                        2
would never take one without rent control get increase low income housing                                     2
proposal show rent get increase well resident value opinion count                                             2
project increase population density                                                                           2
proposal help community                                                                                       2
landlord raise rent                                                                                           2
people lose job                                                                                         

In [18]:
# Print ten random complete narratives with and without dimension reduction
#
# Specifying a small number of clusters leads to a large dimension reduction 
# and may decrease cluster coherence, whilst specifying a large number of clusters
# may produce cluster redundancy. 
#
# The choice of the number of clusters is corpus and application specific 
# (and was chosen at random in this notebook).

sample = complete_narratives.sample(10, random_state = 123).to_dict('records')

for d in sample:
    print('Original raw response: \n %s \n' %split_sentences[1][d['sentence']])
    print('High-dimensional narrative: \n %s \n' %d['narrative_highdim'])
    print('Low-dimensional narrative: \n %s \n' %d['narrative_lowdim'])
    print('--------------------------------------------------- \n')

Original raw response: 
 I would prefer a place with RENT CONTROL, WHERE WE CAN VOTE ON AND THAT DOESNT INCREASE THE DENSITY BY MORE THEN 20% 

High-dimensional narrative: 
 vote increase density 

Low-dimensional narrative: 
 opinion increase density 

--------------------------------------------------- 

Original raw response: 
 Increasing the population 100% would create more crime. 

High-dimensional narrative: 
 increase population create crime 

Low-dimensional narrative: 
 density create problem 

--------------------------------------------------- 

Original raw response: 
 The proposals I picked are actually good projects that will alleviate poverty and also help the community at large 

High-dimensional narrative: 
 good project alleviate poverty 

Low-dimensional narrative: 
 project alleviate problem 

--------------------------------------------------- 

Original raw response: 
 Any change needs to be fully consulted with the involvement of local people. 

High-dimensional

In [19]:
print(list(final_statements['ARG1_highdim']))

['', '', '', '', '', '', 'fair people nt much', '', '', 'much', 'proposal', 'well thing proposal', 'like would nt good enough', 'nt good enough', '', 'low density idea also mixed use restaurant cafe etc', 'would draw people surround community often oppose craft apartment', 'draw people surround community often oppose craft apartment', 'people surround community oppose craft apartment', 'community', '', '', 'would', 'one', 'position', '', '', 'density double', 'density', '', '', '', '', '', 'subsidize housing issue', 'housing issue', '', 'go decision make base feel healthy community', '', 'decision', 'decision', 'healthy community', '', 'good others good proposal', '', 'percent high yearly increase', 'percent', 'good', '', 'affordable housing senior', 'affordable housing senior', 'affordable housing senior', '', 'state unregulated price increase rent mere fraud', 'unregulated price increase rent mere fraud', 'unregulated price increase rent', 'room', 'well choice', '', '', '', 'national

## Step 6: Visualization // Plotting narrative graphs
----------------------------

A collection of narrative statements has an intuitive network structure, in which the edges are verbs and the nodes are entities.

----------------------------

In [20]:
temp = complete_narratives[["ARG0_lowdim", "ARG1_lowdim", "B-V_lowdim"]]
temp.columns = ["ARG0", "ARG1", "B-V"]
temp = temp[(temp["ARG0"] != "") & (temp["ARG1"] != "") & (temp["B-V"] != "")]
temp = temp.groupby(["ARG0", "ARG1", "B-V"]).size().reset_index(name="weight")
temp = temp.sort_values(by="weight", ascending=False)

In [21]:
# Plot low-dimensional complete narrative statements in a directed multi-graph

from relatio.graphs import build_graph, draw_graph

temp = complete_narratives[["ARG0_lowdim", "ARG1_lowdim", "B-V_lowdim"]]
temp.columns = ["ARG0", "ARG1", "B-V"]
temp = temp[(temp["ARG0"] != "") & (temp["ARG1"] != "") & (temp["B-V"] != "")]
temp = temp.groupby(["ARG0", "ARG1", "B-V"]).size().reset_index(name="weight")
temp = temp.sort_values(by="weight", ascending=False).iloc[
    0:100
]  # pick top 100 most frequent narratives
temp = temp.to_dict(orient="records")

for l in temp:
    l["color"] = None

G = build_graph(
    dict_edges=temp, dict_args={}, edge_size=None, node_size=10, prune_network=False #rue
)

draw_graph(G, notebook=True, output_filename="all_data_relatio_result_nltk_stops_no_1_letter_filter_k50_all_named_ents_prune_network_false_top100narr.html")

In [22]:
print(final_statements['ARG0_lowdim'].value_counts())
print(final_statements['ARG1_lowdim'].value_counts())

help(get_narratives)
#final_statements['ARG1_highdim']

                                      12063
people                                  204
proposal                                177
project                                  94
rent                                     84
                                      ...  
citizen chicago                           1
democrat                                  1
government investor                       1
donõt                                     1
base decision rent control id love        1
Name: ARG0_lowdim, Length: 88, dtype: int64
                            5349
rent                        1018
proposal                     567
good                         397
density                      326
                            ... 
greenery                       1
british                        1
thames clipper extension       1
perivale                       1
bayridge                       1
Name: ARG1_lowdim, Length: 130, dtype: int64
Help on function get_narratives in module relatio.wrappers:

get

In [23]:
complete_narratives

Unnamed: 0,doc,sentence,statement,ARG0_highdim,ARG0_lowdim,B-V_highdim,B-V_lowdim,B-ARGM-NEG_highdim,B-ARGM-NEG_lowdim,B-ARGM-MOD_highdim,ARG1_highdim,ARG1_lowdim,ARG2_highdim,ARG2_lowdim,B-V_lowdim_with_neg,B-V_highdim_with_neg,narrative_highdim,narrative_lowdim
9,R_23WCgmyuPAy3b1G,3,9,people,people,,,True,True,,much,well,,,not-,not-,people not- much,people not- well
68,R_1dBsP0wiodbtJHq,29,68,increase rent building place,rent,bring,get,,,would,crime,problem,,,get,bring,increase rent building place bring crime,rent bring problem
81,R_2sceuzLAPUduXvN,34,81,community,community,,,True,True,,enough job security,important,,,not-,not-,community not- enough job security,community not- important
86,R_1LzKpSE1Jg4u1s0,37,86,anything,good,improves,improve,,,,city,city,,,improve,improves,anything improves city,good improves city
87,R_1LzKpSE1Jg4u1s0,37,87,anything,good,help,help,,,,people,people,,,help,help,anything help people,good help people
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13462,R_1mJFHLSfH66ViuI,5845,13462,poorer people,poor,need,need,,,,help,need,,,need,need,poorer people need help,poor need need
13478,R_3nJjrdnMCAMyBZq,5857,13478,proposal,proposal,make,make,,,,sense others,well,,,make,make,proposal make sense others,proposal make well
13500,R_3D7tv6LuiTl0KU3,5866,13500,large investor,small business,care,care,True,True,,affordable rent apartment,rent,,,not-care,not-care,large investor not-care affordable rent apartment,small business not-care rent
13530,R_2v8EIlISE7HEoiF,5879,13530,community resident,community,involve,need,,,,decision community,community,,,need,involve,community resident involve decision community,community involve community


In [24]:
# Plot high-dimensional complete narrative statements in a directed multi-graph

from relatio.graphs import build_graph, draw_graph

temp = complete_narratives[["ARG0_highdim", "ARG1_highdim", "B-V_highdim"]]
temp.columns = ["ARG0", "ARG1", "B-V"]
temp = temp[(temp["ARG0"] != "") & (temp["ARG1"] != "") & (temp["B-V"] != "")]
temp = temp.groupby(["ARG0", "ARG1", "B-V"]).size().reset_index(name="weight")
temp = temp.sort_values(by="weight", ascending=False).iloc[
    0:200
]  # pick top 100 most frequent narratives
temp = temp.to_dict(orient="records")

for l in temp:
    l["color"] = None

G = build_graph(
    dict_edges=temp, dict_args={}, edge_size=None, node_size=10, prune_network=False #True
)

draw_graph(G, notebook=True, output_filename="all_data_relatio_result_nltk_stops_no_1_letter_filter_k50_all_named_ents_high_dim_prune_network_false_top200narr.html")

In [25]:
# Plot low-dimensional complete narrative statements in a directed multi-graph

from relatio.graphs import build_graph, draw_graph

temp = final_statements[["ARG0_lowdim", "ARG1_lowdim", "B-V_lowdim"]]
temp.columns = ["ARG0", "ARG1", "B-V"]
temp = temp[(temp["ARG0"] != "") & (temp["ARG1"] != "") & (temp["B-V"] != "")]
temp = temp.groupby(["ARG0", "ARG1", "B-V"]).size().reset_index(name="weight")
temp = temp.sort_values(by="weight", ascending=False).iloc[
    0:100
]  # pick top 100 most frequent narratives
temp = temp.to_dict(orient="records")

for l in temp:
    l["color"] = None

G = build_graph(
    dict_edges=temp, dict_args={}, edge_size=None, node_size=10, prune_network=False #True
)

draw_graph(G, notebook=True, output_filename="final_statements_initial_relatio_result_nltk_stops_no_1_letter_filter_k50_low_dim_prune_false.html")

In [26]:
# As a final comment, note that you can look up the specifics of any function with the help command.

help(build_narrative_model)

Help on function build_narrative_model in module relatio.wrappers:

build_narrative_model(srl_res: List[dict], sentences: List[str], roles_considered: List[str] = ['ARG0', 'B-V', 'B-ARGM-NEG', 'B-ARGM-MOD', 'ARG1', 'ARG2'], output_path: Union[str, NoneType] = None, max_length: Union[int, NoneType] = None, remove_punctuation: bool = True, remove_digits: bool = True, remove_chars: str = '', stop_words: Union[List[str], NoneType] = None, lowercase: bool = True, strip: bool = True, remove_whitespaces: bool = True, lemmatize: bool = True, stem: bool = False, tags_to_keep: Union[List[str], NoneType] = None, remove_n_letter_words: Union[int, NoneType] = None, roles_with_embeddings: List[List[str]] = [['ARG0', 'ARG1', 'ARG2']], embeddings_type: Union[str, NoneType] = None, embeddings_path: Union[str, NoneType] = None, n_clusters: List[List[int]] = [[1]], verbose: int = 0, random_state: int = 0, roles_with_entities: List[str] = ['ARG0', 'ARG1', 'ARG2'], ent_labels: List[str] = ['PERSON', 'NORP'