# Visualizing results

Here we explore various options for visualizing the results of our analysis and model fitting.

We will be exploring Scattertext library for visualizing the results of our analysis. This library is very useful for visualizing the differences between two corpora.

Other libraries of interest:
Plotly, which is a powerful interactive plotting library. We can also use Matplotlib for some simple plots.
Seaborn can be used to enhance the plots from Matplotlib.

Also to consider is pyLDAvis, which is a library for visualizing the results of topic modeling.

## Scattertext library

Scattertext is a library for visualizing text data. It is particularly useful for visualizing the differences between two corpora. Here we will use it to visualize the differences between the trial documents in our corpora.

Home page for scattertext: [ScatterText lib](https://github.com/JasonKessler/scattertext)

In [None]:
# We will be using tutorial here at
# https://github.com/JasonKessler/scattertext#visualizing-phrase-associations-with-scattertext-and-spacy

# first we will need to isntall scattertext

# !pip install scattertext


In [3]:
# we will need pandas and spacy as well
from tqdm import tqdm
import tqdm as notebook_tqdm
import pandas as pd
import spacy
import scattertext as st

# example to use for creating your own corpus and visualization

df = st.SampleCorpora.ConventionData2012.get_data().assign(
    parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)

corpus = st.CorpusFromParsedDocuments(
    df, category_col='party', parsed_col='parse'
).build().get_unigram_corpus().compact(st.AssociationCompactor(2000))

html = st.produce_scattertext_explorer(
    corpus,
    category='democrat', category_name='Democratic', not_category_name='Republican',
    minimum_term_frequency=0, pmi_threshold_coefficient=0,
    width_in_pixels=1000, metadata=corpus.get_df()['speaker'],
    transform=st.Scalers.dense_rank
)
#open('./demo_compact.html', 'w', encoding="utf-8").write(html)


In [4]:
df.shape

(189, 4)

In [5]:
df.columns

Index(['party', 'text', 'speaker', 'parse'], dtype='object')

In [6]:
df.head()

Unnamed: 0,party,text,speaker,parse
0,democrat,Thank you. Thank you. Thank you. Thank you so ...,BARACK OBAMA,"(thank, you, ., thank, you, ., thank, you, ., ..."
1,democrat,"Thank you so much. Tonight, I am so thrilled a...",MICHELLE OBAMA,"(thank, you, so, much, .)"
2,democrat,Thank you. It is a singular honor to be here t...,RICHARD DURBIN,"(thank, you, ., it, is, a, singular, honor, to..."
3,democrat,"Hey, Delaware. \nAnd my favorite Democrat, Jil...",JOSEPH BIDEN,"(hey, ,, delaware, ., and, my, favorite, democ..."
4,democrat,"Hello. \nThank you, Angie. I'm so proud of how...",JILL BIDEN,"(hello, ., thank, you, ,, angie, ., i, ', m, s..."


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   party    189 non-null    object
 1   text     189 non-null    object
 2   speaker  189 non-null    object
 3   parse    189 non-null    object
dtypes: object(4)
memory usage: 6.0+ KB


In [8]:
first_row_parse_col = df['parse'][0]
print(type(first_row_parse_col))
# okay looks like we can get parse column using apply method

<class 'scattertext.WhitespaceNLP.Doc'>


In [9]:
url = "https://github.com/ValRCS/BSSDH_2024_workshop/raw/main/data/old_bailey_sample_1720_1913.csv"
d = pd.read_csv(url) # ignore 


In [10]:
# unique values of punishment
len(d['punishment'].unique())

275

In [11]:
# how many of punishments involve death
d['punishment'].str.lower().str.contains('death').sum()

85

In [12]:
# how many of punishments involve whip
d['punishment'].str.lower().str.contains('whip').sum()

29

In [13]:
# how many of punishments involve prison
d['punishment'].str.lower().str.contains('prison').sum()

57

In [14]:
# let's create a function to return dictionary of punishment counts
# argument will be df, column name, and list of words to search for
def get_punishment_counts(df, column_name="punishment", list_of_words=()):
    # create empty dictionary
    punishment_dict = {}
    # loop through list of words
    # first count how many times there is empty punishment
    punishment_dict['empty'] = df[column_name].isnull().sum()
    for word in list_of_words:
        # create key value pair in dictionary
        punishment_dict[word] = df[column_name].str.lower().str.contains(word).sum()
    # return dictionary
    return punishment_dict

# punishment words
punishment_words = ['death', 'whip', 'prison', 'transport', 'fine', 'discharg']
# call function
get_punishment_counts(d, column_name="punishment", list_of_words=punishment_words)

{'empty': 475,
 'death': 85,
 'whip': 29,
 'prison': 57,
 'transport': 309,
 'fine': 352,
 'discharg': 33}

In [15]:
# punishment value count
d['punishment'].value_counts()

punishment
[Transportation. See summary.]                     106
Transported for Seven Years                         89
Death                                               53
Confined Three Months                               52
Confined Six Months                                 52
                                                  ... 
Seven Year's Penal Servitude.                        1
Five Year's Penal Servitude.                         1
Seven Years' Penal Servitude each                    1
Two Years Imprisonment.                              1
Eighteen months' imprisonment, second division;      1
Name: count, Length: 274, dtype: int64

## Punishment types function

In [16]:
# convert punishment column to string
d['punishment'] = d['punishment'].astype(str) # changes nan to string nan !!!!
# TODO avoid this in the future
# if we did not do this we would have to change our get_punishment_category function for checking to nan

In [17]:
# let's create a function to return category of punishment from punishment description
# argument will be punishment text and list of words to search for
def get_punishment_category(punishment_text, list_of_words=()):
    # loop through list of words
    # if punishment text is empty or nan
    if not punishment_text.strip() or punishment_text.strip() == "nan": #ugly hack
        # return empty
        return "empty"
    for word in list_of_words:
        # if word is found in punishment text
        if word in punishment_text.lower():
            # return word
            return word # order of words will matter
    # if no word is found return empty
    return "unknown"

In [18]:
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1637 entries, 0 to 1636
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   file_name     1637 non-null   object
 1   year          1637 non-null   int64 
 2   trial_number  1637 non-null   int64 
 3   punishment    1637 non-null   object
 4   text          1637 non-null   object
dtypes: int64(2), object(3)
memory usage: 64.1+ KB


In [19]:
# let's create a punishment category column
d['punish_type'] = d['punishment'].apply(get_punishment_category, list_of_words=punishment_words)
# head
d.head()

Unnamed: 0,file_name,year,trial_number,punishment,text,punish_type
0,OBC2-17200427.xml,1720,1,Transportation,",of St. Leonard Eastcheap , was indicted ...",transport
1,OBC2-17200427.xml,1720,2,Transportation,"Alice Jones , of St. Michael's Cornhi...",transport
2,OBC2-17200427.xml,1720,3,Transportation,"James Wilson , of St Katharine Colema...",transport
3,OBC2-17200427.xml,1720,4,Transportation,"James Mercy , alias Masse , of St....",transport
4,OBC2-17200427.xml,1720,5,Transportation,"Benjamin Cook , alias Richard Smith ...",transport


## Creating parsed text column using spacy and scattertext

In [20]:
# Scattertext requires parsed documents

# we using assign to create new column
# we pass in new column name and function that will be applied to each row
# we use st.whitespace_nlp_with_sentences function to parse text

d = d.assign(
    parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)
d.head()

Unnamed: 0,file_name,year,trial_number,punishment,text,punish_type,parse
0,OBC2-17200427.xml,1720,1,Transportation,",of St. Leonard Eastcheap , was indicted ...",transport,"(,, of, st, ., leonard, eastcheap, ,, was, ind..."
1,OBC2-17200427.xml,1720,2,Transportation,"Alice Jones , of St. Michael's Cornhi...",transport,"(alice, jones, ,, of, st, ., michael, ', s, co..."
2,OBC2-17200427.xml,1720,3,Transportation,"James Wilson , of St Katharine Colema...",transport,"(james, wilson, ,, of, st, katharine, coleman,..."
3,OBC2-17200427.xml,1720,4,Transportation,"James Mercy , alias Masse , of St....",transport,"(james, mercy, ,, alias, masse, ,, of, st, ., ..."
4,OBC2-17200427.xml,1720,5,Transportation,"Benjamin Cook , alias Richard Smith ...",transport,"(benjamin, cook, ,, alias, richard, smith, ,, ..."


In [21]:
# unique punich types
d['punish_type'].unique()

array(['transport', 'empty', 'death', 'unknown', 'whip', 'fine', 'prison',
       'discharg'], dtype=object)

In [22]:
# value counts for punish type
d['punish_type'].value_counts()

punish_type
empty        475
fine         350
unknown      319
transport    309
death         85
prison        57
whip          29
discharg      13
Name: count, dtype: int64

In [23]:
# let's creata  freedom column
d['freedom'] = d['punish_type'].isin(['empty', 'discharg']).astype(int)
d.head()

Unnamed: 0,file_name,year,trial_number,punishment,text,punish_type,parse,freedom
0,OBC2-17200427.xml,1720,1,Transportation,",of St. Leonard Eastcheap , was indicted ...",transport,"(,, of, st, ., leonard, eastcheap, ,, was, ind...",0
1,OBC2-17200427.xml,1720,2,Transportation,"Alice Jones , of St. Michael's Cornhi...",transport,"(alice, jones, ,, of, st, ., michael, ', s, co...",0
2,OBC2-17200427.xml,1720,3,Transportation,"James Wilson , of St Katharine Colema...",transport,"(james, wilson, ,, of, st, katharine, coleman,...",0
3,OBC2-17200427.xml,1720,4,Transportation,"James Mercy , alias Masse , of St....",transport,"(james, mercy, ,, alias, masse, ,, of, st, ., ...",0
4,OBC2-17200427.xml,1720,5,Transportation,"Benjamin Cook , alias Richard Smith ...",transport,"(benjamin, cook, ,, alias, richard, smith, ,, ...",0


In [24]:
# convert 0 to punishment and 1 to freedom
d['freedom'] = d['freedom'].replace({0: 'punishment', 1: 'freedom'})

## Creating Corpus for visualization

In [25]:
# we need to build a corpus
# TODO - can we use column with more than two categories?

corpus = st.CorpusFromParsedDocuments(
    d, category_col='freedom', parsed_col='parse'
).build().get_unigram_corpus().compact(st.AssociationCompactor(2000))

In [26]:
cdf = corpus.get_df()
cdf

Unnamed: 0,index,file_name,year,trial_number,punishment,text,punish_type,parse,freedom
0,0,OBC2-17200427.xml,1720,1,Transportation,",of St. Leonard Eastcheap , was indicted ...",transport,"(,, of, st, ., leonard, eastcheap, ,, was, ind...",punishment
1,1,OBC2-17200427.xml,1720,2,Transportation,"Alice Jones , of St. Michael's Cornhi...",transport,"(alice, jones, ,, of, st, ., michael, ', s, co...",punishment
2,2,OBC2-17200427.xml,1720,3,Transportation,"James Wilson , of St Katharine Colema...",transport,"(james, wilson, ,, of, st, katharine, coleman,...",punishment
3,3,OBC2-17200427.xml,1720,4,Transportation,"James Mercy , alias Masse , of St....",transport,"(james, mercy, ,, alias, masse, ,, of, st, ., ...",punishment
4,4,OBC2-17200427.xml,1720,5,Transportation,"Benjamin Cook , alias Richard Smith ...",transport,"(benjamin, cook, ,, alias, richard, smith, ,, ...",punishment
...,...,...,...,...,...,...,...,...,...
1632,1632,OBC2-19130304.xml,1913,65,Six months' hard labour,"SUDDABY , John (30, labourer) , and S...",unknown,"(suddaby, ,, john, (, 30, ,, labourer, ), ,, a...",punishment
1633,1633,OBC2-19130304.xml,1913,66,"Nine months' imprisonment, second division.","STEVENSON , Ella, otherwise Ethel Slade (53...",prison,"(stevenson, ,, ella, ,, otherwise, ethel, slad...",punishment
1634,1634,OBC2-19130304.xml,1913,67,Four months' hard labour.,"WELLAND , John (35, labourer) , pleaded gui...",unknown,"(welland, ,, john, (, 35, ,, labourer, ), ,, p...",punishment
1635,1635,OBC2-19130304.xml,1913,68,"Eighteen months' imprisonment, second division;","WHARRY , Olive, otherwise known as Joyce Loc...",prison,"(wharry, ,, olive, ,, otherwise, known, as, jo...",punishment


## Creating html visualization

In [27]:
## finally we are ready to feed our data to scattertext from our dataframe d
# we have our parse column - parse and we have our category column - punish_type


# and we have a corpus

html = st.produce_scattertext_explorer(
    corpus,
    category='freedom', category_name='freedom', not_category_name='punishment',
    minimum_term_frequency=0, pmi_threshold_coefficient=0,
    width_in_pixels=1000, metadata=corpus.get_df()['year'],
    transform=st.Scalers.dense_rank
)
open('../data/fine_vs_freedom_bailey.html', 'w', encoding="utf-8").write(html)

6597976

## Conclusions from working with scattertext and spacy

* creating parse for scattertext was easy for English language
* tricky was creation of suitable category
* currently it looks like only support for two categories is available - TODO ready docs!/code
* resulting plot is interactive and allows to explore the data and search terms

Overall very promising library for text visualization to explore word associations and differences between corpora.