## Scatter plot with `Scattertext`
`scattertext` is "a tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in an interesting, interactive scatter plot with non-overlapping term labels." (See the [documentation]( https://github.com/JasonKessler/scattertext).)

In this notebook, we are going to compare the works of two 19th century novelists: [Charles Dickens](https://en.wikipedia.org/wiki/Charles_Dickens) and [George Eliot](https://en.wikipedia.org/wiki/George_Eliot) (aka Mary Ann Evans). Such a comparison could be used to address questions about gender when it comes to authorship, or, perhaps, about key differences between novels set in urban vs. rural environments.

## Set up

In [1]:
!pip install scattertext

Collecting scattertext
  Downloading scattertext-0.1.19-py3-none-any.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting statsmodels (from scattertext)
  Downloading statsmodels-0.14.0-cp311-cp311-macosx_11_0_arm64.whl (9.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting flashtext (from scattertext)
  Downloading flashtext-2.7.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting patsy>=0.5.2 (from statsmodels->scattertext)
  Downloading patsy-0.5.4-py2.py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.9/233.9 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: flashtext
  Building wheel for flashtext (setup.py) ... [?25ldone
[?25h  Created wheel for flashtext: filename=flas

Installing collected packages: flashtext, patsy, statsmodels, scattertext
Successfully installed flashtext-2.7 patsy-0.5.4 scattertext-0.1.19 statsmodels-0.14.0


In [2]:
import pandas as pd
import scattertext as st
from IPython.core.display import HTML

In [3]:
#load data
dickens_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/dickens.csv'
dickens_df = pd.read_csv(dickens_url)

In [5]:
# sanity check
print(dickens_df.shape)
dickens_df.sample(5)

(24707, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
9411,dickens,copperfield,"‘Here! Peggotty!’ cried Miss Betsey, opening t...",door tea mistress,parlour little unwell,cry open dawdle
22756,dickens,pickwick,"'It's not so much that, Mr. Weller,' replied M...",wine,much bad afraid,reply dissipate
16832,dickens,bleak,“But I know she is very beautiful this morning.”,morning,beautiful,know
23866,dickens,pickwick,"'Well, I think he was; I think I may say he wa...",man story uncle gentleman,same,think think say answer eye tell surprise
13225,dickens,copperfield,"He looked upon her, while she made this suppli...",supplication manner,wild distracted silent,look make raise


In [6]:
#load data
eliot_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/eliot.csv'
eliot_df = pd.read_csv(eliot_url)

In [7]:
# sanity check
print(eliot_df.shape)
eliot_df.sample(5)

(18139, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
13382,eliot,romola,"“There you go, supposing you’ll get people to ...",people leg sack pair hosen beast man battle fi...,wild unarmed,go suppose get put call say say rush ’ run fac...
8443,eliot,deronda,It happened that the very vividness of his imp...,vividness impression friend indefiniteness sen...,enigmatic apparent many sided persistent memor...,happen make contribute waken develop threaten ...
9816,eliot,deronda,"“If he were here again, what should I do? I ca...",face coward contempt fiend will face,dead beggar dead,wish bear bear go go wander stay feel turn bea...
3348,eliot,middlemarch,“People will not make a boast of being methodi...,People boast while,methodistical good,make come
1989,eliot,middlemarch,"“People don’t like his religious tone,” said L...",People tone,religious,like say break


## Pre-process data

There are a few changes we need to make to our data to get it ready for processing by `Scattertext`.

**First**, we are going to get a smaller sample of the data so that we can process things more quickly for our in-class demonstration. If you were to do this as a research project, you might consider using your entire dataset.

**Second**, we are going to combine both datasets into one `DataFrame`.

**Third**, we are going to drop all the columns from that `DataFrame` except for `author` and `nouns.`

In [8]:
# create samples
dickens_sample_df = dickens_df.sample(10_000)
eliot_sample_df = eliot_df.sample(10_000)

In [9]:
# combine DataFrames
df = pd.concat([dickens_sample_df, eliot_sample_df])

In [10]:
# drop all columns except 'author' and 'nouns'
nouns_df = df[['author', 'nouns']]

In [12]:
# sanity check
print(nouns_df.shape)
nouns_df.sample(10)

(20000, 2)


Unnamed: 0,author,nouns
6524,dickens,doctor day sake sex mood fellow compassion fel...
10029,dickens,voice judge question
4621,eliot,sob minute house attic floor head worm shelf s...
14533,eliot,pity falsehood sort shrug time side affair
6033,dickens,candle table book night
8048,eliot,layer thing
18732,dickens,step room trifling indication room darkness wa...
7686,eliot,credit sake kind idiot brace passport life uni...
705,eliot,haunt pleasure footprint echo treasure
12934,dickens,turn room love wife self inclination head door


## Build corpus and visualize

Now that we have our data in the shape that we need, we can hand it over to `Scattertext` to do the heavy lifting. The code below follows `Scattertext`'s [documentation](https://github.com/JasonKessler/scattertext). We first create a `Scattertext` corpus, then we transform that corpus into an html-based visualization, finally, we display that visualization within our notebook. Note: you can also download the visualization as an html file.

In [13]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(nouns_df, category_col='author', text_col='nouns').build()

In [14]:
# transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [15]:
# display visualization in notebook
HTML(html)

In [16]:
# Note: You can save this visualization as an html file
file_name = 'inclass_corpus.html'
with open(file_name, encoding='utf8', mode='w') as f:
    f.write(html)

## Compare using adjectives

We have compared Dickens and Eliot on the basis of the nouns they used. It might also be infomrative to compare them on the basis of the adjectives they used.

Starting with our initial datasets, `dickens_df` and `eliot_df`, make a comparison on the adjectives used by these authors using `Scattertext`.

In [17]:
# create samples
dickens_sample_df2 = dickens_df.sample(10_000)
eliot_sample_df2 = eliot_df.sample(10_000)

In [18]:
# combine DataFrames
df2 = pd.concat([dickens_sample_df2, eliot_sample_df2])

In [19]:
# drop all columns except 'author' and 'adjectives'
adj_df = df2[['author', 'adjectives']]

In [21]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(adj_df, category_col='author', text_col='adjectives').build()

In [22]:
# # transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [23]:
# display visualization in notebook
HTML(html)

In [24]:
# Note: You can save this visualization as an html file
file_name = 'HW_corpus.html'
with open(file_name, encoding='utf8', mode='w') as f:
    f.write(html)