## Scatter plot with `Scattertext`
`scattertext` is "a tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in an interesting, interactive scatter plot with non-overlapping term labels." (See the [documentation]( https://github.com/JasonKessler/scattertext).)

In this notebook, we are going to compare the works of two 19th century novelists: [Charles Dickens](https://en.wikipedia.org/wiki/Charles_Dickens) and [George Eliot](https://en.wikipedia.org/wiki/George_Eliot) (aka Mary Ann Evans). Such a comparison could be used to address questions about gender when it comes to authorship, or, perhaps, about key differences between novels set in urban vs. rural environments.

## Set up

In [1]:
%%capture
!pip3 install scattertext

In [2]:
import pandas as pd
import scattertext as st
from IPython.core.display import HTML



In [3]:
#load data
dickens_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/dickens.csv'
dickens_df = pd.read_csv(dickens_url)

In [4]:
# sanity check
print(dickens_df.shape)
dickens_df.sample(5)

(24707, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
18900,dickens,bleak,His formal array of words might have at any ot...,array word time time earnestness fidelity gall...,formal other ludicrous serious noble generous ...,affect see see bear aspire rise shine
17075,dickens,bleak,Here some shrill spectre cries out in a mockin...,spectre mocking manner visitor chin toss deris...,shrill good gracious sportive silent startled ...,cry be find look receive become
15860,dickens,bleak,“I had confident expectations that things woul...,expectation thing vagueness expression meaning...,confident square disappointed dirty new profes...,come say come make deal make borrow give menti...
6959,dickens,twist,"With these words, he suddenly wheeled the tabl...",word table iron ring boarding trap door foot g...,large several great,wheel pull throw open cause retire
9057,dickens,times,Mr. Bounderby’s first procedure was to shake M...,procedure stage floor recourse administration ...,first various potent such fast other dead alive,shake leave progress suffer screw smite water ...


In [5]:
#load data
eliot_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/eliot.csv'
eliot_df = pd.read_csv(eliot_url)

In [6]:
# sanity check
print(eliot_df.shape)
eliot_df.sample(5)

(18139, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
17779,eliot,scenes,"Nevertheless, Evangelicalism had brought into ...",existence operation society idea duty recognit...,palpable mere moral great central high mere po...,bring live begin mould rise introduce prune cu...
9778,eliot,deronda,\nWhen Deronda met Gwendolen and Grandcourt on...,staircase mind interview mother,second,meet preoccupy summon
13865,eliot,romola,It was a graceful way of putting a necessary s...,way statement word reply _ spokesman extremity...,graceful necessary dignified great loud round ...,put pass turn follow begin swing give cease di...
6083,eliot,mill,“And there’s your mother—you’ll try and make h...,mother you’ll amend luck wench,bad little,’ try make ’
11208,eliot,bede,"Hereupon a glorious shouting, a rapping, a jin...",shouting rapping jingling clattering shouting ...,glorious plentiful sublimest first feeble clos...,receive feel nullify feel praise deserve say l...


## Pre-process data

There are a few changes we need to make to our data to get it ready for processing by `Scattertext`.

**First**, we are going to get a smaller sample of the data so that we can process things more quickly for our in-class demonstration. If you were to do this as a research project, you might consider using your entire dataset.

**Second**, we are going to combine both datasets into one `DataFrame`.

**Third**, we are going to drop all the columns from that `DataFrame` except for `author` and `nouns.`

In [7]:
# create samples
dickens_sample_df = dickens_df.sample(10_000)
eliot_sample_df = eliot_df.sample(10_000)

In [8]:
# combine DataFrames
df = pd.concat([dickens_sample_df, eliot_sample_df])

In [9]:
# drop all columns except 'author' and 'nouns'
nouns_df = df[['author', 'nouns']]

## Build corpus and visualize

Now that we have our data in the shape that we need, we can hand it over to `Scattertext` to do the heavy lifting. The code below follows `Scattertext`'s [documentation](https://github.com/JasonKessler/scattertext). We first create a `Scattertext` corpus, then we transform that corpus into an html-based visualization, finally, we display that visualization within our notebook. Note: you can also download the visualization as an html file.

In [10]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(nouns_df, category_col='author', text_col='nouns').build()

In [11]:
# transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [12]:
# display visualization in notebook
HTML(html)

In [13]:
# Note: You can save this visualization as an html file
file_name = 'example.html'
with open(file_name, encoding='utf8', mode='w') as f:
  f.write(html)

## Compare using adjectives

We have compared Dickens and Eliot on the basis of the nouns they used. It might also be infomrative to compare them on the basis of the adjectives they used.

Starting with our initial datasets, `dickens_df` and `eliot_df`, make a comparison on the adjectives used by these authors using `Scattertext`.

In [14]:
# create samples
dickens_sample = dickens_df.sample(10_000)
eliot_sample = eliot_df.sample(10_000)

In [15]:
# combine DataFrames
combine_df = pd.concat([dickens_sample, eliot_sample])

In [16]:
# drop all columns except 'author' and 'adjectives'
adj_df = df[['author', 'adjectives']]

In [17]:
# create a scattertext corpus
scatter_corpus = st.CorpusFromPandas(adj_df, category_col='author', text_col='adjectives').build()

In [18]:
# # transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(scatter_corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [19]:
# display visualization in notebook
HTML(html)