## Scatter plot with `Scattertext`
`scattertext` is "a tool for finding distinguishing terms in small-to-medium-sized corpora, and presenting them in an interesting, interactive scatter plot with non-overlapping term labels." (See the [documentation]( https://github.com/JasonKessler/scattertext).)

In this notebook, we are going to compare the works of two 19th century novelists: [Charles Dickens](https://en.wikipedia.org/wiki/Charles_Dickens) and [George Eliot](https://en.wikipedia.org/wiki/George_Eliot) (aka Mary Ann Evans). Such a comparison could be used to address questions about gender when it comes to authorship, or, perhaps, about key differences between novels set in urban vs. rural environments.

## Set up

In [1]:
%%capture
!pip install scattertext

In [51]:
import pandas as pd
import scattertext as st
from IPython.core.display import HTML
from IPython.display import IFrame

In [9]:
d_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/dickens.csv'
d_df = pd.read_csv(d_url)

In [12]:
print(d_df.shape)
d_df.head()

(24707, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
0,dickens,carol,*** START OF THE PROJECT GUTENBERG EBOOK A CHR...,start,*,EBOOK
1,dickens,carol,\nThe combined qualities of the realist and th...,quality realist idealist degree attitude life ...,combined remarkable jovial general happy littl...,possess seem give allow
2,dickens,carol,Dickens gave his first formal expression to hi...,expression thought series book chrysolite succ...,first formal small first famous perfect immedi...,give write listen regard seem read
3,dickens,carol,This volume was put forth in a very attractive...,volume manner illustration artist character dr...,attractive first varied,put make live spirit
4,dickens,carol,"There followed upon this four others: ""The Chi...",other illustration appearance other day series...,first next familiar,follow know know love


In [14]:
#load data
e_url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/eliot.csv'
e_df = pd.read_csv(e_url)

In [15]:
# sanity check
print(e_df.shape)
e_df.head()

(18139, 6)


Unnamed: 0,author,title,text,nouns,adjectives,verbs
0,eliot,middlemarch,*** START OF THE PROJECT GUTENBERG EBOOK MIDDL...,start,*,EBOOK
1,eliot,middlemarch,"To my dear Husband, George Henry Lewes,\nin th...",year union,dear nineteenth,bless
2,eliot,middlemarch,\nWho that cares much to know the history of m...,history man mixture behave experiment life gen...,mysterious least little small rugged wide eyed...,care know vary smile walk go seek toddle beat ...
3,eliot,middlemarch,That Spanish woman who lived three hundred yea...,woman year kind theresa life unfolding action ...,spanish last many epic constant resonant certa...,live bear find match find sink tangle try shap...
4,eliot,middlemarch,Some have felt that these blundering lives are...,life indefiniteness nature woman level incompe...,due inconvenient feminine strict more social s...,feel blunder fashion count treat remain imagin...


## Pre-process data

There are a few changes we need to make to our data to get it ready for processing by `Scattertext`.

**First**, we are going to get a smaller sample of the data so that we can process things more quickly for our in-class demonstration. If you were to do this as a research project, you might consider using your entire dataset.

**Second**, we are going to combine both datasets into one `DataFrame`.

**Third**, we are going to drop all the columns from that `DataFrame` except for `author` and `nouns.`

In [16]:
d_samp_df = d_df.sample(10_000)
e_samp_df = e_df.sample(10_000)

In [17]:
df = pd.concat([d_samp_df, e_samp_df])

In [22]:
noun_df = df[['author', 'nouns']]

## Build corpus and visualize

Now that we have our data in the shape that we need, we can hand it over to `Scattertext` to do the heavy lifting. The code below follows `Scattertext`'s [documentation](https://github.com/JasonKessler/scattertext). We first create a `Scattertext` corpus, then we transform that corpus into an html-based visualization, finally, we display that visualization within our notebook. Note: you can also download the visualization as an html file.

In [23]:
corpus = st.CorpusFromPandas(noun_df, category_col='author', text_col='nouns').build()

In [43]:
html_noun = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [45]:
# Note: You can save this visualization as an html file
file_name = 'scattertext_noun.html'
with open(file_name, encoding='utf8', mode='w') as f:
    f.write(html_noun)

In [54]:
IFrame(src='scattertext_noun.html', width=900, height=600)

## Compare using adjectives

We have compared Dickens and Eliot on the basis of the nouns they used. It might also be infomrative to compare them on the basis of the adjectives they used.

Starting with our initial datasets, `dickens_df` and `eliot_df`, make a comparison on the adjectives used by these authors using `Scattertext`.

In [28]:
# create samples
d_samp_df = d_df.sample(5_000)
e_samp_df = e_df.sample(5_000)

In [29]:
# combine DataFrames
df = pd.concat([d_samp_df, e_samp_df])

In [30]:
# drop all columns except 'author' and 'adjectives'
adj_df = df[['author', 'adjectives']]

In [31]:
# create a scattertext corpus
corpus = st.CorpusFromPandas(adj_df, category_col='author', text_col='adjectives').build()

In [40]:
# transform corpus into html-based visualization with scattertext
html_adj = st.produce_scattertext_explorer(corpus,
                                       category='eliot',  # this sets the y-axis
                                       category_name='Eliot', # label y-axis
                                       not_category_name='Dickens',  # label x-axis
                                       minimum_term_frequency=20,
                                       width_in_pixels=900)

In [36]:
file_name = 'scattertext_adj.html'
with open(file_name, encoding='utf8', mode='w') as f:
    f.write(html_adj)

In [55]:
IFrame(src='scattertext_adj.html', width=900, height=600)