## Scattertext based on term frequencies

A small example that shows how to generate a scattertext using term frequencies. We will use the scattertext package that is available here https://github.com/JasonKessler/scattertext

The first step is to load example data.

In [1]:
import pandas as pd

# Read the data from the xls file.

# https://www.wordfrequency.info/files/genres_sample.xls

# The column 'lemma' contains words and it should become the index when we want to use scattertext.
df = (pd.read_excel('data/genres_sample.xls')
	      .dropna()
	      .set_index('lemma')[['SPOKEN', 'FICTION']]
	      .iloc[:1000])

df.head()

Unnamed: 0_level_0,SPOKEN,FICTION
lemma,Unnamed: 1_level_1,Unnamed: 2_level_1
the,3859682.0,4092394.0
I,1346545.0,1382716.0
they,609735.0,352405.0
she,212920.0,798208.0
would,233766.0,229865.0


We will now generate the scattertext and save the result to a HTML file. The output file with the scattertext will be accessible via [this URL](logs/spoken_fiction.html).

In [2]:
import scattertext as st

term_cat_freq = st.TermCategoryFrequencies(df)

html = st.produce_scattertext_explorer(
	term_cat_freq,
	category='SPOKEN',
	category_name='Spoken',
	not_category_name='Fiction',
)

# save the output to a file
open("logs/spoken_fiction.html", 'wb').write(html.encode('utf-8'))

# from IPython.display import IFrame
# IFrame('spoken_fiction.html', width=700, height=600)
# from IPython.display import display, HTML
# display(HTML(html.encode('utf-8')))


662453

Load the token frequencies from a CVS file. This file has to be prepared in advance.

In [3]:
# load the data from the CSV file
df = pd.read_csv('logs/frequencies.csv')

# remove spaces and tabs (they cause a lot of trouble)
df.columns = df.columns.str.strip()
print(df.columns)

print(df.head())

# words have to be an index for rows
df = df.set_index('Word')
print(df.head())

term_cat_freq = st.TermCategoryFrequencies(df)

html = st.produce_scattertext_explorer(
	term_cat_freq,
	category='Brothers_Karamazov',
	category_name='Brothers_Karamazov',
	not_category_name='Murder_of_Roger_Ackroyd',
)

open("logs/two_books_scattertext.html", 'wb').write(html.encode('utf-8'))

Index(['Word', 'Brothers_Karamazov', 'Murder_of_Roger_Ackroyd'], dtype='object')
        Word  Brothers_Karamazov  Murder_of_Roger_Ackroyd
0    project                  89                        0
1  gutenberg                  41                        0
2      ebook                  15                        0
3   brothers                  55                        0
4  karamazov                 179                        0
           Brothers_Karamazov  Murder_of_Roger_Ackroyd
Word                                                  
project                    89                        0
gutenberg                  41                        0
ebook                      15                        0
brothers                   55                        0
karamazov                 179                        0


1336483

The output file with the final scattertext is accessible via [this URL](logs/two_books_scattertext.html).