# Word Counting

## Skills

1. **Tokenize data using the tidytext module.**
2. **Cleanse text data.**
2. **Analyze a document using its most frequent words.**
3. **Visualize a text using word clouds.**
4. Compare paired documents using relative frequencies.
5. Use TF-IDF to examine word frequencies in groups of documents.
6. Use TF-IDF to define context-dependent stop words.

## Vocabulary List

**data pipeline.** A series of steps taken from the point data are recorded to its final, processed stage, which may include visualizations or other summary statistics. Data pipelines are common not only for NLP but more generally in data science.

**lemma.** The "basic" form of a word, without conjugation, pluralization, &c. What this means is language-dependent, and can be context-dependent as well.

**logarithm.** A way to transform data that allows you to compare numbers across very different scales easily. The number 1 has a logarithm of 0, which is extremely useful for "zeroing out" common words in TF-IDF analysis. Additionally, $\log(1/2) = -\log(2)$, which is useful for looking at relative frequencies.

**stop words.** Extremely common words which don't give insight into a document, from a word-level analysis.

**TF-IDF.** Text-frequency inverse document-frequency. A way of picking out the most common words in a text that are unique to that text.

## Loading Libraries

In [None]:
# Standard packages that we'll always be using
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# The tidytext package isn't in Google Colaboratory's default list of packages, so we install it first
!pip install tidytext
import tidytext
# This library is used by tidytext for tokenization
import nltk
nltk.download('punkt')

# For making word clouds (unsurprisingly)
from wordcloud import WordCloud, STOPWORDS

Collecting tidytext
  Downloading tidytext-0.0.1.tar.gz (4.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting siuba (from tidytext)
  Downloading siuba-0.4.4-py3-none-any.whl (208 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m208.6/208.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: tidytext
  Building wheel for tidytext (setup.py) ... [?25l[?25hdone
  Created wheel for tidytext: filename=tidytext-0.0.1-py3-none-any.whl size=3870 sha256=5e71f5744dac3c72f4ecdc58569231517f655b8b72d4cba306d7089b4664ee43
  Stored in directory: /root/.cache/pip/wheels/88/40/40/04f8d22d7729547afa13c2cbffb494737351dd4465f2f26288
Successfully built tidytext
Installing collected packages: siuba, tidytext
Successfully installed siuba-0.4.4 tidytext-0.0.1


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Processing and Word Counting

<div style="text-align:center;"><img src="https://images-na.ssl-images-amazon.com/images/I/91QclGg4BjL.jpg" height="250" width="155">&nbsp;&nbsp;&nbsp;&nbsp;<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/210120-D-WD757-2714_%2850861221216%29.jpg/320px-210120-D-WD757-2714_%2850861221216%29.jpg"></div>

Below is the poem "The Hill We Climb" by Amanda Gorman, youth poet laureate of the U.S. from 2017-2018.

In [None]:
text = """When day comes, we ask ourselves, where can we find light in this never-ending shade?
The loss we carry. A sea we must wade.
We braved the belly of the beast.
We’ve learned that quiet isn’t always peace, and the norms and notions of what “just” is isn’t always justice.
And yet the dawn is ours before we knew it.
Somehow we do it.
Somehow we weathered and witnessed a nation that isn’t broken, but simply unfinished.
We, the successors of a country and a time where a skinny Black girl descended from slaves and raised by a single mother can dream of becoming president, only to find herself reciting for one.
And, yes, we are far from polished, far from pristine, but that doesn’t mean we are striving to form a union that is perfect.
We are striving to forge our union with purpose.
To compose a country committed to all cultures, colors, characters and conditions of man.
And so we lift our gaze, not to what stands between us, but what stands before us.
We close the divide because we know to put our future first, we must first put our differences aside.
We lay down our arms so we can reach out our arms to one another.
We seek harm to none and harmony for all.
Let the globe, if nothing else, say this is true.
That even as we grieved, we grew.
That even as we hurt, we hoped.
That even as we tired, we tried.
That we’ll forever be tied together, victorious.
Not because we will never again know defeat, but because we will never again sow division.
Scripture tells us to envision that everyone shall sit under their own vine and fig tree, and no one shall make them afraid.
If we’re to live up to our own time, then victory won’t lie in the blade, but in all the bridges we’ve made.
That is the promise to glade, the hill we climb, if only we dare.
It’s because being American is more than a pride we inherit.
It’s the past we step into and how we repair it.
We’ve seen a force that would shatter our nation, rather than share it.
Would destroy our country if it meant delaying democracy.
And this effort very nearly succeeded.
But while democracy can be periodically delayed, it can never be permanently defeated.
In this truth, in this faith we trust, for while we have our eyes on the future, history has its eyes on us.
This is the era of just redemption.
We feared at its inception.
We did not feel prepared to be the heirs of such a terrifying hour.
But within it we found the power to author a new chapter, to offer hope and laughter to ourselves.
So, while once we asked, how could we possibly prevail over catastrophe, now we assert, how could catastrophe possibly prevail over us?
We will not march back to what was, but move to what shall be: a country that is bruised but whole, benevolent but bold, fierce and free.
We will not be turned around or interrupted by intimidation because we know our inaction and inertia will be the inheritance of the next generation, become the future.
Our blunders become their burdens.
But one thing is certain.
If we merge mercy with might, and might with right, then love becomes our legacy and change our children’s birthright.
So let us leave behind a country better than the one we were left.
Every breath from my bronze-pounded chest, we will raise this wounded world into a wondrous one.
We will rise from the golden hills of the West.
We will rise from the windswept Northeast where our forefathers first realized revolution.
We will rise from the lake-rimmed cities of the Midwestern states.
We will rise from the sun-baked South.
We will rebuild, reconcile, and recover.
And every known nook of our nation and every corner called our country, our people diverse and beautiful, will emerge battered and beautiful.
When day comes, we step out of the shade of flame and unafraid.
The new dawn balloons as we free it.
For there is always light, if only we’re brave enough to see it.
If only we’re brave enough to be it."""

We can take this text and put it into a dataframe.

This is sort of an awkward way to do it, since it's just a single entry. Creating a DataFrame directly requires that you give it a list (or list of lists) and a list of the column names.

Soon enough we'll be working with more text in CSV files again, and won't have to worry about this so much.

In [None]:
df = pd.DataFrame([text], columns=["text"])

We can use the tidytext package's `unnest_tokens()` function to tokenize the text's dataframe into one word per row in the dataframe. The three arguments to the functions are:
* The DataFrame with the text to tokenize
* The name of the column that words will go in
* The name of the column that contains the text to tokenize

In [None]:
word_df = tidytext.unnest_tokens(df, "word", "text")
word_df.head()

Unnamed: 0,word
0,when
0,day
0,comes
0,we
0,ask


From here, we can count the words and visualize the most common ones. Let's do that together.

In [None]:
word_df["word"].value_counts()

we          60
the         29
and         25
to          21
our         18
            ..
tells        1
envision     1
everyone     1
sit          1
see          1
Name: word, Length: 340, dtype: int64

Notice anything unusual? How can we fix this?

In [None]:
word_df = word_df.loc[ ~word_df["word"].isin(STOPWORDS) ]
# Or with query
# word_df = word_df.query("word not in @STOPWORDS")

In [None]:
word_df["word"].value_counts()[0:10]

’          15
will       12
one         6
country     6
us          6
t           5
rise        4
isn         3
even        3
always      3
Name: word, dtype: int64

## Word Clouds

Word clouds are a fun way to visualize the most common words in a dataset. The [WordCloud package](https://amueller.github.io/word_cloud/) provides a number of functions for making pretty images, with lots of customization.

Here's a basic word cloud. Note that the package requires a String as an input, not a DataFrame, so we first convert our cleaned and tokenized DataFrame back into a String.

In [None]:
newtext = " ".join(word_df["word"])

# I copied all of this from the WordCloud documentation.
# See the link above for options about shape, color, number of words, and so on.
cloud = WordCloud().generate(newtext)
plt.figure()
plt.imshow(cloud, interpolation="bilinear")
plt.axis("off")
plt.show()