## Gentle introduction to text mining

In this notebook you will do the simpliest of text mining against a file. The notebook employs only functionality which comes "out of the box" with Python; no additional libraries are imported. In the end, you will be able to do the most rudimentary comparisons between a number of different texts.

The first step is to denote a constant, the name of the file you want to "read".

In [1]:
# pre-configure
LIBRARY = '/Users/eric/Documents/reader-library'
CARREL  = 'homer'


In [None]:
# configure
ETC       = 'etc'
TEXT      = 'reader.txt'
STOPWORDS = 'stopwords.txt'


In [None]:
# require
from pathlib import Path


The second step is to open the file, get all the data, and output the result. You can see that the file is a whole lot of text.

In [None]:
# initialize
library = Path( LIBRARY )
text    = library/CARREL/ETC/TEXT


In [None]:
# get the text
with open( text ) as handle : text = handle.read()
print( text )


The third step is to parse the data into tokens ("words"), and in this case, a token is defined as anything delimited by a space -- the default option of the "split" method. The result of this process will be the creation of a list ("array"), and the list is then output.

In [None]:
tokens = text.split()

print()
print( 'Your text consists of the following tokens ("words"):' )
print()
print( tokens )

Arrays have sizes -- the number of elements in the array. Here we calculate the size of the array and output it.

In [None]:
size = len( tokens )

print()
print( 'The given file is %s tokens ("words") in size.' % size )

The next step is to count & tabulate the items in the array. This will give us a set of name-value pairs (a type of Python list called a "dictionary") where the name of each pair is a token, and the value of each pair is the token's frequency. Once the name-values pairs have been created, the dictionary is sorted by value, and finally we output the N most frequently occuring tokens along with their counts.

The sorting process is very "Pythonic", meaning, the process uses an idiom specific to the Python programming langauge. 

Increase or decrease the value of N to output a greater or lesser number of frequencies.

In [None]:
frequencies = {}
N           = 50

for token in tokens :
    
    if token in frequencies : frequencies[ token ] += 1
    else                    : frequencies[ token ] =  1

frequencies = { key:value for key, value in sorted( frequencies.items(), key=lambda item:item[ 1 ], reverse=True ) }

print()
print( 'The %i most frequently occuring tokens and their counts are:' % N )
print()

for index, token in enumerate( frequencies ) :
            
    count = frequencies[ token ]
    print( '  * %s (%i)' % ( token, count ) )
    
    if ( index > N ) : break

If the size of N (above) is large enough, you will see tokens containing punctation marks, such as the tokens at the end of sentences. If the text was a more normal text (one that has not been preprocessed by the Distant Reader), then you would also see words in a variety of cases. Accurate text mining requires the tokens to be normalized ("cleaned"), and the following cell illustrates how to retain word-only tokens as well as lower-casing all tokens all in a single go. This process is also very Pythonic.

In [None]:
tokens = [ token.lower() for token in tokens if token.isalpha() ]

Almost without a doubt, the list of most frequent words included words of little interest, such as "the", "a", or "an". These types of words are often called "stop words". The following cell initializes a list of stop words and then removes them from the list of tokens. Yet again, the process is Pythonic.

In [None]:
stopwords = library/CARREL/ETC/STOPWORDS
with open( stopwords ) as handle : stopwords = handle.read().split( '\n')
tokens    = [ token for token in tokens if token not in stopwords ]

The list of tokens has now been altered, and consequently you will want to re-create a list of frequencies, sort the list, and output the N most frequent tokens. The result ought to more meaningful than the previous list of frequencies.

Increase or decrease the value of N to output a greater or lesser number of frequencies.

In [None]:
frequencies = {}
N           = 50

for token in tokens :
    
    if token in frequencies : frequencies[ token ] += 1
    else                    : frequencies[ token ] =  1

frequencies = { key:value for key, value in sorted( frequencies.items(), key=lambda item:item[ 1 ], reverse=True ) }

print()
print( 'The %i most frequently occuring tokens and their counts are:' % N )
print()

for index, token in enumerate( frequencies ) :
            
    count = frequencies[ token ]
    print( '  * %s (%i)' % (token, count ))
    
    if ( index > N ) : break

The following cell creates the simplest of visualizations illustrating the frequency of the most frequent N tokens. Season the value of N to taste.

In [None]:
N       = 50
width   = 50
maximum = round( frequencies[ list( frequencies.keys() )[ 0 ] ] * 1.1 )

for index, token in enumerate( frequencies ) :
    
    count = frequencies[ token ]
    size  = round ( ( count * width / maximum ) )
    print( token, "\t", '#' * size )
    
    if ( index > N ) : break

It is quite probable you will want to know the significance of given idea in your text(s). For example, you might be interested in "love" or "war" or "truth" or "beauty". One way to determine significance is to compare the frequency of the given idea to the size of the entire text. The greater the resulting ratio is the more significant the given word.

Change the value of idea to observe differences between words, but be careful. Your idea may not appear in the text, and if it doesn't then a "KeyError" will occur.

In [None]:
idea         = 'love'
count        = frequencies[ idea ]
size         = len( tokens )
significance = count / size

print()
print( 'The significance of the idea "%s" is %f.' % ( idea, significance ) )


## Summary

Text mining is a lot about reading files, normalizing the text, counting & tabulating tokens ("words"), comparing those tokens and their characteristics to other tokens, and visualizing the results. This notebook -- a gentle introduction to text mining -- demonstrated some of these techniques sans any special funcationality provided by any Python library.

But that is doing it the hard way. There exist a whole world of text mining & natural langauge processing toolkits for analyzing ("reading") texts. The balance of notebooks in this repository exploit those toolkits.

## Next steps

Consider some next steps.

This repository came with a number of different Distant Reader study carrels. By default, this notebook processed content from the carrel named "homer". Other carrels in this repository ought to include:

   * augustine-confessions-397
   * melville-moby-1851
   * shakespeare-sonnets
 
Replace "homer" in the value of FILE at the beginning of this notebook to the name of a different study carrel, and then run the whole notebook again. Once you do so, then ask yourself the following questions:

   * Which study carrel is largest? Smallest? How do they all compare?
   * Depending on the carrel, how might you alter the list of stop words?
   * How does the significance of a given idea compare across carrels?

Finally, you are not limited to files from Distant Reader study carrels. Change the value of FILE to point to any plain text (.txt) file on your computer, and do a bit of additional "reading".

---
<p style='text-align: right'>Eric Lease Morgan &lt;emorgan@nd.edu&gt;<br />
April 21, 2021</p>