## Gentle introduction to text mining, revisited

In this notebook you will define a file to open, and then the script will count & tabulate the words it contains. By defining different files to open, you will be able to quickly & easily compare & contrast texts. --Eric Lease Morgan

In [1]:
# configure; define some constants
CARREL   = 'CodeOfConduct4Lib-2021-03-10'
TEMPLATE = './carrels/%s/etc/reader.txt'
LANGUAGE = 'english'


In [2]:
# require; what modules are needed?
from nltk import *


In [3]:
file = TEMPLATE % CARREL
data = open( file ).read()


In [4]:
# create a list of all the tokens (words, punctuation, etc) in the data
tokens = word_tokenize( data )


In [5]:
# count & tabulate each token, and output the N most frequent tokens
N          = 5
frequecies = FreqDist( tokens )
frequecies.most_common( N ) 


[(',', 82), ('the', 78), ('.', 36), ('to', 34), ('or', 34)]

In [6]:
# loop through each token and retain only lower-cased "words"; very pythonic
tokens = [ token.lower() for token in tokens if token.isalpha() ] 


In [7]:
# count & tabulate again
N          = 5
frequecies = FreqDist( tokens )
frequecies.most_common( N ) 


[('the', 78), ('to', 34), ('or', 34), ('a', 26), ('you', 23)]

In [8]:
# create a list ("set") of stop words from the given language
stopwords = set( corpus.stopwords.words( LANGUAGE ) ) 


In [9]:
# loop through the tokens (again), and retain only non-stopwords
tokens = [ token for token in tokens if token not in stopwords ] 


In [10]:
# count & tabulate again
N          = 5
frequecies = FreqDist( tokens )
frequecies.most_common( N ) 

[('community', 20), ('lib', 17), ('offender', 17), ('code', 16), ('event', 11)]

In [11]:
# count & tabulate the tokens yet again
N          = 50
frequecies = FreqDist( tokens )
frequecies.most_common( N ) 


[('community', 20),
 ('lib', 17),
 ('offender', 17),
 ('code', 16),
 ('event', 11),
 ('support', 11),
 ('may', 10),
 ('including', 8),
 ('conference', 7),
 ('incident', 7),
 ('volunteers', 7),
 ('list', 6),
 ('volunteer', 6),
 ('people', 6),
 ('phone', 6),
 ('number', 6),
 ('sexual', 5),
 ('channel', 5),
 ('behavior', 5),
 ('events', 5),
 ('participants', 5),
 ('step', 5),
 ('either', 5),
 ('help', 5),
 ('helpers', 5),
 ('name', 5),
 ('harassment', 4),
 ('public', 4),
 ('online', 4),
 ('harassing', 4),
 ('action', 4),
 ('decision', 4),
 ('banning', 4),
 ('coc', 4),
 ('safe', 3),
 ('talks', 3),
 ('gender', 3),
 ('physical', 3),
 ('contact', 3),
 ('feel', 3),
 ('comfortable', 3),
 ('speaking', 3),
 ('time', 3),
 ('organizer', 3),
 ('website', 3),
 ('c', 3),
 ('know', 3),
 ('slack', 3),
 ('role', 3),
 ('see', 3)]