## Computing with Language: Statistics

In [None]:
import nltk

# make sure that NLTK language resources have been downloaded 
# (see "NLTK Introduction" notebook)

from nltk.book import *

### [Word] Frequency Distributions

FreqDist is used to encode "frequency distributions", which count the number of times that each outcome of an experiment occurs.
* In case of text, its frequency distribution will contains counts of all tokens that appear in the text.
* Technically: FreqDist() creates a Python object (that holds information about a frequency distribution)

In [None]:
# frequency distribution of text1
fdist1 = FreqDist(text1)

print(fdist1)

**FreqDist** methods:

* freq(sample) - returns the number of times "sample" appears in FreqDist
* hapaxes() - a list of samples that appear only once
* max() - the sample with the maximum number of occurences
* plot() - plot a FreqDist chart
* pprint() - "pretty print" the first items of FreqDist

NLTK book: http://www.nltk.org/book/ch01.html#computing-with-language-simple-statistics

Full list of methods: http://www.nltk.org/api/nltk.html#nltk.probability.FreqDist

In [None]:
# print frequency distribution (top results)

fdist1.pprint()

In [None]:
# max()

fdist1.max()

In [None]:
# freq()

print("','  :", fdist1.freq(","))
print("whale:", fdist1.freq("whale"))

---

Information about Python dictionaries: 
* ["Dictionaries and Structuring Data"](https://automatetheboringstuff.com/chapter5/)


In [None]:
# output of fdist1.pprint() looks like a Python "dictionary"

# can we look up its values by a given "key"?
fdist1["whale"]

In [None]:
# top 10 results (not that interesting for text)

fdist1.most_common(10)

In [None]:
%matplotlib inline

# plot the distribution
fdist1.plot(30)

In [None]:
# most_common() returs a list -> we can "slice" it

my_list = fdist1.most_common(100)

# results 50 through 59
my_list[50:60]

In [None]:
# least common results (first 10 examples)

fdist1.hapaxes()[:10]

### Words can appear both in lowercase and Capitalized

Let's fix our FreqDist:

In [None]:
# need to "lowercase" the text before passing it to FreqDist
#   - see example in https://www.nltk.org/api/nltk.html#nltk.probability.FreqDist

fdist2 = FreqDist(word.lower() for word in text1)

# we're going through the list of tokens in text,
#  - returning (generating) lowercase versions of these tokens
#  - and passing the result to FreqDist

In [None]:
# initial:
print(fdist1.freq("whale"))
print(fdist1.freq("Whale"))
print()

# fixed:
print(fdist2.freq("whale"))

### Cleaning data: removing stopwords

NLTK contains a corpus of *stopwords* - high-frequency words like "the", "to" and "also" - that we may want to filter out of a document before further processing.

Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts.

https://www.nltk.org/book/ch02#wordlist-corpora

In [None]:
from nltk.corpus import stopwords

# English stopwords
stop_words = stopwords.words("english")

stop_words[:8]

In [None]:
# let's start with text1 in lowercase

# return a list  
#   containing "word.lower()"
#     for every item (stored in variable "word")
#       in resource "text1"

text = [word.lower() for word in text1]

text[:7]

**NLTK book: [4.2   Operating on Every Element](https://www.nltk.org/book/ch01#operating-on-every-element)**

This *pattern* – doing something (e.g. modifying) with every item in a sequence and returning a list of results – is called Python *list comprehension*:
    
`result_list = [item.do_something() for item in list]`

List comprehensions may also contain conditions (only items matching the condition will be included in the resulting list):

`result_list = [item.do_something() for item in list `**`if`**` condition]`

It is very useful for filtering and modifying lists.

In [None]:
# we can filter either (a) text before calling FreqDist or (b) results of FreqDist.
# let's filter before calling FreqDist.

# create a set of stopwords (operations with sets are faster that with lists)
stop_set = set(stop_words)

# filter out stopwords (return only words not in the stoplist)
without_stopwords = [word for word in text if word not in stop_set]

text[:7]

In [None]:
# let's also filter out tokens that are not text or numbers

# Python has a built-in method .isalnum() that determines 
# if a string only consists of letters or digits:

# https://docs.python.org/3/library/stdtypes.html#str.isalnum

filtered = [word for word in without_stopwords if word.isalnum()]

filtered[:7]

In [None]:
# word frequency

freq = FreqDist(filtered)

freq.most_common(15)

### Exploring data: finding interesting words

NLTK also includes a list of common English words. We can use it to find unusual or mis-spelt words in a text corpus.

See also: https://www.nltk.org/book/ch02#code-unusual

In [None]:
word_list = nltk.corpus.words.words()

# convert word list to a set (+ convert words to lowercase)
word_set = set(word.lower() for word in word_list)

# filter out common words
uncommon = [word for word in filtered if word not in word_set]

uncommon[:7]

In [None]:
# word frequency

freq = FreqDist(uncommon)

freq.most_common(15)

Note: in order to find really uncommon words we may need to clean data further (convert nouns to singular, etc.) or get a larger list of common words.

---

### Further information

[**Introduction to stylometry with Python**](https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python) by François Dominic Laramée
* uses FreqDist

Stylometry is the quantitative study of literary style through computational distant reading methods. It is based on the observation that authors tend to write in relatively consistent, recognizable and unique ways. 