# Week 4: Style I

## Screencasts

https://drive.google.com/drive/u/1/folders/0B4OAOue0b3VMOU1yYW1JcUlNcWM
## Readings

- E.A. Smith. [Automated Readability Index](https://github.com/denten-courses/computing-context/blob/master/readings/ari.pdf), 1967.
- Stubbs, Michael. “[Conrad in the Computer: Examples of Quantitative Stylistic Methods](http://lal.sagepub.com/content/14/1/5.full.pdf+html).” Language and Literature 14, no. 1 (February 1, 2005): 5–24.
- Fish, Stanley E. “[What Is Stylistics and Why Are They Saying Such Terrible Things about It?-Part II](http://www.jstor.org/stable/303144).” Boundary 2 8, no. 1 (1979): 129–46.

## Home Experiment

- [Writers Battle](https://github.com/denten-courses/computing-context/blob/master/experiments/5-experiment/battle.md)

## Lecture Notes:

In [1]:
from textblob import TextBlob

text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

In [2]:
blob = TextBlob(text)

In [3]:
blob?

In [4]:
type(blob)

textblob.blob.TextBlob

In [5]:
# automated readability index (ARI) 
# (4.71 * characters/words) + (0.5 * words/sentences) - 21.43

words = blob.words
chars = blob.string

In [6]:
words

WordList(['The', 'titular', 'threat', 'of', 'The', 'Blob', 'has', 'always', 'struck', 'me', 'as', 'the', 'ultimate', 'movie', 'monster', 'an', 'insatiably', 'hungry', 'amoeba-like', 'mass', 'able', 'to', 'penetrate', 'virtually', 'any', 'safeguard', 'capable', 'of', 'as', 'a', 'doomed', 'doctor', 'chillingly', 'describes', 'it', 'assimilating', 'flesh', 'on', 'contact', 'Snide', 'comparisons', 'to', 'gelatin', 'be', 'damned', 'it', "'s", 'a', 'concept', 'with', 'the', 'most', 'devastating', 'of', 'potential', 'consequences', 'not', 'unlike', 'the', 'grey', 'goo', 'scenario', 'proposed', 'by', 'technological', 'theorists', 'fearful', 'of', 'artificial', 'intelligence', 'run', 'rampant'])

In [7]:
chars

'\nThe titular threat of The Blob has always struck me as the ultimate movie\nmonster: an insatiably hungry, amoeba-like mass able to penetrate\nvirtually any safeguard, capable of--as a doomed doctor chillingly\ndescribes it--"assimilating flesh on contact.\nSnide comparisons to gelatin be damned, it\'s a concept with the most\ndevastating of potential consequences, not unlike the grey goo scenario\nproposed by technological theorists fearful of\nartificial intelligence run rampant.\n'

In [8]:
# what type of object is words
type(words)

textblob.blob.WordList

In [9]:
# what type of object is chards
type(chars)

str

In [10]:
# get length of words
len(words)

72

In [11]:
# hmm, do we trust the len? lets check. looks good!
test = TextBlob("two words")
len(test.words)

2

In [12]:
# good ol' blob has everything we need
sents = blob.sentences

In [13]:
sents

[Sentence("
 The titular threat of The Blob has always struck me as the ultimate movie
 monster: an insatiably hungry, amoeba-like mass able to penetrate
 virtually any safeguard, capable of--as a doomed doctor chillingly
 describes it--"assimilating flesh on contact."),
 Sentence("Snide comparisons to gelatin be damned, it's a concept with the most
 devastating of potential consequences, not unlike the grey goo scenario
 proposed by technological theorists fearful of
 artificial intelligence run rampant.
 ")]

In [14]:
type(sents)

list

In [15]:
len(sents)

2

In [16]:
# lets clean things up
# automated readability index (ARI) 
# (4.71 * characters/words) + (0.5 * words/sentences) - 21.43

num_chars = len(blob.string)
num_words = len(blob.words)
num_sents = len (blob.sentences)

ari = 4.71 * num_chars/num_words + 0.5 * num_words/num_sents - 21.43

In [17]:
ari

27.904583333333335

## Interlude

We have a number! What does it mean? How to verify models? The problem of ground truth. Garbage in, garbage out. What is stylistics and why are people saying such terrible things about it. Four ways to get at the ground truth:

1. Expert opinion
2. Laypeople opinion
3. A tagged corpus
4. Artificial corpus


In [2]:
# we pick up here next week!
# words, tokens, lemmas, n-grams
blob.words

NameError: name 'blob' is not defined

In [19]:
blob.tokens

WordList(['The', 'titular', 'threat', 'of', 'The', 'Blob', 'has', 'always', 'struck', 'me', 'as', 'the', 'ultimate', 'movie', 'monster', ':', 'an', 'insatiably', 'hungry', ',', 'amoeba-like', 'mass', 'able', 'to', 'penetrate', 'virtually', 'any', 'safeguard', ',', 'capable', 'of', '--', 'as', 'a', 'doomed', 'doctor', 'chillingly', 'describes', 'it', '--', "''", 'assimilating', 'flesh', 'on', 'contact', '.', 'Snide', 'comparisons', 'to', 'gelatin', 'be', 'damned', ',', 'it', "'s", 'a', 'concept', 'with', 'the', 'most', 'devastating', 'of', 'potential', 'consequences', ',', 'not', 'unlike', 'the', 'grey', 'goo', 'scenario', 'proposed', 'by', 'technological', 'theorists', 'fearful', 'of', 'artificial', 'intelligence', 'run', 'rampant', '.'])

In [20]:
# ngrams "the moving window"
# why ngrams?
blob.ngrams(n=2)

[WordList(['The', 'titular']),
 WordList(['titular', 'threat']),
 WordList(['threat', 'of']),
 WordList(['of', 'The']),
 WordList(['The', 'Blob']),
 WordList(['Blob', 'has']),
 WordList(['has', 'always']),
 WordList(['always', 'struck']),
 WordList(['struck', 'me']),
 WordList(['me', 'as']),
 WordList(['as', 'the']),
 WordList(['the', 'ultimate']),
 WordList(['ultimate', 'movie']),
 WordList(['movie', 'monster']),
 WordList(['monster', 'an']),
 WordList(['an', 'insatiably']),
 WordList(['insatiably', 'hungry']),
 WordList(['hungry', 'amoeba-like']),
 WordList(['amoeba-like', 'mass']),
 WordList(['mass', 'able']),
 WordList(['able', 'to']),
 WordList(['to', 'penetrate']),
 WordList(['penetrate', 'virtually']),
 WordList(['virtually', 'any']),
 WordList(['any', 'safeguard']),
 WordList(['safeguard', 'capable']),
 WordList(['capable', 'of']),
 WordList(['of', 'as']),
 WordList(['as', 'a']),
 WordList(['a', 'doomed']),
 WordList(['doomed', 'doctor']),
 WordList(['doctor', 'chillingly']),
 

In [21]:
blob.ngrams(n=3)

[WordList(['The', 'titular', 'threat']),
 WordList(['titular', 'threat', 'of']),
 WordList(['threat', 'of', 'The']),
 WordList(['of', 'The', 'Blob']),
 WordList(['The', 'Blob', 'has']),
 WordList(['Blob', 'has', 'always']),
 WordList(['has', 'always', 'struck']),
 WordList(['always', 'struck', 'me']),
 WordList(['struck', 'me', 'as']),
 WordList(['me', 'as', 'the']),
 WordList(['as', 'the', 'ultimate']),
 WordList(['the', 'ultimate', 'movie']),
 WordList(['ultimate', 'movie', 'monster']),
 WordList(['movie', 'monster', 'an']),
 WordList(['monster', 'an', 'insatiably']),
 WordList(['an', 'insatiably', 'hungry']),
 WordList(['insatiably', 'hungry', 'amoeba-like']),
 WordList(['hungry', 'amoeba-like', 'mass']),
 WordList(['amoeba-like', 'mass', 'able']),
 WordList(['mass', 'able', 'to']),
 WordList(['able', 'to', 'penetrate']),
 WordList(['to', 'penetrate', 'virtually']),
 WordList(['penetrate', 'virtually', 'any']),
 WordList(['virtually', 'any', 'safeguard']),
 WordList(['any', 'safegua

In [22]:
# interesting
blob.noun_phrases

WordList([u'titular threat', 'blob', u'ultimate movie monster', u'amoeba-like mass', 'snide', u'potential consequences', u'grey goo scenario', u'technological theorists fearful', u'artificial intelligence run rampant'])

In [23]:
# lemma
new_blob = TextBlob("going went is was be")
for word in new_blob.words:
    print word.lemmatize()

going
went
is
wa
be


In [24]:
# oh-oh problem! 
for word in new_blob.words:
    print word.lemmatize("v")

go
go
be
be
be


In [25]:
word.lemmatize?

In [29]:
# lemmas need part of speech!
import nltk

# runs nltk.download()

In [36]:
# tagging is hard. this will take a while.
blob.tags

[('The', u'DT'),
 ('titular', u'JJ'),
 ('threat', u'NN'),
 ('of', u'IN'),
 ('The', u'DT'),
 ('Blob', u'NNP'),
 ('has', u'VBZ'),
 ('always', u'RB'),
 ('struck', u'VBN'),
 ('me', u'PRP'),
 ('as', u'IN'),
 ('the', u'DT'),
 ('ultimate', u'JJ'),
 ('movie', u'NN'),
 ('monster', u'NN'),
 ('an', u'DT'),
 ('insatiably', u'RB'),
 ('hungry', u'JJ'),
 ('amoeba-like', u'JJ'),
 ('mass', u'NN'),
 ('able', u'JJ'),
 ('to', u'TO'),
 ('penetrate', u'VB'),
 ('virtually', u'RB'),
 ('any', u'DT'),
 ('safeguard', u'NN'),
 ('capable', u'JJ'),
 ('of', u'IN'),
 ('as', u'IN'),
 ('a', u'DT'),
 ('doomed', u'JJ'),
 ('doctor', u'NN'),
 ('chillingly', u'RB'),
 ('describes', u'VBZ'),
 ('it', u'PRP'),
 ('assimilating', u'VBG'),
 ('flesh', u'NN'),
 ('on', u'IN'),
 ('contact', u'NN'),
 ('Snide', u'JJ'),
 ('comparisons', u'NNS'),
 ('to', u'TO'),
 ('gelatin', u'VB'),
 ('be', u'VB'),
 ('damned', u'VBN'),
 ('it', u'PRP'),
 ("'s", u'VBZ'),
 ('a', u'DT'),
 ('concept', u'NN'),
 ('with', u'IN'),
 ('the', u'DT'),
 ('most', u'RBS'

In [30]:
tags = new_blob.tags

In [31]:
# list of tuples
type(tags)

list

In [40]:
# hmmm, why do you think is this not working?
# look up wordnet vs. treebank parts of speech
for list_item in new_blob.tags:
    list_item[0].lemmatize('list_item[1]')

'list_item[1]'


MissingCorpusError: 
Looks like you are missing some required data for this feature.

To download the necessary data, simply run

    python -m textblob.download_corpora

or use the NLTK downloader to download the missing data: http://nltk.org/data.html
If this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.


In [48]:
from nltk.corpus import wordnet

# booooring
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

In [49]:
lemmas = ()
# ok this seems to work
for list_item in new_blob.tags:
    word = list_item[0]
    tpos = list_item[1]
    wpos = get_wordnet_pos(tpos)
    print tpos, wpos
    
    #lemmas.append(list_item[0].lemmatize(get_wordent_pos(list_item[1]))
            

VBG v
VBD v
VBZ v
VBD v
VB v


In [57]:
lemmas = []
# lets make it better
for list_item in new_blob.tags:
    word = list_item[0]
    tpos = list_item[1]
    wpos = get_wordnet_pos(tpos)
    lemmas.append(word.lemmatize(wpos))
    
print lemmas

[u'go', u'go', u'be', u'be', 'be']


In [39]:
# ported from Allison Parish's excellent
# http://rwet.decontextualize.com/book/textblob/

# let's make a simple text summary
from textblob import TextBlob, Word
import random

# From Tender Buttons by Gertrude Stein
text = '''
A cushion has that cover. Supposing you do not like to change, supposing it is very clean 
that there is no change in appearance, supposing that there is regularity and a costume is 
that any the worse than an oyster and an exchange. Come to season that is there any extreme 
use in feather and cotton. Is there not much more joy in a table and more chairs and very 
likely roundness and a place to put them. A circle of fine card board and a chance to see a tassel. 
'''

blob = TextBlob(text)

nouns = list()
for word, tag in blob.tags:
    if tag == 'NN':
        nouns.append(word.lemmatize())

print "This text is about:"
for item in random.sample(nouns, 5):
    word = Word(item)
    print word.pluralize()

This text is about:
tables
cottons
places
exchanges
covers
