# Vicky Pollard and English Word Counts

Victoria 'Vicky' Pollard is a popular character from the TV and radio series *Little Britain*. She is a moody, obnoxious teenage girl seemingly incapable of doing much but gossip in a strong Bristol accent. She is a representation of teenagers and chavs.

To get started on this assignment, read the BBC News article: UK's Vicky Pollard's 
[left behind.](http://news.bbc.co.uk/1/hi/education/6173441.stm)
The article gives the following statistic about teen language: 
"The top 20 words used, including *yeah, no, but* and *like*, account 
for around a third of all words."  The idea is that this shows
that teenagers speak an impoverished variety of English.  Much
of their speaking time is used up uttering one of a very small
set of words.

But is that really a **significant** fact about teen speech?
Let's look at a (relatively) random sample of English,
count up the 20 most frequent words, and answer the following question:

### Question 1:  What proportion of all word tokens are covered by those top 20 words?

Question 1 is the question you must answer to complete this
exercise.

To answer this question, you will use a **balanced corpus**
of English texts, a corpus collected with the purpose of representing
a balanced variety of English text types: fiction, poetry, speech,
non fiction, and so on.  One relatively well-established, free,
and easy-to-get example of such a corpus is the **Brown Corpus.**
Brown is about 1.2 M words. 

To get the Brown corpus, do the following:

In [4]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /Users/gawron/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [1]:
from nltk.corpus import brown

The following expression returns a list of all 1.2 M word tokens in Brow

In [2]:
 brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

Loop through this list and make a frequency distribution dictionary
(the value associated with word `w` is the count of how many times 
`w` has occurred in Brown). Then do some more computing to abswer Question 1.
What proportion of the total words is taken up by the top
20 words?

Here's Question 2:

### Question 2:  What do you conclude about the statistic cited in the Vicky Pollard article? Does the statistic cited demonstrate that teenagers speak with an impoverished vocabulary?  Why or why not?

Read more about this on 
[LanguageLog.](http://itre.cis.upenn.edu/~myl/languagelog/archives/003993.html)

## Helpful hints and code

In [60]:
bw0 = brown.words()

The name `bw` has now been set to the list of all the word tokens in Brown. 

In [55]:
bw0

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

To get a data structure with all the word counts, do:

In [56]:
fd0 = nltk.FreqDist(bw0)

Having done all that you can now execute the following to find the top n most frequent words and their counts.
For example:

In [57]:
fd0.most_common(3)

[('the', 62713), (',', 58334), ('.', 49346)]

What `fd` is is an`nltk FreqDist` (Frequency Distribution), essentially, a Python dictionary containing the count of all the word types in Brown (it also has some features customized for word frequencies, like the `most_common` method).  To find the number of word tokens in Brown, find the length of `bw`.

In [43]:
len(bw0)

1161192

So, about 1.16 Million words.  You can also use another one of the customized methods included with a `FreqDist`:

In [58]:
fd0.N()

1161192

To find out the number of **word types** in Brown, just get the length of the dictionary:

In [45]:
len(fd0)

56057

I said `fd` was a dictionary of word counts and it is.  Here's how to use it as a Python dictionary:
To get the count of the word type ‘computer’ do:

In [46]:
fd0['computer']

13

That means there are 13 *tokens* of the word type 'computer' in the Brown corpus.  Note that case matters:

In [47]:
fd0['Computer']

0

In [48]:
fd0['the'],fd0['The']

(62713, 7258)

It is handy to ignore case for purposes of this exercise. To do that, just lowercase all the words in Brown before making the frequency distribution. This changes the number of word types, but not the number of word tokens:

In [61]:
# This uses what Pythonistas call a list comprehension.
# It's just a compact way of writing a loop and collecting results in  a list.
bw1 = [w.lower() for w in bw0]
len(bw1),len(bw0)

(1161192, 1161192)

The length of the **corpus** (the body of data) hasn't changed. 

In [62]:
fd1 = nltk.FreqDist(bw1)
len(fd0),len(fd1)
#49815

(56057, 49815)

The dictionary has shrunk because we now collapsed the words 'The' and 'the'.

In [63]:
fd1['the']

69971

In [64]:
fd1['The']

0

You may find it useful to take a look at [Chapter One of the NLTK book.](http://www.nltk.org/book/ch01.html)

## To hand in

1. An answers to questions 1 and 2.

2. The code you used to answer question 1.  The simplest way to do that is to hand in an edtited version of this Python notebook.  When you want to type in an answer to a question, create a notebook cell by hovering your cursor over the bottom center of a cell and selecting "+ Text".  Then type your answer in the cell that appears.

## Solution

## Question 1

In [65]:
top_20 = sum(ct for (wd, ct) in fd1.most_common(20))

All word tokens

In [66]:
fd1.N()

1161192

In [67]:
top_20/fd1.N()

0.3598250763009046

In [68]:
bw2 = [w.lower() for w in brown.words() if w not in ',!?.;()``\"']

In [69]:
len(bw2)

1027919

In [70]:
fd2 = nltk.FreqDist(bw2)

In [71]:
new_top_20 = sum(ct for wd, ct in fd2.most_common(20))
print(new_top_20/fd.N())

0.274169990837002


## Question 2

"The top 20 words used, including yeah, no, but and like, account for around a third of all words."

Well in the first way of counting, which counted punctuation as words, the top 20
words accounted for one-third of all tokens.  So the Pollard teenagers aren't all
that different from a balanced English corpus, so **this statistic** at least does not
support the claim that they have an impoverished vocabulary.

In the second wy of counting, the top 20 words for Pollard teenagers take up a 
somewhat greater proportion of the words than the top 20 words
in Brown, so the claim has a little more plausibility, but not much.
