# Counting things

By [Allison Parrish](http://www.decontextualize.com/)

Counting how many times something occurs is a very common task in programming, so Python includes a special kind of object—a `Counter`—to make the task easier.

To use the `Counter` object, you need to import it from the `collections` module.

In [1]:
from collections import Counter

To demonstrate the `Counter` object, consider the following list of letters:

In [2]:
letters = ["a", "b", "c", "a", "a", "a", "b", "d", "e", "a", "b", "d", "d"]

Let's say that you wanted to know *how many times* each letter occurred. To do this in Python, just pass the list to the `Counter` object like so:

In [6]:
count = Counter(letters)
count

Counter({'a': 5, 'b': 3, 'c': 1, 'd': 3, 'e': 1})

Here we've assigned a `Counter` object to the variable `count` (although you can use whatever variable name you want, of course). Just evaluating the `count` object shows the counts for each of the items in the list, telling us (for example) that the letter `a` occurs five times, the letter `d` occurs three times, and so forth.

You can get the count for a particular item in the `Counter` by using indexing syntax as though it were a dictionary:

In [9]:
count["a"]

5

The `.most_common()` method returns a list of tuples of the most common items in the counter, in descending order by number:

In [10]:
count.most_common()

[('a', 5), ('b', 3), ('d', 3), ('c', 1), ('e', 1)]

## Counting words

What are we going to count? Let's read in the contents of the first chapter of Genesis from the King James Version of the Bible. ([Download it from here](http://rwet.decontextualize.com/texts/genesis.txt) and make sure it's in the same directory as this notebook.)

In [11]:
text = open("genesis.txt").read() # read the entire file in as a string
words = text.split() # split it up into words

Now, create a `Counter` object by calling `Counter()` with the list of things you want to count:

In [12]:
count = Counter(words)

That's all you need to do! The `Counter` object has counted up all of the items in the list and stored them with their frequencies. We can get the most common twenty-five words by calling `.most_common()` with a parameter of `25`:

In [13]:
count.most_common(25)

[('the', 108),
 ('and', 63),
 ('And', 33),
 ('God', 32),
 ('of', 20),
 ('was', 17),
 ('it', 15),
 ('that', 14),
 ('in', 13),
 ('every', 12),
 ('after', 11),
 ('to', 11),
 ('upon', 10),
 ('over', 10),
 ('said,', 9),
 ('his', 9),
 ('earth', 8),
 ('Let', 8),
 ('were', 8),
 ('waters', 8),
 ('be', 7),
 ('saw', 7),
 ('light', 7),
 ('firmament', 7),
 ('he', 6)]

As mentioned before, you can use the object as a dictionary to get the count for a particular value:

In [14]:
count['earth']

8

You can iterate over this using a `for` loop to print out just the words:

In [16]:
for word, number in count.most_common(10):
    print(word)

the
and
And
God
of
was
it
that
in
every


## Improving the counts

The word counts returned from this procedure are a little bit weird, in that (a) they count instances of the same word with different cases separately (i.e., "And" and "and"); and (b) words with punctuation at the end are counted separately from words with no punctuation (i.e., "day" and "day,").

To fix this problem, we want to "clean" our list of words. In this case, "cleaning" will consist of:

* Convert all the words to lower case
* Remove punctuation from the end of the string.

On their own, these are easy operations. The `.lower()` method of a string returns a copy of the string with all letters in lower case:

In [17]:
"Hello there, Bob".lower()

'hello there, bob'

And the "strip" method can be used to remove punctuation from the end of a string, like so:

In [18]:
"Okay,".strip(",.;:")

'Okay'

Combine them into a single expression like so:

In [19]:
"Okay,".lower().strip(",.;:")

'okay'

What we need to do is take the list of words and create a new list of words with these transformations applied. You can write this very succinctly with a list comprehension, but let's do it "long hand" so it's easier to understand:

In [20]:
clean_words = []
for item in words:
    cleaned = item.lower().strip(",.;:")
    clean_words.append(cleaned)

Pass these to a new `Counter` object and here's the result:

In [21]:
count = Counter(clean_words)

In [22]:
count.most_common(25)

[('the', 108),
 ('and', 97),
 ('god', 32),
 ('earth', 21),
 ('of', 20),
 ('was', 17),
 ('it', 16),
 ('in', 14),
 ('let', 14),
 ('that', 14),
 ('every', 12),
 ('waters', 11),
 ('after', 11),
 ('to', 11),
 ('upon', 10),
 ('said', 10),
 ('light', 10),
 ('day', 10),
 ('kind', 10),
 ('over', 10),
 ('be', 9),
 ('firmament', 9),
 ('his', 9),
 ('were', 8),
 ('them', 8)]

## Removing stopwords

The ten most common words are, according to the `Counter` object:

In [23]:
count.most_common(10)

[('the', 108),
 ('and', 97),
 ('god', 32),
 ('earth', 21),
 ('of', 20),
 ('was', 17),
 ('it', 16),
 ('in', 14),
 ('let', 14),
 ('that', 14)]

Intuitively, it seems strange to count these words like "the" and "and" among the "most common," because words like these are presumably common across *all* texts, not just this text in particular. To solve this problem, we can use "stopwords": a list of commonly-occurring English words that shouldn't be counted for the purpose of word frequency. No one exactly agrees on what this list should be, but here's one attempt (from [here](https://gist.github.com/sebleier/554280)):

In [24]:
stopwords = [
    "i",
    "me",
    "my",
    "myself",
    "we",
    "our",
    "ours",
    "ourselves",
    "you",
    "your",
    "yours",
    "yourself",
    "yourselves",
    "he",
    "him",
    "his",
    "himself",
    "she",
    "her",
    "hers",
    "herself",
    "it",
    "its",
    "itself",
    "they",
    "them",
    "their",
    "theirs",
    "themselves",
    "what",
    "which",
    "who",
    "whom",
    "this",
    "that",
    "these",
    "those",
    "am",
    "is",
    "are",
    "was",
    "were",
    "be",
    "been",
    "being",
    "have",
    "has",
    "had",
    "having",
    "do",
    "does",
    "did",
    "doing",
    "a",
    "an",
    "the",
    "and",
    "but",
    "if",
    "or",
    "because",
    "as",
    "until",
    "while",
    "of",
    "at",
    "by",
    "for",
    "with",
    "about",
    "against",
    "between",
    "into",
    "through",
    "during",
    "before",
    "after",
    "above",
    "below",
    "to",
    "from",
    "up",
    "down",
    "in",
    "out",
    "on",
    "off",
    "over",
    "under",
    "again",
    "further",
    "then",
    "once",
    "here",
    "there",
    "when",
    "where",
    "why",
    "how",
    "all",
    "any",
    "both",
    "each",
    "few",
    "more",
    "most",
    "other",
    "some",
    "such",
    "no",
    "nor",
    "not",
    "only",
    "own",
    "same",
    "so",
    "than",
    "too",
    "very",
    "s",
    "t",
    "can",
    "will",
    "just",
    "don",
    "should",
    "now"
]

To make use of this list, we'll revise our loop from before. In addition to converting the strings to lower case and removing punctuation, we'll only add a word to the list if it isn't present in the `stopwords` list. Like so:

In [25]:
clean_words = []
for item in words:
    cleaned = item.lower().strip(",.;:")
    if cleaned not in stopwords:
        clean_words.append(cleaned)

Now let's see what the most common words are:

In [26]:
count = Counter(clean_words)
count.most_common(10)

[('god', 32),
 ('earth', 21),
 ('let', 14),
 ('every', 12),
 ('waters', 11),
 ('upon', 10),
 ('said', 10),
 ('light', 10),
 ('day', 10),
 ('kind', 10)]

Much better!

## More applications of Counter

But you can do more than just count words with the `Counter` object! You can pass any list-like thing, including a string, to the `Counter` object. In the following example, we simply read in the entire text of `genesis.txt` as a string and pass it to the `Counter` object:

In [28]:
count = Counter(open("genesis.txt").read())

This `Counter` object tells us how often each *character* occurs in the text:

In [30]:
count.most_common(25)

[(' ', 797),
 ('e', 450),
 ('t', 339),
 ('a', 267),
 ('h', 255),
 ('n', 233),
 ('d', 232),
 ('i', 195),
 ('r', 186),
 ('o', 167),
 ('s', 128),
 ('f', 87),
 ('l', 86),
 ('g', 82),
 (',', 68),
 ('m', 65),
 ('w', 62),
 ('v', 59),
 ('u', 44),
 ('y', 42),
 ('.', 33),
 ('A', 33),
 ('G', 32),
 ('c', 31),
 ('\n', 31)]

This type of analysis works better for Chinese text:

In [32]:
count = Counter(open("xu_zhimo.txt").read())

In [34]:
count.most_common(25)

[('\n', 585),
 ('，', 332),
 ('的', 302),
 ('我', 156),
 ('一', 105),
 ('你', 90),
 ('在', 89),
 ('是', 78),
 ('不', 64),
 (' ', 62),
 ('！', 59),
 ('。', 56),
 ('—', 54),
 ('了', 53),
 ('这', 51),
 ('里', 43),
 ('着', 42),
 ('上', 41),
 ('；', 36),
 ('那', 33),
 ('－', 33),
 ('有', 29),
 ('天', 27),
 ('她', 25),
 ('个', 25)]

Counter objects don't have to count strings. You can also use them to build up a histogram from numerical data. For example, take the following list of numbers:

In [38]:
readings = [23, 14, 22, 24, 15, 21, 18, 16, 10, 14, 20, 26, 17, 14, 17, 30, 20, 20, 16, 14]

You can pass this list to a Counter object the same way you would with any list:

In [39]:
count = Counter(readings)

In [41]:
count.most_common()

[(14, 4),
 (20, 3),
 (16, 2),
 (17, 2),
 (23, 1),
 (22, 1),
 (24, 1),
 (15, 1),
 (21, 1),
 (18, 1),
 (10, 1),
 (26, 1),
 (30, 1)]

It's maybe a little weird, but you can find the number of times that a particular number occurs using the same dictionary-like syntax as before:

In [43]:
count[17]

2

In this case the number `17` is behaving like a dictionary key, not like a numerical index.

> EXERCISE: Use a `Counter` object to find out how many words beginning with the letter `a` occur in `genesis.txt`.