# Word Counts

When you are working with a pile of twitter data it can be useful to take a look at the types of words that are used in the dataset. To do this we need to create a function that will read through the tweets, and split the text of the tweet on whitespace, normalize (lowercase) the word, and then keep track of the number of times the word was seen using a Python dictionary.

In [14]:
import json
import gzip

def word_counts(filename):
    counts = {}
    for line in gzip.open(filename, 'rt'):
        tweet = json.loads(line)
        if 'text' not in tweet:
            break
        for word in tweet['text'].split(' '):
            word = word.lower()
            counts[word] = counts.get(word, 0) + 1
    return counts

Go ahead and change this filename if you want to look at another dataset:

In [17]:
counts = word_counts('data/filters/fakePREMIS/author-mentions.json.gz')

We can try printing out the results, but the dictionary is unordered, so it's kind of messy.

In [18]:
print(counts)

{'': 1087, 'guilty': 2, "she's": 3, 'mean': 3, 'form': 2, '😀': 3, 'idol,please': 1, 'vc': 4, "riches'": 1, "(he's": 2, 'worthwhile': 1, 'posible:': 1, 'truly': 1, 'http://t.co/1yk5ouuwbh': 1, 'goes!': 1, 'happened': 2, 'aim': 1, 'am': 6, 'choose': 1, '😂❤️': 1, '@nytimes': 1, 'longer': 1, 'answers,': 1, "'starved',": 1, 'weasley': 1, 'looking': 1, 'head': 5, 'nós': 1, 'instantly': 1, '@jk_rowling,': 1, '*bangs': 2, 'dedicated': 1, 'tiffany': 2, '@lolgorges': 1, '😍👸\U0001f3fe👸\U0001f3fc': 1, 'trek:': 16, "i'll": 11, 'creativity!': 1, 'letter?': 1, 'fred': 1, 'begrudgers"': 1, "albus's": 5, 'army': 1, 'hermione,': 1, 'die': 2, 'and': 312, 'kidding.': 1, 'attention': 2, 'ginny': 2, 'moon.': 1, '#8': 1, 'disgrace!': 1, 'beautiful': 5, 'loved': 4, 'tonight:': 3, 'belongs,': 2, 'her': 16, 'jiggery-poking': 1, "weasley's": 1, 'beautiful!': 2, 'https://t.co/ynnglpnjeo': 1, 'wrong.': 5, 'para': 1, 'henry.': 1, 'technically,': 1, 'chrome': 1, 'real?': 1, 'negative': 2, '-or': 1, 'opening': 4, 'en

We can create a list of the words that are sorted in descending order by the number of times they appear.

In [19]:
words = sorted(counts, key=counts.get, reverse=True)

Now we can see how many unique words there are:

In [20]:
print(len(words))

3210


And we can print out the top 25 words with their frequency:

In [22]:
for word in words[0:25]:
    print(word, counts[word])

rt 2150
@jk_rowling: 1530
 1087
you 1087
she 1001
don't 994
than 953
go 946
i'm 944
many 935
because 924
people 915
super-talented 898
#teamserena 897
meet 897
is. 896
nicer 894
today, 893
http://t.c… 893
@serenawilliams! 893
the 663
a 644
of 427
is 403
in 378
