# Text I-O and JSON

## *'Anything goes in/ Anything goes out/ Fish, bananas, old pyjamas/ Mutton, beef and Trout!'*
*– User manual for first PC*

# Input/Output (IO)

**Reading** from input and **writing** to output allows us to retrieve or store information.

## Input

### Reading from files

Python makes it very easy to read from files. Calling `open()` on a file name will give you an iterator over the lines in a file.

In [1]:
num_lines = 0
for line in open('../data/tweets.txt', 'r'): # this is BAD coding, don't repeat
    num_lines += 1
print("The file has {} lines".format(num_lines))

The file has 100064 lines


However, this does not close the file, and it is better to make sure we do. We use the `with` construct. The outermost level simply open the file and give it a name, and closes it when done. The inner loop goes over the file.

In [2]:
num_lines = 0
with open('../data/tweets.txt', encoding='utf8') as input_file:
    for line in input_file:
        num_lines += 1
print("The file has {} lines".format(num_lines))

The file has 100064 lines


## Activity
* count the number of words in `tweets.txt`

In [9]:
# your code here
num_words = 0
with open('../data/tweets.txt', encoding='utf-8') as input_file:
    for line in input_file:
        res = len(line.strip().split())
        num_words = res + num_words
print("The file has {} words".format(num_words))

num_words = 0
with open('../data/tweets.txt', encoding='utf-8') as input_file:
    for line in input_file:
        spaces = line.count(" ")
        num_words += spaces
print("The file has {} words".format(num_words))

s = "Two     words"
s.split()

The file has 1134318 words
The file has 1038963 words


['Two', 'words']

### Reading user input
Sometimes, we want to give our user the chance to input something (hint: *very* useful to label data). We can simply do this via `input()`. If we give it an argument, we can write a message to the prompt:

In [12]:
user_says = input('what ')
print(user_says, type(user_says))

what 2
2 <class 'str'>


NB: the return value of `input()` is always a `str`. If you want something else, you neeed to either cast it, or use `eval()`, which interprets the input as Python.

In [17]:
user_says = input('Try typing an int or list: ')
result = eval(user_says)
print(result, type(result))

Try typing an int or list: 42
42 <class 'int'>


To prevent empty input errors, or to check the user makes a valid choice, use a `while` loop:

In [20]:
must_be_int = None
while must_be_int is None or must_be_int not in {1, 2}:
    must_be_int = input('Type 1 or 2: ')
    must_be_int = eval(must_be_int)
print(must_be_int, type(must_be_int))

Type 1 or 2: 3
Type 1 or 2: 1
1 <class 'int'>


## Output

### User Output

Output to the user is our good old friend `print()`. There are three important attributes:
* `end`
* `file`
* `flush`

In [50]:
import sys
print('This is a string!', end='...', flush=True)
print("Let's continue", file=sys.stderr, flush=True)
with open('../data/tweets.txt', encoding='utf-8') as input_file:
    for i, line in enumerate(input_file):
        if i > 0:
            if i % 1000 == 0:
                print(i, file=sys.stderr, flush=True)
            elif i % 100 == 0:
                print('.', end='', file=sys.stderr, flush=True)

This is a string!...

Let's continue
.........1000
.........2000
.........3000
.........4000
.........5000
.........6000
.........7000
.........8000
.........9000
.........10000
.........11000
.........12000
.........13000
.........14000
.........15000
.........16000
.........17000
.........18000
.........19000
.........20000
.........21000
.........22000
.........23000
.........24000
.........25000
.........26000
.........27000
.........28000
.........29000
.........30000
.........31000
.........32000
.........33000
.........34000
.........35000
.........36000
.........37000
.........38000
.........39000
.........40000
.........41000
.........42000
.........43000
.........44000
.........45000
.........46000
.........47000
.........48000
.........49000
.........50000
.........51000
.........52000
.........53000
.........54000
.........55000
.........56000
.........57000
.........58000
.........59000
.........60000
.........61000
.........62000
.........63000
.........64000
.........65000
.........66000
....

### File Output

File output allows us to use the same Python objects in different programs/sessions/computers. It works almost like file input, with three differences:
1. we need to specify write mode by giving `open()` the string argument `'w'`
2. we use the `write()` command to write to the file
3. we need to end every input line with a newline break `\n`

In [33]:
with open('../data/silly_test_file.txt', 'a', encoding='utf8') as output_file:
    output_file.write('This is the second line\n')

In [34]:
! cat ../data/silly_test_file.txt

This is the first line
This is the second line


In [28]:
! ls -alFh ../data/

total 305584
drwxr-xr-x@ 10 dirkhovy  staff   320B Mar 10 14:07 [34m.[m[m/
drwxr-xr-x@ 25 dirkhovy  staff   800B Mar 10 14:06 [34m..[m[m/
-rw-r--r--@  1 dirkhovy  staff    46M Feb 27  2019 example.db
-rw-rw-r--@  1 dirkhovy  staff   1.2M Jan 29  2019 moby_dick.txt
-rw-r--r--@  1 dirkhovy  staff    40M Feb 14  2019 reviews.full.tsv
-rw-rw-r--@  1 dirkhovy  staff   752K Apr 22  2018 sa_test.csv
-rw-rw-r--@  1 dirkhovy  staff    11M Jun  6  2019 sa_train.csv
-rw-rw-r--@  1 dirkhovy  staff    23B Mar 13 11:05 silly_test_file.txt
-rw-rw-r--@  1 dirkhovy  staff   6.5M Feb 21  2019 tweets.txt
-rw-rw-r--@  1 dirkhovy  staff    44M Jun  6  2019 wine.csv


## Activity

* Open the file `silly_test_file.txt` and print all the lines in it

In [38]:
# your code here
with open('../data/silly_test_file.txt', encoding='utf-8') as input_file:
    for line in input_file:
        print(line.strip())

This is the first line
This is the second line


# JSON

JSON is a file format that allows us to read and write Python objects (rather than strings) from files. This is a great way to save your progress or to store a model.

However, note that dictionary keys become strings, and that it cannot store "special" data types (`defaultdict`, `DataFrame`, etc.).

We need to import the `json` library first

In [51]:
import json

# JSON output

In order to save a Python object to a file, we only need the function `dump()` from `json`. It takes two arguments
1. the Python object to write to file
2. a **file handle**, i.e., an `open(<FILENAME>, 'w')` command

You can call JSON files whatever you want, but it is common to use the ending `.json`

## Activity

* create a dictionary `line_length`
* open the file `tweets.txt`
* use `line_length` to map each line in the file from its line number to its length in characters
* save `line_length` to a file "`lineinfo.json`"

In [53]:
# your code here
line_length = {}
with open('../data/tweets.txt', encoding='utf8') as tweets:
    for i, line in enumerate(tweets):
        line_length[i] = len(line.strip())

with open('../data/lineinfo.json', 'w', encoding='utf8') as out_file:
    json.dump(line_length, out_file)

In [57]:
! head ../data/lineinfo.json


{"0": 83, "1": 51, "2": 26, "3": 19, "4": 87, "5": 132, "6": 55, "7": 38, "8": 37, "9": 46, "10": 9, "11": 45, "12": 102, "13": 5, "14": 37, "15": 101, "16": 52, "17": 63, "18": 52, "19": 18, "20": 33, "21": 104, "22": 53, "23": 38, "24": 79, "25": 55, "26": 23, "27": 59, "28": 52, "29": 31, "30": 26, "31": 129, "32": 138, "33": 15, "34": 70, "35": 96, "36": 48, "37": 69, "38": 51, "39": 73, "40": 53, "41": 43, "42": 50, "43": 137, "44": 120, "45": 66, "46": 61, "47": 21, "48": 103, "49": 42, "50": 74, "51": 90, "52": 62, "53": 41, "54": 36, "55": 137, "56": 24, "57": 42, "58": 40, "59": 20, "60": 48, "61": 72, "62": 53, "63": 6, "64": 131, "65": 69, "66": 13, "67": 75, "68": 51, "69": 38, "70": 59, "71": 140, "72": 45, "73": 36, "74": 32, "75": 47, "76": 44, "77": 95, "78": 45, "79": 66, "80": 38, "81": 133, "82": 104, "83": 95, "84": 29, "85": 40, "86": 68, "87": 48, "88": 139, "89": 26, "90": 121, "91": 28, "92": 105, "93": 138, "94": 49, "95": 3, "96": 122, "97": 28, "98": 49, "99"

": 60, "28774": 35, "28775": 140, "28776": 87, "28777": 62, "28778": 115, "28779": 66, "28780": 136, "28781": 20, "28782": 74, "28783": 68, "28784": 33, "28785": 68, "28786": 57, "28787": 119, "28788": 118, "28789": 124, "28790": 69, "28791": 131, "28792": 70, "28793": 39, "28794": 35, "28795": 105, "28796": 47, "28797": 63, "28798": 96, "28799": 73, "28800": 66, "28801": 84, "28802": 87, "28803": 140, "28804": 80, "28805": 21, "28806": 52, "28807": 97, "28808": 33, "28809": 24, "28810": 102, "28811": 29, "28812": 19, "28813": 91, "28814": 45, "28815": 87, "28816": 54, "28817": 130, "28818": 40, "28819": 80, "28820": 60, "28821": 67, "28822": 26, "28823": 49, "28824": 81, "28825": 33, "28826": 106, "28827": 81, "28828": 51, "28829": 23, "28830": 80, "28831": 60, "28832": 142, "28833": 104, "28834": 109, "28835": 58, "28836": 86, "28837": 49, "28838": 74, "28839": 56, "28840": 111, "28841": 27, "28842": 56, "28843": 32, "28844": 136, "28845": 56, "28846": 64, "28847": 73, "28848": 71, "

 "59156": 143, "59157": 34, "59158": 84, "59159": 111, "59160": 48, "59161": 10, "59162": 140, "59163": 70, "59164": 68, "59165": 94, "59166": 82, "59167": 124, "59168": 26, "59169": 93, "59170": 35, "59171": 58, "59172": 107, "59173": 70, "59174": 125, "59175": 20, "59176": 27, "59177": 27, "59178": 28, "59179": 65, "59180": 56, "59181": 118, "59182": 77, "59183": 106, "59184": 87, "59185": 18, "59186": 44, "59187": 67, "59188": 82, "59189": 102, "59190": 127, "59191": 64, "59192": 140, "59193": 68, "59194": 107, "59195": 53, "59196": 140, "59197": 38, "59198": 49, "59199": 136, "59200": 74, "59201": 35, "59202": 62, "59203": 24, "59204": 106, "59205": 45, "59206": 93, "59207": 41, "59208": 35, "59209": 28, "59210": 28, "59211": 120, "59212": 68, "59213": 22, "59214": 30, "59215": 27, "59216": 30, "59217": 77, "59218": 74, "59219": 64, "59220": 73, "59221": 47, "59222": 78, "59223": 63, "59224": 91, "59225": 121, "59226": 54, "59227": 137, "59228": 29, "59229": 39, "59230": 119, "5923

: 105, "85319": 97, "85320": 107, "85321": 113, "85322": 107, "85323": 55, "85324": 67, "85325": 66, "85326": 96, "85327": 52, "85328": 83, "85329": 133, "85330": 133, "85331": 20, "85332": 39, "85333": 3, "85334": 53, "85335": 119, "85336": 52, "85337": 38, "85338": 110, "85339": 32, "85340": 140, "85341": 23, "85342": 59, "85343": 65, "85344": 34, "85345": 90, "85346": 103, "85347": 32, "85348": 63, "85349": 45, "85350": 97, "85351": 5, "85352": 38, "85353": 72, "85354": 33, "85355": 144, "85356": 46, "85357": 71, "85358": 19, "85359": 42, "85360": 30, "85361": 88, "85362": 71, "85363": 49, "85364": 27, "85365": 22, "85366": 23, "85367": 48, "85368": 138, "85369": 34, "85370": 18, "85371": 39, "85372": 33, "85373": 140, "85374": 67, "85375": 56, "85376": 52, "85377": 37, "85378": 38, "85379": 25, "85380": 84, "85381": 105, "85382": 65, "85383": 60, "85384": 132, "85385": 68, "85386": 89, "85387": 34, "85388": 26, "85389": 30, "85390": 48, "85391": 67, "85392": 64, "85393": 47, "85394

# JSON input

To retrive a Python object from a file, we use the function `load()` from `json`. It only takes a **file handle**, i.e., an `open(<FILENAME>)` command.

In [63]:
with open('../data/lineinfo.json', encoding='utf8') as json_in:
    info_from_file = json.load(json_in)
print(info_from_file['99'])

101


# Word Representations

## *"I know words. I have the best words!"*
    - Noam Chomsky

## Discrete Sparse Representations (Bag of words)

The easiest way is to represent documents is as a word counts. It takes three steps:
1. determine the vocabulary
2. collect the counts for each word
3. transform the individual counts into one big matrix

![Bag of words procedure](bow.png)

The result is a matrix $X$ with one row for each instance, and one column for each word in the vocabulary.


![](matrix.pdf)

First, let's get some review data:

In [64]:
import pandas as pd
df = pd.read_csv('../data/reviews.full.tsv', sep='\t', nrows=100000)
documents = df.text.tolist()
print(documents[:2])

["Prices change daily and if you want to really research the price continually at many different sites , I have found cheaper cars elsewhere . However , if you don ' t have a lot of time to research the price , this site has always been among the top three ( e . g ., cheapest ) of the ten sites I use to reserve a car .", 'and the fact that they will match other companies is awesome !!']


Let's implement the steps ourselves:

In [81]:
import numpy as np # to deal with linear algebra

num_docs = 3

# collect all word types (= vocabulary)
vocabulary = set()
for document in documents[:num_docs]:
    tokens = document.lower().split()
    vocabulary = vocabulary.union(set(tokens))
vocabulary = sorted(vocabulary)

# create a data matrix with #docs-by-#features dimensions
X = np.zeros((num_docs, len(vocabulary)))

# fill that matrix with sweet counts
for d, document in enumerate(documents[:num_docs]):
    tokens = document.lower().split()
    for i, feature in enumerate(vocabulary):
        X[d, i] = tokens.count(feature)

vocabulary_ = {word: position for position, word in enumerate(vocabulary)}

# show the result as a DataFrame
pd.DataFrame(data=X, columns=vocabulary, dtype=int)

Unnamed: 0,!!,',(,),",",.,".,",0,a,always,...,three,time,to,top,use,used,want,will,years,you
0,0,1,1,1,3,3,1,0,2,1,...,1,1,3,1,1,0,1,0,0,2
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,1,0,0,0,1,0,1,0,0,...,0,0,1,0,0,1,0,0,1,0


In [71]:
vocabulary_

{"'": 0,
 '(': 1,
 ')': 2,
 ',': 3,
 '.': 4,
 '.,': 5,
 'a': 6,
 'always': 7,
 'among': 8,
 'and': 9,
 'at': 10,
 'been': 11,
 'car': 12,
 'cars': 13,
 'change': 14,
 'cheaper': 15,
 'cheapest': 16,
 'continually': 17,
 'daily': 18,
 'different': 19,
 'don': 20,
 'e': 21,
 'elsewhere': 22,
 'found': 23,
 'g': 24,
 'has': 25,
 'have': 26,
 'however': 27,
 'i': 28,
 'if': 29,
 'lot': 30,
 'many': 31,
 'of': 32,
 'price': 33,
 'prices': 34,
 'really': 35,
 'research': 36,
 'reserve': 37,
 'site': 38,
 'sites': 39,
 't': 40,
 'ten': 41,
 'the': 42,
 'this': 43,
 'three': 44,
 'time': 45,
 'to': 46,
 'top': 47,
 'use': 48,
 'want': 49,
 'you': 50}

In `sklearn`, we can use the `CountVectorizer` object, which does all of that (and then some)

In [65]:
from sklearn.feature_extraction.text import CountVectorizer
small_vectorizer = CountVectorizer()

sentences_2 = documents[:1]

X1 = small_vectorizer.fit_transform(sentences_2)

In [67]:
X1.todense()

matrix([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1,
         1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 4, 1, 1, 1, 3, 1, 1, 1, 2]],
       dtype=int64)

The result is a *sparse count matrix*:

In [73]:
# indexed representation
import numpy as np
print(X1)

# dense representation
print(X1.todense())

  (0, 5)	1
  (0, 27)	1
  (0, 37)	1
  (0, 30)	1
  (0, 9)	1
  (0, 33)	1
  (0, 36)	1
  (0, 1)	1
  (0, 4)	1
  (0, 0)	1
  (0, 16)	1
  (0, 28)	1
  (0, 32)	1
  (0, 34)	1
  (0, 22)	2
  (0, 20)	1
  (0, 13)	1
  (0, 18)	1
  (0, 14)	1
  (0, 6)	1
  (0, 8)	1
  (0, 15)	1
  (0, 17)	2
  (0, 29)	2
  (0, 12)	1
  (0, 21)	1
  (0, 3)	1
  (0, 10)	1
  (0, 23)	2
  (0, 31)	4
  (0, 26)	2
  (0, 25)	1
  (0, 35)	3
  (0, 38)	1
  (0, 39)	2
  (0, 19)	2
  (0, 2)	1
  (0, 11)	1
  (0, 7)	1
  (0, 24)	1
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 2 1 1 2 1 1 2 1 4 1 1 1 3
  1 1 1 2]]


We can access the mapping from vector position to feature names via `get_feature_names()`:

In [74]:
print(small_vectorizer.get_feature_names())

['always', 'among', 'and', 'at', 'been', 'car', 'cars', 'change', 'cheaper', 'cheapest', 'continually', 'daily', 'different', 'don', 'elsewhere', 'found', 'has', 'have', 'however', 'if', 'lot', 'many', 'of', 'price', 'prices', 'really', 'research', 'reserve', 'site', 'sites', 'ten', 'the', 'this', 'three', 'time', 'to', 'top', 'use', 'want', 'you']


The inverse (the mapping from feature names to vector positions) is encoded as a list in `vocabulary_`:

In [75]:
print(small_vectorizer.vocabulary_)

{'prices': 24, 'change': 7, 'daily': 11, 'and': 2, 'if': 19, 'you': 39, 'want': 38, 'to': 35, 'really': 25, 'research': 26, 'the': 31, 'price': 23, 'continually': 10, 'at': 3, 'many': 21, 'different': 12, 'sites': 29, 'have': 17, 'found': 15, 'cheaper': 8, 'cars': 6, 'elsewhere': 14, 'however': 18, 'don': 13, 'lot': 20, 'of': 22, 'time': 34, 'this': 32, 'site': 28, 'has': 16, 'always': 0, 'been': 4, 'among': 1, 'top': 36, 'three': 33, 'cheapest': 9, 'ten': 30, 'use': 37, 'reserve': 27, 'car': 5}


Let's redo this for the entire corpus:

In [76]:
vectorizer = CountVectorizer(analyzer='word', 
                             ngram_range=(1, 2),
                             min_df=0.001,
                             max_df=0.75,
                             stop_words='english')

X = vectorizer.fit_transform(documents[:10000])

print(X.shape)

(10000, 3869)


In [80]:
from sklearn.feature_extraction import stop_words
 
print(stop_words.ENGLISH_STOP_WORDS)

frozenset({'thereby', 'thin', 'sometimes', 'fifteen', 'had', 'why', 'put', 'sixty', 'onto', 'too', 'here', 'else', 'throughout', 'enough', 'must', 'whence', 'both', 'not', 'of', 'move', 'serious', 'via', 'yourself', 'yet', 'sometime', 'everything', 'below', 'or', 'who', 'side', 'thru', 'empty', 'give', 'three', 'nevertheless', 'itself', 'myself', 'amoungst', 'his', 'last', 'mine', 'co', 'find', 'while', 'your', 'often', 'might', 'ie', 'others', 'into', 'alone', 'beside', 'get', 'been', 'please', 'but', 'is', 'eg', 'being', 'front', 'has', 'hasnt', 'during', 'eight', 'once', 'seems', 'herein', 'take', 'four', 'besides', 'beyond', 'whereas', 'except', 'up', 'although', 'then', 'very', 'more', 'own', 'back', 'hereby', 'cry', 'formerly', 'himself', 'i', 'across', 'fill', 'its', 'between', 're', 'several', 'latterly', 'are', 'until', 'none', 'am', 'whole', 'became', 'though', 'found', 'he', 'therein', 'it', 'should', 'otherwise', 'ltd', 'see', 'a', 'and', 'have', 'such', 'behind', 'if', 'ou

Calling `transform()` on a new document will aply the vocabulary we collected previously to this new data point. Any words we have not seen before are ignored.

In [None]:
vectorizer.transform([documents[-1]])

In [None]:
documents[-1]

## Exercise

Use vector operations to find out 
- what the 5 most frequent words are in `X`
- in how many different documents the word `delivery` occurs
- what percentage of the overall corpus that number corresponds to

In [None]:
# your code here


## Character $n$-grams

Instead of words, we can also use characters to analyze text:

In [None]:
char_vectorizer = CountVectorizer(analyzer='char', 
                                  ngram_range=(2, 6), 
                                  min_df=0.001, 
                                  max_df=0.75)

C = char_vectorizer.fit_transform(documents[:10])
C

In [None]:
print(char_vectorizer.vocabulary_)

## TF-IDF

Let's extract the most important words from Moby Dick

In [82]:
documents = [line.strip() for line in open('../data/moby_dick.txt', encoding='utf8')]
print(documents[1])

Call me Ishmael .


In [83]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer='word', 
                                   min_df=0.001, 
                                   max_df=0.75, 
                                   stop_words='english', 
                                   sublinear_tf=True)

X = tfidf_vectorizer.fit_transform(documents)

Now, let's get the same information as raw counts:

In [87]:
vectorizer = CountVectorizer(analyzer='word', min_df=0.001, max_df=0.75, stop_words='english')
X2 = vectorizer.fit_transform(documents)
assert X.shape == X2.shape, 'Shapes do not match for vectorizers' # make sure we have the same shape for TFIDF and counts

We can now display that information

In [88]:
df = pd.DataFrame(data={'word': vectorizer.get_feature_names(), 
                        'tf': X2.sum(axis=0).A1, 
                        'idf': tfidf_vectorizer.idf_,
                        'tfidf': X.sum(axis=0).A1
                       })

In [89]:
df = df.sort_values(['tfidf', 'tf', 'idf'])
df

Unnamed: 0,word,tf,idf,tfidf
1071,nations,10,7.789074,2.818093
1602,surprise,10,7.789074,2.934600
1735,valiant,10,7.789074,3.017954
1423,shortly,10,7.789074,3.032615
554,fleet,11,7.702063,3.049731
407,downward,10,7.789074,3.111894
1192,pitched,10,7.789074,3.124130
283,concluding,11,7.702063,3.142589
1318,retained,10,7.789074,3.149837
1654,thither,10,7.789074,3.212621


## Collocations

Some words occur so frequently together, that they should be treated as one, like "New York". Instead of defining those words, we can measure their pointwise mututal information (PMI), i.e., their likelihood to occur only together.
Extracting PMI from text is relatively straightforward, and `nltk` offer some functions to do so flexibly.

In [91]:
import nltk
# If you have not yet downloaded the data, you can do this by uncommenting the next line:
# nltk.download('all')

In [94]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.corpus import stopwords
from collections import Counter

stopwords_ = set(stopwords.words('english'))

# transform data into long list of words (no stopwords)
words = [word.lower() for document in documents for word in document.split() 
         if len(word) > 2 
         and word not in stopwords_]

# search for bigram collocations
finder = BigramCollocationFinder.from_words(words)
bgm = BigramAssocMeasures()
score = bgm.mi_like

# get collocations
collocations = {'_'.join(bigram): pmi for bigram, pmi in finder.score_ngrams(score)}


In [95]:
Counter(collocations).most_common(20)

[('moby_dick', 83.0),
 ('sperm_whale', 20.002847184002935),
 ('mrs_hussey', 10.5625),
 ('mast_heads', 4.391152941176471),
 ('sag_harbor', 4.0),
 ('vinegar_cruet', 4.0),
 ('try_works', 3.7944046844502277),
 ('dough_boy', 3.7067873303167422),
 ('white_whale', 3.698807453416149),
 ('caw_caw', 3.4722222222222223),
 ('samuel_enderby', 3.4285714285714284),
 ('cape_horn', 3.4133333333333336),
 ('new_bedford', 3.3402061855670104),
 ('quarter_deck', 3.2339339991315676),
 ('deacon_deuteronomy', 3.2),
 ('father_mapple', 3.0),
 ('gamy_jesty', 3.0),
 ('hoky_poky', 3.0),
 ('jesty_joky', 3.0),
 ('joky_hoky', 3.0)]

## Exercise

Extract the top 10 collocations for the Twitter data. You need to preprocess the data first!

In [None]:
# your code here