Here is a jupyter notebook which details how one can tokenize a file from COCA and then store it in an HDF5 file to call up later.  This addresses the issue Graeme had where he thought that each time he would search for a word, he would have to tokenize the whole COCA corpus anew.

In [2]:
import sys
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import nltk
from nltk.tokenize import *

In [18]:
from pandas import HDFStore

In [3]:
from nltk.tokenize import sent_tokenize

In [6]:
f = open('COCA/text_newspaper_lsp/w_news_1990.txt')
text = f.read()

In [11]:
print text[0:100]


##3000001 <p> He is trying to make the best of it , but the days have seemed to pass like months f


In [7]:
sent_tokenize_list = sent_tokenize(text)
len(sent_tokenize_list)

207190

In [12]:
print sent_tokenize_list[0:5]

['\r\n##3000001 <p> He is trying to make the best of it , but the days have seemed to pass like months for Lou Piniella .', 'Some days , he visits relatives .', 'Others , he spends the afternoon fishing .', 'Mostly , he waits .', 'It has been like this since the first week of February , when Piniella and his family arrived at their beach home in Reddington Shores , Fla. , and waited for baseball to begin .']


In [13]:
sents = pd.Series(sent_tokenize_list)

In [14]:
sents[0:2]

0    \r\n##3000001 <p> He is trying to make the bes...
1                    Some days , he visits relatives .
dtype: object

Need to open a "store" so that you can save it into the HDF5 format.

In [34]:
store = HDFStore('COCA_store.h5')

In [35]:
store['w_news_1990'] = sents

Note that one HDF5 container can hold many files, so we can also store the next chunk of COCA there too.

In [30]:
f2 = open('COCA/text_newspaper_lsp/w_news_1991.txt')
text2 = f2.read()
sent_tokenize_list2 = sent_tokenize(text2)
sents2 = pd.Series(sent_tokenize_list2)
store['w_news_1991'] = sents

To see what is in the store, just use 'store'.  When done, it is good practice to close the store.  Note you can see this file in whatever directory you are in, here it is called "COCA_store.h5".

In [36]:
store

<class 'pandas.io.pytables.HDFStore'>
File path: COCA_store.h5
/w_news_1990            series       (shape->[1])

In [37]:
store.close()

Now to open the file, use the following:

In [25]:
news1990_sents = pd.read_hdf('store.h5', 'w_news_1990')

In [26]:
news1990_sents[0:5]

0    \r\n##3000001 <p> He is trying to make the bes...
1                    Some days , he visits relatives .
2           Others , he spends the afternoon fishing .
3                                  Mostly , he waits .
4    It has been like this since the first week of ...
dtype: object

So now everytime you need to access the tokenized sentences from COCA, you just have to load them from the HDF5 file---much much quicker!