Here is some NLP on the comment corpus collected from regulations.gov for the proposal EPA–R08–OAR–2015–0463.

I have followed "Corpora_and_Vector_Spaces" gensim tutorial https://radimrehurek.com/gensim/tut1.html

In this example we remove common word strings like "of", "the", etc and also remove words that only appear once (across all comments).  The remaining words make up the dictionary.  Each comment has a collection of vectors.  The vectors are stored as an ordered pair (a,b) where "a" is the ID of the word and "b" is the number of times that word appears in the comment.

In [1]:
from gensim import corpora
import pickle
import glob
from pprint import pprint  # pretty-printer
from collections import defaultdict
from nltk.tokenize import RegexpTokenizer



We use pickle to retrive the file.  We have two files; the one labled "clean" was created in the "Corpus_Cleaning" notebook and removed non-word code like "\n".

It seems to make the corpus smaller, but we will keep an eye out in case keeping the "\n"'s etc, may be useful.

In [2]:
!ls *.pkl

EPA–R08–OAR–2015–0463.pkl
EPA–R08–OAR–2015–0463.pkl_clean.pkl


In [3]:
!pwd

/Users/elliegrano/codes/NLP/blog/data-collection


In [4]:
fname = glob.glob('/Users/elliegrano/codes/NLP/blog/data-collection/*_clean.pkl')[0]

In [5]:
# Load the dictionary back from the pickle file.

corpus = pickle.load( open( fname, "rb" ) )

In [6]:
# docID is the ID of the EPA proposal
docID=corpus.keys()[0]
docID

'EPA\xe2\x80\x93R08\xe2\x80\x93OAR\xe2\x80\x932015\xe2\x80\x930463'

In [7]:
# These are the comments loaded into the corpus
# docID is the EPA proposal ID number
# corpus[docID] is a dictionary with structure the comment ID : the comment

comments = corpus[docID]
# comments.keys()

In [8]:
# example of a comment
one_comment = comments[comments.keys()[1]]
print(one_comment)

Utah's DAQ & AQ B deserve deference.   Utah has complied with the 70 recommendations of the Grand Canyon Visibility Transport Commission, including submitting a 309 SIP in 2003, 5 years before agencies complying with 308 provisions.  Utah has submitted all the required sulfur dioxide Milestone and Backstop Trading Program reports, has revised the program frequently to accommodate litigation and various members leaving the program.  All reports & updates have been submitted in a timely fashion, following final action by Utah's AQB after notice and public comment.  This was at very considerable investment in staff time, study by AQB members and meaningful contributions by the public.  Under the provisions of Utah's 309 SIP, additional pollution controls were applied to several units at Hunter & Huntington which reduced PM, SO2 and NOx.    We reached our 2018 SO2 reduction target years early, in 2010, and SO2 has continued to decline.   Analysis of monitoring values after instillation of 

In [9]:
# remove common words and make all words lowercase

stoplist = set('s by at & i we if has is was were for a of the and to in'.split())

tokenizer = RegexpTokenizer(r'\w+') # removes punctuation

# Q: maybe include this in corpus cleaning or could this replace the work done in corpus cleaning

raw_texts = [[word for word in tokenizer.tokenize(value.lower()) if word not in stoplist]
         for key, value in comments.iteritems()]

print(raw_texts[1][110:130]) # all words not in stoplist in comment 1
# (printing only some from the middle)


# remove words that appear only once

frequency = defaultdict(int)
for text in raw_texts:
    for token in text:
        frequency[token] += 1
        
texts = [[token for token in text if frequency[token] > 1] for text in raw_texts]

print(texts[1][110:130]) # all words not in stoplist in comment 1
# that ALSO appear in other comments in the corpus 
# (printing only some from the middle)

[u'values', u'after', u'instillation', u'these', u'rh', u'sip', u'required', u'controls', u'indicate', u'that', u'model', u'had', u'substantially', u'over', u'predicted', u'visibility', u'improvements', u'from', u'nox', u'reductions']
[u'values', u'after', u'these', u'rh', u'sip', u'required', u'controls', u'indicate', u'that', u'model', u'had', u'substantially', u'over', u'predicted', u'visibility', u'improvements', u'from', u'nox', u'reductions', u'rather']


Next is just showing which words were removed in comment 1 in "texts"

In [10]:
print(str(len(raw_texts[1])) + ": number of words in comment 1 not in stoplist   ")
print( str(len(texts[1]))+": same, removing words that do not appear in other comments in the corpus (will not be in the dictionary)")

200: number of words in comment 1 not in stoplist   
194: same, removing words that do not appear in other comments in the corpus (will not be in the dictionary)


In [11]:
# words in comment 1 that did not appear in any other comment
for i, w in enumerate(raw_texts[1]):
    if (w not in texts[1]):
        print(str(i) + '  ' + w)

112  instillation
157  polarization
160  divert
172  politically
173  palatable
176  attainable


In [12]:
# number of times a word appears in the collection of comments
print("utah: " + str(frequency['utah']))
print("instillation: " + str(frequency['instillation']))

utah: 5594
instillation: 1


In [13]:
# comments where the word "existing" occurs
print(frequency['existing'])
for i, c in enumerate(texts):
    word='existing'
    if (word in c):
        print i

632
0
2
4
8
10
11
14
19
21


Here we store the dictionary using corpora from gensim.  This stores the dictionary in RAM which we will change later on.

In [14]:
dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/dstore.dict')  # store the dictionary, for future reference
print(dictionary)

Dictionary(22734 unique tokens: [u'fawn', u'11546', u'11545', u'11548', u'11549']...)


In [15]:
# print(dictionary.token2id)

from itertools import islice

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

take(15, dictionary.token2id.iteritems())

[(u'fawn', 22167),
 (u'11546', 5366),
 (u'11545', 5367),
 (u'11548', 5368),
 (u'11549', 5369),
 (u'localized', 5372),
 (u'3483', 5373),
 (u'sprague', 5374),
 (u'19395', 5375),
 (u'refunding', 5378),
 (u'stipulate', 5388),
 (u'appropriation', 5389),
 (u'rawhide', 5566),
 (u'bringing', 5392),
 (u'wooded', 341)]

The fact that we are storing a lot of numbers $\uparrow $  may be a concern.  On the otherhand the proposal itself has numbers in it describing sections or other proposals, so we may want to keep at least some of the numbers.

In [16]:
new_doc = tokenizer.tokenize(("Years and years of truely serious litigation.").lower())
new_vec = dictionary.doc2bow(new_doc)
print(new_vec) 

[(1173, 1), (1472, 2), (4933, 1)]


In [17]:
print(dictionary[1173])
print(dictionary[1472])
print(dictionary[4933])

serious
years
litigation


The word "years" appears twice, "litigation" once, and "serious" once.  Ignores the punctuation, case, and flagged words.  Apparently "truely" doesn't appear more than once in the collection of comments, if at all.

In [18]:
new_corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('/tmp/dstore.mm', new_corpus)  # store to disk, for later use
# for c in new_corpus:
#     print(c)
# print(new_corpus[0:1])
print(str(len(new_corpus[1]))+": unique words in comment 1 that are in the dicionary")

155: unique words in comment 1 that are in the dicionary


In [19]:
# count=0
# for vector in new_corpus:  # load one vector into memory at a time
#     if (0<count<3):
# #         print(vector)
#         print(len(vector))
#         count+=1

for i in range(0,3):
    print(len(new_corpus[i]))

4933
155
1516


I want to organize the comments by distance:
Perhaps determine which ones share words.  

new_corpus[1] and new_corpus[2] both share the word 'under' (#26), 1 time for comment1 and 7 times for comment2

Word to Vec should do this stuff for me though.
I'll keep going through the gensim tutorial.

Next we make it so the entire corpus is not stored in RAM, but read in one comment at a time.  I'm not sure that this is necessary since we already store the original corpus.  Perhaps this is useful during corpus collection?

In [20]:
class MyCorpus(object):
    def __iter__(self):
        for key, comment in comments.iteritems():
            yield dictionary.doc2bow(tokenizer.tokenize(comment.lower()))

            
#  raw_texts = [[word for word in tokenizer.tokenize(value.lower()) if word not in stoplist]
#          for key, value in comments.iteritems()]

# texts = [[word for word in document.lower().split() if word not in stoplist]
# >>>          for document in documents]
# >>>

In [21]:
corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
print(corpus_memory_friendly)

<__main__.MyCorpus object at 0x10cda34d0>


In [22]:
# checking these produce the same dictionary

len_1 = []
for vector in corpus_memory_friendly:  # load one vector into memory at a time
    len_1.append(len(vector))
print(len_1)

len_2 = []
for i in range(0,len(new_corpus)):
    len_2.append(len(new_corpus[i]))
print(len_1==len_2)

[4933, 155, 1516, 9, 4945, 214, 414, 419, 531, 231, 20904, 2110, 357, 506, 2836, 375, 652, 253, 233, 5001, 336, 586, 425]
True


Next we will determine which comments are near each other using bag of words in "Topics and Transformations" continuing the tutorial