In [140]:
import attr
import os
import spacy
import re
import torch

import numpy as np

from annoy import AnnoyIndex
from cached_property import cached_property
from glob import glob
from tqdm import tqdm_notebook
from itertools import islice
from boltons.iterutils import chunked_iter

from sent_order.models import kt_regression as kt_reg

In [141]:
sent_encoder = torch.load(
    '../../plot-ordering/data/models/new/kt-reg/sent_encoder.68.bin',
    map_location={'cuda:0': 'cpu'}
)



In [142]:
nlp = spacy.load('en')
nlp.add_pipe(nlp.create_pipe('sentencizer'))

In [143]:
def clean_text(text):
    return re.sub('[\s]{2,}|\n', ' ', text.strip())

In [144]:
@attr.s
class Sentence:
    
    text = attr.ib()
        
    @cached_property
    def doc(self):
        return nlp(self.text, disable=['parser', 'tagger', 'ner'])
    
    def tokens(self):
        return [t.text for t in self.doc]
    
    def sent_order_variable(self):
        return kt_reg.Sentence(self.tokens()).variable()
    
    def sent_order_vector(self):
        x = self.sent_order_variable()
        return sent_encoder([x])[0].data.tolist()

In [145]:
@attr.s
class Segment:
    
    path = attr.ib()
    
    def text(self):
        with open(self.path) as fh:
            return clean_text(fh.read())
        
    @cached_property
    def doc(self):
        return nlp(self.text(), disable=['tokenizer', 'parser', 'tagger', 'ner'])
    
    def sentences(self):
        for sent in self.doc.sents:
            yield Sentence(sent.text)

In [146]:
@attr.s
class NewspaperCorpus:
    
    root = attr.ib()
    
    def paths(self):
        return glob(os.path.join(self.root, '**/*.txt'), recursive=True)
    
    def segments(self):
        for path in tqdm_notebook(self.paths()):
            yield Segment(path)
            
    def sentences(self):
        for segment in self.segments():
            yield from segment.sentences()

In [147]:
c = NewspaperCorpus('../data/kathy2012/newspapers2012/')

In [148]:
sents = {
    i: sent
    for i, sent in enumerate(islice(c.sentences(), 500000))
}

In [149]:
len(sents)

500000

In [150]:
vidx = AnnoyIndex(1000)

for chunk in chunked_iter(tqdm_notebook(sents.items()), 1000):
    
    idxs, sents_ = zip(*chunk)
    
    x = [s.sent_order_variable() for s in sents_]
    x = sent_encoder(x)
    
    for i, v in zip(idxs, x):
        vidx.add_item(i, v.data.tolist())
    
vidx.build(10)




Exception in thread Thread-25:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/dclure/Projects/infuzzy/env/lib/python3.6/site-packages/tqdm/_tqdm.py", line 144, in run
    for instance in self.tqdm_cls._instances:
  File "/usr/local/bin/../Cellar/python3/3.6.2/bin/../Frameworks/Python.framework/Versions/3.6/lib/python3.6/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration






True

In [153]:
def query(text, n=10):
    sent = Sentence(text)
    for ri in vidx.get_nns_by_vector(sent.sent_order_vector(), n):
        print(sents[ri].text, '\n')

In [156]:
query("I mean our kids are really well educated, but we need to take politics out of the education system.")

I think we are witnessing another example of the abuses that have been going on for a long time behind the closed doors of our public school system! 

I think we need to simplify our tax system. 

I think our friends are OK with it, too,” Larry said. “ 

I think our guys do a good job of it.” 

I know the board has been working on it with consultants, but I believe we should also get input from every- body, including the teachers and taxpayers.” 

I regret having to pass a portion of these cuts on to students and their families because I know they also are struggling,” Muir said. “ 

I don’t think we should be forced to do it,” Adams said when he heard about the proposed electronic database. “ 

I owe my life to them, and now I must deny them the right to vote! 

I personally will go there less often at the new location because I do not drive that way. 

I know the president has said he will do those things. 



In [154]:
query("my major concern is the fact that we're just going to pass this debt on to the kids, and I think that's embarrassing.")

I want to assure you that this was not simply a campaign gim- mick, it is the honest truth and it is very much who I am. 

My biggest concern though, frankly, isn’t the optics of it but whether it’s fair or not to the young men and young women,” he said in Feb- ruary. “ 

I think the reason you don’t see any such criti- cism is because the “looney right” is such an impor- tant part of the Republican base that they are afraid to anger Rush Limbaugh, Sean Hannity, and such. 

My gut instinct is that no, it shouldn’t be used in hamburgers. 

I want him to enjoy what he can now, as there will come a time when he can’t. — 

I want them to know there are real options out there to relieve the pain, and I want them to use those options. 

My sense is, why not use it? 

My sense is, why not use it? 

I call that avoiding your responsibilities, something my family doesn’t do. 

I know the president has said he will do those things. 



In [155]:
query("I wish both sides would work together and accomplish something instead of being on the campaign trail all the time.")

I think it’s a great chance for the kids to get out in the community,” he said. “ 

I wish we could thank each of you personally for your dona- tions and for your support of these fine students. 

I think they’ve to some degree employed a “gin up the base” strategy, which might ultimately keep Walker in the governor’s mansion but isn’t what a lot of voters signed up for. 

I think it reflects the hard work of a very dedicated team, and also a very good understanding and positive working rela- tionship that we’ve had with the city.” 

I wish I could do that in my business but I can’t because I don’t have the govern- ment giving me free money! 

I do think it is a prime example of the wolf in sheep clothing putting his own private agenda before all those people in the know. 

I wish all the money they spent on those ads was going to the schools, going to the homeless and going to social services in- stead of making these irritating political ads.” 

I hope all votes are based on facts an

In [157]:
query("The government was not meant to run health insurance.")

The business office was broken into and not any of Broadway Limousine’s vehi- cles, Pederson said. 

The river was high enough, but not too high,” Benike said. “ 

The medical board was apparently never con- tacted by Scouting offi- cials or law enforcement in the 1980s. 

The fire apparently started in a lower level of the duplex, where sever- al electronic devices were plugged in. 

The law was in effect for the February elections de- spite numerous lawsuits challenging it. 

The general was ada- mant there was no politi- cization of the process, no White House interference or political agenda,” said Rep. Adam Schiff, D-Calif. “ 

The election was particularly important because Franken’s vic- tory gave Senate Democrats a 60th vote in favor of President Barack Obama’s national health care pro- posal — the deciding vote to over- come a Republican filibuster. 

The Mexican government was not notified the pro- gram existed. 

The law was in effect for the February election despite numero

In [158]:
query("There has to be respect for your job and the rules and your employers.")

There's a possibility that, if and when he gets up here, he might remember things," Wegner said last week. " 

There’s too much at stake to drop it now. 

There’s no saying for sure what’s causing the diseases in what is still a relatively small percentage of the fish. 

There’s going to have to be changes made to get Dale’s vote,’’ Welhouse said. ‘‘ 

There’s going to have to be changes made to get Dale’s vote,” Welhouse said. “ 

There’s going to have to be changes made to get Dale’s vote,” Welhouse said. “ 

There will be some ‘maybes’ and some disagree- ments,” he said. “ 

There’s nothing wrong with asking our employees to shoulder a larger portion of the costs of their benefits, he said. “ 

There should be no fireworks anywhere, for any reason. 

There is a con- cern that overseas security doesn’t match ours. 



In [159]:
query("This is wonderful.")

This is excellent. 

This makes us feel wonderful. 

This is just so wonderful.” 

This is just so wonderful.” 

This is perfect. 

This is good, good education policy reform.” 

This is great for children, grandchildren, best friends and families. 

This is amazing,” he said. “ 

This is very exciting,” Miller said. “ 

This is a great opportu- nity for us today,” Walker said. “ 



In [172]:
query("This is terrible.")

This is horrible. 

This is unacceptable. 

This is horrible …” Williamson said. “ 

This is troubling. 

This is amazing,” he said. “ 

This is a particularly bad time to pass a tax that hits downtown busi- ness,” he said. “ 

This obviously is one category in which being below average feels good. 

This is wish- ful thinking at best. 

This is just so wonderful.” 

This is just so wonderful.” 



In [160]:
query("My God, I've been here almost forty years.")

My father had diabetes, and I watched its progress for 21 years. 

That was 40 years ago. 

That’s 12 years ago. 

It’s a far cry from just a few years ago. 

Her participation in the Fulbright Program makes her one of the 111,000 Americans who have been part of the program since its start more than 60 years ago. 

It’s been struggling over the last few years. 

That happened 85 years ago. 

My father was right about one thing: The military robots aren’t carrying the colors of the Venus Interplanetary Expedition forces, but those of the U.S. Army. 

The massive tree succumbed to Dutch elm disease and was cut down two years ago. 

So I say ﬁ ve years ago. 



In [161]:
query("The money going into this campaign is so frustrating to see.")

The need for something like this is apparent, and it’s so inspiring to see it recognized by so many in the community.” 

A mandate on individuals recog- nizes this implicit contract.” 

The federal government has no business doing this,” he said. 

The way this is right now it’s going to be the way I feel on Election Day, unless something comes out of the box to sway me,” the 50-year-old said. “ 

After going over this do you believe his plan is a smart buy? 

The difficulty with laws like this is the understanding that medicine is, for all that it is, not an exact science. 

The cost of the storm is incalculable at this point.” 

It’s not going to happen this election cycle, and that’s a huge challenge for our democracy. 

One may wonder how all of this volunteer time is spent. 

It is not year clear, at this stage of the process, how the sanctions could affect gasoline prices. 



In [162]:
query("Well, I don't see the economy recovering real quickly.")

Well, we do care. 

Well, that’s my opinion and I’m sticking to it. 

Actually, I’m trying to lose several. 

Well, folks, I’m sorry to say it’s not coming back. 

I’m going to try and get America healthier again because 138 million Ameri- cans have one or more chronic illnesses. 

I really understand I’ll never be up here again. 

I’m going to keep doing it. 

Well, that’s the same thing the fed- eral government is doing right now. 

We reconnected via the Internet and have become close again. 

I’m going to do everything in my power to get it done and ﬁ gure out a way to have a balance. 



In [163]:
query("So it seems like government in general is kind of unresponsive to your concerns.")

So if Romney and Ryan want to make laws and not just speeches, they will have to compromise. 

So it was just a logi- cal, no-brainer to take that down. 

So we obviously have a dif- ferent opinion and a differ- ent way of looking at spending and how to serve the taxpayers.” 

And on a weekday when many people couldn’t attend unless they took off work. 

So where are we Americans, post-conven- tion, in this quadrennial election process? 

So the presi- dent, and I think all of us here, don’t like the fact that people have to die. 

And so the media narrative boils it all down into easy- to-understand parables, sometimes about race. 

So I think it’s important for his teammates to maybe speak for him at times. 

And we’re careful not to have too many irons in the fire and shortchange any- one because we still want to have fun in retirement.” 

So right now it’s just shadowy allegations or aides who allegedly did bad things. 



In [164]:
query("They should mind their own business.")

They should attempt to resolve them by communicating with each other — preferably with the help of a licensed marriage counselor. 

They will be in their “punishment” phase when they do. 

They see instead their own supersti- tions and suppositions, paranoia and guilt, night terrors and vulnerabili- ties. 

They’re going to make up their own numbers.” 

They’re going to make up their own numbers.” 

They’re going to make up their own numbers.’’ 

They should be centered on the massive problems which face this country. 

They just need somebody who knows what they’re going through. “ 

They need to be that way for sur- vival in today’s society. 

They should focus on the people in need,” he said. “ 



In [173]:
query("I drove 35 miles one way to go and teach.")

I took to drumming like a fish to water,” Armato said. “ 

I moved out of state when I turned 18, but Tara still lives there. 

I fell in love with the staff – they’re the reason the momentum is there and the reason the library won Library of the Year in 2011, and we can work as a team to really strive to new heights.” 

I just walked over and had a peep and took some pictures,’’ he said. 

I kept my maiden name and hyphenated it, but was proud that my husband, children and I ALL created the “Smith” family. — 

I started out kind of rough,” Kirichkow said. “ 

I was actually looking forward to coming back,” she said, adding she liked the idea of participating in the event that helped people. 

I started with the Argentine tango, which might be a bit too much for some folks, but I have learned to absolutely love it. 

I’ve sat alongside the road crying after I hit a cat, wondering why people let them outside. 

I spent the next 20 years often partnered with Evan and a number of other pa

In [177]:
query("It's millions of dollars in taxpayers' money")

It’s not appropriate to have a business enterprise in a residential area,” he said. “ 

It sends signals through nerve fibers that are part of the sympa- thetic nervous system. 

It’s hard to have a family fight in front of strangers. 

It’s a dark place with few exits and lots of people. 

It’s an awful, slow-motion tragedy touching tens of millions of Americans, especially when you add all the family members and dependents who also are af- fected. 

It all comes down to trust and faith in the human being’s ability to navigate one’s own life successfully. 

It's always because of a love of God she has in her heart. 

It’s on: GOP, Democrats fight over women voters WASHINGTON (AP) – Is the 2012 election shaping up to be all about women? 

It’s nation’s transportation sys- tem predicted in 2009 that the U.S. will face nightmarish congestion unless it spends more. 

It is the only system that taxes wealth. 

