In [1]:
import attr
import os
import spacy
import re
import sent2vec

import numpy as np

from annoy import AnnoyIndex
from cached_property import cached_property
from glob import glob
from tqdm import tqdm_notebook
from itertools import islice
from boltons.iterutils import chunked_iter

In [2]:
nlp = spacy.load('en')
nlp.add_pipe(nlp.create_pipe('sentencizer'))

In [4]:
model = sent2vec.Sent2vecModel()
model.load_model('../data/wiki_unigrams.bin')

In [5]:
def clean_text(text):
    return re.sub('[\s]{2,}|\n', ' ', text.strip())

In [13]:
@attr.s
class Sentence:
    
    text = attr.ib()
        
    @cached_property
    def doc(self):
        return nlp(self.text, disable=['parser', 'tagger', 'ner'])
    
    def embedding(self):
        return model.embed_sentence(self.text)

In [14]:
@attr.s
class Segment:
    
    path = attr.ib()
    
    def text(self):
        with open(self.path) as fh:
            return clean_text(fh.read())
        
    @cached_property
    def doc(self):
        return nlp(self.text(), disable=['tokenizer', 'parser', 'tagger', 'ner'])
    
    def sentences(self):
        for sent in self.doc.sents:
            yield Sentence(sent.text)

In [15]:
@attr.s
class NewspaperCorpus:
    
    root = attr.ib()
    
    def paths(self):
        return glob(os.path.join(self.root, '**/*.txt'), recursive=True)
    
    def segments(self):
        for path in tqdm_notebook(self.paths()):
            yield Segment(path)
            
    def sentences(self):
        for segment in self.segments():
            yield from segment.sentences()

In [16]:
c = NewspaperCorpus('../data/kathy2012/newspapers2012/')

In [25]:
sents = {
    i: sent
    for i, sent in enumerate(islice(c.sentences(), 500000))
}

In [26]:
len(sents)

500000

In [27]:
vidx = AnnoyIndex(600)

for i, sent in tqdm_notebook(sents.items()):
    vidx.add_item(i, sent.embedding())
    
vidx.build(10)




Exception in thread Thread-9:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/dclure/Projects/infuzzy/env/lib/python3.6/site-packages/tqdm/_tqdm.py", line 144, in run
    for instance in self.tqdm_cls._instances:
  File "/usr/local/bin/../Cellar/python3/3.6.2/bin/../Frameworks/Python.framework/Versions/3.6/lib/python3.6/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration






True

In [28]:
def query(text, n=10):
    sent = Sentence(text)
    for ri in vidx.get_nns_by_vector(sent.embedding(), n):
        print(sents[ri].text, '\n')

In [29]:
query("I mean our kids are really well educated, but we need to take politics out of the education system.")

If we want our schools to thrive we need to get more work out of the great teachers while dumping the bad ones. 

As we obsess over who is right or wrong, we are los- ing focus of what really matters: our future, our children. 

We hope people come out to the event because it keeps the local area music going and helps out our youth athletics, which are so important for developing the area and keeping kids out of trouble,” Cazel said. 

Our job in govern- ment is to make the soil rich, educate our children in quality schools we are proud of, make our com- munities safe, train our workers in the latest tech- nologies, keep our envi- ronment clean and healthy, modernize our transpor- tation and communication networks, and support a quality of life we enjoy. “ 

Often, they do a far better job shaping the minds and character of our youth than do our colleges. 

However, we concluded that turning the page on this legacy legal issue through the positive steps we are taking is in the best int

In [30]:
query("my major concern is the fact that we're just going to pass this debt on to the kids, and I think that's embarrassing.")

I think this is going to set down the predicate for the fall election and I think it’s going to show the country that Wisconsin is going to be very close for the presiden- tial election and I think that (Mitt) Romney’s going to take another look at Wiscon- sin and he’s going to be here a lot more,” Thompson said. “ 

I don’t think you’re going to see that tomorrow.” 

Referring to Romney, he said, ‘‘I think he’s going to find this a long campaign.’’ 

Referring to Romney, he said, ‘‘I think he’s going to find this a long campaign.’’ 

I don’t think I’m ever going to know the truth,” he said. 

I just don’t think that’s fair,” Rep. James McGovern, D-Mass., 

I just think it’s a terrible situation and I think that if they can’t come to a decision over the next few weeks, then the Super Bowl will not be anything to watch, either.” 

But while standing near Moser and Autumn on Sunday, she added: “I think he’s going to get to know one real well.” “ 

I think that it’s really cool to have so

In [31]:
query("I wish both sides would work together and accomplish something instead of being on the campaign trail all the time.")

A manufactur- ing renaissance is being preached both from the White House, on the GOP campaign trail and in Super Bowl commercials. 

Regardless of lofty notions and pas- sionately held ideals, nothing on the candidates’ wish lists will be accom- plished if legislators do not learn to work together. 

The only thing I can say is that the fund had been set up under the guidance of the GAB (Government Account- ability Board),’’ campaign spokeswoman Ciara Matthews said. ‘‘ 

My name is Rick Santo- rum, and I am the only authentic, passionate con- servative who can unite the GOP,’’ Santorum wrote in a fundraising missive sent as Iowa caucus votes were being tallied in a race he barely ‘‘I need an URGENT contribution of at least $35 today to unite con- servative voters and win the lost. 

Because Dave has been deployed overseas, and Rachel frequently travels on business, they say one of the most rewarding parts of the competition was the chance to spend so much time together. 

Such a group

In [32]:
query("The government was not meant to run health insurance.")

I was not allowed. 

Brown was not in- jured. 

He was not bullied.’’ 

While Rodriguez’s run was automatic, Pettersen’s was anything but. 

Nets were not used. 

The third deputy was not hurt. 

The driver was not injured. 

The baby was not hurt. 

Santorum was not on the ballot. “ 

It was not an overstatement. 



In [33]:
query("There has to be respect for your job and the rules and your employers.")

Thank you for your time and your vote. – 

Pizza Participants The Alley Little Caesars Pizza Pub Sixth Street Market- Jada’s Racheli’s Deli Frankie’s Pizza- Best Pepperoni White River Saloon- Best Specialty Rafﬂe and Contest Sponsors DaLous Bistro Ashland Lake Superior Lodge Eldorado’s Country Buds Flower Shoppe AmericInn/Splashland HSI Business Center Omer Nelson Electric Thank you for your support! 

Special thank you to students and sta(cid:283) of the Eleva-Strum schools for all you’ve done, Pastor Solem for all your kindness and prayers, and the EMT’s for your care and job you did. 

Thank you for your example and your true grit. — 

Thank you for your time and consideration. 

Tank you for trusting me with your business, I have loved making your jewelry! 

Eau Claire DAVID HANVELT To President Obama: How do you and your family practice your faith? 

The President and his underlings,” writes one fiery critic, “are your accuser, your judge, your jury and your executioner all wrappe

In [34]:
query("This is wonderful.")

Everyone is welcome! 

What is … An editorial? 

Bayfield Mayor Larry MacDonald is proud. “ 

Today’s Birthdays: Com- poser-conductor John Wil- liams is 80. 

Actress Mary McCormack is 43. 

Actor Peter Riegert is 64. 

Retired MLB All-Star Pete Rose is 71. 

Author Joyce Carol Oates is 74. 

It is self-defeating.” 

Actor Jerry Houser is 60. 



In [35]:
query("This is terrible.")

Everyone is welcome! 

What is … An editorial? 

Bayfield Mayor Larry MacDonald is proud. “ 

Today’s Birthdays: Com- poser-conductor John Wil- liams is 80. 

Actress Mary McCormack is 43. 

Actor Peter Riegert is 64. 

Retired MLB All-Star Pete Rose is 71. 

Author Joyce Carol Oates is 74. 

It is self-defeating.” 

Actor Jerry Houser is 60. 



In [36]:
query("My God, I've been here almost forty years.")

It's been almost 20 years. 

In 2011, 149 saw-whets were caught, while 158 had been processed as of Oct. 29 this year, includ- ing 30 in one night. 

Dear Abby: I am 20 and have been with my boyfriend, “Griffin,” for five years. 

For the last 17 years, Colgrove’s partner has been Greg Rigoni of Hurley. 

CMT Music ’ Animal Cops Miami Å My Wife Top 20 Countdown (N) ’ Å Hates Chris My Wife Parking 700 Club The Sandlot (1993) ››‡ Tom Guiry, Mike Vitar. 

News (N) Top 20 Countdown ’ Å Zumba Fit Austin Finding Bigfoot ’ Å Tabatha Takes Over Paid Prog. 

Top 20 Countdown (N) ’ Å My Wife Still Stnd The Parkers Fast Money Halftime CNN Newsroom (N) Comedy RENO 911! 

Å Top 20 Countdown ’ Å Paid Prog. 

Murder, She Wrote Å Necessary Roughness Top 20 Video Countdown ’ Paid Prog. 

Å Housewives/NJ Housewives/NJ Flipping Out Å CMT Music ’ TRIA Cindy C Fareed Zakaria GPS (N) S. Harvey Housewives/NJ Top 20 Countdown ’ Paid Prog. 



In [37]:
query("The money going into this campaign is so frustrating to see.")

Curing the problem will require identifying both the action and the reaction: where the money is coming from, and also where it’s going — not to mention why so much of it is going there. — 

The primary is the time for us to vote for the best candidate, so that’s what I’m going to do,” Rudrud said. 

I feel it is time to realize his rhetoric is just political manipulation (telling people what they want to hear) instead of honest campaign promises and vote for someone else in the November, 2012 elections. 

Town of Waukesha rezoning I just wanted to say how frustrating it is reading articles like this about Walgreens and Aldi needing a place to build. 

The fact is that there are more reason- able ways for them to save money in this time of disparity but they don’t want to sac- rifice anything and it’s easier to raise the prices. 

Paul wants to talk about the wisdom of borrowing money from China to dis- perse it to other countries. 

Is it really going to close down at any min- ute, or

In [38]:
query("Well, I don't see the economy recovering real quickly.")

Please see FALK P.2A COMING TUESDAY: DOWNTOWN BUILDING PLANS EXPLORED Join Our Email List! 

Please see HEROIN P. 2A COMING SATURDAY: BELOIT COMIC STORE CELEBRATES TEN YEARS Join Our Email List! 

Please see PAGEANT P. 2A COMING WEDNESDAY: LOCAL RECALL ELECTION RESULTS LISTED Join Our Email List! 

Exclusions apply, see pass. 

For more information, see www.washburnherita- geassociation.org. 

Please see Tuesday’s Freeman for the complete notice. 

We’ll see about that,” Vrakas said. “ 

Voters instinctively see the court (or prefer to see it) as an independent en- tity, immune to partisan cross fire. 

We’ll see where the evidence takes us.” 

I don’t see what Barrett has done for Milwaukee.” 



In [39]:
query("So it seems like government in general is kind of unresponsive to your concerns.")

Personally, I kind of feel like his time has passed,” Westrate said. “ 

There was a time I would have thought it a capital offense to even consid- they really want is their kind of big government. 

If he keeps talking like that, whole new audiences might do just that. 

Hunters should sit back and listen too, because it sounds like Kroll wants to give them a bigger role in managing the herd. 

The governor has ad- dressed this with his son, just like any father would do,” the campaign said 689194 10-16-12 Western Wisconsin Locations Eau Claire Black River Falls Chippewa Falls Tomah Hudson Rice Lake Durand Siren Menomonie Onalaska 888-849-0404 www.miracle-ear-eauclaire.com 689287 • 10-16-12 680482 7-24-12 www. 

Kim Vogt, a La Crosse sociology pro- fessor, says there is STATE VOICES Drownings Adam Bradley is like a whole bunch of other young volunteers in La Crosse. 

A vital attitude lost in the non-individ- ualized tests is emphasized in a letter to the editor of the New York Times 

In [40]:
query("They should mind their own business.")

You’re allowed to own a gun. 

■ Obama positioned him- self as a voice of civility and – to a nation of voting par- ents – maturity in the debate over whether health insur- ance plans should cover con- traception. 

This full color cook- ThThThisisis ff ululll l cococoloolor cococookokok-- book will feature book will feature your recipes and your photos! 

Another easy way for each of us to help out would be if everyone would get a couple of credit cards and treat your friends and favorite sons to a spend- ing spree. 

If the Republican Party is to survive, it may need to update its philosophy for the 21st century. 

All questions should be directed to the Pub- lic Works Department at (715) 682-7061. 

That means anyone with some cash will be able to own part of a Silicon Valley icon that quickly transformed from dorm-room startup to cultural touchstone. 

The whole sordid ep- isode should prompt Congress to pass legisla- tion requiring candidates to report large gifts, in- Edwards clu

In [41]:
query("I drove 35 miles one way to go and teach.")

I just want to go somewhere else.” – 

It’s not going to bring my son back, and this is the worst thing any mother could go through.” 

I knew I was never going to be rich living and working way up-north. 

What’s going on?” 

She knows when it goes well because she gets to go to Victory Lane. 

We’re not only going to get Republicans and Indepen- dents, we’re also going to get discerning Democrats.” 

8,000 miles on engine. 

I don’t see any way for it not to go up,” Krokowski said of food prices. 

I’m not saying, ‘I’m not going to push it, but you guys go ahead,’” Walker said of lawmakers. 

But that kind of thinking must stop or some- one is going to get hurt or killed. 



In [42]:
query("It's millions of dollars in taxpayers' money")

States don’t want to cede control of their massive utilities, which rake in billions of dollars in annual revenue. 

LIGHT SWEET CRUDE 1,000 bbl.- dollars per bbl. 

LIGHT SWEET CRUDE 1,000 bbl.- dollars per bbl. 

LIGHT SWEET CRUDE 1,000 bbl.- dollars per bbl. 

LIGHT SWEET CRUDE 1,000 bbl.- dollars per bbl. 

LIGHT SWEET CRUDE 1,000 bbl.- dollars per bbl. 

New Zealand police raided homes and businesses linked to the founder, Kim Dotcom, on Friday and seized guns, millions of dollars and nearly $5 million in luxury cars, officials there said. 

They accumulate wealth through stock speculation, tax arbitrage, currency manipulation, “outsourcing” and “offshoring” jobs, basically manip- ulating money with no regard for the companies or communities affected. 

But they also worry Greece will be denied €130 billion in bail- out money if it can’t cut its deficit. 

The Commis- sion issued its final report this month, and identified nearly half a billion dollars in annual savings from state