<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Information-Retrieval" data-toc-modified-id="Information-Retrieval-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Information Retrieval</a></span><ul class="toc-item"><li><span><a href="#BM25" data-toc-modified-id="BM25-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>BM25</a></span></li><li><span><a href="#Word2Vec" data-toc-modified-id="Word2Vec-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Word2Vec</a></span></li></ul></li></ul></div>

In [1]:
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import os
import time
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# change default style figure and font size
plt.rcParams['figure.figsize'] = 8, 6
plt.rcParams['font.size'] = 12

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn,matplotlib,nltk,gensim

Ethen 2018-11-19 11:26:26 

CPython 3.6.4
IPython 6.4.0

numpy 1.14.1
pandas 0.23.0
sklearn 0.19.1
matplotlib 2.2.2
nltk 3.2.5
gensim 3.6.0


# Information Retrieval

In [2]:
# https://www.kaggle.com/snapcrack/all-the-news
# https://www.kaggle.com/girianantharaman/dual-embeddings-space-model-demo

DATA_DIR = os.path.join('data', 'all-the-news')
data_path = os.path.join(DATA_DIR, 'articles1.csv')

df = pd.read_csv(data_path, sep=',')
print('dimension: ', df.shape)
df.head()

dimension:  (50000, 10)


Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [3]:
content = df.loc[0, 'content']
content

'WASHINGTON  —   Congressional Republicans have a new fear when it comes to their    health care lawsuit against the Obama administration: They might win. The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administration’s authority to spend billions of dollars on health insurance subsidies for   and   Americans, handing House Republicans a big victory on    issues. But a sudden loss of the disputed subsidies could conceivably cause the health care program to implode, leaving millions of people without access to health insurance before Republicans have prepared a replacement. That could lead to chaos in the insurance market and spur a political backlash just as Republicans gain full control of the government. To stave off that outcome, Republicans could find themselves in the awkward position of appropriating huge sums to temporarily prop up the Obama health care law, angering conservative voters who have been 

In [4]:
import re
from nltk.corpus import stopwords


stop_words = set(stopwords.words('english'))


def normalize_text(text):

    # remove special characters\whitespaces
    text = re.sub(r'[^a-zA-Z\s]', '', text, re.I | re.A)

    # lower case & tokenize text
    tokens = re.split(r'\s+', text.lower().strip())

    # filter stopwords out of text &
    # re-create text from filtered tokens
    text = ' '.join(token for token in tokens if token not in stop_words)
    return text

In [5]:
cleaned_content = normalize_text(content)
cleaned_content

'washington congressional republicans new fear comes health care lawsuit obama administration might win incoming trump administration could choose longer defend executive branch suit challenges administrations authority spend billions dollars health insurance subsidies americans handing house republicans big victory issues sudden loss disputed subsidies could conceivably cause health care program implode leaving millions people without access health insurance republicans prepared replacement could lead chaos insurance market spur political backlash republicans gain full control government stave outcome republicans could find awkward position appropriating huge sums temporarily prop obama health care law angering conservative voters demanding end law years another twist donald j trumps administration worried preserving executive branch prerogatives could choose fight republican allies house central questions dispute eager avoid ugly political pileup republicans capitol hill trump transi

In [6]:
CENTROIDS_PATH = './inputs/centroids/'
BM25_PATH = './inputs/bm25/'
    
if not os.path.isdir(CENTROIDS_PATH):
    os.makedirs(CENTROIDS_PATH)

if not os.path.isdir(BM25_PATH):
    os.makedirs(BM25_PATH)

In [7]:
from utils import CsvTextPreprocessor


CONTENT_INDEX = 9
CONTENT_DIR = './inputs/contents'  # original text
TOKENS_PATH = './inputs/tokens'  # normalized text
text_preprocessor = CsvTextPreprocessor(CONTENT_DIR, TOKENS_PATH)

start = time.time()        
text_preprocessor.preprocess(DATA_DIR, CONTENT_INDEX)
elapsed = time.time() - start
print('elapsed: ', elapsed)

50000it [00:50, 981.52it/s] 
42571it [00:50, 846.87it/s] 
49999it [00:50, 982.36it/s] 

elapsed:  152.11539769172668





## BM25

In [8]:
import glob

class BM25Sentences:

    def __init__(self, input_dir):
        self.input_dir = input_dir

    def __iter__(self):
        text_files = os.path.join(self.input_dir, '*.txt')
        for file_path in glob.iglob(text_files):
            with open(file_path) as f:
                for line in f:
                    yield file_path, line.split(' ')

In [10]:
from utils import normalize_text

query = 'political stability and economic health'
query_words = normalize_text(query).split(' ')
query_words

['political', 'stability', 'economic', 'health']

In [11]:
from bm25 import BM25

start = time.time()
sentences = BM25Sentences(TOKENS_PATH)
bm25 = BM25().fit(sentences)
elapsed = time.time() - start
print('elapsed: ', elapsed)

bm25

142462it [02:39, 893.45it/s] 


elapsed:  160.083251953125


<bm25.BM25 at 0x11b96add8>

In [12]:
scores = bm25.search(query_words)
scores = np.array(scores)
print(len(scores))
print(len(bm25.doc_path_))
scores[:5]

142462
142462


array([0.        , 0.        , 0.        , 0.        , 3.53871261])

In [13]:
import numpy as np

count = 5
ids = np.argpartition(scores, -count)[-count:]
best = sorted(zip(ids, scores[ids]), key=lambda x: -x[1])
best

[(68539, 15.81140841689766),
 (14662, 14.613784082568912),
 (95402, 14.064168286153397),
 (70024, 13.972153952455397),
 (102903, 13.965064742436493)]

In [14]:
for ids, scores in best:
    file_path = bm25.doc_path_[ids]
    with open(file_path) as f:
        content = f.readlines()
        print(content)

['u trails switzerland singapore economic competitiveness new global index finds americas infrastructure health system primary education lagging world economic forums index also notes three u strengths large market financial sophistication labor efficiency economies worldwide u rank top basic requirements pillars institutions infrastructure macroeconomic environment health primary education years global competitiveness index says authors add u high ranking supported innovation business sophistication market size financial market development labor market efficiency higher education third straight year u hasnt ranked global economic competitiveness since past decade u fallen top five twice sixth seventh notable findings index index heart nearly report takes worlds economic temperature ranks economies according wefs managing board member richard samans years edition identifies large challenge build prosperous inclusive world economic zeitgeist samans says one rising income inequality moun

## Word2Vec

In [15]:
import glob
import logging
from gensim.models import Word2Vec


class MySentences:
    def __init__(self, input_dir, sent_tokenize=False, yield_file_path=False):
        self.input_dir = input_dir
        self.sent_tokenize = sent_tokenize
        self.yield_file_path = yield_file_path

        self._file_count = None

    def __iter__(self):
        file_count = 0
        text_files = os.path.join(self.input_dir, '*.txt')
        for file_path in glob.iglob(text_files):
            with open(file_path) as f:
                for line in f:
                    if self.sent_tokenize:
                        for sentence in nltk.sent_tokenize(line):
                            splitted = sentence.split(' ')
                            if self.yield_file_path:
                                yield file_path, splitted
                            else:
                                yield splitted
                    else:
                        splitted = line.split(' ')
                        if self.yield_file_path:
                            yield file_path, splitted
                        else:
                            yield splitted
      
            file_count += 1

        self._file_count = file_count

    def __len__(self):
        if self._file_count is not None:
            return self._file_count
        else:
            file_count = 0
            text_files = os.path.join(self.input_dir, '*.txt')
            for _ in glob.iglob(text_files):
                file_count += 1

            self._file_count = file_count
            return file_count

In [16]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
size = 100

MODEL_DIR = './models'
if not os.path.isdir(MODEL_DIR):
    os.makedirs(MODEL_DIR, exist_ok=True)

WORD2VEC_CHECKPOINT = os.path.join(MODEL_DIR, 'word2vec')
if os.path.exists(WORD2VEC_CHECKPOINT):
    model = Word2Vec.load(WORD2VEC_CHECKPOINT)
else:
    sentences = MySentences(TOKENS_PATH, sent_tokenize=True)
    start = time.time()
    model = Word2Vec(sentences, size=size, min_count=1, workers=8)
    elapse = time.time() - start
    print('training word2vec, elapse', elapse)
    model.save(WORD2VEC_CHECKPOINT)

model

2018-11-19 11:31:44,514 : INFO : loading Word2Vec object from ./models/word2vec
2018-11-19 11:31:46,007 : INFO : loading wv recursively from ./models/word2vec.wv.* with mmap=None
2018-11-19 11:31:46,008 : INFO : loading vectors from ./models/word2vec.wv.vectors.npy with mmap=None
2018-11-19 11:31:46,484 : INFO : setting ignored attribute vectors_norm to None
2018-11-19 11:31:46,486 : INFO : loading vocabulary recursively from ./models/word2vec.vocabulary.* with mmap=None
2018-11-19 11:31:46,487 : INFO : loading trainables recursively from ./models/word2vec.trainables.* with mmap=None
2018-11-19 11:31:46,487 : INFO : loading syn1neg from ./models/word2vec.trainables.syn1neg.npy with mmap=None
2018-11-19 11:31:46,831 : INFO : setting ignored attribute cum_table to None
2018-11-19 11:31:46,832 : INFO : loaded ./models/word2vec


<gensim.models.word2vec.Word2Vec at 0x1af569f28>

In [17]:
model.wv.most_similar(positive=['texas', 'senate'], negative=['alabama'])

2018-11-19 11:31:47,745 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('congress', 0.5660451650619507),
 ('mcconnell', 0.5639157295227051),
 ('senates', 0.5626640915870667),
 ('congressional', 0.5412375926971436),
 ('senators', 0.5374283194541931),
 ('compromise', 0.521332323551178),
 ('bipartisan', 0.4992099404335022),
 ('lawmakers', 0.4846542775630951),
 ('legislation', 0.4817521572113037),
 ('cornyn', 0.4757631719112396)]

In [18]:
model.trainables.syn1neg

array([[-1.15044944e-01, -1.23589449e-01, -3.17708403e-01, ...,
        -5.46056449e-01, -4.26160991e-02,  3.38331640e-01],
       [-1.16323628e-01, -1.14829942e-04, -1.83716163e-01, ...,
        -4.59382236e-01,  7.80692771e-02,  2.33324707e-01],
       [ 1.74659625e-01, -2.44787801e-02, -1.74992323e-01, ...,
        -3.57166618e-01, -8.44258815e-02,  3.32570553e-01],
       ...,
       [-6.19058535e-02, -3.33593898e-02, -1.30370945e-01, ...,
        -2.35101551e-01,  3.42206731e-02,  2.22871929e-01],
       [-1.12095596e-02, -5.11139221e-02, -7.98190460e-02, ...,
        -2.68821448e-01, -5.53974323e-03,  1.96674615e-01],
       [-7.09969923e-02, -2.63407007e-02, -1.14711694e-01, ...,
        -2.63222098e-01, -3.43595035e-02,  2.01708406e-01]], dtype=float32)

In [19]:
model.wv.vectors

array([[ 1.6682332e+00,  2.1501987e+00, -3.4519074e+00, ...,
        -9.8638493e-01,  3.0872801e-01, -8.4184754e-01],
       [ 4.0478511e+00, -1.0189250e+00,  2.3866029e+00, ...,
        -2.0242831e-01,  2.4158988e+00, -6.1949694e-01],
       [ 1.8393631e+00, -2.4332662e+00,  3.9833120e-01, ...,
        -3.1544128e-01,  4.7272205e-01,  2.5445123e+00],
       ...,
       [ 1.0556578e-02, -1.7846090e-03,  3.8046867e-03, ...,
         9.8366261e-05, -2.0594260e-02, -2.6757052e-02],
       [ 3.7142928e-03,  8.9221317e-03,  2.8417742e-02, ...,
         2.9896203e-02, -4.5573595e-03, -2.3104765e-02],
       [ 1.1918565e-03, -8.8695716e-03,  3.6429144e-02, ...,
         2.7563004e-02, -1.3972600e-03, -1.9709809e-02]], dtype=float32)

In [20]:
model.wv.index2word[:5]

['said', 'trump', 'would', 'one', 'people']

In [53]:
from keyedvectors import most_similar

# check word similarity of in and out

# in-similarity
most_similar(model.wv.vectors, model.wv.vocab, model.wv.index2word,
             positive=['texas', 'senate'], negative=['alabama'], topn=10)

[('congress', 0.56604517),
 ('mcconnell', 0.5639157),
 ('senates', 0.56266415),
 ('congressional', 0.5412376),
 ('senators', 0.5374283),
 ('compromise', 0.52133226),
 ('bipartisan', 0.49920994),
 ('lawmakers', 0.48465428),
 ('legislation', 0.4817521),
 ('cornyn', 0.47576317)]

In [54]:
# do nltk.word_tokenize, stemming ??
# out-similarity
most_similar(model.trainables.syn1neg, model.wv.vocab, model.wv.index2word,
             positive=['texas', 'senate'], negative=['alabama'], topn=10)

[('senators', 0.8506255),
 ('senates', 0.842402),
 ('cornyn', 0.8409874),
 ('senate,', 0.8402041),
 ('nunes,', 0.8382237),
 ('whitehouse,', 0.8379873),
 ('grillings', 0.83711666),
 ('systemthe', 0.8369872),
 ('senate.', 0.8369804),
 ('cosponsoring', 0.83695364)]

In [21]:
def get_embedding(word, out=False):
    """get the in and out embedding."""
    if word in model.wv.vocab:
        if out:
            return model.trainables.syn1neg[model.wv.vocab[word].index]
        else:
            return model.wv.word_vec(word)
    else:
        return np.zeros(size)
    
    
get_embedding('texas')

array([-0.8053016 , -0.20635933,  0.26323155, -1.3916781 , -1.6074103 ,
       -0.6458612 ,  3.6251957 ,  1.924907  ,  1.7238611 , -3.0091443 ,
       -3.644323  , -2.2989883 ,  0.4058073 , -1.8312562 ,  1.2772194 ,
        0.99042547,  1.1072534 ,  1.4322779 ,  1.4491457 , -0.0270129 ,
       -0.32702562,  2.3497567 ,  0.7401319 , -1.6613798 , -1.8708793 ,
       -1.6488732 ,  0.13329391, -2.4221842 ,  0.81547606,  0.26860705,
       -1.8837483 , -2.0518198 ,  0.18068574, -3.856494  , -1.8311516 ,
       -0.65552324, -0.8590028 ,  1.8917407 , -1.0743737 , -2.530846  ,
       -1.7569964 ,  0.01815351,  0.43278348, -1.7510601 ,  0.83483505,
        0.91933   ,  1.4509525 , -4.130169  , -1.6613231 , -2.3731368 ,
        4.643101  , -4.2266436 , -1.0283322 , -0.580952  ,  2.476184  ,
        0.10395367, -2.7749097 ,  4.268075  ,  0.926306  ,  2.1863344 ,
        2.3218188 , -1.3676759 , -1.640588  ,  2.7934365 ,  1.7173072 ,
       -0.02814626,  1.0236717 , -0.61047035, -0.8806903 , -0.30

In [22]:
query_in = np.array([get_embedding(word) for word in query_words]).mean(axis=0)
print(query_in.shape)
query_in

(100,)


array([ 1.80335259e+00, -1.25592148e+00,  1.07672417e+00,  1.48030376e+00,
       -8.97015214e-01, -1.88996458e+00, -1.36558509e+00, -9.24432278e-02,
       -1.36952734e+00, -8.14530969e-01, -1.71751094e+00, -7.22950578e-01,
       -2.20678568e-01, -2.23531055e+00,  1.62976861e-01,  4.88344252e-01,
        3.09180975e+00,  1.17521095e+00, -2.75864124e-01,  2.44087338e+00,
        8.79394174e-01,  1.46664238e+00, -1.29474211e+00, -9.56055820e-01,
        1.84187025e-01,  1.10502258e-01, -4.38267924e-02,  5.14714003e-01,
        1.25214052e+00, -1.80455238e-01, -7.30888844e-02,  1.20776606e+00,
       -6.28649056e-01,  1.19478035e+00,  4.53989059e-01, -1.55133533e+00,
       -2.66639829e+00, -9.99739766e-01,  1.26152921e+00,  9.12181854e-01,
       -3.48918986e+00,  3.42816949e+00, -6.37068570e-01, -1.85933501e-01,
       -5.15830040e-01, -3.06389034e-01, -1.47171807e+00,  9.96897876e-01,
       -5.54930687e-01,  9.95828286e-02, -4.64269459e-01, -1.17028028e-01,
       -5.70893943e-01,  

In [23]:
from tqdm import tqdm

start = time.time()

sentences = MySentences(TOKENS_PATH, sent_tokenize=False, yield_file_path=True)

doc_path = []
centroid_in = np.zeros((len(sentences), size), dtype=np.float32)
centroid_out = np.zeros_like(centroid_in, dtype=np.float32)

for idx, (file_path, sentence) in tqdm(enumerate(sentences)):
    # compute the central embedding representation for each document
    centroid_in[idx] = np.mean([get_embedding(word) for word in sentence], axis=0)
    centroid_out[idx] = np.mean([get_embedding(word, out=True) for word in sentence], axis=0)
    doc_path.append(file_path)

elapsed = time.time() - start
print('elapsed: ', elapsed)

142462it [06:00, 401.97it/s]

elapsed:  360.60191106796265





In [24]:
from sklearn.preprocessing import normalize
from sklearn.neighbors import NearestNeighbors

topn = 5
n_jobs = -1
normed_factors = normalize(centroid_in)
knn = NearestNeighbors(
    n_neighbors=topn + 1, metric='euclidean', n_jobs=n_jobs)
knn.fit(normed_factors)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='euclidean',
         metric_params=None, n_jobs=-1, n_neighbors=6, p=2, radius=1.0)

In [25]:
distances, indices = knn.kneighbors(query_in.reshape(1, -1))
indices

array([[ 51876,  95005, 141087,  29223,  76819, 110946]])

In [26]:
for index in indices.ravel():
    file_path = doc_path[index]
    with open(file_path) as f:
        content = f.readlines()
        print(content)

['robert e rubin council foreign relations u treasury secretary progress economy made since financial crisis real sense country adrift faith institutions eroding income inequality job insecurity sluggish wage growth even improved performance fraying social fabric technological development lesser extent globalization contribute productivity growth also put pressure wages jobs poverty rate unconscionably high many feel american promise hard work leading better life reach united states however still holds worlds best hand question play cards need policy regime effectively promotes growth widespread income gains greater economic security context economy undergoing transformation three objectives interdependent mutually reinforcing taken together constitute one overarching goal inclusive growth could turn restore sense common purpose confidence future social cohesion inclusive growth agenda would address three broad categories challenges first public investment second structural reform inno