# Distributed Representations of Sentences and Documents

https://arxiv.org/abs/1405.4053

http://research.google.com/pubs/pub44894.html

http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip

In [1]:
# Python 3.6.1
from typing import Dict, List, Deque, Tuple, Iterable, Callable

import collections
import functools
import gc
import math
import os
import random
import re
import requests
import shutil
import zipfile

import tensorflow as tf
import numpy as np

tf.logging.set_verbosity(tf.logging.ERROR)

tf.VERSION

'1.2.1'

## Data Preparation

**Download**

In [2]:
HOME_DIR = 'rotten_tomatoes'
DATA_DIR = os.path.join(HOME_DIR, 'data')

if not os.path.isdir(DATA_DIR):
    os.makedirs(DATA_DIR)

DATASET_URL = 'http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip'
DATASET_FILENAME = DATASET_URL.split('/')[-1]
DATASET_PACKAGE = os.path.join(DATA_DIR, DATASET_FILENAME)

package_missing = not os.path.isfile(DATASET_PACKAGE)

if package_missing:
    print('Downloading {}...'.format(DATASET_FILENAME))
    r = requests.get(DATASET_URL, stream=True)
    with open(DATASET_PACKAGE, 'wb') as f:
        for chunk in r.iter_content(chunk_size=32768):
            if chunk:
                f.write(chunk)
    print('Done!')

print('Unpacking Stanford Sentiment Treebank dataset...')

FILE_PATTERN = re.compile(r'^stanfordSentimentTreebank/.+\.txt$')

def extract(zip_file, filename, dst_path):
    print('Extracting', filename)
    dst_file = os.path.join(dst_path, os.path.basename(filename))
    with open(dst_file, 'wb') as fout:
        fin = zip_file.open(filename)
        shutil.copyfileobj(fin, fout)

with zipfile.ZipFile(DATASET_PACKAGE) as f:
    files = list(name for name in f.namelist() if FILE_PATTERN.match(name))
    for filename in files:
        extract(f, filename, DATA_DIR)

Unpacking Stanford Sentiment Treebank dataset...
Extracting stanfordSentimentTreebank/datasetSentences.txt
Extracting stanfordSentimentTreebank/datasetSplit.txt
Extracting stanfordSentimentTreebank/dictionary.txt
Extracting stanfordSentimentTreebank/original_rt_snippets.txt
Extracting stanfordSentimentTreebank/README.txt
Extracting stanfordSentimentTreebank/sentiment_labels.txt
Extracting stanfordSentimentTreebank/SOStr.txt
Extracting stanfordSentimentTreebank/STree.txt


**Exploration**

In [3]:
def show(file, lines=10):
    with open(file) as f:
        for _ in range(lines):
            print(next(f).strip())

In [4]:
SENTENCES_FILE = os.path.join(DATA_DIR, 'datasetSentences.txt')
show(SENTENCES_FILE)

sentence_index	sentence
1	The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .
2	The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer\/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .
3	Effective but too-tepid biopic
4	If you sometimes like to go to the movies to have fun , Wasabi is a good place to start .
5	Emerges as something rare , an issue movie that 's so honest and keenly observed that it does n't feel like one .
6	The film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .
7	Offers that rare combination of entertainment and education .
8	Perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions .
9	Ste

In [5]:
PHRASES_FILE = os.path.join(DATA_DIR, 'dictionary.txt')
show(PHRASES_FILE)

!|0
! '|22935
! ''|18235
! Alas|179257
! Brilliant|22936
! Brilliant !|40532
! Brilliant ! '|22937
! C'mon|60624
! Gollum 's ` performance ' is incredible|13402
! Oh , look at that clever angle ! Wow , a jump cut !|179258


In [6]:
SENTIMENT_FILE = os.path.join(DATA_DIR, 'sentiment_labels.txt')
show(SENTIMENT_FILE)

phrase ids|sentiment values
0|0.5
1|0.5
2|0.44444
3|0.5
4|0.42708
5|0.375
6|0.41667
7|0.54167
8|0.33333


In [7]:
SPLIT_FILE = os.path.join(DATA_DIR, 'datasetSplit.txt')
show(SPLIT_FILE)

sentence_index,splitset_label
1,1
2,1
3,2
4,2
5,2
6,2
7,2
8,2
9,2


In [8]:
phrases: Dict[str, str] = dict()

with open(PHRASES_FILE) as f:
    for line in f:
        phrase_text, phrase_id = line.rstrip().split('|')
        phrases[phrase_text] = phrase_id

print('Phrases: {:,d}'.format(len(phrases)))

Phrases: 239,232


In [9]:
sentiments: Dict[str, float] = dict()

with open(SENTIMENT_FILE) as f:
    next(f) # skip header
    for line in f:
        phrase_id, sentiment_score = line.rstrip().split('|')
        sentiments[phrase_id] = float(sentiment_score)

print('Sentiments: {:,d}'.format(len(sentiments)))

Sentiments: 239,232


In [10]:
sentences: Dict[str, str] = dict()

with open(SENTENCES_FILE) as f:
    next(f) # skip header
    for line in f:
        sentence_id, sentence_text = line.rstrip().split('\t')
        sentences[sentence_id] = sentence_text

print('Sentences: {:,d}'.format(len(sentences)))

Sentences: 11,855


In [11]:
n = 0
for sentence_text in sentences.values():
    if sentence_text not in phrases:
        if n < 10:
            print('(missing sentence)\n\n{}\n'.format(sentence_text))
        n += 1

print('Total missing: {:,d}'.format(n))

(missing sentence)

But in Imax 3-D , the clichÃ©s disappear into the vertiginous perspectives opened up by the photography .

(missing sentence)

-LRB- But it 's -RRB- worth recommending because of two marvelous performances by Michael Caine and Brendan Fraser .

(missing sentence)

JirÃ­ Hubac 's script is a gem .

(missing sentence)

You would n't call The Good Girl a date movie -LRB- an anti-date movie is more like it -RRB- , but when it 's good , it 's good and horrid .

(missing sentence)

An incendiary , deeply thought-provoking look at one of the most peculiar -LRB- and peculiarly venomous -RRB- bigotries in our increasingly frightening theocracy

(missing sentence)

MÃ¼nch 's genuine insight makes the film 's occasional overindulgence forgivable .

(missing sentence)

I enjoyed the ride -LRB- bumps and all -RRB- , creamy depth , and ultimate theme .

(missing sentence)

As a randy film about sexy people in gorgeous places being pushed and pulled -LRB- literally and figurativel

In [12]:
sentence_replace = {
    '-LRB-': '(',
    '-RRB-': ')',
    'Ã¡': 'á',
    'Ã ': 'à',
    'Ã¢': 'â',
    'Ã£': 'ã',
    'Ã©': 'é',
    'Ã¨': 'è',
    'Ã­': 'í',
    'Ã¯': 'ï',
    'Ã³': 'ó',
    'Ã´': 'ô',
    'Ã¶': 'ö',
    'Ã»': 'û',
    'Ã¼': 'ü',
    'Ã¦': 'æ',
    'Ã§': 'ç',
    'Ã±': 'ñ',
    '2Â': '2',
    '8Â': '8',    
}

def text_fix(txt):
    for k, v in sentence_replace.items():
        if k in txt:
            txt = txt.replace(k, v)
    return txt

sentences: Dict[str, str] = dict()

with open(SENTENCES_FILE) as f:
    next(f) # skip header
    for line in f:
        sentence_id, sentence_text = line.rstrip().split('\t')
        sentences[sentence_id] = text_fix(sentence_text)

print('Sentences: {:,d}'.format(len(sentences)))

Sentences: 11,855


In [13]:
n = 0
for sentence_text in sentences.values():
    if sentence_text not in phrases:
        if n < 10:
            print('(missing sentence)\n\n{}\n'.format(sentence_text))
        n += 1

print('Total missing: {:,d}'.format(n))

Total missing: 0


In [14]:
for i, s in enumerate(sentences.items()):
    if i == 10:
        break
    print('{}\n\n{}\n'.format(*s))

1

The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .

2

The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer\/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .

3

Effective but too-tepid biopic

4

If you sometimes like to go to the movies to have fun , Wasabi is a good place to start .

5

Emerges as something rare , an issue movie that 's so honest and keenly observed that it does n't feel like one .

6

The film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .

7

Offers that rare combination of entertainment and education .

8

Perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions .

9

Steers tur

In [15]:
sentence_to_phrase: Dict[str, str] = dict(
    (sentence_id, phrases[sentence_text])
    for sentence_id, sentence_text in sentences.items())

print('Sentences to phrase: {:,d}'.format(len(sentence_to_phrase)))

Sentences to phrase: 11,855


In [16]:
train_sentiment: List[Tuple[str, float]] = list()
valid_sentiment: List[Tuple[str, float]] = list()
test_sentiment: List[Tuple[str, float]] = list()

splits = {
    '1': train_sentiment,
    '2': test_sentiment,
    '3': valid_sentiment,
}

with open(SPLIT_FILE) as f:
    next(f) # skip header
    for line in f:
        sentence_id, split = line.rstrip().split(',')
        phrase_id = sentence_to_phrase[sentence_id]
        sentiment_score = sentiments[phrase_id]
        splits[split].append((phrase_id, sentiment_score))

print('Train sentences: {:,d}'.format(len(train_sentiment)))
print('Valid sentences: {:,d}'.format(len(valid_sentiment)))
print('Test sentences: {:,d}'.format(len(test_sentiment)))

Train sentences: 8,544
Valid sentences: 1,101
Test sentences: 2,210


In [17]:
train_sentiment[:10]

[('226166', 0.69444),
 ('226300', 0.83333),
 ('225801', 0.625),
 ('14646', 0.5),
 ('14644', 0.72222),
 ('227114', 0.83333),
 ('224508', 0.875),
 ('225402', 0.72222),
 ('228134', 0.83333),
 ('227472', 0.73611)]

**Vocabulary**

In [18]:
all_phrases = True
min_word_freq = 5
# all_phrases = False
# min_word_freq = 2

In [19]:
sentence_phrases = set(sentence_to_phrase.values())
raw_text = list((phrase_ref, phrase_text)
               for phrase_text, phrase_ref in phrases.items()
               if all_phrases or phrase_ref in sentence_phrases)

raw_text[:10]

[('0', '!'),
 ('22935', "! '"),
 ('18235', "! ''"),
 ('179257', '! Alas'),
 ('22936', '! Brilliant'),
 ('40532', '! Brilliant !'),
 ('22937', "! Brilliant ! '"),
 ('60624', "! C'mon"),
 ('13402', "! Gollum 's ` performance ' is incredible"),
 ('179258', '! Oh , look at that clever angle ! Wow , a jump cut !')]

In [20]:
def transform_tokenize(data: List[Tuple[str, str]]) -> List[Tuple[str, List[str]]]:
    return list((ref, text.lower().split()) for ref, text in data)

text_tokens = transform_tokenize(raw_text)

print('Text (tokenized): {:,d}'.format(len(text_tokens)))

Text (tokenized): 239,232


In [21]:
text_length = list(len(tokens) for _, tokens in text_tokens)
text_length = sorted(text_length)

print('First 10 lengths\n\n{}\n'.format(text_length[:10]))
print('Last 10 lengths\n\n{}\n'.format(text_length[-10:]))

l_min = np.amin(text_length)
l_max = np.amax(text_length)
l_mean = np.mean(text_length)
l_stdev = np.std(text_length)
l_50 = np.median(text_length)
l_25 = np.percentile(text_length, 25)
l_75 = np.percentile(text_length, 75)

print('Statistics\n')
print('Min: {:,d}'.format(l_min))
print('Max: {:,d}'.format(l_max))
print('Mean: {:,.1f}'.format(l_mean))
print('Stdev: {:,.1f}'.format(l_stdev))
print('25%: {:,.1f}'.format(l_25))
print('50%: {:,.1f}'.format(l_50))
print('75%: {:,.1f}'.format(l_75), '\n')

l_hist = collections.Counter(text_length).most_common()

print('Most common:\n')
for length, freq in l_hist[:10]:
    print('{}: {:,d}'.format(length, freq))
print('\nLeast common:\n')
for length, freq in l_hist[-10:]:
    print('{}: {:,d}'.format(length, freq))
print('\nLength <= 10:\n')
for length, freq in (c for c in l_hist if c[0] <= 10):
    print('{}: {:,d}'.format(length, freq))
print()

First 10 lengths

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Last 10 lengths

[51, 51, 51, 51, 51, 52, 52, 52, 55, 56]

Statistics

Min: 1
Max: 56
Mean: 7.8
Stdev: 7.5
25%: 2.0
50%: 5.0
75%: 10.0 

Most common:

2: 37,489
3: 30,949
1: 22,346
4: 21,403
5: 16,711
6: 13,739
7: 11,518
8: 9,791
9: 8,314
10: 7,291

Least common:

45: 68
46: 42
47: 30
48: 26
49: 13
50: 8
51: 7
52: 3
55: 1
56: 1

Length <= 10:

2: 37,489
3: 30,949
1: 22,346
4: 21,403
5: 16,711
6: 13,739
7: 11,518
8: 9,791
9: 8,314
10: 7,291



In [22]:
def transform_flat(data: List[Tuple[str, List[str]]]) -> List[str]:
    return list(token for _, tokens in data for token in tokens)

tokens = transform_flat(text_tokens)

print('Tokens (total): {:,d}'.format(len(tokens)))

Tokens (total): 1,855,983


In [23]:
def transform_freq(data: List[str]) -> List[Tuple[str, int]]:
    return collections.Counter(data).most_common()

tokens_freq = transform_freq(tokens)

print('Tokens (unique): {:,d}'.format(len(tokens_freq)))

Tokens (unique): 19,795


In [24]:
print('Most common:\n')
for token, freq in tokens_freq[:20]:
    print('{} ({:,d})'.format(token, freq))
print('\nLeast common:\n')
for token, freq in tokens_freq[-20:]:
    print('{} ({:,d})'.format(token, freq))

Most common:

the (83,351)
, (70,577)
a (58,742)
and (51,804)
of (51,771)
. (38,004)
to (36,937)
's (28,200)
is (23,073)
in (22,602)
it (20,930)
that (20,057)
as (14,224)
with (12,573)
for (12,080)
its (11,473)
film (10,518)
an (10,262)
this (10,120)
movie (9,688)

Least common:

ryosuke (2)
schnieder (2)
sensitively (2)
snoots (2)
spectators (2)
spiderman (2)
symbolically (2)
theirs (2)
topkapi (2)
touché (2)
two-bit (2)
ub (2)
unflinchingly (2)
unintelligible (2)
unspools (2)
unsurprisingly (2)
vereté (2)
ou (2)
overburdened (2)
unk (1)


In [25]:
tokens_vocab = list(token for token, freq in tokens_freq if freq >= min_word_freq)

print('Tokens (unique): {:,d}'.format(len(tokens_freq)))
print('Tokens {}+: {:,d}'.format(min_word_freq, len(tokens_vocab)))

Tokens (unique): 19,795
Tokens 5+: 19,212


In [26]:
vocabulary_size = len(tokens_vocab) + 2

NULL_ID = 0
UNK_ID = 1
token_to_id: Dict[str, int] = dict((token, token_id)
                                   for token_id, token in enumerate(tokens_vocab, 2))
token_to_id['NULL'] = NULL_ID
token_to_id['UNK'] = UNK_ID

token_from_id: Dict[int, str] = dict((token_id, token)
                                     for token, token_id in token_to_id.items())

print('Vocabulary size: {:,d}'.format(vocabulary_size))
print('Tokens (to id): {:,d}'.format(len(token_to_id)))
print('Tokens (from id): {:,d}'.format(len(token_from_id)))

Vocabulary size: 19,214
Tokens (to id): 19,214
Tokens (from id): 19,214


In [27]:
collection_size = len(raw_text)

document_to_id: Dict[str, int] = dict((doc_ref, doc_id) for doc_id, (doc_ref, _) in enumerate(raw_text))
document_from_id: Dict[int, str] = dict((doc_id, doc_ref) for doc_ref, doc_id in document_to_id.items())

print('Collection size: {:,d}'.format(collection_size))
print('Documents (to id): {:,d}'.format(len(document_to_id)))
print('Documents (from id): {:,d}'.format(len(document_from_id)))

Collection size: 239,232
Documents (to id): 239,232
Documents (from id): 239,232


In [28]:
VOCABULARY_FILE = os.path.join(HOME_DIR, 'vocabulary.txt')

with open(VOCABULARY_FILE, 'w') as f:
    for token_id in range(len(token_from_id)):
        f.write(token_from_id[token_id] + '\n')

print('Vocabulary file size: {:,d} bytes'.format(os.stat(VOCABULARY_FILE).st_size))

Vocabulary file size: 170,052 bytes


In [29]:
DOCUMENTS_FILE = os.path.join(HOME_DIR, 'documents.txt')

with open(DOCUMENTS_FILE, 'w') as f:
    for doc_id in range(len(document_from_id)):
        f.write(document_from_id[doc_id] + '\n')

print('Documents file size: {:,d} bytes'.format(os.stat(DOCUMENTS_FILE).st_size))

Documents file size: 1,563,514 bytes


**Transformation**

In [30]:
def transform_text(data: List[Tuple[str, List[str]]],
                   key_to_id: Dict[str, int],
                   value_to_id: Dict[str, int],
                   unk_id: str) \
    -> List[Tuple[int, int]]:
    return list((key_to_id[key], value_to_id.get(value_, unk_id))
                for key, value in data
                for value_ in value)

data = transform_text(text_tokens, document_to_id, token_to_id, UNK_ID)

print('Tokens (total):\n\n{:,d}\n'.format(len(data)))
print('Text (IDs):\n\n{}\n'.format(data[:10]))
print('Text (Tokens):\n\n{}'.format(list((document_from_id[doc_id], token_from_id[token_id])
                                         for doc_id, token_id in data[:10])))

Tokens (total):

1,855,983

Text (IDs):

[(0, 255), (1, 255), (1, 44), (2, 255), (2, 27), (3, 255), (3, 2796), (4, 255), (4, 653), (5, 255)]

Text (Tokens):

[('0', '!'), ('22935', '!'), ('22935', "'"), ('18235', '!'), ('18235', "''"), ('179257', '!'), ('179257', 'alas'), ('22936', '!'), ('22936', 'brilliant'), ('40532', '!')]


In [31]:
def transform_sentiment(data: List[Tuple[str, int]],
                        key_to_id: Dict[str, int],
                        threshold=0.5) \
    -> List[Tuple[int, int]]:
    return list((key_to_id[doc_ref], int(score > threshold))
                 for doc_ref, score in data)

train_data = transform_sentiment(train_sentiment, document_to_id)
valid_data = transform_sentiment(valid_sentiment, document_to_id)
test_data = transform_sentiment(test_sentiment, document_to_id)

print('Train data size: {:,d}'.format(len(train_data)))
print('Valid data size: {:,d}'.format(len(valid_data)))
print('Test data size: {:,d}'.format(len(test_data)))

Train data size: 8,544
Valid data size: 1,101
Test data size: 2,210


In [32]:
train_data[:10]

[(50444, 1),
 (52284, 1),
 (47441, 1),
 (60951, 0),
 (60905, 1),
 (59623, 1),
 (37093, 1),
 (43737, 1),
 (69284, 1),
 (61884, 1)]

In [33]:
collections.Counter(x[1] for x in train_data).most_common()

[(1, 4300), (0, 4244)]

In [34]:
collections.Counter(x[1] for x in valid_data).most_common()

[(0, 558), (1, 543)]

In [35]:
collections.Counter(x[1] for x in test_data).most_common()

[(0, 1143), (1, 1067)]

In [36]:
del phrases, sentiments, sentences, sentence_to_phrase, sentence_phrases
del raw_text, text_tokens, text_length, l_hist, tokens, tokens_freq, tokens_vocab
del train_sentiment, valid_sentiment, test_sentiment
gc.collect()

0

Results:

* **`token_to_id: Dict[str, int]`** - token text to index
* **`token_from_id: Dict[int, str]`** - index to token text
* **`document_to_id: Dict[str, int]`** - phrase id (reference) to index
* **`document_from_id: Dict[int, str]`** - index to phrase id (reference)
* **`data: List[Tuple[int, int]]`** - list of tuples (phrase index, token index)
* **`train_data: List[Tuple[int, int]]`** - list of tuples (phrase index, sentiment class)
* **`valid_data: List[Tuple[int, int]]`** - list of tuples (phrase index, sentiment class)
* **`test_data: List[Tuple[int, int]]`** - list of tuples (phrase index, sentiment class)

## Distributed Memory (DM)

**Input**

In [37]:
def count_windows(data: List[Tuple[int, int]],
                  window_size: int) -> int:
    doc_length = collections.Counter(doc_id for doc_id, _ in data).values()
    windows_per_doc = (1 + max(0, length - window_size)
                        for length in doc_length)
    return sum(windows_per_doc)

assert count_windows([(0, 1)], 4) == 1
assert count_windows([(0, 1), (1, 1)], 4) == 1 + 1
assert count_windows([(0, 1), (1, 1), (1, 2), (1, 3), (1, 4)], 4) == 1 + 1
assert count_windows([(0, 1), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5)], 4) == 1 + 1 + 1
assert count_windows([(0, 1), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6)], 4) == 1 + 1 + 2

num_examples_4 = count_windows(data, 4)
num_batches_64 = math.ceil(num_examples_4 / 64)
last_batch_64 = num_examples_4 % 64

print('Examples (window_size=4): {:,d}'.format(num_examples_4))
print('Batches (batch_size=64): {:,d}'.format(num_batches_64))
print('Last (batch_size=64): {:,d}'.format(last_batch_64))

Examples (window_size=4): 1,311,252
Batches (batch_size=64): 20,489
Last (batch_size=64): 20


In [38]:
def slice_document(data: Deque[Tuple[int, int]],
                   window_size: int,
                   pad_value=NULL_ID) \
    -> Tuple[int, Deque[int], Deque[int]]:

    doc_id, token_id = data.popleft()
    window = collections.deque(maxlen=window_size)
    window.append(token_id)
    tail = collections.deque()
    while data and data[0][0] == doc_id:
        _, token_id = data.popleft()
        if len(window) < window_size:
            window.append(token_id)
        else:
            tail.append(token_id)
    pad_size = window_size - len(window) 
    if pad_size > 0:
        window.extendleft([pad_value] * pad_size)
    return doc_id, window, tail

fake_data = collections.deque([(0, 1),
                               (1, 1), (1, 2),
                               (2, 1), (2, 2), (2, 3), (2, 4),
                               (3, 1), (3, 2), (3, 3), (3, 4), (3, 5)])
assert slice_document(fake_data, 4) == (0, collections.deque([0, 0, 0, 1]), collections.deque())
assert slice_document(fake_data, 4) == (1, collections.deque([0, 0, 1, 2]), collections.deque())
assert slice_document(fake_data, 4) == (2, collections.deque([1, 2, 3, 4]), collections.deque())
assert slice_document(fake_data, 4) == (3, collections.deque([1, 2, 3, 4]), collections.deque([5]))

In [39]:
def examples_generator_dm(data: List[Tuple[int, int]],
                          window_size: int) \
    -> Tuple[int, List[int], int]:
    
    num_examples = count_windows(data, window_size)
    data_tail = collections.deque(data)
    doc_id, window, tail = None, None, None

    for _ in range(num_examples):
        if not tail:
            doc_id, window, tail = slice_document(data_tail, window_size)
        else:
            window.append(tail.popleft())
        _window = list(window)
        yield doc_id, _window[:-1], _window[-1]

assert len(list(examples_generator_dm(data, 4))) == count_windows(data, 4)

In [40]:
def input_dm(data: List[Tuple[int, int]],
             batch_size: int,
             window_size: int,
             shuffle=True) \
    -> Iterable[Tuple[np.ndarray, np.ndarray, np.ndarray]]:
    
    examples = list(examples_generator_dm(data, window_size))
    if shuffle:
        random.shuffle(examples)
    
    num_examples = len(examples)
    while num_examples > 0:
        batch_size_i = min(batch_size, num_examples)
        
        doc_batch = np.ndarray(shape=(batch_size_i, 1), dtype=np.int32)
        words_batch = \
            np.ndarray(shape=(batch_size_i, window_size-1), dtype=np.int32)
        target_batch = np.ndarray(shape=(batch_size_i, 1), dtype=np.int32)
        
        for i in range(batch_size_i):
            doc_id, words, target = examples.pop()
            doc_batch[i, 0] = doc_id
            words_batch[i, :] = words
            target_batch[i, 0] = target
        
        num_examples -= batch_size_i
        yield doc_batch, words_batch, target_batch

In [41]:
batch_size = 64
window_size = 4

n = 0
for _ in input_dm(data, batch_size, window_size):
    n += 1
print('Epoch steps: {:,d}'.format(n))

Epoch steps: 20,489


Example:

In [42]:
batch_size = 4
window_size = 3
num_iters = 2

data_iter = input_dm(data, batch_size, window_size)

for k in range(1, num_iters+1):
    print('Batch {}\n'.format(k))
    doc_batch, words_batch, target_batch = next(data_iter)
    for i in range(batch_size):
        doc_ref = document_from_id[doc_batch[i, 0]]
        words = ' '.join(token_from_id[token_id]
                         for token_id in words_batch[i])
        target = token_from_id[target_batch[i, 0]]
        print('{}: {} -> {}'.format(doc_ref, words, target))
    print()

del data_iter

Batch 1

165942: paint the -> castro
152210: metaphor for -> the
199649: -- a -> dearth
115802: everyday lives -> of

Batch 2

198730: to the -> soggy
53466: far from -> painful
11911: just how -> ridiculous
63847: effective portrait -> of



**Model**

Model building:

In [43]:
graph = tf.Graph()
graph.as_default()
session = tf.InteractiveSession(graph=graph)
session

<tensorflow.python.client.session.InteractiveSession at 0x7f9d9fcd4588>

In [44]:
batch_size = 4
window_size = 3
collection_size = 5
vocabulary_size = 20
embedding_size = 3
num_sampled = 2

In [45]:
X_doc = tf.constant(np.random.randint(low=0,
                                      high=collection_size,
                                      size=(batch_size, 1),
                                      dtype=np.int32))

print(X_doc, '\n')
print(X_doc.eval())

Tensor("Const:0", shape=(4, 1), dtype=int32) 

[[3]
 [0]
 [2]
 [4]]


In [46]:
X_words = tf.constant(np.random.randint(low=0,
                                        high=vocabulary_size,
                                        size=(batch_size, window_size-1),
                                        dtype=np.int32))

print(X_words, '\n')
print(X_words.eval())

Tensor("Const_1:0", shape=(4, 2), dtype=int32) 

[[ 6 12]
 [13  4]
 [11  5]
 [ 3  7]]


In [47]:
y = tf.constant(np.random.randint(low=0,
                                  high=vocabulary_size,
                                  size=(batch_size, 1),
                                  dtype=np.int32))

print(y, '\n')
print(y.eval())

Tensor("Const_2:0", shape=(4, 1), dtype=int32) 

[[18]
 [11]
 [ 7]
 [17]]


In [48]:
# ~ tf.random_uniform(shape=(collection_size, embedding_size),
#                     minval=-1.0, maxval=1.0)
doc_embeddings = tf.Variable(
    2 * np.random.rand(collection_size, embedding_size) - 1, dtype=tf.float32)

doc_embeddings.initializer.run()

print(doc_embeddings, '\n')
print(doc_embeddings.eval())

<tf.Variable 'Variable:0' shape=(5, 3) dtype=float32_ref> 

[[ 0.61695331 -0.83561099  0.10422225]
 [ 0.01217361  0.48313096 -0.78804368]
 [-0.88473046 -0.86139524  0.17132476]
 [-0.99765617  0.31212142 -0.67784834]
 [-0.98943394 -0.74844962 -0.75118273]]


In [49]:
NULL = tf.zeros(shape=(1, embedding_size))

print(NULL, '\n')
print(NULL.eval())

Tensor("zeros:0", shape=(1, 3), dtype=float32) 

[[ 0.  0.  0.]]


In [50]:
# ~ tf.random_uniform(shape=(vocabulary_size - 1, embedding_size),
#                     minval=-1.0, maxval=1.0)
word_embeddings_ = tf.Variable(
    2 * np.random.rand(vocabulary_size - 1, embedding_size) - 1, dtype=tf.float32)

word_embeddings_.initializer.run()

print(word_embeddings_, '\n')
print(word_embeddings_.eval())

<tf.Variable 'Variable_1:0' shape=(19, 3) dtype=float32_ref> 

[[ 0.34873706  0.60267746 -0.51587909]
 [ 0.33118272 -0.82778776 -0.55520707]
 [ 0.4912844   0.34662133  0.85085076]
 [-0.835271    0.17302893 -0.18058257]
 [-0.37621927 -0.85698396 -0.62743801]
 [-0.75230187  0.13368917  0.99833119]
 [-0.33602506  0.35631764 -0.23668386]
 [-0.57253391 -0.05880897  0.78443813]
 [ 0.72282249 -0.38552123 -0.79930961]
 [-0.68327624  0.463929    0.84884042]
 [ 0.18314609 -0.24641821 -0.41466221]
 [-0.2665945  -0.15293165 -0.12184247]
 [ 0.29443887  0.70825493 -0.17377342]
 [-0.04638245  0.14809784  0.05762608]
 [ 0.68977821 -0.82786703  0.30497012]
 [ 0.80665559 -0.7813338   0.80110413]
 [-0.63951135 -0.83063197  0.41020012]
 [-0.85422176  0.95679355 -0.51974726]
 [ 0.32139546 -0.39768055 -0.79400396]]


In [51]:
word_embeddings = tf.concat([NULL, word_embeddings_], axis=0)

print(word_embeddings, '\n')
print(word_embeddings.eval())

Tensor("concat:0", shape=(20, 3), dtype=float32) 

[[ 0.          0.          0.        ]
 [ 0.34873706  0.60267746 -0.51587909]
 [ 0.33118272 -0.82778776 -0.55520707]
 [ 0.4912844   0.34662133  0.85085076]
 [-0.835271    0.17302893 -0.18058257]
 [-0.37621927 -0.85698396 -0.62743801]
 [-0.75230187  0.13368917  0.99833119]
 [-0.33602506  0.35631764 -0.23668386]
 [-0.57253391 -0.05880897  0.78443813]
 [ 0.72282249 -0.38552123 -0.79930961]
 [-0.68327624  0.463929    0.84884042]
 [ 0.18314609 -0.24641821 -0.41466221]
 [-0.2665945  -0.15293165 -0.12184247]
 [ 0.29443887  0.70825493 -0.17377342]
 [-0.04638245  0.14809784  0.05762608]
 [ 0.68977821 -0.82786703  0.30497012]
 [ 0.80665559 -0.7813338   0.80110413]
 [-0.63951135 -0.83063197  0.41020012]
 [-0.85422176  0.95679355 -0.51974726]
 [ 0.32139546 -0.39768055 -0.79400396]]


In [52]:
D_embed = tf.nn.embedding_lookup(doc_embeddings, X_doc)

print(D_embed, '\n')
print(D_embed.eval())

Tensor("embedding_lookup:0", shape=(4, 1, 3), dtype=float32) 

[[[-0.99765617  0.31212142 -0.67784834]]

 [[ 0.61695331 -0.83561099  0.10422225]]

 [[-0.88473046 -0.86139524  0.17132476]]

 [[-0.98943394 -0.74844962 -0.75118273]]]


In [53]:
W_embed = tf.nn.embedding_lookup(word_embeddings, X_words)

print(W_embed, '\n')
print(W_embed.eval())

Tensor("embedding_lookup_1:0", shape=(4, 2, 3), dtype=float32) 

[[[-0.75230187  0.13368917  0.99833119]
  [-0.2665945  -0.15293165 -0.12184247]]

 [[ 0.29443887  0.70825493 -0.17377342]
  [-0.835271    0.17302893 -0.18058257]]

 [[ 0.18314609 -0.24641821 -0.41466221]
  [-0.37621927 -0.85698396 -0.62743801]]

 [[ 0.4912844   0.34662133  0.85085076]
  [-0.33602506  0.35631764 -0.23668386]]]


In [54]:
X_embed = tf.concat([D_embed, W_embed], axis=1)

print(X_embed, '\n')
print(X_embed.eval())

Tensor("concat_1:0", shape=(4, 3, 3), dtype=float32) 

[[[-0.99765617  0.31212142 -0.67784834]
  [-0.75230187  0.13368917  0.99833119]
  [-0.2665945  -0.15293165 -0.12184247]]

 [[ 0.61695331 -0.83561099  0.10422225]
  [ 0.29443887  0.70825493 -0.17377342]
  [-0.835271    0.17302893 -0.18058257]]

 [[-0.88473046 -0.86139524  0.17132476]
  [ 0.18314609 -0.24641821 -0.41466221]
  [-0.37621927 -0.85698396 -0.62743801]]

 [[-0.98943394 -0.74844962 -0.75118273]
  [ 0.4912844   0.34662133  0.85085076]
  [-0.33602506  0.35631764 -0.23668386]]]


In [55]:
# concatenate
X_linear = tf.reshape(X_embed, [-1, window_size * embedding_size])

print(X_linear, '\n')
print(X_linear.eval())

Tensor("Reshape:0", shape=(4, 9), dtype=float32) 

[[-0.99765617  0.31212142 -0.67784834 -0.75230187  0.13368917  0.99833119
  -0.2665945  -0.15293165 -0.12184247]
 [ 0.61695331 -0.83561099  0.10422225  0.29443887  0.70825493 -0.17377342
  -0.835271    0.17302893 -0.18058257]
 [-0.88473046 -0.86139524  0.17132476  0.18314609 -0.24641821 -0.41466221
  -0.37621927 -0.85698396 -0.62743801]
 [-0.98943394 -0.74844962 -0.75118273  0.4912844   0.34662133  0.85085076
  -0.33602506  0.35631764 -0.23668386]]


In [56]:
# average
X_linear_ = tf.reduce_mean(X_embed, axis=1)

print(X_linear_, '\n')
print(X_linear_.eval())

Tensor("Mean:0", shape=(4, 3), dtype=float32) 

[[-0.67218417  0.09762631  0.06621346]
 [ 0.02537374  0.01522429 -0.08337792]
 [-0.35926786 -0.6549325  -0.2902585 ]
 [-0.2780582  -0.01517022 -0.04567194]]


In [57]:
# ~ tf.truncated_normal(shape=(vocabulary_size, window_size * embedding_size),
#                       stddev=1.0 / np.sqrt(window_size * embedding_size))
W_linear = tf.Variable(
    np.random.randn(vocabulary_size,
                    window_size * embedding_size) \
        / np.sqrt(window_size * embedding_size),
    dtype=tf.float32)

W_linear.initializer.run()

print(W_linear, '\n')
print(W_linear.eval())

<tf.Variable 'Variable_2:0' shape=(20, 9) dtype=float32_ref> 

[[ 0.50300145  0.0552301   0.01993387  0.03547276  0.20155744 -0.27312559
  -0.02927397 -0.86327475  0.10278616]
 [ 0.17196627 -0.21156923 -0.31027129  0.43497968  0.08492911  0.08177979
  -0.07587215  0.41648802  0.18212952]
 [ 0.44487646 -0.35043812 -0.51195747  0.00614848 -0.04894156 -0.54465908
   0.11943867  0.14217165  0.5565322 ]
 [-0.35984975  0.10350386  0.71107525  0.44209716  0.12714341  0.02990536
   0.31134015  0.18437816 -0.85236979]
 [ 0.31551623 -0.07670244 -0.14918286 -0.08804412 -0.34384787  0.51661211
  -0.18063158  0.16129385  0.32392266]
 [ 0.95518899 -0.17771925 -0.09204181 -0.54479259 -0.72058147 -0.64434087
  -0.01533921  0.10319012  0.0417698 ]
 [-0.04822987 -0.30193558  0.11717546 -0.00480292 -0.41884765  0.14872152
  -0.28904089  0.04287879  0.22426446]
 [-0.53877252  0.19082838 -0.15504326  0.18127279 -0.29014444  0.13743672
  -0.71249241 -0.04261471 -0.17864887]
 [-0.03443231 -0.39198354  0.3187

In [58]:
# ~ tf.zeros(shape=(vocabulary_size,))
b_linear = tf.Variable(np.zeros(vocabulary_size), dtype=tf.float32)

b_linear.initializer.run()

print(b_linear, '\n')
print(b_linear.eval())

<tf.Variable 'Variable_3:0' shape=(20,) dtype=float32_ref> 

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.]


In [59]:
sampled_loss = tf.nn.sampled_softmax_loss(weights=W_linear,
                                          biases=b_linear,
                                          inputs=X_linear,
                                          labels=y,
                                          num_sampled=num_sampled,
                                          num_classes=vocabulary_size)

print(sampled_loss, '\n')
print(sampled_loss.eval())

Tensor("Reshape_3:0", shape=(4,), dtype=float32) 

[ 0.1228271   0.46898732  0.17658222  0.19335443]


In [60]:
loss = tf.reduce_mean(sampled_loss)

print(loss, '\n')
print(loss.eval())

Tensor("Mean_1:0", shape=(), dtype=float32) 

0.410072


In [61]:
session.close()
del X_doc, X_words, y, doc_embeddings, NULL, word_embeddings_, word_embeddings
del D_embed, W_embed, X_embed, X_linear, X_linear_, W_linear, b_linear
del sampled_loss, loss
del graph, session
gc.collect()

0

Model function:

In [62]:
def model_dm(collection_size: int,
             vocabulary_size: int,
             embedding_size: int,
             window_size: int,
             num_sampled: int,
             linear_input='concatenate') \
    -> Tuple[List[tf.Tensor], List[tf.Tensor], tf.Tensor]:
    
    X_doc = tf.placeholder_with_default([[0]],
                                        shape=(None, 1),
                                        name='X_doc')
    X_words = tf.placeholder_with_default([[0]*(window_size-1)],
                                          shape=(None, window_size-1),
                                          name='X_words')
    y = tf.placeholder_with_default([[0]],
                                    shape=(None, 1),
                                    name='y')

    doc_embeddings = tf.Variable(
        tf.random_uniform(shape=(collection_size, embedding_size),
                          minval=-1.0, maxval=1.0),
        name='doc_embeddings')
    NULL = tf.zeros(shape=(1, embedding_size))
    word_embeddings_ = tf.Variable(
        tf.random_uniform(shape=(vocabulary_size - 1, embedding_size),
                          minval=-1.0, maxval=1.0))
    word_embeddings = tf.concat([NULL, word_embeddings_], axis=0,
                                name='word_embeddings')

    D_embed = tf.nn.embedding_lookup(doc_embeddings, X_doc)
    W_embed = tf.nn.embedding_lookup(word_embeddings, X_words)
    X_embed = tf.concat([D_embed, W_embed], axis=1)
    
    if linear_input == 'concatenate':
        linear_input_size = window_size * embedding_size
        X_linear = tf.reshape(X_embed, [-1, linear_input_size])
    elif linear_input == 'average':
        linear_input_size = embedding_size
        X_linear = tf.reduce_mean(X_embed, axis=1)
    
    W_linear = tf.Variable(
        tf.truncated_normal(shape=(vocabulary_size, linear_input_size),
                            stddev=1.0 / np.sqrt(linear_input_size)),
        name='W')
    b_linear = tf.Variable(
        tf.zeros(shape=(vocabulary_size,)),
        name='b')

    with tf.name_scope('loss'):
        sampled_loss = tf.nn.sampled_softmax_loss(weights=W_linear,
                                                  biases=b_linear,
                                                  inputs=X_linear,
                                                  labels=y,
                                                  num_sampled=num_sampled,
                                                  num_classes=vocabulary_size)
        loss = tf.reduce_mean(sampled_loss, name='mean')


    inputs = [X_doc, X_words, y]
    embeddings = [doc_embeddings, word_embeddings]
    return inputs, embeddings, loss

Example:

In [63]:
batch_size = 4
window_size = 3
vocabulary_size = 20
collection_size = 5
embedding_size = 3
num_sampled = 2

X_doc_batch = np.random.randint(low=0,
                                high=collection_size,
                                size=(batch_size, 1),
                                dtype=np.int32)
X_words_batch = np.random.randint(low=0,
                                  high=vocabulary_size,
                                  size=(batch_size, window_size-1),
                                  dtype=np.int32)
y_batch = np.random.randint(low=0,
                            high=vocabulary_size,
                            size=(batch_size, 1),
                            dtype=np.int32)
data_batch = (X_doc_batch, X_words_batch, y_batch)

with tf.Graph().as_default() as graph, \
    tf.Session(graph=graph) as session:

    inputs, embeddings, loss_op = \
        model_dm(collection_size,
                 vocabulary_size,
                 embedding_size,
                 window_size,
                 num_sampled)

    tf.global_variables_initializer().run()

    data_feed = dict(zip(inputs, data_batch))
    loss, doc_embeddings, word_embeddings = \
        session.run([loss_op,  *embeddings], data_feed)

    print('Average loss:\n\n{:,.3f}\n'.format(loss))
    print('Document embeddings:\n\n{}\n'.format(doc_embeddings))
    print('Word embeddings:\n\n{}\n'.format(word_embeddings))

Average loss:

1.093

Document embeddings:

[[ 0.45113659  0.77803659 -0.67613029]
 [ 0.94764209  0.39480782 -0.2601161 ]
 [-0.49436474 -0.47757936 -0.70696068]
 [ 0.42527127 -0.17255878 -0.35939431]
 [-0.98591185 -0.40841484  0.00539851]]

Word embeddings:

[[ 0.          0.          0.        ]
 [-0.84714127 -0.76014662  0.27584815]
 [ 0.59547353  0.46513414 -0.82155466]
 [ 0.19842935 -0.22104239  0.01252961]
 [ 0.94651079  0.81266451  0.6597259 ]
 [-0.30908179  0.4171741  -0.50931215]
 [-0.86100388 -0.35169339  0.74930382]
 [-0.15820646  0.06676936 -0.19249582]
 [-0.38554835  0.99897957  0.28609705]
 [-0.56098485 -0.32115078 -0.53414893]
 [ 0.76078367  0.12152863 -0.61840367]
 [-0.91746092  0.06317115 -0.78675151]
 [ 0.58433509 -0.75683093 -0.02150941]
 [ 0.93951225 -0.03575969 -0.95609236]
 [-0.05698776 -0.2371254  -0.34965897]
 [-0.19159937 -0.89720178  0.1561265 ]
 [-0.51984644  0.10438085 -0.34959316]
 [ 0.95687032 -0.54687071 -0.19117522]
 [ 0.63521814 -0.57325006  0.05669355]


## Distributed Bag-of-Words (DBOW)

**Input**

In [64]:
def examples_generator_dbow(data: List[Tuple[int, int]],
                            window_size: int) \
    -> Tuple[int, List[int], int]:
    
    num_examples = count_windows(data, window_size)
    data_tail = collections.deque(data)
    doc_id, window, tail = None, None, None

    for _ in range(num_examples):
        if not tail:
            doc_id, window, tail = slice_document(data_tail, window_size)
        else:
            window.append(tail.popleft())
        yield doc_id, list(window)

assert len(list(examples_generator_dbow(data, 4))) == count_windows(data, 4)

In [65]:
def input_dbow(data: List[Tuple[int, int]],
               batch_size: int,
               window_size: int,
               shuffle=True) \
    -> Iterable[Tuple[np.ndarray, np.ndarray]]:
    
    examples = list(examples_generator_dbow(data, window_size))
    if shuffle:
        random.shuffle(examples)

    num_examples = len(examples)
    while num_examples > 0:
        batch_size_i = min(batch_size, num_examples)
        
        doc_batch = np.ndarray(shape=(batch_size_i, 1), dtype=np.int32)
        target_batch = \
            np.ndarray(shape=(batch_size_i, window_size), dtype=np.int32)

        for i in range(batch_size_i):
            doc_id, words = examples.pop()
            doc_batch[i, 0] = doc_id
            target_batch[i, :] = words
        
        num_examples -= batch_size_i
        yield doc_batch, target_batch

In [66]:
batch_size = 64
window_size = 4

n = 0
for _ in input_dbow(data, batch_size, window_size):
    n += 1
print('Epoch steps: {:,d}'.format(n))

Epoch steps: 20,489


Example:

In [67]:
batch_size = 4
window_size = 3
num_iters = 2

data_iter = input_dbow(data, batch_size, window_size)

for k in range(1, num_iters+1):
    print('Batch {}\n'.format(k))
    doc_batch, target_batch = next(data_iter)
    for i in range(batch_size):
        doc_ref = document_from_id[doc_batch[i, 0]]
        target = ' '.join(token_from_id[token_id]
                          for token_id in target_batch[i])
        print('{} -> {}'.format(doc_ref, target))
    print()

del data_iter

Batch 1

220991 -> can be said
141095 -> hard to understand
112899 -> of whatever idealism
143158 -> or post-production stages

Batch 2

69597 -> sweetness , with
184949 -> like a human
182188 -> does n't galvanize
221968 -> from her dangerous



**Model**

Model building:

In [68]:
graph = tf.Graph()
graph.as_default()
session = tf.InteractiveSession(graph=graph)
session

<tensorflow.python.client.session.InteractiveSession at 0x7f9da3baaac8>

In [69]:
batch_size = 4
window_size = 3
collection_size = 5
vocabulary_size = 20
embedding_size = 3
num_sampled = 2

In [70]:
X = tf.constant(np.random.randint(low=0,
                                  high=collection_size,
                                  size=(batch_size, 1),
                                  dtype=np.int32))

print(X, '\n')
print(X.eval())

Tensor("Const:0", shape=(4, 1), dtype=int32) 

[[1]
 [4]
 [4]
 [0]]


In [71]:
y = tf.constant(np.random.randint(low=0,
                                  high=vocabulary_size,
                                  size=(batch_size, window_size),
                                  dtype=np.int32))

print(y, '\n')
print(y.eval())

Tensor("Const_1:0", shape=(4, 3), dtype=int32) 

[[ 0  4  6]
 [ 6  2  3]
 [ 9 11 19]
 [ 9 17 16]]


In [72]:
# ~ tf.random_uniform(shape=(collection_size, embedding_size),
#                     minval=-1.0, maxval=1.0)
doc_embeddings = tf.Variable(
    2 * np.random.rand(collection_size, embedding_size) - 1, dtype=tf.float32)

doc_embeddings.initializer.run()

print(doc_embeddings, '\n')
print(doc_embeddings.eval())

<tf.Variable 'Variable:0' shape=(5, 3) dtype=float32_ref> 

[[-0.61349183  0.77466398  0.45858532]
 [ 0.66857857 -0.04503886 -0.12617138]
 [ 0.60222524 -0.96931243 -0.58015776]
 [-0.5013777  -0.65873313  0.3652117 ]
 [-0.48010617 -0.93881094 -0.16271667]]


In [73]:
D_embed = tf.nn.embedding_lookup(doc_embeddings, X)

print(D_embed, '\n')
print(D_embed.eval())

Tensor("embedding_lookup:0", shape=(4, 1, 3), dtype=float32) 

[[[ 0.66857857 -0.04503886 -0.12617138]]

 [[-0.48010617 -0.93881094 -0.16271667]]

 [[-0.48010617 -0.93881094 -0.16271667]]

 [[-0.61349183  0.77466398  0.45858532]]]


In [74]:
X_linear = tf.squeeze(D_embed, axis=1)

print(X_linear, '\n')
print(X_linear.eval())

Tensor("Squeeze:0", shape=(4, 3), dtype=float32) 

[[ 0.66857857 -0.04503886 -0.12617138]
 [-0.48010617 -0.93881094 -0.16271667]
 [-0.48010617 -0.93881094 -0.16271667]
 [-0.61349183  0.77466398  0.45858532]]


In [75]:
# ~ tf.truncated_normal(shape=(vocabulary_size, embedding_size),
#                       stddev=1.0 / np.sqrt(embedding_size))
W_linear = tf.Variable(
    np.random.randn(vocabulary_size, embedding_size) / np.sqrt(embedding_size),
    dtype=tf.float32)

W_linear.initializer.run()

print(W_linear, '\n')
print(W_linear.eval())

<tf.Variable 'Variable_1:0' shape=(20, 3) dtype=float32_ref> 

[[-0.82627457 -0.39039791  0.81090957]
 [ 0.56913507 -0.41484466 -0.1717644 ]
 [-0.36643881 -0.40700886  0.40026474]
 [ 0.43798104 -0.11199692 -0.38186169]
 [-0.83280349  0.32550862 -0.70675701]
 [-0.17739746 -1.37478101  0.45252252]
 [-0.46652669 -0.27244434 -0.18778431]
 [ 0.53468007 -0.27579314  0.33644667]
 [-1.36255693  0.4423517  -0.28640082]
 [ 0.34280732 -0.45193434 -0.71272874]
 [ 0.32164633  0.69007677 -0.62571734]
 [-1.32908964 -0.05086939  0.12795296]
 [ 0.90208244  0.20637964 -0.16464441]
 [ 0.38290766 -0.06800617 -0.73474473]
 [-0.85834175  0.16015102  0.22423425]
 [ 0.74233687  0.10035369  0.05704785]
 [ 0.01810645  0.74310029 -0.06171733]
 [-0.8623957   0.02693488  0.28534612]
 [ 0.68768942 -0.81597233  0.30073568]
 [-0.01739639  0.08307558 -0.1174571 ]]


In [76]:
# ~ tf.zeros(shape=(vocabulary_size,))
b_linear = tf.Variable(np.zeros(vocabulary_size), dtype=tf.float32)

b_linear.initializer.run()

print(b_linear, '\n')
print(b_linear.eval())

<tf.Variable 'Variable_2:0' shape=(20,) dtype=float32_ref> 

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.]


In [77]:
sampled_loss = tf.nn.sampled_softmax_loss(weights=W_linear,
                                          biases=b_linear,
                                          inputs=X_linear,
                                          labels=y,
                                          num_sampled=num_sampled,
                                          num_classes=vocabulary_size,
                                          num_true=window_size)

print(sampled_loss, '\n')
print(sampled_loss.eval())

Tensor("Reshape_2:0", shape=(4,), dtype=float32) 

[ 2.75929332  1.84279549  1.40849543  1.84590399]


In [78]:
loss = tf.reduce_mean(sampled_loss)

print(loss, '\n')
print(loss.eval())

Tensor("Mean:0", shape=(), dtype=float32) 

1.84419


In [79]:
session.close()
del X, y, doc_embeddings, D_embed
del X_linear, W_linear, b_linear, sampled_loss, loss
del graph, session
gc.collect()

0

Model function:

In [80]:
def model_dbow(collection_size: int,
               vocabulary_size: int,
               embedding_size: int,
               window_size: int,
               num_sampled: int) \
    -> Tuple[List[tf.Tensor], tf.Tensor, tf.Tensor]:
    
    X = tf.placeholder_with_default([[0]],
                                    shape=(None, 1),
                                    name='X')
    y = tf.placeholder_with_default([[0]*window_size],
                                    shape=(None, window_size),
                                    name='y')

    doc_embeddings = tf.Variable(
        tf.random_uniform(shape=(collection_size, embedding_size),
                          minval=-1.0, maxval=1.0),
        name='doc_embeddings')

    D_embed = tf.nn.embedding_lookup(doc_embeddings, X)
    X_linear = tf.squeeze(D_embed, axis=1)
    
    W_linear = tf.Variable(
        tf.truncated_normal(shape=(vocabulary_size, embedding_size),
                            stddev=1.0 / np.sqrt(embedding_size)),
        name='W')
    b_linear = tf.Variable(
        tf.zeros(shape=(vocabulary_size,)),
        name='b')

    with tf.name_scope('loss'):
        sampled_loss = tf.nn.sampled_softmax_loss(weights=W_linear,
                                                  biases=b_linear,
                                                  inputs=X_linear,
                                                  labels=y,
                                                  num_sampled=num_sampled,
                                                  num_classes=vocabulary_size,
                                                  num_true=window_size)
        loss = tf.reduce_mean(sampled_loss, name='mean')
        
    inputs = [X, y]
    return inputs, doc_embeddings, loss

Example:

In [81]:
batch_size = 4
window_size = 3
vocabulary_size = 20
collection_size = 5
embedding_size = 3
num_sampled = 2

X_batch = np.random.randint(low=0,
                            high=collection_size,
                            size=(batch_size, 1),
                            dtype=np.int32)
y_batch = np.random.randint(low=0,
                            high=vocabulary_size,
                            size=(batch_size, window_size),
                            dtype=np.int32)
data_batch = (X_batch, y_batch)

with tf.Graph().as_default() as graph, \
    tf.Session(graph=graph) as session:

    inputs, embeddings, loss_op = \
        model_dbow(collection_size,
                   vocabulary_size,
                   embedding_size,
                   window_size,
                   num_sampled)

    tf.global_variables_initializer().run()

    data_feed = dict(zip(inputs, data_batch))
    loss, doc_embeddings = \
        session.run([loss_op, embeddings], data_feed)

    print('Average loss:\n\n{:,.3f}\n'.format(loss))
    print('Document embeddings:\n\n{}\n'.format(doc_embeddings))

Average loss:

1.398

Document embeddings:

[[ 0.94213724  0.11667156  0.86827254]
 [ 0.66263461  0.35906649  0.8484571 ]
 [ 0.31185198  0.13261962 -0.03142118]
 [ 0.10023284  0.80146909 -0.01191282]
 [-0.78691721  0.38137054  0.86561298]]



## Sentiment Analysis

**Input**

In [82]:
num_examples = len(train_data)
num_batches_16 = math.ceil(num_examples / 16)
last_batch_16 = num_examples % 16

print('Examples: {:,d}'.format(num_examples))
print('Batches (batch_size=16): {:,d}'.format(num_batches_16))
print('Last (batch_size=16): {:,d}'.format(last_batch_16))

Examples: 8,544
Batches (batch_size=16): 534
Last (batch_size=16): 0


In [83]:
def input_sentiment(data: List[Tuple[int, int]],
                    batch_size: int,
                    shuffle=True) \
    -> Iterable[Tuple[np.ndarray, np.ndarray]]:
    
    num_examples = len(data)
    data_tail = collections.deque(data)
    if shuffle:
        random.shuffle(data_tail)
    
    while num_examples > 0:
        batch_size_i = min(batch_size, num_examples)
        
        doc_batch = np.ndarray(shape=(batch_size_i, 1), dtype=np.int32)
        target_batch = np.ndarray(shape=(batch_size_i, 1), dtype=np.int32)
        
        for i in range(batch_size_i):
            doc_id, sentiment = data_tail.popleft()
            doc_batch[i, 0] = doc_id
            target_batch[i, 0] = sentiment
        
        num_examples -= batch_size_i
        yield doc_batch, target_batch

In [84]:
batch_size = 16

n = 0
for _ in input_sentiment(train_data, batch_size):
    n += 1
print('Epcoh steps: {:,d}'.format(n))

Epcoh steps: 534


Example:

In [85]:
batch_size = 4
num_iters = 2

data_iter = input_sentiment(train_data, batch_size)

for k in range(1, num_iters + 1):
    print('Batch {}\n'.format(k))
    doc_batch, target_batch = next(data_iter)
    for i in range(batch_size):
        doc_ref = document_from_id[doc_batch[i, 0]]
        sentiment_class = target_batch[i, 0]
        print('{} -> {}'.format(doc_ref, sentiment_class))
    print()

del data_iter

Batch 1

224012 -> 0
226719 -> 0
63663 -> 1
66646 -> 1

Batch 2

183842 -> 0
66594 -> 1
223497 -> 0
110576 -> 0



**Model**

Model build:

In [86]:
graph = tf.Graph()
graph.as_default()
session = tf.InteractiveSession(graph=graph)
session

<tensorflow.python.client.session.InteractiveSession at 0x7f9da4aaafd0>

In [87]:
batch_size = 4
collection_size = 5
embedding_size = 3

In [88]:
X = tf.constant(np.random.randint(low=0,
                                  high=collection_size,
                                  size=(batch_size, 1),
                                  dtype=np.int32))

print(X, '\n')
print(X.eval())

Tensor("Const:0", shape=(4, 1), dtype=int32) 

[[2]
 [4]
 [4]
 [0]]


In [89]:
y = tf.constant(np.random.randint(low=0,
                                  high=2,
                                  size=(batch_size, 1),
                                  dtype=np.int32))

print(y, '\n')
print(y.eval())

Tensor("Const_1:0", shape=(4, 1), dtype=int32) 

[[0]
 [1]
 [1]
 [1]]


In [90]:
embeddings_dm = tf.Variable(
    np.random.randn(collection_size, embedding_size),
    dtype=tf.float32,
    trainable=False)

embeddings_dm.initializer.run()

print(embeddings_dm, '\n')
print(embeddings_dm.eval())

<tf.Variable 'Variable:0' shape=(5, 3) dtype=float32_ref> 

[[ 0.43830919  0.85851675  0.27750978]
 [-1.04084623  0.07217167 -1.79794228]
 [-0.29914698 -0.30388564  0.21007468]
 [ 0.12226102  0.35672107 -0.90264189]
 [-0.24634524 -0.98216307  0.30227593]]


In [91]:
embeddings_dbow = tf.Variable(
    np.random.randn(collection_size, embedding_size),
    dtype=tf.float32,
    trainable=False)

embeddings_dbow.initializer.run()

print(embeddings_dbow, '\n')
print(embeddings_dbow.eval())

<tf.Variable 'Variable_1:0' shape=(5, 3) dtype=float32_ref> 

[[-1.49994326  1.92934251  1.87464392]
 [ 0.1671124   0.17607364 -1.13918853]
 [-0.97666603 -0.20756362 -1.22162235]
 [-0.82695103 -0.28422162 -0.68925583]
 [ 0.49236581  0.86827451 -0.21970809]]


In [92]:
X_dm = tf.nn.embedding_lookup(embeddings_dm, X)

print(X_dm, '\n')
print(X_dm.eval())

Tensor("embedding_lookup:0", shape=(4, 1, 3), dtype=float32) 

[[[-0.29914698 -0.30388564  0.21007468]]

 [[-0.24634524 -0.98216307  0.30227593]]

 [[-0.24634524 -0.98216307  0.30227593]]

 [[ 0.43830919  0.85851675  0.27750978]]]


In [93]:
X_dbow = tf.nn.embedding_lookup(embeddings_dbow, X)

print(X_dbow, '\n')
print(X_dbow.eval())

Tensor("embedding_lookup_1:0", shape=(4, 1, 3), dtype=float32) 

[[[-0.97666603 -0.20756362 -1.22162235]]

 [[ 0.49236581  0.86827451 -0.21970809]]

 [[ 0.49236581  0.86827451 -0.21970809]]

 [[-1.49994326  1.92934251  1.87464392]]]


In [94]:
X_embed = tf.concat([X_dm, X_dbow], axis=2)

print(X_embed, '\n')
print(X_embed.eval())

Tensor("concat:0", shape=(4, 1, 6), dtype=float32) 

[[[-0.29914698 -0.30388564  0.21007468 -0.97666603 -0.20756362 -1.22162235]]

 [[-0.24634524 -0.98216307  0.30227593  0.49236581  0.86827451 -0.21970809]]

 [[-0.24634524 -0.98216307  0.30227593  0.49236581  0.86827451 -0.21970809]]

 [[ 0.43830919  0.85851675  0.27750978 -1.49994326  1.92934251  1.87464392]]]


In [95]:
X_linear = tf.squeeze(X_embed, axis=1)

print(X_linear, '\n')
print(X_linear.eval())

Tensor("Squeeze:0", shape=(4, 6), dtype=float32) 

[[-0.29914698 -0.30388564  0.21007468 -0.97666603 -0.20756362 -1.22162235]
 [-0.24634524 -0.98216307  0.30227593  0.49236581  0.86827451 -0.21970809]
 [-0.24634524 -0.98216307  0.30227593  0.49236581  0.86827451 -0.21970809]
 [ 0.43830919  0.85851675  0.27750978 -1.49994326  1.92934251  1.87464392]]


In [96]:
# ~ tf.truncated_normal(shape=(2 * embedding_size, 1))
W = tf.Variable(
    np.random.randn(2 * embedding_size, 1),
    dtype=tf.float32)

W.initializer.run()

print(W, '\n')
print(W.eval())

<tf.Variable 'Variable_2:0' shape=(6, 1) dtype=float32_ref> 

[[-0.94533986]
 [-0.61754084]
 [-0.02720818]
 [ 0.04461898]
 [ 1.42896974]
 [ 0.06104174]]


In [97]:
# ~ tf.zeros(shape=(1,))
b = tf.Variable(np.zeros(1), dtype=tf.float32)

b.initializer.run()

print(b, '\n')
print(b.eval())

<tf.Variable 'Variable_3:0' shape=(1,) dtype=float32_ref> 

[ 0.]


In [98]:
logits = tf.nn.xw_plus_b(X_linear, W, b)

print(logits, '\n')
print(logits.eval())

Tensor("xw_plus_b:0", shape=(4, 1), dtype=float32) 

[[ 0.04999167]
 [ 2.080477  ]
 [ 2.080477  ]
 [ 1.85240686]]


In [99]:
y_prob = tf.sigmoid(logits)

print(y_prob, '\n')
print(y_prob.eval())

Tensor("Sigmoid:0", shape=(4, 1), dtype=float32) 

[[ 0.51249534]
 [ 0.88899112]
 [ 0.88899112]
 [ 0.86440945]]


In [100]:
y_hat = tf.cast(tf.greater_equal(y_prob, 0.5), tf.int32)

print(y_hat, '\n')
print(y_hat.eval())

Tensor("Cast:0", shape=(4, 1), dtype=int32) 

[[1]
 [1]
 [1]
 [1]]


In [101]:
loss = tf.losses.sigmoid_cross_entropy(y, logits)

print(loss, '\n')
print(loss.eval())

Tensor("sigmoid_cross_entropy_loss/value:0", shape=(), dtype=float32) 

0.274875


In [102]:
acc_tensor, acc_op = tf.metrics.accuracy(labels=y,
                                         predictions=y_hat)

print(acc_tensor)
print(acc_op)

Tensor("accuracy/value:0", shape=(), dtype=float32)
Tensor("accuracy/update_op:0", shape=(), dtype=float32)


In [103]:
*_, acc_total, acc_count = tf.local_variables()

acc_init = tf.variables_initializer([acc_total, acc_count])
acc_init.run()

print(acc_total, '\n')
print(acc_total.eval(), '\n')
print(acc_count, '\n')
print(acc_count.eval())

<tf.Variable 'accuracy/total:0' shape=() dtype=float32_ref> 

0.0 

<tf.Variable 'accuracy/count:0' shape=() dtype=float32_ref> 

0.0


In [104]:
print(acc_tensor.eval())

0.0


In [105]:
print(y.eval(), '\n')
print(y_hat.eval(), '\n')
print(acc_op.eval())

[[0]
 [1]
 [1]
 [1]] 

[[1]
 [1]
 [1]
 [1]] 

0.75


In [106]:
print(acc_tensor.eval())
print(acc_total.eval())
print(acc_count.eval())

0.75
3.0
4.0


In [107]:
acc_init.run()

print(acc_tensor.eval())
print(acc_total.eval())
print(acc_count.eval())

0.0
0.0
0.0


In [108]:
session.close()
del X, y, embeddings_dm, embeddings_dbow
del X_dm, X_dbow, X_embed, X_linear
del W, b, logits, y_prob, y_hat, loss
del acc_tensor, acc_op, acc_total, acc_count, acc_init
del graph, session
gc.collect()

8168

Model function:

In [109]:
def model_sentiment(collection_size: int,
                    embedding_size: int,
                    threshold=0.5) \
    -> Tuple[Tuple[List[tf.Tensor], tf.Operation], List[tf.Tensor], List[tf.Tensor], tf.Tensor]:

    X = tf.placeholder_with_default([[0]], shape=(None, 1), name='X')
    y = tf.placeholder_with_default([[0]], shape=(None, 1), name='y')
    
    embeddings_dm = tf.Variable(tf.zeros(shape=(collection_size, embedding_size)),
                                trainable=False,
                                name='embeddings_dm')
    embeddings_dbow = tf.Variable(tf.zeros(shape=(collection_size, embedding_size)),
                                  trainable=False,
                                  name='embeddings_dbow')
    
    X_dm = tf.nn.embedding_lookup(embeddings_dm, X)
    X_dbow = tf.nn.embedding_lookup(embeddings_dbow, X)
    X_embed = tf.concat([X_dm, X_dbow], axis=2)
    X_linear = tf.squeeze(X_embed, axis=1)

    W = tf.Variable(
        tf.truncated_normal(shape=(2 * embedding_size, 1)),
        name='W')
    b = tf.Variable(
        tf.zeros(shape=(1,)),
        name = 'b')
    logits = tf.nn.xw_plus_b(X_linear, W, b)
    y_prob = tf.sigmoid(logits)
    y_hat = tf.cast(tf.greater_equal(y_prob, threshold), tf.int32)
    
    loss = tf.losses.sigmoid_cross_entropy(y, logits)
    
    embeddings_dm_input = tf.placeholder(
        tf.float32,
        shape=(collection_size, embedding_size),
        name='embeddings_dm_input')
    embeddings_dbow_input = tf.placeholder(
        tf.float32,
        shape=(collection_size, embedding_size),
        name='embeddings_dbow_input')
    embeddings_init_op = tf.group(
        tf.assign(embeddings_dm, embeddings_dm_input),
        tf.assign(embeddings_dbow, embeddings_dbow_input))
    embeddings_inputs = [embeddings_dm_input, embeddings_dbow_input]
    embeddings_init = (embeddings_inputs, embeddings_init_op)

    inputs = [X, y]
    predictions = [y_prob, y_hat]
    return embeddings_init, inputs, predictions, loss

Example:

In [110]:
batch_size = 4
collection_size = 5
embedding_size = 3

embeddings1 = np.random.randn(collection_size,
                              embedding_size)
embeddings2 = np.random.randn(collection_size,
                              embedding_size)
embeddings = [embeddings1.astype(np.float32),
              embeddings2.astype(np.float32)]

X_batch = np.random.randint(low=0,
                            high=collection_size,
                            size=(batch_size, 1),
                            dtype=np.int32)
y_batch = np.random.randint(low=0,
                            high=2,
                            size=(batch_size, 1),
                            dtype=np.int32)
data_batch = (X_batch, y_batch)

with tf.Graph().as_default() as graph, \
    tf.Session(graph=graph) as session:

    init, inputs, predictions, loss_op = \
        model_sentiment(collection_size, embedding_size)
    
    tf.global_variables_initializer().run()

    init_feed = dict(zip(init[0], embeddings))
    session.run(init[1], init_feed)
    
    data_feed = dict(zip(inputs, data_batch))
    loss, y_prob, y_hat = session.run([loss_op, *predictions],
                                      data_feed)
    
    print('Average loss: {:,.3f}\n'.format(loss))
    
    for i in range(batch_size):
        print('y={}, ŷ={} ({:.2f}%)'.format(y_batch[i, 0],
                                            y_hat[i, 0],
                                            100 * y_prob[i, 0]))

Average loss: 0.981

y=0, ŷ=0 (10.89%)
y=1, ŷ=1 (78.08%)
y=1, ŷ=0 (12.95%)
y=0, ŷ=1 (78.08%)


## Experiments

In [111]:
EXP_DIR =  os.path.join(HOME_DIR, 'xxx')

DM_DIR = os.path.join(EXP_DIR, 'pv_dm')
DBOW_DIR = os.path.join(EXP_DIR, 'pv_dbow')
SENT_DIR = os.path.join(EXP_DIR, 'sentiment_linear')

DM_FILE = os.path.join(EXP_DIR, 'pv_dm.txt')
DM_WORDS_FILE = os.path.join(EXP_DIR, 'pv_dm_words.txt')
DBOW_FILE = os.path.join(EXP_DIR, 'pv_dbow.txt')

In [112]:
def remove_dir(path):
    if os.path.isdir(path):
        shutil.rmtree(path)

remove_dir(EXP_DIR)

In [113]:
def opt_adagrad(loss, learning_rate=1.0):
    return tf.contrib.layers.optimize_loss(
        loss=loss,
        optimizer='Adagrad',
        learning_rate=learning_rate,
        global_step=tf.train.get_global_step(),
        summaries=['loss'])

In [114]:
def metrics_average_loss(loss_op, summary_key):
    value, update = tf.metrics.mean(loss_op, name='metrics/average_loss')
    *_, total, count = tf.local_variables()
    reset = tf.variables_initializer([total, count])
    tf.summary.scalar('average_loss', value, [summary_key])
    return value, update, reset
    
def train_embeddings(model_fn, input_fn, opt_fn, num_epochs=1, last_print=True,
                     model_dir='/tmp/embedding_model', remove_model=True):
    if remove_model:
        remove_dir(model_dir)

    EPOCH_SUMMARIES = 'epoch_summaries'

    with tf.Graph().as_default():
        global_step = tf.train.create_global_step()

        inputs, embeddings, loss_op = model_fn()
        train_op = opt_fn(loss_op)

        avg_tensor, avg_op, avg_reset = \
            metrics_average_loss(loss_op, EPOCH_SUMMARIES)

        epoch_summary_op = tf.summary.merge_all(EPOCH_SUMMARIES)

        with tf.train.MonitoredTrainingSession(
            checkpoint_dir=model_dir) as session:

            for epoch in range(1, num_epochs+1):
                #print('Epoch {}...'.format(epoch))

                for data_batch in input_fn():
                    data_feed = dict(zip(inputs, data_batch))
                    session.run([train_op, avg_op], data_feed)

                epoch_summary_proto, step_ = session.run([epoch_summary_op,
                                                          global_step])
                summary_writer = tf.summary.FileWriterCache.get(model_dir)
                summary_writer.add_summary(epoch_summary_proto, step_)
                summary_writer.flush()

                avg_loss = session.run(avg_tensor)
                session.run(avg_reset)
            
            embeddings_ = session.run(embeddings)
        
        tf.summary.FileWriterCache.clear()
    
    if last_print:
        print('Last average loss: {:.4f}'.format(avg_loss))
    return embeddings_

In [115]:
%%time

collection_size = len(document_to_id)
vocabulary_size = len(token_to_id)
embedding_size = 25
window_size = 4
num_sampled = 100
batch_size = 64

model_fn = lambda: model_dm(collection_size,
                            vocabulary_size,
                            embedding_size,
                            window_size,
                            num_sampled,
                            linear_input='average')
input_fn = lambda: input_dm(data,
                            batch_size,
                            window_size)

embeddings_dm, embeddings_dm_words = \
    train_embeddings(model_fn,
                     input_fn,
                     opt_adagrad,
                     num_epochs=1,
                     model_dir=DM_DIR)

Last average loss: 3.8960
CPU times: user 2min 24s, sys: 8.21 s, total: 2min 32s
Wall time: 1min 38s


In [116]:
%%time

collection_size = len(document_to_id)
vocabulary_size = len(token_to_id)
embedding_size = 25
window_size = 4
num_sampled = 100
batch_size = 64

model_fn = lambda: model_dbow(collection_size,
                              vocabulary_size,
                              embedding_size,
                              window_size,
                              num_sampled)
input_fn = lambda: input_dbow(data,
                              batch_size,
                              window_size)

embeddings_dbow = train_embeddings(model_fn,
                                   input_fn,
                                   opt_adagrad,
                                   num_epochs=1,
                                   model_dir=DBOW_DIR)

Last average loss: 4.5411
CPU times: user 1min 53s, sys: 4.05 s, total: 1min 57s
Wall time: 1min 15s


In [117]:
def opt_ftrl(loss, learning_rate=0.1):
    return tf.contrib.layers.optimize_loss(
        loss=loss,
        optimizer='Ftrl',
        learning_rate=learning_rate,
        global_step=tf.train.get_global_step(),
        summaries=['loss'])

In [118]:
def metrics_accuracy(mode, labels, predictions, summary_key):
    value, update = tf.metrics.accuracy(labels=labels,
                                        predictions=predictions,
                                        name='metrics/accuracy/' + mode)
    *_, total, count = tf.local_variables()
    reset = tf.variables_initializer([total, count])
    tf.summary.scalar('accuracy/' + mode, value, [summary_key])
    return value, update, reset

def metrics_auc(mode, labels, predictions, summary_key):
    value, update = tf.metrics.auc(labels=labels,
                                   predictions=predictions,
                                   name='metrics/auc/' + mode)
    *_, tp, tn, fp, fn = tf.local_variables()
    reset = tf.variables_initializer([tp, tn, fp, fn])
    tf.summary.scalar('auc/' + mode, value, [summary_key])
    return value, update, reset


def train_sentiment_pv(model_fn, input_fn, opt_fn, embeddings,
                       eval_data, num_epochs=1, last_print=True,
                       model_dir='/tmp/classifier_model', remove_model=True):
    if remove_model:
        remove_dir(model_dir)

    EPOCH_SUMMARIES = 'epoch_summaries'
    
    with tf.Graph().as_default():
        global_step = tf.train.create_global_step()
        
        init, inputs, predictions, loss_op = model_fn()
        train_op = opt_fn(loss_op)

        avg_tensor, avg_op, avg_reset = \
            metrics_average_loss(loss_op, EPOCH_SUMMARIES)
        
        _, y = inputs
        y_prob, y_hat = predictions
        
        auc_tensor, auc_op, auc_reset = \
            metrics_auc('train', y, y_prob, EPOCH_SUMMARIES)
        auc_eval_tensor, auc_eval_op, auc_eval_reset = \
            metrics_auc('eval', y, y_prob, EPOCH_SUMMARIES)

        acc_tensor, acc_op, acc_reset = \
            metrics_accuracy('train', y, y_hat, EPOCH_SUMMARIES)
        acc_eval_tensor, acc_eval_op, acc_eval_reset = \
            metrics_accuracy('eval', y, y_hat, EPOCH_SUMMARIES)

        eval_feed = dict(zip(inputs, eval_data))
        
        epoch_summary_op = tf.summary.merge_all(EPOCH_SUMMARIES)
        
        loop_ops = [train_op, avg_op, auc_op, acc_op]
        eval_ops = [auc_eval_op, acc_eval_op]
        reset_ops = [avg_reset, auc_reset, acc_reset,
                     auc_eval_reset, acc_eval_reset]
        
        with tf.train.MonitoredTrainingSession(
            checkpoint_dir=model_dir) as session:
            
            init_feed = dict(zip(init[0], embeddings))
            session.run(init[1], init_feed)
            
            for epoch in range(1, num_epochs+1):
                #print('Epoch {}...'.format(epoch))
                
                for data_batch in input_fn():
                    data_feed = dict(zip(inputs, data_batch))
                    session.run(loop_ops, data_feed)

                session.run(eval_ops, eval_feed)
                
                epoch_summary_proto, step_ = session.run([epoch_summary_op,
                                                          global_step])
                summary_writer = tf.summary.FileWriterCache.get(model_dir)
                summary_writer.add_summary(epoch_summary_proto, step_)
                summary_writer.flush()

                avg_loss = session.run(avg_tensor)
                auc = session.run(auc_tensor)
                auc_eval = session.run(auc_eval_tensor)
                acc = session.run(acc_tensor)
                acc_eval = session.run(acc_eval_tensor)

                session.run(reset_ops)

            tf.summary.FileWriterCache.clear()

    if last_print:
        print('Last average loss: {:.3f}'.format(avg_loss))
        print('Last AUC: {:.3f}, eval {:.3f}'.format(auc, auc_eval))
        print('Last accuracy: {:.2f}, eval {:.2f}'.format(
            100 * acc, 100 * acc_eval))

In [119]:
def input_split(data: List[Tuple[int, int]]) \
    -> Tuple[np.ndarray, np.ndarray]:
    X, y = zip(*data)
    X = np.reshape(X, (-1, 1))
    y = np.reshape(y, (-1, 1))
    return X, y

In [120]:
%%time

collection_size = len(document_to_id)
embedding_size = 25
batch_size = 16

valid_data_ = input_split(valid_data)

model_fn = lambda: model_sentiment(collection_size,
                                   embedding_size)
input_fn = lambda: input_sentiment(train_data,
                                   batch_size)

train_sentiment_pv(model_fn,
                   input_fn,
                   opt_ftrl,
                   [embeddings_dm, embeddings_dbow],
                   valid_data_,
                   num_epochs=1,
                   model_dir=SENT_DIR)

Last average loss: 0.703
Last AUC: 0.492, eval 0.495
Last accuracy: 49.34, eval 48.68
CPU times: user 2.1 s, sys: 192 ms, total: 2.29 s
Wall time: 1.75 s


In [121]:
collection_size = len(document_to_id)
embedding_size = 25
batch_size = 16

eval_data = input_split(test_data)

embeddings = [embeddings_dm, embeddings_dbow]

with tf.Graph().as_default() as graph, \
    tf.Session(graph=graph) as session:
    
    init, inputs, predictions, loss_op = \
        model_sentiment(collection_size,
                        embedding_size)

    _, y = inputs
    y_prob, y_hat = predictions

    EPOCH_SUMMARIES = 'epoch_summaries'
    avg_tensor, avg_op, avg_reset = \
        metrics_average_loss(loss_op, EPOCH_SUMMARIES)
    auc_eval_tensor, auc_eval_op, auc_eval_reset = \
        metrics_auc('eval', y, y_prob, EPOCH_SUMMARIES)
    acc_eval_tensor, acc_eval_op, acc_eval_reset = \
        metrics_accuracy('eval', y, y_hat, EPOCH_SUMMARIES)

    eval_feed = dict(zip(inputs, eval_data))

    saver = tf.train.Saver()
    saver.restore(session, tf.train.latest_checkpoint(SENT_DIR))

    session.run(tf.local_variables_initializer())

    init_feed = dict(zip(init[0], embeddings))
    session.run(init[1], init_feed)

    session.run([avg_op, auc_eval_op, acc_eval_op], eval_feed)
    
    avg_loss = session.run(avg_tensor)
    auc_eval = session.run(auc_eval_tensor)
    acc_eval = session.run(acc_eval_tensor)

print('Average loss: {:.3f}'.format(avg_loss))
print('AUC: {:.3f}'.format(auc_eval))
print('Accuracy: {:.2f}'.format(100 * acc_eval))

Average loss: 0.696
AUC: 0.506
Accuracy: 50.54


In [122]:
def save_embeddings(file: str, embeddings: np.ndarray):
    with open(file, 'w') as f:
        num_vectors = embeddings.shape[0]
        for i in range(num_vectors):
            embedding = embeddings[i]
            embedding_string = ('{:.5f}'.format(k) for k in embedding)
            embedding_string = ' '.join(embedding_string)
            f.write(embedding_string)
            f.write('\n')

def load_embeddings(file: str) -> np.ndarray:
    with open(file, 'r') as f:
        vectors = list(list(map(float, line.split())) for line in f)
        return np.asarray(vectors, dtype=np.float32)

In [123]:
save_embeddings(DM_FILE, embeddings_dm)
print('Embeddings file size: {:,d} bytes'.format(os.stat(DM_FILE).st_size))

Embeddings file size: 50,836,170 bytes


In [124]:
save_embeddings(DM_WORDS_FILE, embeddings_dm_words)
print('Embeddings file size: {:,d} bytes'.format(os.stat(DM_WORDS_FILE).st_size))

Embeddings file size: 4,083,149 bytes


In [125]:
save_embeddings(DBOW_FILE, embeddings_dbow)
print('Embeddings file size: {:,d} bytes'.format(os.stat(DBOW_FILE).st_size))

Embeddings file size: 50,836,650 bytes


In [126]:
del embeddings_dm, embeddings_dm_words, embeddings_dbow
gc.collect()

71857

In [127]:
try:
    embeddings_dm
except NameError:
    print('Loading DM embeddings...')
    %time embeddings_dm = load_embeddings(DM_FILE)
try:
    embeddings_dbow
except NameError:
    print('Loading DBOW embeddings...')
    %time embeddings_dbow = load_embeddings(DBOW_FILE)

Loading DM embeddings...
CPU times: user 3.06 s, sys: 79.6 ms, total: 3.14 s
Wall time: 3.14 s
Loading DBOW embeddings...
CPU times: user 2.97 s, sys: 89.6 ms, total: 3.05 s
Wall time: 3.05 s


In [128]:
%%time

collection_size = len(document_to_id)
embedding_size = 25
batch_size = 16

valid_data_ = input_split(valid_data)

model_fn = lambda: model_sentiment(collection_size,
                                   embedding_size)
input_fn = lambda: input_sentiment(train_data,
                                   batch_size)

train_sentiment_pv(model_fn,
                   input_fn,
                   opt_ftrl,
                   [embeddings_dm, embeddings_dbow],
                   valid_data_,
                   num_epochs=1,
                   model_dir=SENT_DIR,
                   remove_model=False)

Last average loss: 0.696
Last AUC: 0.510, eval 0.491
Last accuracy: 50.98, eval 49.14
CPU times: user 2.15 s, sys: 159 ms, total: 2.31 s
Wall time: 1.8 s


In [129]:
del embeddings_dm, embeddings_dbow
gc.collect()

40011

In [130]:
remove_dir(EXP_DIR)

**Exploration**

In [131]:
def train_dm_fn(data: List[Tuple[int, int]],
                collection_size: int,
                vocabulary_size: int,
                embedding_size: int,
                window_size: int,
                num_sampled: int,
                linear_input: str,
                batch_size: int,
                num_epochs=1,
                model_dir=DM_DIR) \
    -> Callable[[], Tuple[np.ndarray, np.array]]:

    model_fn = functools.partial(model_dm,
                                 collection_size,
                                 vocabulary_size,
                                 embedding_size,
                                 window_size,
                                 num_sampled,
                                 linear_input)
    input_fn = functools.partial(input_dm,
                                 data,
                                 batch_size,
                                 window_size)

    return functools.partial(train_embeddings,
                             model_fn,
                             input_fn,
                             opt_adagrad,
                             num_epochs=num_epochs,
                             model_dir=model_dir,
                             remove_model=False)

def train_dbow_fn(data: List[Tuple[int, int]],
                  collection_size: int,
                  vocabulary_size: int,
                  embedding_size: int,
                  window_size: int,
                  num_sampled: int,
                  batch_size: int,
                  num_epochs=1,
                  model_dir=DBOW_DIR) \
    -> Callable[[], np.ndarray]:

    model_fn = functools.partial(model_dbow,
                                 collection_size,
                                 vocabulary_size,
                                 embedding_size,
                                 window_size,
                                 num_sampled)
    input_fn = functools.partial(input_dbow,
                                 data,
                                 batch_size,
                                 window_size)

    return functools.partial(train_embeddings,
                             model_fn,
                             input_fn,
                             opt_adagrad,
                             num_epochs=num_epochs,
                             model_dir=model_dir,
                             remove_model=False)

def train_sentiment_fn(train_data: List[Tuple[int, int]],
                       eval_data: List[Tuple[int, int]],
                       collection_size: int,
                       embedding_size: int,
                       batch_size: int,
                       num_epochs=2,
                       model_dir=SENT_DIR) \
    -> Callable[[List[np.ndarray]], None]:

    eval_data_ = input_split(eval_data)
    
    model_fn = functools.partial(model_sentiment,
                                 collection_size,
                                 embedding_size)
    input_fn = functools.partial(input_sentiment,
                                 train_data,
                                 batch_size)

    return functools.partial(train_sentiment_pv,
                             model_fn,
                             input_fn,
                             opt_ftrl,
                             eval_data=eval_data_,
                             num_epochs=num_epochs,
                             model_dir=model_dir,
                             remove_model=False)

def run_experiment(name: str,
                   num_iters: int,
                   log_steps: int,
                   data: List[Tuple[int, int]],
                   train_data: List[Tuple[int, int]],
                   eval_data: List[Tuple[int, int]],
                   collection_size: int,
                   vocabulary_size: int,
                   embedding_size: int,
                   window_size: int,
                   num_sampled: int,
                   dm_linear: str,
                   batch_size: int,
                   remove_home=False):

    exp_dir = os.path.join(HOME_DIR, name) 
    if remove_home:
        remove_dir(exp_dir)

    dm_dir = os.path.join(exp_dir, 'pv_dm')
    dbow_dir = os.path.join(exp_dir, 'pv_dbow')
    sent_dm_dir = os.path.join(exp_dir, 'sent_dm')
    sent_dbow_dir = os.path.join(exp_dir, 'sent_dbow')
    sent_dir = os.path.join(exp_dir, 'sent')

    dm_file = os.path.join(exp_dir, 'pv_dm.txt')
    dm_words_file = os.path.join(exp_dir, 'pv_dm_words.txt')
    dbow_file = os.path.join(exp_dir, 'pv_dbow.txt')

    train_dm = train_dm_fn(
        data,
        collection_size,
        vocabulary_size,
        embedding_size,
        window_size,
        num_sampled,
        dm_linear,
        batch_size,
        model_dir=dm_dir)

    train_dbow = train_dbow_fn(
        data,
        collection_size,
        vocabulary_size,
        embedding_size,
        window_size,
        num_sampled,
        batch_size,
        model_dir=dbow_dir)

    train_sentiment = train_sentiment_fn(
        train_data,
        eval_data,
        collection_size,
        embedding_size,
        batch_size=16)

    no_embeddings = np.zeros((collection_size, embedding_size),
                             dtype=np.float32)
    
    for k in range(1, num_iters+1):
        print_step = k == 1 or k == num_iters or k % log_steps == 0
        if print_step: print('[ {} ]\n'.format(k))
        if print_step: print('DM...\n')

        embeddings_dm, embeddings_dm_words = \
            train_dm(last_print=print_step)
        
        if print_step: print('\nDBOW...\n')
        
        embeddings_dbow = train_dbow(last_print=print_step)
        
        embeddings_dm = embeddings_dm \
            / np.linalg.norm(embeddings_dm, axis=1, keepdims=True)
        embeddings_dbow = embeddings_dbow \
            / np.linalg.norm(embeddings_dbow, axis=1, keepdims=True)
        
        if print_step: print('\nSentiment...\n')

        train_sentiment([embeddings_dm / 2, embeddings_dbow / 2],
                        model_dir=sent_dir,
                        last_print=print_step)
        
        if print_step: print('\nSentiment DM-only...\n')
        
        train_sentiment([embeddings_dm, no_embeddings],
                        model_dir=sent_dm_dir,
                        last_print=print_step)
        if print_step: print('\nSentiment DBOW-only...\n')
        
        train_sentiment([embeddings_dbow, no_embeddings],
                        model_dir=sent_dbow_dir,
                        last_print=print_step)
        
        if print_step: print()
    
    save_embeddings(dm_file, embeddings_dm)
    print('DM Embeddings file size: {:,d} bytes'.format(os.stat(dm_file).st_size))
    save_embeddings(dm_words_file, embeddings_dm_words)
    print('DM Word Embeddings file size: {:,d} bytes'.format(os.stat(dm_words_file).st_size))
    save_embeddings(dbow_file, embeddings_dbow)
    print('DBOW Embeddings file size: {:,d} bytes'.format(os.stat(dbow_file).st_size))

In [132]:
common_params = dict(num_iters=25,
                     log_steps=5,
                     data=data,
                     train_data=train_data,
                     eval_data=valid_data,
                     collection_size=len(document_to_id),
                     vocabulary_size=len(token_to_id),
                     num_sampled=100,
                     batch_size=64,
                     remove_home=True)

In [133]:
run_experiment(name='25_2_avg',
               embedding_size=25,
               window_size=2,
               dm_linear='average',
               **common_params)

[ 1 ]

DM...

Last average loss: 3.6848

DBOW...

Last average loss: 4.5622

Sentiment...

Last average loss: 0.693
Last AUC: 0.520, eval 0.506
Last accuracy: 51.23, eval 50.41

Sentiment DM-only...

Last average loss: 0.693
Last AUC: 0.515, eval 0.499
Last accuracy: 50.64, eval 51.32

Sentiment DBOW-only...

Last average loss: 0.693
Last AUC: 0.513, eval 0.505
Last accuracy: 50.49, eval 50.59

[ 5 ]

DM...

Last average loss: 2.3938

DBOW...

Last average loss: 4.1841

Sentiment...

Last average loss: 0.689
Last AUC: 0.555, eval 0.526
Last accuracy: 53.73, eval 51.95

Sentiment DM-only...

Last average loss: 0.690
Last AUC: 0.544, eval 0.521
Last accuracy: 53.11, eval 51.86

Sentiment DBOW-only...

Last average loss: 0.692
Last AUC: 0.532, eval 0.516
Last accuracy: 51.99, eval 50.95

[ 10 ]

DM...

Last average loss: 1.8530

DBOW...

Last average loss: 3.5868

Sentiment...

Last average loss: 0.676
Last AUC: 0.611, eval 0.600
Last accuracy: 58.12, eval 58.67

Sentiment DM-only...

Las

In [134]:
run_experiment(name='25_4_avg',
               embedding_size=25,
               window_size=4,
               dm_linear='average',
               **common_params)

[ 1 ]

DM...

Last average loss: 3.8960

DBOW...

Last average loss: 4.5404

Sentiment...

Last average loss: 0.692
Last AUC: 0.527, eval 0.489
Last accuracy: 51.97, eval 47.23

Sentiment DM-only...

Last average loss: 0.693
Last AUC: 0.520, eval 0.509
Last accuracy: 51.92, eval 50.86

Sentiment DBOW-only...

Last average loss: 0.693
Last AUC: 0.515, eval 0.487
Last accuracy: 50.42, eval 47.87

[ 5 ]

DM...

Last average loss: 2.5777

DBOW...

Last average loss: 4.1952

Sentiment...

Last average loss: 0.690
Last AUC: 0.546, eval 0.497
Last accuracy: 53.20, eval 48.14

Sentiment DM-only...

Last average loss: 0.691
Last AUC: 0.534, eval 0.513
Last accuracy: 52.36, eval 50.86

Sentiment DBOW-only...

Last average loss: 0.692
Last AUC: 0.527, eval 0.486
Last accuracy: 51.92, eval 47.68

[ 10 ]

DM...

Last average loss: 2.0458

DBOW...

Last average loss: 3.7894

Sentiment...

Last average loss: 0.685
Last AUC: 0.572, eval 0.552
Last accuracy: 55.08, eval 54.86

Sentiment DM-only...

Las

In [135]:
run_experiment(name='25_2_concat',
               embedding_size=25,
               window_size=2,
               dm_linear='concatenate',
               **common_params)

[ 1 ]

DM...

Last average loss: 3.3434

DBOW...

Last average loss: 4.5614

Sentiment...

Last average loss: 0.692
Last AUC: 0.530, eval 0.513
Last accuracy: 51.59, eval 51.50

Sentiment DM-only...

Last average loss: 0.693
Last AUC: 0.519, eval 0.489
Last accuracy: 51.36, eval 49.77

Sentiment DBOW-only...

Last average loss: 0.692
Last AUC: 0.524, eval 0.521
Last accuracy: 51.69, eval 51.86

[ 5 ]

DM...

Last average loss: 2.0796

DBOW...

Last average loss: 4.1830

Sentiment...

Last average loss: 0.689
Last AUC: 0.548, eval 0.524
Last accuracy: 53.07, eval 52.41

Sentiment DM-only...

Last average loss: 0.692
Last AUC: 0.530, eval 0.492
Last accuracy: 51.47, eval 49.95

Sentiment DBOW-only...

Last average loss: 0.691
Last AUC: 0.535, eval 0.531
Last accuracy: 52.65, eval 51.86

[ 10 ]

DM...

Last average loss: 1.7241

DBOW...

Last average loss: 3.5776

Sentiment...

Last average loss: 0.683
Last AUC: 0.581, eval 0.582
Last accuracy: 55.56, eval 56.04

Sentiment DM-only...

Las

In [136]:
run_experiment(name='25_4_concat',
               embedding_size=25,
               window_size=4,
               dm_linear='concatenate',
               **common_params)

[ 1 ]

DM...

Last average loss: 2.9098

DBOW...

Last average loss: 4.5406

Sentiment...

Last average loss: 0.692
Last AUC: 0.526, eval 0.502
Last accuracy: 51.76, eval 49.59

Sentiment DM-only...

Last average loss: 0.692
Last AUC: 0.523, eval 0.498
Last accuracy: 51.67, eval 51.04

Sentiment DBOW-only...

Last average loss: 0.693
Last AUC: 0.515, eval 0.510
Last accuracy: 50.76, eval 49.77

[ 5 ]

DM...

Last average loss: 1.3410

DBOW...

Last average loss: 4.1954

Sentiment...

Last average loss: 0.690
Last AUC: 0.543, eval 0.499
Last accuracy: 52.61, eval 49.77

Sentiment DM-only...

Last average loss: 0.691
Last AUC: 0.534, eval 0.498
Last accuracy: 52.19, eval 49.59

Sentiment DBOW-only...

Last average loss: 0.692
Last AUC: 0.527, eval 0.507
Last accuracy: 51.97, eval 51.41

[ 10 ]

DM...

Last average loss: 1.0657

DBOW...

Last average loss: 3.7917

Sentiment...

Last average loss: 0.687
Last AUC: 0.561, eval 0.520
Last accuracy: 54.27, eval 50.68

Sentiment DM-only...

Las