<a id='top'></a><a name='top'></a>
# Chapter 4: Word and Sentence Embeddings

## 4.2 Sentence Embeddings

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/gbih/nlp/blob/main/ja_nlp_book/chp04_4_2_sentence_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

* [Setup](#setup)
* [4.2 Sentence Embeddings](#4.2)
    - [4.2.1 What are Sentence Embeddings?](#4.2.1)
    - [4.2.2 Train Sentence Embeddings with the Doc2Vec Model](#4.2.2)
    - [4.2.3 Search by Japanese Keywords with ElasticSearch](#4.2.3)
    - [4.2.4 Search by Sentence Embeddings with ElasticSearch](#4.2.4)

---
<a name='setup'></a><a id='setup'></a>
# Setup
<a href="#top">[back to top]</a>

In [1]:
from pathlib import Path

data_root = Path("chp04_02")
req_file = data_root / "requirements_4_4_2.txt"

if not data_root.is_dir():
    data_root.mkdir()
else:
    print(f"{data_root} exists.")

chp04_02 exists.


In [2]:
%%writefile {req_file}
fugashi[unidic]==1.2.1
gensim==4.2.0
japanize_matplotlib==1.1.3
watermark==2.3.1

Overwriting chp04_02/requirements_4_4_2.txt


In [3]:
import os
import sys

check1 = ('google.colab' in sys.modules)
check2 = (os.environ.get('CLOUDSDK_CONFIG')=='/content/.config')
IS_COLAB = True if (check1 or check2) else False

if IS_COLAB:
    print("Installing packages")
    !pip install --quiet -r {req_file}
    !python -m unidic download
    print("Packages installed.")
else:
    print("Running locally.")

Running locally.


In [4]:
# Standard Library imports
from importlib.metadata import version
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Suppress TensorFlog log messages
import pathlib
from pathlib import Path
import pprint
pp = pprint.PrettyPrinter(indent=4)
import shlex
import shutil
import subprocess
#from subprocess import Popen, PIPE, STDOUT
from sys import modules
import time

# Third-party imports
from elasticsearch import Elasticsearch
from fugashi import Tagger
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
import japanize_matplotlib
import matplotlib.pyplot as plt
import requests
import tensorflow_datasets as tfds
from tqdm import tqdm
from watermark import watermark

def HR():
    print("-"*50)

# Examine all imported packages
print(watermark(iversions=True, globals_=globals(),python=True, machine=False))

Python implementation: CPython
Python version       : 3.8.12
IPython version      : 7.34.0

japanize_matplotlib: 1.1.3
requests           : 2.28.1
matplotlib         : 3.6.2
sys                : 3.8.12 (default, Dec 13 2021, 20:17:08) 
[Clang 13.0.0 (clang-1300.0.29.3)]
tensorflow_datasets: 4.6.0



In [5]:
assert version('fugashi') == '1.2.1'
assert version('gensim') == '4.2.0'
assert version('japanize_matplotlib') == '1.1.3'
assert version('unidic_lite') == '1.0.8'

print("Successfully imported specified packages.")

Successfully imported specified packages.


In [6]:
data_file = "elasticsearch-7.13.0-darwin-x86_64.tar.gz"
data_url = f"https://artifacts.elastic.co/downloads/elasticsearch/{data_file}"
data_dir = Path("chp04_02")
data_src = data_dir / data_file
data_path = data_dir / "elasticsearch-7.13.0"

print(f"""
data_file:\t{data_file}
data_url:\t{data_url}
data_dir:\t{data_dir}
data_src:\t{data_src}
data_path:\t{data_path}
""")


data_file:	elasticsearch-7.13.0-darwin-x86_64.tar.gz
data_url:	https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.0-darwin-x86_64.tar.gz
data_dir:	chp04_02
data_src:	chp04_02/elasticsearch-7.13.0-darwin-x86_64.tar.gz
data_path:	chp04_02/elasticsearch-7.13.0



In [7]:
data_wikipedia_path = Path("chp04_shared/wikipedia_data")
print(f"data_wikipedia_path:\t{data_wikipedia_path}")

data_wikipedia_path:	chp04_shared/wikipedia_data


In [8]:
model_dir = data_dir / "models"
model_path = model_dir / "Doc2Vec_model_4_2_2"
print(f"""
model_dir:\t{model_dir}
model_path:\t{model_path}
""")


model_dir:	chp04_02/models
model_path:	chp04_02/models/Doc2Vec_model_4_2_2



---
<a name='4.2'></a><a id='4.2'></a>
# 4.2 Sentence Embeddings
<a href="#top">[back to top]</a>

<a name='4.2.1'></a><a id='4.2.1'></a>
## 4.2.1 What are Sentence Embeddings?
<a href="#top">[back to top]</a>

**Concept**

In addition to word embeddings, we can also use sentence embeddings. These are real-valued vector representations of sentences. They are trained in such a way to capture linguistic and semantic properties of sentences.

Sentence embeddings are a powerful tool that enables finding documents that do not necessarily have lexical overlap with the query. 

**Mechanism**

There are multiple ways to compute sentence embeddings. 

The simplest way is to treat a sentence as a sequence of words, and average the embeddings for those words.

A more sophisticated way is to use the Doc2Vec model, which treats sentences as additional variables to the Word2Vec model. This captures the semantics of sentences by predicting the words that appear inside. 

**Workflow**

Here, we use the Doc2Vec model to train sentence embeddings from Wikipedia articles, and explore how to use the embeddings to index documents with ElasticSearch and retrieve documents semantically with vectors, not just with keywords. 

The main relevant Python libraries are:

* elasticsearch
* fugashi with unidic-lite or unidic
* gensim
* tensorflow
* tensorflow-datasets

**Dataset**

The main dataset we use is the Japanese Wikipedia dump, accessed via Tensorflow Dataset. We compute one vector per document/article.

**Reference**

* [Distributed Representations of Sentences and Documents](https://arxiv.org/abs/1405.4053)
* gensim.models.doc2vec.Doc2Vec
    - Learn paragraph and document embeddings via the distributed memory and distributed bag of words models from Quoc Le and Tomas Mikolov: “Distributed Representations of Sentences and Documents”.
* model.dv.most_similar
    - Find the top-N most similar keys.
    - Positive keys contribute positively towards the similarity, negative keys negatively.


<a name='4.2.2'></a><a id='4.2.2'></a>
## 4.2.2 Train Sentence Embeddings with the Doc2Vec Model
<a href="#top">[back to top]</a>

### Compute one vector per document/article.

In [9]:
# Load the Japanese Wikipedia dump
wikipedia_src='wikipedia/20200301.ja'
     
print(f"Use {wikipedia_src}")
HR()
      
ds = tfds.load(
    wikipedia_src, 
    split='train', 
    shuffle_files=True,
    data_dir = data_wikipedia_path
)

print(ds)

Use wikipedia/20200301.ja
--------------------------------------------------
<PrefetchDataset element_spec={'text': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>


### Tokenize the text into words (morphemes) with fugashi 

* We are creating TaggedDocument instances, which we need as input for the Doc2Vec model. 
* The sentence embeddings will be trained by creating the Doc2Vec object. 
* The resulting document embeddings are stored in `model.dv` and can be retrieved by invoking the `model.dv.most_similar()` method. 

In [10]:
tagger = Tagger()

documents = []
for i, example in enumerate(tqdm(tfds.as_numpy(ds.take(10_000)))):
    text = example['text'].decode('utf-8')
    tokens = [w.surface for w in tagger(text)]
    documents.append(TaggedDocument(tokens, [i]))
    
tqdm.write("Done tokenizing words.")

100%|████████████████████████████████████| 10000/10000 [00:25<00:00, 392.00it/s]

Done tokenizing words.





In [11]:
# Examine the document titles
for i, x in enumerate(documents[:5]):
    print(i, documents[i][0][:5])

0 ['フィリップ', '・', 'ルイス', '・', 'メイ']
1 ['転送', 'テイラー', '・', 'セント', '・']
2 ['喜多', '了祐', '（', 'きた', 'りょう']
3 ['ひまわり', '温泉', '(', 'ひまわり', 'おんせん']
4 ['高田', 'ドブロク', '事件', '（', 'たか']


In [12]:
# Setup for model and wikipedia directories
if not Path(data_wikipedia_path).is_dir():
    data_wikipedia_path.mkdir(parents=True, exist_ok=False)
else:
    print(f"{data_wikipedia_path} exists.")
    
if not Path(model_dir).is_dir():
    model_dir.mkdir(parents=True, exist_ok=False)
else:
    print(f"{model_dir} exists.")

chp04_shared/wikipedia_data exists.
chp04_02/models exists.


In [13]:
if not Path(model_path).is_file():
    print(f"Creating and training model {model_path}")

    model = Doc2Vec(
        documents,
        vector_size=100,
        window=5,
        min_count=5,
        workers=4
    )
    
    print("Done training")
    model.save(str(model_path))
    
else:
    print(f"{model_path} exists")

chp04_02/models/Doc2Vec_model_4_2_2 exists


In [14]:
# Continue training with the loaded model.

model = Doc2Vec.load(str(model_path))
type(model)

gensim.models.doc2vec.Doc2Vec

### Retrieve document embeddings stored in model.dv KeyedVectors

* `model.dv.most_similar` is used to find the top-N most similar keys.
* `model.dv` KeyedVectors is used to perform operations on the vectors such as vector lookup, distance, similarity etc.

In [15]:
def get_similar_docs(orig_doc_id):
    """
    Previews 10 most similar documents for a given document.
    """

    # Replacement for utility.truncate_by_width()
    truncate_n = 20 
    
    # Show more text for the doc_id so we can infer its content type
    w1 = ''.join(documents[orig_doc_id].words)[:truncate_n*4]
    t1 = 'orig id = '
    print(f"{t1:>10}{orig_doc_id:>4}, ttl = {w1}")
    HR()
    t2 = 'id = '
    
    # Find the top-N most similar keys per document id
    for doc_id, sim in model.dv.most_similar(orig_doc_id): 
        w2 = ''.join(documents[doc_id].words)[:truncate_n]
        print(f"{t2:>10}{doc_id:>4}, ttl = {w2}, sim={sim:.2f}")

In [16]:
get_similar_docs(orig_doc_id=1)

orig id =    1, ttl = 転送テイラー・セント・クレア
--------------------------------------------------
     id = 3747, ttl = 飯田和孝（いいだかずたか、1982年（昭, sim=0.87
     id = 7778, ttl = 転送シャルンホルスト級戦艦, sim=0.86
     id =   35, ttl = 『ONE』（ワン）は、ASKAの4枚目の, sim=0.86
     id =  528, ttl = 転送グレートディヴァイディング山脈, sim=0.86
     id = 6120, ttl = 転送本土#日本, sim=0.85
     id = 1474, ttl = 清水康英（しみずやすひで）は、戦国時代の, sim=0.85
     id = 3606, ttl = 転送マックス・パーキンズ, sim=0.85
     id = 9710, ttl = 高野切（こうやぎれ）は、『古今和歌集』の, sim=0.85
     id =  506, ttl = アズレン(azulene)は10個の炭素, sim=0.84
     id = 5810, ttl = 住吉台（すみよしだい）は、兵庫県神戸市東, sim=0.84


In [17]:
get_similar_docs(orig_doc_id=2)

orig id =    2, ttl = 喜多了祐（きたりょうゆう、1921年-2007年）は、日本の法学者。一橋大学法学部教授・名誉教授・法学博士。北海道小樽市生まれ。研究分野は商法。中華人民共和国国
--------------------------------------------------
     id =  639, ttl = サンティアゴ・アリアス・ナランホ（San, sim=0.90
     id = 2930, ttl = 南郷区（なんごうく）青森県八戸市に設置さ, sim=0.90
     id = 5337, ttl = 川崎市立御幸中学校（かわさきしりつみゆき, sim=0.90
     id = 7179, ttl = 転送明日はきっといい日になる#高橋優によ, sim=0.89
     id = 4544, ttl = 聴覚障害者のみなさんへ（ちょうかくしょう, sim=0.89
     id = 5695, ttl = 大阪東郵便局（おおさかひがしゆうびんきょ, sim=0.89
     id = 4564, ttl = ジムロック（Simrock）は、ドイツ語, sim=0.89
     id = 1322, ttl = みんなのうた年度別放送楽曲一覧（みんなの, sim=0.89
     id = 7876, ttl = 小島吉蔵（吉藏、こじまきちぞう、1885, sim=0.89
     id = 8878, ttl = 蔡元生（WinsonTsai、1969年, sim=0.88


In [18]:
get_similar_docs(orig_doc_id=123)

orig id =  123, ttl = 『トカゲの王』（トカゲのおう）は、入間人間・著、ブリキ・画のライトノベル作品。電撃文庫（アスキー・メディアワークス）より刊行されている。守月史貴・作画で電撃マオ
--------------------------------------------------
     id = 4743, ttl = ラリー・ユージーン・アンダーセン（Lar, sim=0.78
     id = 1166, ttl = 転送Z会, sim=0.68
     id = 4450, ttl = 『DDTVSサイバーエージェント路上プロ, sim=0.68
     id = 8249, ttl = リーク郡（）は、アメリカ合衆国ミシシッピ, sim=0.66
     id = 2145, ttl = 新生パーソナルローン株式会社（シンセイパ, sim=0.66
     id = 3140, ttl = パトロール（Patrol）は、日産車体が, sim=0.66
     id =  230, ttl = ヤマドリタケ(Boletusedulis, sim=0.65
     id = 6129, ttl = 転送衝撃速報!アカルイ☆ミライ, sim=0.65
     id = 7752, ttl = 『LostMaria-名もなき花-』（ロ, sim=0.65
     id = 4111, ttl = 『聖闘士星矢真紅の少年伝説』（セイントセ, sim=0.65


<a name='4.2.3'></a><a id='4.2.3'></a>
## 4.2.3 Search by Japanese Keywords with ElasticSearch
<a href="#top">[back to top]</a>

**Concept**

Sentence embeddings is a powerful technique enabling retrieval of documents that do not necessarily have words in common with the given query.

For large documents, it may be useful to integrate sentence embeddings with dedicated search engines for performance reasons. 

Here, we explore using sentence embeddings to index and retrieve documents using ElasticSearch, a popular search engine library.

**Workflow**

* Download and run ElasticSearch as a subprocess (necessary if we run this inside a Jupyter notebook). 
* Install the plugin for Kuromoji, a popular open source softward for tokenizing Japanese text. This is often used for search.

### Setup Elasticsearch

In [19]:
if not data_src.is_file():
    print(f"Downloading {data_url}")
    subprocess.run(shlex.split(f"wget -q -O {data_src} {data_url}"))
    print("Done.")
else:
    print(f"{data_src} exists.")

chp04_02/elasticsearch-7.13.0-darwin-x86_64.tar.gz exists.


In [20]:
if not data_path.is_dir():
    print(f"Extracting {data_src} to {data_path}")
    
    shutil.unpack_archive(data_src, data_dir)
    # subprocess.run(shlex.split(f"tar -xf {data_src} -C {data_dir}"))
    print("Done.")
else:
    print(f"{data_path} exists")

Extracting chp04_02/elasticsearch-7.13.0-darwin-x86_64.tar.gz to chp04_02/elasticsearch-7.13.0
Done.


In [21]:
if IS_COLAB:
    !chown ‐R daemon:daemon {data_path}

In [22]:
# Setup for elasticsearch
elasticsearch_bin = data_path / "bin"
elasticsearch_plugin = f"{elasticsearch_bin}/elasticsearch-plugin"
elasticsearch = f"{elasticsearch_bin}/elasticsearch"

print(f"""
elasticsearch_bin:\t{elasticsearch_bin}
elasticsearch_plugin:\t{elasticsearch_plugin}
elasticsearch:\t\t{elasticsearch}
""")


elasticsearch_bin:	chp04_02/elasticsearch-7.13.0/bin
elasticsearch_plugin:	chp04_02/elasticsearch-7.13.0/bin/elasticsearch-plugin
elasticsearch:		chp04_02/elasticsearch-7.13.0/bin/elasticsearch



In [23]:
print("Check if any plugins are pre-installed. Should be none initially:")
!{elasticsearch_plugin} list
HR()

print("Install Kuromoji plugin:")
try:
    subprocess.run(shlex.split(f"{elasticsearch_plugin} install analysis-kuromoji"))
except Exception as e:
    print(f"Error:  {e}")
HR()

print("Confirm plugins:")
!{elasticsearch_plugin} list

Check if any plugins are pre-installed. Should be none initially:
--------------------------------------------------
Install Kuromoji plugin:
-> Installing analysis-kuromoji
-> Downloading analysis-kuromoji from elastic
-> Installed analysis-kuromoji
-> Please restart Elasticsearch to activate any plugins installed
--------------------------------------------------
Confirm plugins:
analysis-kuromoji


In [24]:
# Make sure we don't have a previous elasticsearch daemon instance
!ps -ef | grep elasticsearch

  501  5109  5052   0  9:26PM ttys001    0:00.01 /bin/zsh -c ps -ef | grep elasticsearch
  501  5111  5109   0  9:26PM ttys001    0:00.00 grep elasticsearch


### Run ElasticSearch as a subprocess.

In [25]:
# This handle errors more gracefully than es.ping
def readiness_probe(mode="code"):
    try:
        # Query base endpoint. This also retrieves cluster information.
        resp = requests.get('http://localhost:9200/')
    except Exception:
        return False
    else:
        if mode == "code":
            return resp
        elif mode=="text":
            return resp.text
        
# print(readiness_probe(mode="text"))

In [26]:
if IS_COLAB:
    server = subprocess.Popen([elasticsearch], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, preexec_fn=lambda: os.setuid(1))
else:
    server = subprocess.Popen([elasticsearch], stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 

# If we use subprocess.run() inside a Jupyter notebook, it will end up hanging on the specific cell, 
# since that process is not detached. So, we use Popen since it is asynchronous, however we have to
# wait for its start-up time. Here we use a pseudo-readiness probe
timeout_n = 40
print("Starting up Elasticsearch service.", end = '')

while not readiness_probe():
    i += 1
    time.sleep(1)
    print(".", end='')
    #readiness_probe()

print(f" done in {i} seconds.")

# Initialize the Python library for ElasticSearch and confirms it's running
print("Initialize Elasticsearch Client")

es = Elasticsearch(
    hosts=["http://localhost:9200"], 
    request_timeout=60, 
    retry_on_timeout=True
)

try:
    print(f"Running ping test: {es.ping()}")
except Exception as e:
    print(f"Error: {e}")

Starting up Elasticsearch service.................................. done in 37 seconds.
Initialize Elasticsearch Client
Running ping test: True




In [27]:
# If need to kill for debugging:
# !pkill -f 'elasticsearch'

In [28]:
# Retrieve information about the cluster.
print(readiness_probe(mode="text"))

{
  "name" : "Georges-MacBook-Air.local",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "YTU83_JvTOSTJ-wmDGTD6Q",
  "version" : {
    "number" : "7.13.0",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "5ca8591c6fcdb1260ce95b08a8e023559635c6f3",
    "build_date" : "2021-05-19T22:22:26.081971330Z",
    "build_snapshot" : false,
    "lucene_version" : "8.8.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}



In [29]:
es.indices.delete(index='wikipedia', ignore=404)

In [32]:
ES_SETTINGS = {
    "settings": {
        "number_of_shards": 3,
        "number_of_replicas": 1
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "kuromoji"
            },
            "text": {
                "type": "text",
                "analyzer": "kuromoji"
            }
        }
    }
}

es.indices.create(index="wikipedia", body=ES_SETTINGS)

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'wikipedia'}

### Index first 10K Japanese Wikipedia articles with Elasticsearch

In [33]:
print("Start of indexing..")

for example in tqdm(tfds.as_numpy(ds.take(10_000))):
    es.index(
        index="wikipedia",
        body={
            "title": example["title"].decode("utf-8"),
            "text": example["title"].decode("utf-8").replace("\n", "")
        }
    )
    
tqdm.write("Finished indexing.")

Start of indexing..


100%|█████████████████████████████████████| 10000/10000 [05:59<00:00, 27.78it/s]

Finished indexing.





### Run a query

We run the query, "自然言語", and retrieve the most relevant documents. The matched parts will be highlighted.

In [34]:
# Make sure Elasticsearch is running
readiness_probe()

<Response [200]>

In [35]:
query_word = '自然言語'

res = es.search(
    index="wikipedia",
    body={
        "query": {
            "query_string": {"query": query_word}
        },
        "highlight": {
            "fragment_size": 100,
            "fields": {
                "title": {},
                "text": {}
            }
        }
    }
)

# Check the data structure of the returned dict
res

{'took': 1214,
 'timed_out': False,
 '_shards': {'total': 3, 'successful': 3, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 8, 'relation': 'eq'},
  'max_score': 8.326296,
  'hits': [{'_index': 'wikipedia',
    '_type': '_doc',
    '_id': '7UMoAYUBsmPYkYGKPbgW',
    '_score': 8.326296,
    '_source': {'title': '自然硫黄', 'text': '自然硫黄'},
    'highlight': {'text': ['<em>自然</em>硫黄'], 'title': ['<em>自然</em>硫黄']}},
   {'_index': 'wikipedia',
    '_type': '_doc',
    '_id': 'm0MrAYUBsmPYkYGKZs9M',
    '_score': 8.326296,
    '_source': {'title': '自然成立', 'text': '自然成立'},
    'highlight': {'text': ['<em>自然</em>成立'], 'title': ['<em>自然</em>成立']}},
   {'_index': 'wikipedia',
    '_type': '_doc',
    '_id': 'jEMqAYUBsmPYkYGKHsbP',
    '_score': 6.8374434,
    '_source': {'title': 'IETF言語タグ', 'text': 'IETF言語タグ'},
    'highlight': {'text': ['IETF<em>言語</em>タグ'],
     'title': ['IETF<em>言語</em>タグ']}},
   {'_index': 'wikipedia',
    '_type': '_doc',
    '_id': 'tUMqAYUBsmPYkYGK_swN',
    '_sco

In [36]:
# Prettier output
print(f"Result of keyword search: '{query_word}'")
HR()
for doc in res["hits"]["hits"]:
    if doc["_source"]["title"]:
        print(doc["_source"]["title"])
        for line in doc["highlight"]["text"]:
            print("     "+line[:50])

Result of keyword search: '自然言語'
--------------------------------------------------
自然硫黄
     <em>自然</em>硫黄
自然成立
     <em>自然</em>成立
IETF言語タグ
     IETF<em>言語</em>タグ
Lucid (プログラミング言語)
     Lucid (プログラミング<em>言語</em>)
C11 (C言語)
     C11 (C<em>言語</em>)
利根別自然公園
     利根別<em>自然</em>公園
禁止法 (言語学)
     禁止法 (<em>言語</em>学)
斎藤報恩会自然史博物館
     斎藤報恩会<em>自然</em>史博物館


<a name='4.2.4'></a><a id='4.2.4'></a>
## 4.2.4 Search by Sentence Embeddings with ElasticSearch
<a href="#top">[back to top]</a>

Next, we index documents using sentence embedding. 

For this, we specify a field of type `dense_vector` when we create an index with ElasticSearch. 

Here, we create a field called `text_vector` of type `dense_vector` that stores 100-dimensional vectors. 

In [37]:
es.indices.delete(index="wikipedia-vector", ignore=404)

ES_SETTINGS = {
    "settings": {
        "number_of_shards": 3,
        "number_of_replicas": 1
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "kuromoji"
            },
            "text": {
                "type": "text",
                "analyzer": "kuromoji"
            },
            "text_vector": {
                "type": "dense_vector",
                "dims": 100
            }
        }
    }
}

es.indices.create(index="wikipedia-vector", body=ES_SETTINGS)

{'acknowledged': True,
 'shards_acknowledged': True,
 'index': 'wikipedia-vector'}

### Add new dense_vector field to enable sentence embeddings.

`doc2vec_model.dv.get_vector` returns a single unit-normalized vector for a key.

In [38]:
# Wrap enumerate over tqdm
for i, example in enumerate(tqdm(tfds.as_numpy(ds.take(10_000)))):
    es.index(
        index="wikipedia-vector",
        body={
            "title": example["title"].decode("utf-8"),
            "text": example["text"].decode("utf-8").replace("\n", " "),
            "text_vector": model.dv.get_vector(i)
        }
    )

100%|█████████████████████████████████████| 10000/10000 [06:10<00:00, 27.00it/s]


### Search by sentence embeddings

Search by sentence embeddings by specifying a script when querying the index. 

Here, we use a `cosineSimilarity()` function to calculate the similarity between the query vector, and each document vector. 

Elasticsearch returns the same results as from the above `most_similar()` method.

In [39]:
def search_similar_docs(orig_doc_id):
    print(f"orig_doc_id={orig_doc_id}, doc={' '.join(documents[orig_doc_id].words)[:40]}")

    query_vector = model.dv.get_vector(orig_doc_id)
    
    # Our search script
    res = es.search(
        index="wikipedia-vector",
        body = {
            "query": {
                "script_score": {
                    "query": {"match_all": {}},
                    "script": {
                        # Calculate the similarity between query vector and each document vector
                        "source": "cosineSimilarity(params.query_vector, 'text_vector') + 1.0",
                        "params": {"query_vector": query_vector}
                    }
                }
            }
        }
    )
        
    for doc in res["hits"]["hits"]:
        print(doc["_source"]["title"])
        print("\t" + (doc["_source"]["text"])[:50])
        HR()

In [40]:
# Make sure Elasticsearch is running
readiness_probe()

<Response [200]>

In [41]:
search_similar_docs(orig_doc_id=1)

orig_doc_id=1, doc=転送 テイラー ・ セント ・ クレア
3Dプリンター銃製造事件
	3Dプリンター銃製造事件（スリーディープリンターじゅうせいぞうじけん）とは、2014年に日本で3Dプ
--------------------------------------------------
トミー・リピューマ
	トミー・リピューマ（Tommy LiPuma, 1936年7月5日 - 2017年3月13日）はアメ
--------------------------------------------------
1980年モスクワオリンピックのオーストラリア選手団
	1980年モスクワオリンピックのオーストラリア選手団は、1980年7月19日から8月3日にかけてソビ
--------------------------------------------------
油通し
	転送 揚げ物
--------------------------------------------------
川口ヱリサ
	川口 ヱリサ（かわぐち えりさ、Elisa Kawaguti、1959年6月8日 - ）は、北九州市
--------------------------------------------------
矢内理絵子
	矢内 理絵子（やうち りえこ、1980年1月10日 - ）は、日本将棋連盟所属の女流棋士。埼玉県行田
--------------------------------------------------
藍住町民体育館
	藍住町民体育館（あいずみちょうみんたいいくかん）は、徳島県板野郡藍住町にある体育館である。  藍住町
--------------------------------------------------
南日本造船
	株式会社南日本造船（みなみにっぽんぞうせん）は、大分県大分市に本社を置く今治造船関連造船会社。  概
--------------------------------------------------
京極高正
	京極 高正（きょうごく たかまさ、生没年不詳）は、江戸時代後期の高家旗本。父は京極高以。通称は鋼之丞
--------------------------------------------

In [42]:
search_similar_docs(orig_doc_id=2)

orig_doc_id=2, doc=喜多 了祐 （ きた りょう ゆう 、 1921 年 - 2007 年 ） は 
1576年
	  他の紀年法    干支 : 丙子  日本  天正4年  光永元年（私年号）  皇紀2236年  
--------------------------------------------------
ルカ・ラノッテ
	ルカ・ラノッテ（, 1985年7月30日 - ）は、イタリアミラノ出身の男性(アイスダンス)選手。パ
--------------------------------------------------
ウィリアム・ハント
	ウィリアム・ハント (William Hunt)   ウィリアム・ヘンリー・ハント (画家) - イ
--------------------------------------------------
大阪府立東大阪支援学校
	大阪府立東大阪支援学校（おおさかふりつ ひがしおおさかしえんがっこう）は、大阪府東大阪市中石切町三丁
--------------------------------------------------
民俗採集
	民俗採集（みんぞくさいしゅう）とは、民俗資料の収集のために現地に直接足を運び、聞き書きや参与観察をお
--------------------------------------------------
マイケル・ナナリー
	マイケル・ナナリー（Micael Nunnally、1986年7月18日 - ）は、アメリカ合衆国・
--------------------------------------------------
全日本学生柔道体重別団体優勝大会
	全日本学生柔道体重別団体優勝大会（ぜんにほんがくせいじゅうどうたいじゅうべつだんたいゆうしょういたい
--------------------------------------------------
くちびるモーション
	転送 オリエンタル・ダイヤモンド/くちびるモーション  Category:吉井和哉が制作した楽曲 C
--------------------------------------------------
牛島村
	牛島村（うしのしまそん）は、徳島県麻植郡にあった村。現在の吉野川市

In [43]:
# Stop Elasticserver
!pkill -f 'elasticsearch'
readiness_probe()

False