<a id='top'></a><a name='top'></a>
# Chapter 4: Word and Sentence Embeddings

## 4.2 Sentence Embeddings

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/gbih/nlp/blob/main/ja_nlp_book/chp04_4_2_sentence_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

* [Setup](#setup)
* [4.2 Sentence Embeddings](#4.2)
    - [4.2.1 What are Sentence Embeddings?](#4.2.1)
    - [4.2.2 Train Sentence Embeddings with the Doc2Vec Model](#4.2.2)
    - [4.2.3 Search by Japanese Keywords with ElasticSearch](#4.2.3)
    - [4.2.4 Search by Sentence Embeddings with ElasticSearch](#4.2.4)

---
<a name='setup'></a><a id='setup'></a>
# Setup
<a href="#top">[back to top]</a>

In [None]:
from pathlib import Path

data_root = Path("chp04_02")
req_file = data_root / "requirements_4_4_2.txt"

if not data_root.is_dir():
    data_root.mkdir()
else:
    print(f"{data_root} exists.")

In [None]:
%%writefile {req_file}
elasticsearch==7.13.0
fugashi[unidic]==1.2.1
gensim==4.2.0
japanize_matplotlib==1.1.3
watermark==2.3.1

Writing chp04_02/requirements_4_4_2.txt


In [None]:
import os
import sys

check1 = ('google.colab' in sys.modules)
check2 = (os.environ.get('CLOUDSDK_CONFIG')=='/content/.config')
IS_COLAB = True if (check1 or check2) else False

if IS_COLAB:
    print("Installing packages")
    !pip install --quiet -r {req_file}
    !python -m unidic download
    print("Packages installed.")
else:
    print("Running locally.")

Installing packages
[K     |████████████████████████████████| 354 kB 14.1 MB/s 
[K     |████████████████████████████████| 615 kB 50.0 MB/s 
[K     |████████████████████████████████| 24.1 MB 1.2 MB/s 
[K     |████████████████████████████████| 4.1 MB 27.2 MB/s 
[K     |████████████████████████████████| 1.6 MB 26.5 MB/s 
[?25h  Building wheel for japanize-matplotlib (setup.py) ... [?25l[?25hdone
  Building wheel for unidic (setup.py) ... [?25l[?25hdone
download url: https://cotonoha-dic.s3-ap-northeast-1.amazonaws.com/unidic-3.1.0.zip
Dictionary version: 3.1.0+2021-08-31
Downloading UniDic v3.1.0+2021-08-31...
unidic-3.1.0.zip: 100% 526M/526M [00:41<00:00, 12.6MB/s]
Finished download.
Downloaded UniDic v3.1.0+2021-08-31 to /usr/local/lib/python3.8/dist-packages/unidic/dicdir
Packages installed.


In [None]:
# Standard Library imports
from importlib.metadata import version
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Suppress TensorFlog log messages
import pathlib
from pathlib import Path
import pprint
pp = pprint.PrettyPrinter(indent=4)
import shlex
import shutil
import subprocess
#from subprocess import Popen, PIPE, STDOUT
from sys import modules
import time

# Third-party imports
from elasticsearch import Elasticsearch
from fugashi import Tagger
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
import japanize_matplotlib
import matplotlib.pyplot as plt
import requests
import tensorflow_datasets as tfds
from tqdm import tqdm
from watermark import watermark

def HR():
    print("-"*50)

# Examine all imported packages
print(watermark(iversions=True, globals_=globals(),python=True, machine=True))

Python implementation: CPython
Python version       : 3.8.16
IPython version      : 7.9.0

Compiler    : GCC 7.5.0
OS          : Linux
Release     : 5.10.133+
Machine     : x86_64
Processor   : x86_64
CPU cores   : 2
Architecture: 64bit

matplotlib         : 3.2.2
pathlib            : 1.0.1
japanize_matplotlib: 1.1.3
requests           : 2.23.0
tensorflow_datasets: 4.6.0
sys                : 3.8.16 (default, Dec  7 2022, 01:12:13) 
[GCC 7.5.0]



In [None]:
assert version('elasticsearch') == '7.13.0'
assert version('fugashi') == '1.2.1'
assert version('gensim') == '4.2.0'
assert version('japanize_matplotlib') == '1.1.3'

print("Successfully imported specified packages.")

Successfully imported specified packages.


In [None]:
if IS_COLAB:
    data_file = "elasticsearch-7.13.0-linux-x86_64.tar.gz"
else:
    data_file = "elasticsearch-7.13.0-darwin-x86_64.tar.gz"
data_url = f"https://artifacts.elastic.co/downloads/elasticsearch/{data_file}"
data_dir = Path("chp04_02")
data_src = data_dir / data_file
data_path = data_dir / "elasticsearch-7.13.0"

print(f"""
data_file:\t{data_file}
data_url:\t{data_url}
data_dir:\t{data_dir}
data_src:\t{data_src}
data_path:\t{data_path}
""")


data_file:	elasticsearch-7.13.0-linux-x86_64.tar.gz
data_url:	https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.0-linux-x86_64.tar.gz
data_dir:	chp04_02
data_src:	chp04_02/elasticsearch-7.13.0-linux-x86_64.tar.gz
data_path:	chp04_02/elasticsearch-7.13.0



In [None]:
data_wikipedia_path = Path("chp04_shared/wikipedia_data")
print(f"data_wikipedia_path:\t{data_wikipedia_path}")

data_wikipedia_path:	chp04_shared/wikipedia_data


In [None]:
model_dir = data_dir / "models"
model_path = model_dir / "Doc2Vec_model_4_2_2"
print(f"""
model_dir:\t{model_dir}
model_path:\t{model_path}
""")


model_dir:	chp04_02/models
model_path:	chp04_02/models/Doc2Vec_model_4_2_2



---
<a name='4.2'></a><a id='4.2'></a>
# 4.2 Sentence Embeddings
<a href="#top">[back to top]</a>

<a name='4.2.1'></a><a id='4.2.1'></a>
## 4.2.1 What are Sentence Embeddings?
<a href="#top">[back to top]</a>

**Concept**

In addition to word embeddings, we can also use sentence embeddings. These are real-valued vector representations of sentences. They are trained in such a way to capture linguistic and semantic properties of sentences.

Sentence embeddings are a powerful tool that enables finding documents that do not necessarily have lexical overlap with the query. 

**Mechanism**

There are multiple ways to compute sentence embeddings. 

The simplest way is to treat a sentence as a sequence of words, and average the embeddings for those words.

A more sophisticated way is to use the Doc2Vec model, which treats sentences as additional variables to the Word2Vec model. This captures the semantics of sentences by predicting the words that appear inside. 

**Workflow**

Here, we use the Doc2Vec model to train sentence embeddings from Wikipedia articles, and explore how to use the embeddings to index documents with ElasticSearch and retrieve documents semantically with vectors, not just with keywords. 

The main relevant Python libraries are:

* elasticsearch
* fugashi with unidic-lite or unidic
* gensim
* tensorflow
* tensorflow-datasets

**Dataset**

The main dataset we use is the Japanese Wikipedia dump, accessed via Tensorflow Dataset. We compute one vector per document/article.

**Reference**

* [Distributed Representations of Sentences and Documents](https://arxiv.org/abs/1405.4053)
* gensim.models.doc2vec.Doc2Vec
    - Learn paragraph and document embeddings via the distributed memory and distributed bag of words models from Quoc Le and Tomas Mikolov: “Distributed Representations of Sentences and Documents”.
* model.dv.most_similar
    - Find the top-N most similar keys.
    - Positive keys contribute positively towards the similarity, negative keys negatively.


<a name='4.2.2'></a><a id='4.2.2'></a>
## 4.2.2 Train Sentence Embeddings with the Doc2Vec Model
<a href="#top">[back to top]</a>

### Compute one vector per document/article.

In [None]:
# Load the Japanese Wikipedia dump
wikipedia_src='wikipedia/20200301.ja'
     
print(f"Use {wikipedia_src}")
HR()
      
ds = tfds.load(
    wikipedia_src, 
    split='train', 
    shuffle_files=True,
    data_dir = data_wikipedia_path
)

print(ds)

Use wikipedia/20200301.ja
--------------------------------------------------
Downloading and preparing dataset 2.95 GiB (download: 2.95 GiB, generated: 5.33 GiB, total: 8.28 GiB) to chp04_shared/wikipedia_data/wikipedia/20200301.ja/1.0.0...


Dl Completed...:   0%|          | 0/66 [00:00<?, ? file/s]

Dataset wikipedia downloaded and prepared to chp04_shared/wikipedia_data/wikipedia/20200301.ja/1.0.0. Subsequent calls will reuse this data.
<PrefetchDataset element_spec={'text': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>


### Tokenize the text into words (morphemes) with fugashi 

* We are creating TaggedDocument instances, which we need as input for the Doc2Vec model. 
* The sentence embeddings will be trained by creating the Doc2Vec object. 
* The resulting document embeddings are stored in `model.dv` and can be retrieved by invoking the `model.dv.most_similar()` method. 

In [None]:
tagger = Tagger()

documents = []
for i, example in enumerate(tqdm(tfds.as_numpy(ds.take(10_000)))):
    text = example['text'].decode('utf-8')
    tokens = [w.surface for w in tagger(text)]
    documents.append(TaggedDocument(tokens, [i]))
    
tqdm.write("Done tokenizing words.")

100%|██████████| 10000/10000 [00:23<00:00, 419.81it/s]

Done tokenizing words.





In [None]:
# Examine the document titles
for i, x in enumerate(documents[:5]):
    print(i, documents[i][0][:5])

0 ['転送', '伊', '号', '第', '百']
1 ['「', 'taboo', '」', 'は', '、']
2 ['チオシアン', '酸', 'エチル', '（', 'チオシアン']
3 ['北海道', '道', '1154', '号', '本別']
4 ['第', '35', '回', '有馬', '記念']


In [None]:
# Setup for model and wikipedia directories
if not Path(data_wikipedia_path).is_dir():
    data_wikipedia_path.mkdir(parents=True, exist_ok=False)
else:
    print(f"{data_wikipedia_path} exists.")
    
if not Path(model_dir).is_dir():
    model_dir.mkdir(parents=True, exist_ok=False)
else:
    print(f"{model_dir} exists.")

chp04_shared/wikipedia_data exists.


In [None]:
if not Path(model_path).is_file():
    print(f"Creating and training model {model_path}")

    model = Doc2Vec(
        documents,
        vector_size=100,
        window=5,
        min_count=5,
        workers=4
    )
    
    print("Done training")
    model.save(str(model_path))
    
else:
    print(f"{model_path} exists")

Creating and training model chp04_02/models/Doc2Vec_model_4_2_2
Done training


In [None]:
# Continue training with the loaded model.

model = Doc2Vec.load(str(model_path))
type(model)

gensim.models.doc2vec.Doc2Vec

### Retrieve document embeddings stored in model.dv KeyedVectors

* `model.dv.most_similar` is used to find the top-N most similar keys.
* `model.dv` KeyedVectors is used to perform operations on the vectors such as vector lookup, distance, similarity etc.

In [None]:
def get_similar_docs(orig_doc_id):
    """
    Previews 10 most similar documents for a given document.
    """

    # Replacement for utility.truncate_by_width()
    truncate_n = 20 
    
    # Show more text for the doc_id so we can infer its content type
    w1 = ''.join(documents[orig_doc_id].words)[:truncate_n*4]
    t1 = 'orig id = '
    print(f"{t1:>10}{orig_doc_id:>4}, ttl = {w1}")
    HR()
    t2 = 'id = '
    
    # Find the top-N most similar keys per document id
    for doc_id, sim in model.dv.most_similar(orig_doc_id): 
        w2 = ''.join(documents[doc_id].words)[:truncate_n]
        print(f"{t2:>10}{doc_id:>4}, ttl = {w2}, sim={sim:.2f}")

In [None]:
get_similar_docs(orig_doc_id=1)

orig id =    1, ttl = 「taboo」は、MUCCの楽曲で、40枚目のシングル。2019年8月21日、なんばHatchにて開催された『MUCCBIRTHDAYCIRCUIT2019「4
--------------------------------------------------
     id = 7000, ttl = 「せかいでいちばん」は、井上苑子の楽曲で, sim=0.86
     id = 7229, ttl = 「youaremysecret」（ユー・, sim=0.84
     id = 9224, ttl = オ・ハヨン(漢字:呉夏栄、ハングル:오하, sim=0.83
     id =  959, ttl = 転送VOICE(Perfumeの曲)Ca, sim=0.82
     id = 4436, ttl = 「melody〜SOUNDSREAL〜」, sim=0.82
     id = 8672, ttl = 『秋コレ～MTR&YTour2015～』, sim=0.82
     id = 6091, ttl = 「WHITEOUT〜memoryofac, sim=0.82
     id = 8086, ttl = 岡田直美（おかだなおみ）は、日本のフリー, sim=0.82
     id =  345, ttl = ポール・ラスト（PaulRust,198, sim=0.82
     id = 7722, ttl = 佐藤泰男（さとうやすお、1953年9月2, sim=0.81


In [None]:
get_similar_docs(orig_doc_id=2)

orig id =    2, ttl = チオシアン酸エチル（チオシアンさんエチル）はチオシアン酸エステルの一種で、化学式で表される有機化合物である。別名エチルロダニド。性質常温ではタマネギ臭を持つ無色
--------------------------------------------------
     id = 5594, ttl = 3,3-ジメチル-1-ブテン（）は、化学, sim=0.88
     id = 7903, ttl = シクロモナス属はシクロモナス科の基準属で, sim=0.85
     id = 7181, ttl = リュウキュウアイ（琉球藍、学名:Stro, sim=0.83
     id = 9835, ttl = 鈴木進悦（すずきしんえつ、1939年（昭, sim=0.81
     id = 1917, ttl = 一次小節（いちじしょうせつ、または一次濾, sim=0.80
     id = 1486, ttl = ホルミルメタノフランデヒドロゲナーゼ（f, sim=0.80
     id = 6014, ttl = PFEファイザーのニューヨーク証券取引所, sim=0.80
     id = 1031, ttl = キサンツレン酸(xanthurenica, sim=0.79
     id = 3966, ttl = 転送ヘキサニトロコバルト(III)酸カリ, sim=0.79
     id = 6915, ttl = アルコールデヒドロゲナーゼ(アズリン)(, sim=0.79


In [None]:
get_similar_docs(orig_doc_id=123)

orig id =  123, ttl = 転送925hPa
--------------------------------------------------
     id =  520, ttl = 転送フリチオフ・ホルムグレーン, sim=0.93
     id = 7766, ttl = 転送IStock, sim=0.93
     id = 9704, ttl = 転送アルフィー, sim=0.93
     id = 4478, ttl = 転送ドロンニング・モード・ランド, sim=0.93
     id = 8687, ttl = 転送フェティシズム, sim=0.92
     id = 5804, ttl = 転送イーサネット, sim=0.92
     id = 1633, ttl = 転送鞴, sim=0.92
     id = 1163, ttl = 転送特産品, sim=0.92
     id = 9723, ttl = 転送アンモライト, sim=0.92
     id = 2601, ttl = 転送サラーバト・ジャング, sim=0.92


<a name='4.2.3'></a><a id='4.2.3'></a>
## 4.2.3 Search by Japanese Keywords with ElasticSearch
<a href="#top">[back to top]</a>

**Concept**

Sentence embeddings is a powerful technique enabling retrieval of documents that do not necessarily have words in common with the given query.

For large documents, it may be useful to integrate sentence embeddings with dedicated search engines for performance reasons. 

Here, we explore using sentence embeddings to index and retrieve documents using ElasticSearch, a popular search engine library.

**Workflow**

* Download and run ElasticSearch as a subprocess (necessary if we run this inside a Jupyter notebook). 
* Install the plugin for Kuromoji, a popular open source softward for tokenizing Japanese text. This is often used for search.

### Setup Elasticsearch

In [None]:
if not data_src.is_file():
    print(f"Downloading {data_url}")
    subprocess.run(shlex.split(f"wget -q -O {data_src} {data_url}"))
    print("Done.")
else:
    print(f"{data_src} exists.")

Downloading https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.13.0-linux-x86_64.tar.gz
Done.


In [None]:
if not data_path.is_dir():
    print(f"Extracting {data_src} to {data_path}")
    
    shutil.unpack_archive(data_src, data_dir)
    # subprocess.run(shlex.split(f"tar -xf {data_src} -C {data_dir}"))
    print("Done.")
else:
    print(f"{data_path} exists")

Extracting chp04_02/elasticsearch-7.13.0-linux-x86_64.tar.gz to chp04_02/elasticsearch-7.13.0
Done.


In [None]:
if IS_COLAB:
    !sudo chown -R daemon:daemon {data_path}
    #!sudo chown -R daemon:daemon elasticsearch-7.9.2/

In [None]:
# Setup for elasticsearch
elasticsearch_bin = data_path / "bin"
elasticsearch_plugin = f"{elasticsearch_bin}/elasticsearch-plugin"
elasticsearch = f"{elasticsearch_bin}/elasticsearch"

print(f"""
elasticsearch_bin:\t{elasticsearch_bin}
elasticsearch_plugin:\t{elasticsearch_plugin}
elasticsearch:\t\t{elasticsearch}
""")


elasticsearch_bin:	chp04_02/elasticsearch-7.13.0/bin
elasticsearch_plugin:	chp04_02/elasticsearch-7.13.0/bin/elasticsearch-plugin
elasticsearch:		chp04_02/elasticsearch-7.13.0/bin/elasticsearch



In [None]:
print("Check if any plugins are pre-installed. Should be none initially:")
!{elasticsearch_plugin} list
HR()

print("Install Kuromoji plugin:")
try:
    subprocess.run(shlex.split(f"{elasticsearch_plugin} install analysis-kuromoji"))
except Exception as e:
    print(f"Error:  {e}")
HR()

print("Confirm plugins:")
!{elasticsearch_plugin} list

Check if any plugins are pre-installed. Should be none initially:
--------------------------------------------------
Install Kuromoji plugin:
--------------------------------------------------
Confirm plugins:
analysis-kuromoji


In [None]:
# Make sure we don't have a previous elasticsearch daemon instance
!ps -ef | grep elasticsearch

root         553      75  0 15:09 ?        00:00:00 /bin/bash -c ps -ef | grep elasticsearch
root         555     553  0 15:09 ?        00:00:00 grep elasticsearch


### Run ElasticSearch as a subprocess.

In [None]:
# This handle errors more gracefully than es.ping
def readiness_probe(mode="code"):
    try:
        # Query base endpoint. This also retrieves cluster information.
        resp = requests.get('http://localhost:9200/')
    except Exception:
        return False
    else:
        if mode == "code":
            return resp
        elif mode=="text":
            return resp.text
        
# print(readiness_probe(mode="text"))

In [None]:
if IS_COLAB:
    server = subprocess.Popen([elasticsearch], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, preexec_fn=lambda: os.setuid(1))
else:
    server = subprocess.Popen([elasticsearch], stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 

# If we use subprocess.run() inside a Jupyter notebook, it will end up hanging on the specific cell, 
# since that process is not detached. So, we use Popen since it is asynchronous, however we have to
# wait for its start-up time. Here we use a pseudo-readiness probe
timeout_n = 40
print("Starting up Elasticsearch service.", end = '')

for i in range(timeout_n):
    if not readiness_probe():
        time.sleep(1)
        print(".", end='')
    else:
        break

print(f" done in {i} seconds.")

# Initialize the Python library for ElasticSearch and confirms it's running
print("Initialize Elasticsearch Client")

es = Elasticsearch(
    hosts=["http://localhost:9200"], 
    request_timeout=60, 
    retry_on_timeout=True
)

Starting up Elasticsearch service.......................... done in 25 seconds.
Initialize Elasticsearch Client


In [None]:
# If need to kill for debugging:
# !pkill -f 'elasticsearch'

In [None]:
# Retrieve information about the cluster.
print(readiness_probe(mode="text"))

{
  "name" : "6ea91558da17",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "5TrcbDV_TqWO2VPZM0HB_g",
  "version" : {
    "number" : "7.13.0",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "5ca8591c6fcdb1260ce95b08a8e023559635c6f3",
    "build_date" : "2021-05-19T22:22:26.081971330Z",
    "build_snapshot" : false,
    "lucene_version" : "8.8.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}



In [None]:
try:
    print(f"Running ping test: {es.ping()}")
except Exception as e:
    print(f"Error: {e}")

Running ping test: True




In [None]:
es.indices.delete(index='wikipedia', ignore=404)

{'error': {'root_cause': [{'type': 'index_not_found_exception',
    'reason': 'no such index [wikipedia]',
    'resource.type': 'index_or_alias',
    'resource.id': 'wikipedia',
    'index_uuid': '_na_',
    'index': 'wikipedia'}],
  'type': 'index_not_found_exception',
  'reason': 'no such index [wikipedia]',
  'resource.type': 'index_or_alias',
  'resource.id': 'wikipedia',
  'index_uuid': '_na_',
  'index': 'wikipedia'},
 'status': 404}

In [None]:
ES_SETTINGS = {
    "settings": {
        "number_of_shards": 3,
        "number_of_replicas": 1
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "kuromoji"
            },
            "text": {
                "type": "text",
                "analyzer": "kuromoji"
            }
        }
    }
}

es.indices.create(index="wikipedia", body=ES_SETTINGS)

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'wikipedia'}

### Index first 10K Japanese Wikipedia articles with Elasticsearch

In [None]:
print("Start of indexing..")

for example in tqdm(tfds.as_numpy(ds.take(10_000))):
    es.index(
        index="wikipedia",
        body={
            "title": example["title"].decode("utf-8"),
            "text": example["title"].decode("utf-8").replace("\n", "")
        }
    )
    
tqdm.write("Finished indexing.")

Start of indexing..


100%|██████████| 10000/10000 [01:07<00:00, 147.60it/s]

Finished indexing.





### Run a query

We run the query, "自然言語", and retrieve the most relevant documents. The matched parts will be highlighted.

In [None]:
# Make sure Elasticsearch is running
readiness_probe()

<Response [200]>

In [None]:
query_word = '自然言語'

res = es.search(
    index="wikipedia",
    body={
        "query": {
            "query_string": {"query": query_word}
        },
        "highlight": {
            "fragment_size": 100,
            "fields": {
                "title": {},
                "text": {}
            }
        }
    }
)

# Check the data structure of the returned dict
res

{'took': 324,
 'timed_out': False,
 '_shards': {'total': 3, 'successful': 3, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 8, 'relation': 'eq'},
  'max_score': 7.6104784,
  'hits': [{'_index': 'wikipedia',
    '_type': '_doc',
    '_id': 'iaTjBoUBFtScIDlQXw2D',
    '_score': 7.6104784,
    '_source': {'title': '多言語化', 'text': '多言語化'},
    'highlight': {'text': ['多<em>言語</em>化'], 'title': ['多<em>言語</em>化']}},
   {'_index': 'wikipedia',
    '_type': '_doc',
    '_id': 'waTjBoUBFtScIDlQSQmA',
    '_score': 7.597665,
    '_source': {'title': '自然数', 'text': '自然数'},
    'highlight': {'text': ['<em>自然</em>数'], 'title': ['<em>自然</em>数']}},
   {'_index': 'wikipedia',
    '_type': '_doc',
    '_id': 'BKPiBoUBFtScIDlQmO5o',
    '_score': 6.7742214,
    '_source': {'title': '自然単位系', 'text': '自然単位系'},
    'highlight': {'text': ['<em>自然</em>単位系'], 'title': ['<em>自然</em>単位系']}},
   {'_index': 'wikipedia',
    '_type': '_doc',
    '_id': 'zqPiBoUBFtScIDlQ4fhp',
    '_score': 6.54506,
    '_

In [None]:
# Prettier output
print(f"Result of keyword search: '{query_word}'")
HR()
for doc in res["hits"]["hits"]:
    if doc["_source"]["title"]:
        print(doc["_source"]["title"])
        for line in doc["highlight"]["text"]:
            print("     "+line[:50])

Result of keyword search: '自然言語'
--------------------------------------------------
多言語化
     多<em>言語</em>化
自然数
     <em>自然</em>数
自然単位系
     <em>自然</em>単位系
自然の斉一性
     <em>自然</em>の斉一性
兵庫県立人と自然の博物館
     兵庫県立人と<em>自然</em>の博物館
多自然型川づくり
     多<em>自然</em>型川づくり
北海道エコ・動物自然専門学校
     北海道エコ・動物<em>自然</em>専門学校
兵庫県立六甲山自然保護センター
     兵庫県立六甲山<em>自然</em>保護センター


<a name='4.2.4'></a><a id='4.2.4'></a>
## 4.2.4 Search by Sentence Embeddings with ElasticSearch
<a href="#top">[back to top]</a>

Next, we index documents using sentence embedding. 

For this, we specify a field of type `dense_vector` when we create an index with ElasticSearch. 

Here, we create a field called `text_vector` of type `dense_vector` that stores 100-dimensional vectors. 

In [None]:
es.indices.delete(index="wikipedia-vector", ignore=404)

ES_SETTINGS = {
    "settings": {
        "number_of_shards": 3,
        "number_of_replicas": 1
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "kuromoji"
            },
            "text": {
                "type": "text",
                "analyzer": "kuromoji"
            },
            "text_vector": {
                "type": "dense_vector",
                "dims": 100
            }
        }
    }
}

es.indices.create(index="wikipedia-vector", body=ES_SETTINGS)

{'acknowledged': True,
 'shards_acknowledged': True,
 'index': 'wikipedia-vector'}

### Add new dense_vector field to enable sentence embeddings.

`doc2vec_model.dv.get_vector` returns a single unit-normalized vector for a key.

In [None]:
# Wrap enumerate over tqdm
for i, example in enumerate(tqdm(tfds.as_numpy(ds.take(10_000)))):
    es.index(
        index="wikipedia-vector",
        body={
            "title": example["title"].decode("utf-8"),
            "text": example["text"].decode("utf-8").replace("\n", " "),
            "text_vector": model.dv.get_vector(i)
        }
    )

100%|██████████| 10000/10000 [01:36<00:00, 103.45it/s]


### Search by sentence embeddings

Search by sentence embeddings by specifying a script when querying the index. 

Here, we use a `cosineSimilarity()` function to calculate the similarity between the query vector, and each document vector. 

Elasticsearch returns the same results as from the above `most_similar()` method.

In [None]:
def search_similar_docs(orig_doc_id):
    print(f"orig_doc_id={orig_doc_id}, doc={' '.join(documents[orig_doc_id].words)[:40]}")

    query_vector = model.dv.get_vector(orig_doc_id)
    
    # Our search script
    res = es.search(
        index="wikipedia-vector",
        body = {
            "query": {
                "script_score": {
                    "query": {"match_all": {}},
                    "script": {
                        # Calculate the similarity between query vector and each document vector
                        "source": "cosineSimilarity(params.query_vector, 'text_vector') + 1.0",
                        "params": {"query_vector": query_vector}
                    }
                }
            }
        }
    )
        
    for doc in res["hits"]["hits"]:
        print(doc["_source"]["title"])
        print("\t" + (doc["_source"]["text"])[:50])
        HR()

In [None]:
# Make sure Elasticsearch is running
readiness_probe()

<Response [200]>

In [None]:
search_similar_docs(orig_doc_id=1)

orig_doc_id=1, doc=「 taboo 」 は 、 MUCC の 楽曲 で 、 40 枚 目 の シング
日比谷アメニス
	転送 日比谷花壇
--------------------------------------------------
リオデジャネイロオリンピックイエメン選手団
	転送 2016年リオデジャネイロオリンピックのイエメン選手団
--------------------------------------------------
鹿屋
	転送 鹿屋市
--------------------------------------------------
MSConfig
	MSConfig (エムエスコンフィグ、Microsoft System Configuration
--------------------------------------------------
紫野斎院
	転送斎院#斎院制度
--------------------------------------------------
小苦蘊経
	転送 苦蘊小経
--------------------------------------------------
早池峰バス
	早池峰バス株式会社（はやちねバス）は、岩手県交通の100%出資により設立されたバス会社。  設立時の
--------------------------------------------------
松井功 (ゴルファー)
	松井 功（まつい いさお　1941年11月2日 - ）は、日本のプロゴルファー、ゴルフ解説者・指導者
--------------------------------------------------
宮城県柴田農林高等学校
	宮城県柴田農林高等学校（みやぎけん しばたのうりこうとうがっこう）は、宮城県柴田郡大河原町字上川原に
--------------------------------------------------
パロアルト
	パロアルト パロアルト (カリフォルニア州) パロアルト研究所 パロアルト郡 (アイオワ州)  Ca
--------------------------------------------------


In [None]:
search_similar_docs(orig_doc_id=2)

orig_doc_id=2, doc=チオシアン 酸 エチル （ チオシアン さん エチル ） は チオシアン 酸 エ
陽明文庫本源氏物語
	陽明文庫本源氏物語（ようめいぶんこほんげんじものがたり）は、五摂家の一つ近衛家のコレクションである陽
--------------------------------------------------
ホマ・ベイ (カウンティ)
	ホマ湾 (カウンティ)(スワヒリ語：Wilaya ya Homa Bay)は、ケニア南西部の旧ニャン
--------------------------------------------------
新座
	新座（にいざ）は、埼玉県新座市の町名。現行行政地名は新座一丁目から三丁目。郵便番号は352-0006
--------------------------------------------------
桑原悠 (政治家)
	桑原 悠（くわばら はるか、1986年8月4日 - ）は、日本の政治家。新潟県津南町長（1期）。元津
--------------------------------------------------
アザンブジャ線
	アザンブジャ線（Linha da Azambuja）は、ポルトガル鉄道が運行するリスボン近郊鉄道の路
--------------------------------------------------
井上文雄
	井上 文雄（いのうえ ふみお）は、日本の映画プロデューサー。横浜放送映画専門学院（現・日本映画大学）
--------------------------------------------------
巨大大仏
	転送巨大仏
--------------------------------------------------
コンピュータ略語一覧
	コンピュータ略語一覧（コンピュータりゃくごいちらん）は、コンピュータの略語を一覧にしたものである。 
--------------------------------------------------
ドミニク・パーセル
	ドミニク・パーセル（Dominic Purcell, 1970年2月17日 - ）は、イギリス生まれ
---------------------------------

In [None]:
# Stop Elasticserver
!pkill -f 'elasticsearch'
readiness_probe()

False