##### Copyright 2019 The TensorFlow Hub Authors.

Licensed under the Apache License, Version 2.0 (the "License");

In [1]:
#meta 9/1/2022 G Example - Semantic Search with Approximate Nearest Neighbors and Text Embeddings from TF-Hub
#Goal: start w/ G example, make it work in your env -> next step: make it work with your data
#Origin: recommended by Skander
# refer to https://www.tensorflow.org/hub/tutorials/tf2_semantic_approximate_nearest_neighbors
# refer to https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/tf2_semantic_approximate_nearest_neighbors.ipynb
# myColab copy also
# refer to https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/tf2_semantic_approximate_nearest_neighbors.ipynb

#infra original: G Colab
#Python 3.7.13
#numpy 1.21.6, tensorflow 2.8.2, tensorflow_hub 0.12.0, annoy 1.17.1, apache-beam 2.41.0
#scikit_learn~=0.23.0  # For gaussian_random_matrix
#$note: tried to run in my WGCloud env, $error with gaussian_random_matrix

#history  
#9/6/2022 RUN SAME EXAMPLE IN GCP
#      this infra: WGC
#      env: anya_semantic_search2 (installed with Conda, env_anya_semantic_search_py3713.yml -> requirements.txt)
#      confirmed Python 3.7.13, pip 21.2.4 (asked for upgrade though)
#      confirmed same packages as in original infra 
#      $note this notebook pip installs a few packages anyway, so maybe requirements.txt has too many listed?
#      cell 14 `Run Pipeline` eventually expects pandas, so did conda install pandas in this env
#      pandas 1.3.5 

#here 
#9/9/2022 RUN SAME EXAMPLE IN GCP
#      $note: myBreakdown to figure out text to embeddings
#      Modified requirements.txt: added pandas 1.3.5


In [2]:
# Copyright 2018 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# Semantic Search with Approximate Nearest Neighbors and Text Embeddings


<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/hub/tutorials/tf2_semantic_approximate_nearest_neighbors"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/tf2_semantic_approximate_nearest_neighbors.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/hub/blob/master/examples/colab/tf2_semantic_approximate_nearest_neighbors.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/hub/examples/colab/tf2_semantic_approximate_nearest_neighbors.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
  <td>
    <a href="https://tfhub.dev/google/nnlm-en-dim128/2"><img src="https://www.tensorflow.org/images/hub_logo_32px.png" />See TF Hub model</a>
  </td>
</table>

This tutorial illustrates how to generate embeddings from a [TensorFlow Hub](https://tfhub.dev) (TF-Hub) model given input data, and build an approximate nearest neighbours (ANN) index using the extracted embeddings. The index can then be used for real-time similarity matching and retrieval.

When dealing with a large corpus of data, it's not efficient to perform exact matching by scanning the whole repository to find the most similar items to a given query in real-time. Thus, we use an approximate similarity matching algorithm which allows us to trade off a little bit of accuracy in finding exact nearest neighbor matches for a significant boost in speed.

In this tutorial, we show an example of real-time text search over a corpus of news headlines to find the headlines that are most similar to a query. Unlike keyword search, this captures the semantic similarity encoded in the text embedding.

The steps of this tutorial are:
1. Download sample data.
2. Generate embeddings for the data using a TF-Hub model
3. Build an ANN index for the embeddings
4. Use the index for similarity matching

We use [Apache Beam](https://beam.apache.org/documentation/programming-guide/) to generate the embeddings from the TF-Hub model. We also use Spotify's [ANNOY](https://github.com/spotify/annoy) library to build the approximate nearest neighbor index.

### More models
For models that have the same architecture but were trained on a different language, refer to [this](https://tfhub.dev/google/collections/nnlm/1) collection. [Here](https://tfhub.dev/s?module-type=text-embedding) you can find all text embeddings that are currently hosted on [tfhub.dev](tfhub.dev). 

## mySetup

Required libraries (refer to requirements.txt)

- apache_beam  
- scikit_learn~=0.23.0  for gaussian_random_matrix  
- annoy

Import the required libraries

In [3]:
import os
import sys
import pickle
from collections import namedtuple
from datetime import datetime
import numpy as np
import apache_beam as beam
from apache_beam.transforms import util
import tensorflow as tf
import tensorflow_hub as hub
import annoy
from sklearn.random_projection import gaussian_random_matrix

In [4]:
#python version $my
import sys
print(sys.version)
print(np.__version__)

3.7.13 (default, Mar 29 2022, 02:18:16) 
[GCC 7.5.0]
1.21.6


In [5]:
print('TF version: {}'.format(tf.__version__))
print('TF-Hub version: {}'.format(hub.__version__))
print('Apache Beam version: {}'.format(beam.__version__))

TF version: 2.8.2
TF-Hub version: 0.12.0
Apache Beam version: 2.41.0


## 1. Download Sample Data

[A Million News Headlines](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL#) dataset contains news headlines published over a period of 15 years sourced from the reputable Australian Broadcasting Corp. (ABC). This news dataset has a summarised historical record of noteworthy events in the globe from early-2003 to end-2017 with a more granular focus on Australia. 

**Format**: Tab-separated two-column data: 1) publication date and 2) headline text. We are only interested in the headline text.


In [6]:
!wget 'https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true' -O raw.tsv
!wc -l raw.tsv
!head raw.tsv

--2022-09-10 00:54:45--  https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true
Resolving dataverse.harvard.edu (dataverse.harvard.edu)... 54.211.138.37, 54.211.199.229, 3.219.100.164
Connecting to dataverse.harvard.edu (dataverse.harvard.edu)|54.211.138.37|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57600231 (55M) [text/tab-separated-values]
Saving to: ‘raw.tsv’


2022-09-10 00:54:47 (32.8 MB/s) - ‘raw.tsv’ saved [57600231/57600231]

1103664 raw.tsv
publish_date	headline_text
20030219	"aba decides against community broadcasting licence"
20030219	"act fire witnesses must be aware of defamation"
20030219	"a g calls for infrastructure protection summit"
20030219	"air nz staff in aust strike for pay rise"
20030219	"air nz strike to affect australian travellers"
20030219	"ambitious olsson wins triple jump"
20030219	"antic delighted with record breaking barca"
20030219	"aussie qualifier stosur wastes four memphis match"
20030219	"aust 

For simplicity, we only keep the headline text and remove the publication date

In [7]:
!rm -r corpus
!mkdir corpus

with open('corpus/text.txt', 'w') as out_file:
  with open('raw.tsv', 'r') as in_file:
    for line in in_file:
      headline = line.split('\t')[1].strip().strip('"')
      out_file.write(headline+"\n")

In [8]:
!tail corpus/text.txt

severe storms forecast for nye in south east queensland
snake catcher pleads for people not to kill reptiles
south australia prepares for party to welcome new year
strikers cool off the heat with big win in adelaide
stunning images from the sydney to hobart yacht
the ashes smiths warners near miss liven up boxing day test
timelapse: brisbanes new year fireworks
what 2017 meant to the kids of australia
what the papodopoulos meeting may mean for ausus
who is george papadopoulos the former trump campaign aide


## 2. Generate Embeddings for the Data.

In this tutorial, we use the [Neural Network Language Model (NNLM)](https://tfhub.dev/google/nnlm-en-dim128/2) to generate embeddings for the headline data. The sentence embeddings can then be easily used to compute sentence level meaning similarity. We run the embedding generation process using Apache Beam.

### Embedding extraction method

In [9]:
embed_fn = None

def generate_embeddings(text, model_url, random_projection_matrix=None):
  # Beam will run this function in different processes that need to
  # import hub and load embed_fn (if not previously loaded)
  global embed_fn
  if embed_fn is None:
    embed_fn = hub.load(model_url)
  embedding = embed_fn(text).numpy()
  if random_projection_matrix is not None:
    embedding = embedding.dot(random_projection_matrix)
  return text, embedding


### Convert to tf.Example method

In [10]:
def to_tf_example(entries):
  examples = []

  text_list, embedding_list = entries
  for i in range(len(text_list)):
    text = text_list[i]
    embedding = embedding_list[i]

    features = {
        'text': tf.train.Feature(
            bytes_list=tf.train.BytesList(value=[text.encode('utf-8')])),
        'embedding': tf.train.Feature(
            float_list=tf.train.FloatList(value=embedding.tolist()))
    }
  
    example = tf.train.Example(
        features=tf.train.Features(
            feature=features)).SerializeToString(deterministic=True)
  
    examples.append(example)
  
  return examples

### Beam pipeline

In [11]:
def run_hub2emb(args):
  '''Runs the embedding generation pipeline'''

  options = beam.options.pipeline_options.PipelineOptions(**args)
  args = namedtuple("options", args.keys())(*args.values())

  with beam.Pipeline(args.runner, options=options) as pipeline:
    (
        pipeline
        | 'Read sentences from files' >> beam.io.ReadFromText(
            file_pattern=args.data_dir)
        | 'Batch elements' >> util.BatchElements(
            min_batch_size=args.batch_size, max_batch_size=args.batch_size)
        | 'Generate embeddings' >> beam.Map(
            generate_embeddings, args.model_url, args.random_projection_matrix)
        | 'Encode to tf example' >> beam.FlatMap(to_tf_example)
        | 'Write to TFRecords files' >> beam.io.WriteToTFRecord(
            file_path_prefix='{}/emb'.format(args.output_dir),
            file_name_suffix='.tfrecords')
    )

### Generating Random Projection Weight Matrix

[Random projection](https://en.wikipedia.org/wiki/Random_projection) is a simple, yet powerful technique used to reduce the dimensionality of a set of points which lie in Euclidean space. For a theoretical background, see the [Johnson-Lindenstrauss lemma](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma).

Reducing the dimensionality of the embeddings with random projection means less time needed to build and query the ANN index.

In this tutorial we use [Gaussian Random Projection](https://en.wikipedia.org/wiki/Random_projection#Gaussian_random_projection) from the [Scikit-learn](https://scikit-learn.org/stable/modules/random_projection.html#gaussian-random-projection) library.

In [12]:
def generate_random_projection_weights(original_dim, projected_dim):
  random_projection_matrix = None
  random_projection_matrix = gaussian_random_matrix(
      n_components=projected_dim, n_features=original_dim).T
  print("A Gaussian random weight matrix was creates with shape of {}".format(random_projection_matrix.shape))
  print('Storing random projection matrix to disk...')
  with open('random_projection_matrix', 'wb') as handle:
    pickle.dump(random_projection_matrix, 
                handle, protocol=pickle.HIGHEST_PROTOCOL)
        
  return random_projection_matrix

### Set parameters
If you want to build an index using the original embedding space without random projection, set the `projected_dim` parameter to `None`. Note that this will slow down the indexing step for high-dimensional embeddings.

In [13]:
model_url = 'https://tfhub.dev/google/nnlm-en-dim128/2' #@param {type:"string"}
projected_dim = 64  #@param {type:"number"}

### Run pipeline

In [14]:
import tempfile

output_dir = tempfile.mkdtemp()
original_dim = hub.load(model_url)(['']).shape[1]
random_projection_matrix = None

if projected_dim:
  random_projection_matrix = generate_random_projection_weights(
      original_dim, projected_dim)

args = {
    'job_name': 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S')),
    'runner': 'DirectRunner',
    'batch_size': 1024,
    'data_dir': 'corpus/*.txt',
    'output_dir': output_dir,
    'model_url': model_url,
    'random_projection_matrix': random_projection_matrix,
}

print("Pipeline args are set.")
args

2022-09-10 00:54:54.661141: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-09-10 00:54:54.661181: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-09-10 00:54:54.661206: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (vm-6a78437e-2759-4469-95d9-01f80788c108): /proc/driver/nvidia/version does not exist
2022-09-10 00:54:54.661513: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  

A Gaussian random weight matrix was creates with shape of (128, 64)
Storing random projection matrix to disk...
Pipeline args are set.




{'job_name': 'hub2emb-220910-005455',
 'runner': 'DirectRunner',
 'batch_size': 1024,
 'data_dir': 'corpus/*.txt',
 'output_dir': '/tmp/tmpjbhlsgbs',
 'model_url': 'https://tfhub.dev/google/nnlm-en-dim128/2',
 'random_projection_matrix': array([[ 2.01236791e-01,  8.93483119e-02,  5.54279942e-02, ...,
         -2.45547516e-01, -1.48356969e-02,  6.53537678e-03],
        [ 1.15784693e-01, -6.95469504e-02, -5.48163731e-02, ...,
          4.99238821e-02,  1.63294169e-03,  2.55165014e-01],
        [ 9.28542953e-02,  2.26072770e-01, -8.48858275e-02, ...,
         -1.42691433e-01, -3.00625879e-02, -5.18828692e-02],
        ...,
        [ 1.38970299e-02, -1.69851754e-01,  3.17877962e-01, ...,
         -1.66205472e-01,  1.74837610e-01, -8.87367495e-02],
        [ 1.43657886e-01, -2.22593564e-01, -9.92955602e-02, ...,
         -6.31882148e-02, -7.41718247e-02, -5.04202318e-02],
        [-1.93989033e-01,  1.03565901e-02, -3.28919670e-01, ...,
          7.81661117e-02, -2.77780273e-06, -1.48648441e

In [15]:
print("Running pipeline...")
%time run_hub2emb(args)
print("Pipeline is done.")

Running pipeline...




        -2.45547516e-01, -1.48356969e-02,  6.53537678e-03],
       [ 1.15784693e-01, -6.95469504e-02, -5.48163731e-02, ...,
         4.99238821e-02,  1.63294169e-03,  2.55165014e-01],
       [ 9.28542953e-02,  2.26072770e-01, -8.48858275e-02, ...,
        -1.42691433e-01, -3.00625879e-02, -5.18828692e-02],
       ...,
       [ 1.38970299e-02, -1.69851754e-01,  3.17877962e-01, ...,
        -1.66205472e-01,  1.74837610e-01, -8.87367495e-02],
       [ 1.43657886e-01, -2.22593564e-01, -9.92955602e-02, ...,
        -6.31882148e-02, -7.41718247e-02, -5.04202318e-02],
       [-1.93989033e-01,  1.03565901e-02, -3.28919670e-01, ...,
         7.81661117e-02, -2.77780273e-06, -1.48648441e-01]])}


CPU times: user 3min 44s, sys: 4min 49s, total: 8min 33s
Wall time: 2min 29s
Pipeline is done.


In [16]:
!ls {output_dir}

emb-00000-of-00001.tfrecords


- Read some of the generated embeddings...  
$note: refer to myBreakdown for details

In [17]:
embed_file = os.path.join(output_dir, 'emb-00000-of-00001.tfrecords')
sample = 5

# Create a description of the features.
feature_description = {
    'text': tf.io.FixedLenFeature([], tf.string),
    'embedding': tf.io.FixedLenFeature([projected_dim], tf.float32)
}

def _parse_example(example):
  # Parse the input `tf.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example, feature_description)

dataset = tf.data.TFRecordDataset(embed_file)
for record in dataset.take(sample).map(_parse_example):
  print("{}: {}".format(record['text'].numpy().decode('utf-8'), record['embedding'].numpy()[:10]))


headline_text: [-0.21577713 -0.08679349 -0.09089502 -0.08608533 -0.10421044 -0.00496678
  0.04690553 -0.08057941  0.09314948 -0.22174641]
aba decides against community broadcasting licence: [-0.10749932  0.06211396  0.09754886 -0.09016148  0.01318626  0.08292361
 -0.01953283  0.00580397  0.07598736 -0.11664818]
act fire witnesses must be aware of defamation: [ 0.14852294  0.14426763 -0.01337856  0.0806409  -0.01414887  0.19118302
 -0.05756229 -0.02248628  0.02626099  0.02218925]
a g calls for infrastructure protection summit: [-0.10894778 -0.01743285  0.03230982  0.07510062  0.01461216 -0.06994132
 -0.1418096   0.12035619  0.1696226   0.07280549]
air nz staff in aust strike for pay rise: [-0.03192603 -0.17949837  0.33432406  0.07296742  0.00270896  0.12258769
  0.07110699 -0.05459479  0.16654256 -0.13920893]


## 3. Build the ANN Index for the Embeddings

[ANNOY](https://github.com/spotify/annoy) (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mapped into memory. It is built and used by [Spotify](https://www.spotify.com) for music recommendations. If you are interested you can play along with other alternatives to ANNOY such as [NGT](https://github.com/yahoojapan/NGT), [FAISS](https://github.com/facebookresearch/faiss), etc. 

In [18]:
def build_index(embedding_files_pattern, index_filename, vector_length, 
    metric='angular', num_trees=100):
  '''Builds an ANNOY index'''

  annoy_index = annoy.AnnoyIndex(vector_length, metric=metric)
  # Mapping between the item and its identifier in the index
  mapping = {}

  embed_files = tf.io.gfile.glob(embedding_files_pattern)
  num_files = len(embed_files)
  print('Found {} embedding file(s).'.format(num_files))

  item_counter = 0
  for i, embed_file in enumerate(embed_files):
    print('Loading embeddings in file {} of {}...'.format(i+1, num_files))
    dataset = tf.data.TFRecordDataset(embed_file)
    for record in dataset.map(_parse_example):
      text = record['text'].numpy().decode("utf-8")
      embedding = record['embedding'].numpy()
      mapping[item_counter] = text
      annoy_index.add_item(item_counter, embedding)
      item_counter += 1
      if item_counter % 100000 == 0:
        print('{} items loaded to the index'.format(item_counter))

  print('A total of {} items added to the index'.format(item_counter))

  print('Building the index with {} trees...'.format(num_trees))
  annoy_index.build(n_trees=num_trees)
  print('Index is successfully built.')
  
  print('Saving index to disk...')
  annoy_index.save(index_filename)
  print('Index is saved to disk.')
  print("Index file size: {} GB".format(
    round(os.path.getsize(index_filename) / float(1024 ** 3), 2)))
  annoy_index.unload()

  print('Saving mapping to disk...')
  with open(index_filename + '.mapping', 'wb') as handle:
    pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)
  print('Mapping is saved to disk.')
  print("Mapping file size: {} MB".format(
    round(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2)))

In [19]:
embedding_files = "{}/emb-*.tfrecords".format(output_dir)
embedding_dimension = projected_dim
index_filename = "index"

!rm {index_filename}
!rm {index_filename}.mapping

%time build_index(embedding_files, index_filename, embedding_dimension)

rm: cannot remove 'index': No such file or directory
rm: cannot remove 'index.mapping': No such file or directory
Found 1 embedding file(s).
Loading embeddings in file 1 of 1...
100000 items loaded to the index
200000 items loaded to the index
300000 items loaded to the index
400000 items loaded to the index
500000 items loaded to the index
600000 items loaded to the index
700000 items loaded to the index
800000 items loaded to the index
900000 items loaded to the index
1000000 items loaded to the index
1100000 items loaded to the index
A total of 1103664 items added to the index
Building the index with 100 trees...
Index is successfully built.
Saving index to disk...
Index is saved to disk.
Index file size: 1.59 GB
Saving mapping to disk...
Mapping is saved to disk.
Mapping file size: 50.61 MB
CPU times: user 10min 47s, sys: 1min 42s, total: 12min 30s
Wall time: 5min 31s


In [20]:
!ls

anya_semantic_search
corpus
corpus2
data
env_anya_semantic_search_py3713.yml
example_tfsimilarity_mnist.ipynb
gcp_POC_model
index
index.mapping
mySemanticSearch_with_ANNandTextEmbeddingsFromTFHub.ipynb
poc_0_data.ipynb
poc_SemanticSearch_qry.ipynb
poc_SemanticSearch_with_ANNandTextEmbeddingsFromTFHub.ipynb
random_projection_matrix
raw.tsv
requirements.txt
src
tensorflow_datasets
tutorials


## 4. Use the Index for Similarity Matching
Now we can use the ANN index to find news headlines that are semantically close to an input query.

### Load the index and the mapping files

In [21]:
index = annoy.AnnoyIndex(embedding_dimension)
index.load(index_filename, prefault=True)
print('Annoy index is loaded.')
with open(index_filename + '.mapping', 'rb') as handle:
  mapping = pickle.load(handle)
print('Mapping file is loaded.')


Annoy index is loaded.


  """Entry point for launching an IPython kernel.


Mapping file is loaded.


### Similarity matching method

In [22]:
def find_similar_items(embedding, num_matches=5):
  '''Finds similar items to a given embedding in the ANN index'''
  ids = index.get_nns_by_vector(
  embedding, num_matches, search_k=-1, include_distances=False)
  items = [mapping[i] for i in ids]
  return items

### Extract embedding from a given query

In [23]:
# Load the TF-Hub model
print("Loading the TF-Hub model...")
%time embed_fn = hub.load(model_url)
print("TF-Hub model is loaded.")

random_projection_matrix = None
if os.path.exists('random_projection_matrix'):
  print("Loading random projection matrix...")
  with open('random_projection_matrix', 'rb') as handle:
    random_projection_matrix = pickle.load(handle)
  print('random projection matrix is loaded.')

def extract_embeddings(query):
  '''Generates the embedding for the query'''
  query_embedding =  embed_fn([query])[0].numpy()
  if random_projection_matrix is not None:
    query_embedding = query_embedding.dot(random_projection_matrix)
  return query_embedding


Loading the TF-Hub model...
CPU times: user 535 ms, sys: 200 ms, total: 735 ms
Wall time: 732 ms
TF-Hub model is loaded.
Loading random projection matrix...
random projection matrix is loaded.


In [24]:
extract_embeddings("Hello Machine Learning!")[:10]

array([ 0.09250556,  0.13992388,  0.10675006, -0.05237758, -0.0437806 ,
       -0.03317126, -0.00183065,  0.12784306, -0.20596743,  0.00081738])

### Enter a query to find the most similar items

In [25]:
#@title { run: "auto" }
query = "confronting global challenges" #@param {type:"string"}

print("Generating embedding for the query...")
%time query_embedding = extract_embeddings(query)

print("")
print("Finding relevant items in the index...")
%time items = find_similar_items(query_embedding, 10)

print("")
print("Results:")
print("=========")
for item in items:
  print(item)

Generating embedding for the query...
CPU times: user 2.1 ms, sys: 939 µs, total: 3.03 ms
Wall time: 2 ms

Finding relevant items in the index...
CPU times: user 610 µs, sys: 0 ns, total: 610 µs
Wall time: 620 µs

Results:
confronting global challenges
conference examines challenges facing major cities
the domestic challenges facing duterte
the challenges facing incoming rba governor
expert demands national anti smoking laws
bluescope ponders global challenges
facilitator sought to address immigrant issues
unified approach helping tackle truancy
futurist discusses dangerous ideas
terror fears dominate global markets


## Want to learn more?

You can learn more about TensorFlow at [tensorflow.org](https://www.tensorflow.org/) and see the TF-Hub API documentation at [tensorflow.org/hub](https://www.tensorflow.org/hub/). Find available TensorFlow Hub models at [tfhub.dev](https://tfhub.dev/) including more text embedding models and image feature vector models.

Also check out the [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/) which is Google's fast-paced, practical introduction to machine learning.

In [26]:
mystop

NameError: name 'mystop' is not defined

## myXtra


In [27]:
#original from Colab
#!pip install apache_beam
#!pip install 'scikit_learn~=0.23.0'  # For gaussian_random_matrix.
#!pip install annoy


### myBreakdown
from `embed-file` to `dict{'text','embeddings'}`

In [28]:
#myBreakdown
embed_file = os.path.join(output_dir, 'emb-00000-of-00001.tfrecords')
print(embed_file)

/tmp/tmpjbhlsgbs/emb-00000-of-00001.tfrecords


In [29]:
#myBreakdown
dataset = tf.data.TFRecordDataset(embed_file) #class tensorflow.python.data.ops.readers.TFRecordDatasetV2

#record raw
for record in dataset.take(1): #record -> 'tensorflow.python.framework.ops.EagerTensor'
  print(record, '\n')

print('--')

#record raw as numpy array
for record in dataset.take(1): #record -> 'tensorflow.python.framework.ops.EagerTensor'
  print(record.numpy(), '\n')

tf.Tensor(b"\n\xb2\x02\n\x94\x02\n\tembedding\x12\x86\x02\x12\x83\x02\n\x80\x02\xae\xf4\\\xbe\xc9\xc0\xb1\xbd+'\xba\xbd\x81M\xb0\xbdIl\xd5\xbdb\xc0\xa2\xbb\x03 @=\xd1\x06\xa5\xbd'\xc5\xbe=~\x11c\xbe\x1b\xa0\xd9<\x90\x86\xeb=\x8e\xa1\xcb=\xac-Y\xbebR\x06>\xd1u\x01>\xd8\x183\xbd\xdd\x8a\x99=D\xd0\xa9>\xc3\x00}\xbcT\xe0M\xbe24N=\x12\x88\xf7\xbc\xffM\xdf\xbd\x18\xea\xed=oyM=\x9c\x9d\xe3\xbdu\x03\xde; \xe3$>\xb5rM=\x96\xab\x10>\x16\xd1\xe2=\xa0\x9bd>\x02\x95\x9d\xba\xd1&\xbd=W\xc1i\xbd\xee\xe7-\xbew\r0\xbc\xad\x84\x04>2\xcc\xa8=U\xf4^\xbe\xfc\xe6\x84\xbd\x97\xfb\x18>\x89\xd3\xd7<,P\xa19\x029\x8d\xbe\x86J\x8c\xbe\xbbf|\xbe\xc3V\x96>T\xeb\xaa\xbd\xb5\x08z=\x9a\xca3>7\x9c\x0c>\x00\x8e\xf3\xbd\xdb\xceP\xbe\r\xc8\x0c\xbe\x93\xfb\x02\xbem\xddD>\x86@\x86\xbdpf\xec\xbd\\\xeb\xe1\xbb\x8d\xea\xbb>QxY\xbd\xe3+X=\n\x19\n\x04text\x12\x11\n\x0f\n\rheadline_text", shape=(), dtype=string) 

--
b"\n\xb2\x02\n\x94\x02\n\tembedding\x12\x86\x02\x12\x83\x02\n\x80\x02\xae\xf4\\\xbe\xc9\xc0\xb1\xbd+'\xba\xbd\x81M

In [30]:
#myBreakdown: what is what

#record -> diff views
for record in dataset.take(1): #record -> 'tensorflow.python.framework.ops.EagerTensor'
  print(record.numpy(), '\n') #class 'bytes'
  print(tf.io.parse_example(record, {'text': tf.io.FixedLenFeature([], tf.string)}), '\n')
  print(tf.io.parse_example(record, {'embedding': tf.io.FixedLenFeature([projected_dim], tf.float32)}), '\n')

print('--')
#record -> dict w/ 2 elements
for record in dataset.take(1): #record -> 'tensorflow.python.framework.ops.EagerTensor'
  print(tf.io.parse_example(record, {
    'text': tf.io.FixedLenFeature([], tf.string),
    'embedding': tf.io.FixedLenFeature([projected_dim], tf.float32)}
    ), '\n')

b"\n\xb2\x02\n\x94\x02\n\tembedding\x12\x86\x02\x12\x83\x02\n\x80\x02\xae\xf4\\\xbe\xc9\xc0\xb1\xbd+'\xba\xbd\x81M\xb0\xbdIl\xd5\xbdb\xc0\xa2\xbb\x03 @=\xd1\x06\xa5\xbd'\xc5\xbe=~\x11c\xbe\x1b\xa0\xd9<\x90\x86\xeb=\x8e\xa1\xcb=\xac-Y\xbebR\x06>\xd1u\x01>\xd8\x183\xbd\xdd\x8a\x99=D\xd0\xa9>\xc3\x00}\xbcT\xe0M\xbe24N=\x12\x88\xf7\xbc\xffM\xdf\xbd\x18\xea\xed=oyM=\x9c\x9d\xe3\xbdu\x03\xde; \xe3$>\xb5rM=\x96\xab\x10>\x16\xd1\xe2=\xa0\x9bd>\x02\x95\x9d\xba\xd1&\xbd=W\xc1i\xbd\xee\xe7-\xbew\r0\xbc\xad\x84\x04>2\xcc\xa8=U\xf4^\xbe\xfc\xe6\x84\xbd\x97\xfb\x18>\x89\xd3\xd7<,P\xa19\x029\x8d\xbe\x86J\x8c\xbe\xbbf|\xbe\xc3V\x96>T\xeb\xaa\xbd\xb5\x08z=\x9a\xca3>7\x9c\x0c>\x00\x8e\xf3\xbd\xdb\xceP\xbe\r\xc8\x0c\xbe\x93\xfb\x02\xbem\xddD>\x86@\x86\xbdpf\xec\xbd\\\xeb\xe1\xbb\x8d\xea\xbb>QxY\xbd\xe3+X=\n\x19\n\x04text\x12\x11\n\x0f\n\rheadline_text" 

{'text': <tf.Tensor: shape=(), dtype=string, numpy=b'headline_text'>} 

{'embedding': <tf.Tensor: shape=(64,), dtype=float32, numpy=
array([-2.15777129e

In [31]:
#myBreakdown

# Create a description of the features.
feature_description = {
    'text': tf.io.FixedLenFeature([], tf.string),
    'embedding': tf.io.FixedLenFeature([projected_dim], tf.float32)
}

#record -> dict predefined as features
for record in dataset.take(1): #record -> 'tensorflow.python.framework.ops.EagerTensor'
  print(tf.io.parse_single_example(record, feature_description), '\n')

{'embedding': <tf.Tensor: shape=(64,), dtype=float32, numpy=
array([-2.15777129e-01, -8.67934898e-02, -9.08950195e-02, -8.60853270e-02,
       -1.04210444e-01, -4.96678147e-03,  4.69055288e-02, -8.05794075e-02,
        9.31494758e-02, -2.21746415e-01,  2.65656020e-02,  1.15002751e-01,
        9.94292349e-02, -2.12088287e-01,  1.31173640e-01,  1.26425996e-01,
       -4.37248647e-02,  7.49718919e-02,  3.31667066e-01, -1.54420761e-02,
       -2.01051056e-01,  5.03427461e-02, -3.02162506e-02, -1.09035484e-01,
        1.16169155e-01,  5.01646362e-02, -1.11140460e-01,  6.77531445e-03,
        1.61022663e-01,  5.01582213e-02,  1.41279548e-01,  1.10750362e-01,
        2.23249912e-01, -1.20225572e-03,  9.23591927e-02, -5.70691489e-02,
       -1.69830054e-01, -1.07453978e-02,  1.29412368e-01,  8.24207217e-02,
       -2.17728928e-01, -6.48936927e-02,  1.49397239e-01,  2.63459850e-02,
        3.07680457e-04, -2.75825560e-01, -2.74006069e-01, -2.46485636e-01,
        2.93630689e-01, -8.34566653e-02

In [32]:
#myBreakdown

# add parse fn using dict predefined as features
def _parse_example(example):
  # Parse the input `tf.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example, feature_description)

#record -> dict w/ defined fn
for record in dataset.take(1): #record -> 'tensorflow.python.framework.ops.EagerTensor'
  print(_parse_example(record), '\n')

{'embedding': <tf.Tensor: shape=(64,), dtype=float32, numpy=
array([-2.15777129e-01, -8.67934898e-02, -9.08950195e-02, -8.60853270e-02,
       -1.04210444e-01, -4.96678147e-03,  4.69055288e-02, -8.05794075e-02,
        9.31494758e-02, -2.21746415e-01,  2.65656020e-02,  1.15002751e-01,
        9.94292349e-02, -2.12088287e-01,  1.31173640e-01,  1.26425996e-01,
       -4.37248647e-02,  7.49718919e-02,  3.31667066e-01, -1.54420761e-02,
       -2.01051056e-01,  5.03427461e-02, -3.02162506e-02, -1.09035484e-01,
        1.16169155e-01,  5.01646362e-02, -1.11140460e-01,  6.77531445e-03,
        1.61022663e-01,  5.01582213e-02,  1.41279548e-01,  1.10750362e-01,
        2.23249912e-01, -1.20225572e-03,  9.23591927e-02, -5.70691489e-02,
       -1.69830054e-01, -1.07453978e-02,  1.29412368e-01,  8.24207217e-02,
       -2.17728928e-01, -6.48936927e-02,  1.49397239e-01,  2.63459850e-02,
        3.07680457e-04, -2.75825560e-01, -2.74006069e-01, -2.46485636e-01,
        2.93630689e-01, -8.34566653e-02

In [33]:
#myBreakdown - final: how it's done by G professionals

#record -> dict mapped w/ defined fn
for record in dataset.take(1).map(_parse_example): #record -> class dict => class dict w/ predefined features
  print("{}: {}...".format(record['text'].numpy().decode('utf-8'), record['embedding'].numpy()[:10]))

headline_text: [-0.21577713 -0.08679349 -0.09089502 -0.08608533 -0.10421044 -0.00496678
  0.04690553 -0.08057941  0.09314948 -0.22174641]...
