## Import Transformer

First we'll import our pre-trained sentence similarity model. This one was trained using BERT techniques on a massive set of tuples from the internet. Tuples take the form of input-output. So for example, an input could be a question, and an output could be an answer. 

In [4]:
%pip install accelerate transformers torch sentence-transformers huggingface 



Note: you may need to restart the kernel to use updated packages.


In [10]:
from sentence_transformers import SentenceTransformer, util

# Load the model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

## Prepare Corpus

We are going to pull the summary from the <a href="https://en.wikipedia.org/wiki/Japan">Japan Wikipedia Page</a>, then prepare it for vector embedding. 

In [11]:
# set corpus from first page of wikipedia
corpus = "Japan is an island country in East Asia. It is situated in the northwest Pacific Ocean, and is bordered on the west by the Sea of Japan, while extending from the Sea of Okhotsk in the north toward the East China Sea, Philippine Sea, and Taiwan in the south. Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa. Tokyo is the nation's capital and largest city, followed by Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe, and Kyoto. Japan is the eleventh most populous country in the world, as well as one of the most densely populated and urbanized. About three-fourths of the country's terrain is mountainous, concentrating its population of 125.5 million on narrow coastal plains. Japan is divided into 47 administrative prefectures and eight traditional regions. The Greater Tokyo Area is the most populous metropolitan area in the world, with more than 37.4 million residents. Japan has been inhabited since the Upper Paleolithic period (30,000 BC), though the first written mention of the archipelago appears in a Chinese chronicle (the Book of Han) finished in the 2nd century AD. Between the 4th and 9th centuries, the kingdoms of Japan became unified under an emperor and the imperial court based in Heian-kyō. Beginning in the 12th century, political power was held by a series of military dictators (shōgun) and feudal lords (daimyō) and enforced by a class of warrior nobility (samurai). After a century-long period of civil war, the country was reunified in 1603 under the Tokugawa shogunate, which enacted an isolationist foreign policy. In 1854, a United States fleet forced Japan to open trade to the West, which led to the end of the shogunate and the restoration of imperial power in 1868. In the Meiji period, the Empire of Japan adopted a Western-modeled constitution and pursued a program of industrialization and modernization. Amidst a rise in militarism and overseas colonization, Japan invaded China in 1937 and entered World War II as an Axis power in 1941. After suffering defeat in the Pacific War and two atomic bombings, Japan surrendered in 1945 and came under a seven-year Allied occupation, during which it adopted a new constitution and began a military alliance with the United States. Under the 1947 constitution, Japan has maintained a unitary parliamentary constitutional monarchy with a bicameral legislature, the National Diet. Japan is a highly developed country, and a great power in global politics. Its economy is the world's third-largest by nominal GDP and the fourth-largest by PPP. Although Japan has renounced its right to declare war, the country maintains Self-Defense Forces that rank as one of the world's strongest militaries. After World War II, Japan experienced record growth in an economic miracle, becoming the second-largest economy in the world by 1972 but has stagnated since 1995 in what is referred to as the Lost Decades. Japan has the world's highest life expectancy, though it is experiencing a decline in population. A global leader in the automotive, robotics and electronics industries, the country has made significant contributions to science and technology. The culture of Japan is well known around the world, including its art, cuisine, music, and popular culture, which encompasses prominent comic, animation and video game industries. It is a member of numerous international organizations, including the United Nations (since 1956), OECD, G20 and Group of Seven."
# turn it into an array of sentences
docs = corpus.split('.')
print(docs)

['Japan is an island country in East Asia', ' It is situated in the northwest Pacific Ocean, and is bordered on the west by the Sea of Japan, while extending from the Sea of Okhotsk in the north toward the East China Sea, Philippine Sea, and Taiwan in the south', ' Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa', " Tokyo is the nation's capital and largest city, followed by Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe, and Kyoto", ' Japan is the eleventh most populous country in the world, as well as one of the most densely populated and urbanized', " About three-fourths of the country's terrain is mountainous, concentrating its population of 125", '5 million on narrow coastal plains', ' Japan is divided into 47 administrative prefectures and eight traditional regions', ' The Greater Tokyo Area is the most populous metropolitan a

## Encode Corpus
encode each array (sentence) into a 384 dimension vector

In [12]:
corpus_vector = model.encode(docs)
print("Length of vector:", len(corpus_vector[0]))

print(corpus_vector)

Length of vector: 768
[[ 0.04893158 -0.02619271 -0.00278412 ... -0.04319208  0.07940488
  -0.00894494]
 [ 0.01393695 -0.03502559 -0.04357595 ... -0.02307443  0.02728172
  -0.01315697]
 [ 0.00622562 -0.00944361 -0.02291363 ... -0.01319446  0.06887974
   0.0035376 ]
 ...
 [ 0.01956195  0.05152552 -0.02227121 ...  0.00139931  0.05719159
  -0.01325998]
 [-0.00859475 -0.02735347  0.00834539 ... -0.04587486  0.00562254
   0.03815543]
 [-0.01250338  0.06143889 -0.0067346  ... -0.00193855 -0.05036447
  -0.01904947]]


In [13]:
len(corpus_vector)
corpus_vector[0].shape

(768,)

In [14]:
from pymongo import MongoClient
MONGO_CONN = 'mongodb+srv://<username>:<password>@retail-demo.2wqno.mongodb.net/?retryWrites=true&w=majority'
client = MongoClient(MONGO_CONN)

col = client['sample']['vectest']


In [15]:
a = list(map(lambda x:{x[0]:x[1].tolist(), "doc":x[2]},zip(["d"]*28, corpus_vector, docs)))

col.delete_many({})
col.insert_many(a)

<pymongo.results.InsertManyResult at 0x2a6850250>

## Embed Our Query

We then take an english-intuitive question, also send that through the same 384 dimension calculation and then the resulting vector query and corpus query are sent through the `calculate` function, where the most similar strings are calculated. 

In [27]:
# Encode our question and documents in 384 dimension

# query = "How many islands are comprised of Japan?"
# query = "Which constitution model did Japan adopt?"
# query = "Name the captial of Japan?"
query = "Japan is divided into how many administrative regions?"
query_vector = model.encode(query)
print(query_vector)

[ 2.59677079e-02  7.72756943e-03 -1.71495732e-02  3.39120068e-02
  2.43398529e-02  1.90968369e-03  5.53601533e-02  2.43759509e-02
 -9.57010873e-03  3.55584994e-02 -1.02552883e-02 -2.38108225e-02
  4.03740183e-02 -1.73518267e-02 -1.17360922e-02 -1.22100502e-01
 -2.15227148e-04 -2.63164807e-02 -1.63931046e-02 -2.06916984e-02
 -5.42591400e-02 -2.79657934e-02  1.36672251e-03  6.63102465e-03
 -1.12631684e-02  4.42962125e-02 -2.55481340e-02 -6.01313114e-02
 -2.94108484e-02  1.46279149e-02  5.05252779e-02 -7.75548890e-02
  1.77590293e-03 -8.97731166e-03  1.55914734e-06  1.58840753e-02
  1.10980850e-02  6.24909950e-03  4.98716496e-02  7.87470639e-02
 -3.69348824e-02  9.98970401e-03 -2.91637387e-02  1.72565244e-02
  4.45722463e-03 -1.42990546e-02  5.41606322e-02 -4.28984277e-02
  9.74101871e-02  5.85654425e-03 -8.44711903e-03 -7.69575462e-02
 -5.22977374e-02 -2.46982649e-02 -7.07166940e-02  5.20018265e-02
 -2.29601413e-02 -1.86963938e-02 -6.37296438e-02  1.80762950e-02
 -2.45842356e-02 -4.45502

## Calculate Similarity

In [28]:
# Calculate cosine similarity between the corpus of vectors and the query vector
scores = util.cos_sim(query_vector, corpus_vector)[0].cpu().tolist()

# Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

# Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

0.8487468957901001  Japan is divided into 47 administrative prefectures and eight traditional regions
0.5770764350891113 Japan is an island country in East Asia
0.5675115585327148  Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa
0.5423482656478882  Japan is the eleventh most populous country in the world, as well as one of the most densely populated and urbanized
0.5241142511367798  Under the 1947 constitution, Japan has maintained a unitary parliamentary constitutional monarchy with a bicameral legislature, the National Diet
0.5068069696426392  Tokyo is the nation's capital and largest city, followed by Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe, and Kyoto
0.4977687895298004  The Greater Tokyo Area is the most populous metropolitan area in the world, with more than 37
0.4504493772983551  Japan is a highly developed country, and

In [32]:
pipeline = [{
  "$search": {
    "index": "default",
    "knnBeta": {
      "vector": query_vector.tolist(),
      "path": "d",
      # "filter":{
      #           "phrase": {
      #             "path": "doc",
      #             "query": "capital",
      #           }
      #         },
      "k": 3
    }
  }
},
{"$project":{
    "score":{
              '$meta': 'searchScore'
            },
    "doc":1,
    "_id": 0
}}]
res = list(col.aggregate(pipeline))
res




[{'doc': ' Japan is divided into 47 administrative prefectures and eight traditional regions',
  'score': 0.9243733882904053},
 {'doc': 'Japan is an island country in East Asia',
  'score': 0.7885381579399109},
 {'doc': ' Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa',
  'score': 0.7837554216384888}]