## Import Transformer

First we'll import our pre-trained sentence similarity model. This one was trained using BERT techniques on a massive set of tuples from the internet. Tuples take the form of input-output. So for example, an input could be a question, and an output could be an answer. 

In [7]:
from sentence_transformers import SentenceTransformer, util

# Load the model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

## Prepare Corpus

We are going to pull the summary from the <a href="https://en.wikipedia.org/wiki/Japan">Japan Wikipedia Page</a>, then prepare it for vector embedding. 

In [8]:
# set corpus from first page of wikipedia
corpus = "Japan is an island country in East Asia. It is situated in the northwest Pacific Ocean, and is bordered on the west by the Sea of Japan, while extending from the Sea of Okhotsk in the north toward the East China Sea, Philippine Sea, and Taiwan in the south. Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa. Tokyo is the nation's capital and largest city, followed by Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe, and Kyoto. Japan is the eleventh most populous country in the world, as well as one of the most densely populated and urbanized. About three-fourths of the country's terrain is mountainous, concentrating its population of 125.5 million on narrow coastal plains. Japan is divided into 47 administrative prefectures and eight traditional regions. The Greater Tokyo Area is the most populous metropolitan area in the world, with more than 37.4 million residents. Japan has been inhabited since the Upper Paleolithic period (30,000 BC), though the first written mention of the archipelago appears in a Chinese chronicle (the Book of Han) finished in the 2nd century AD. Between the 4th and 9th centuries, the kingdoms of Japan became unified under an emperor and the imperial court based in Heian-kyō. Beginning in the 12th century, political power was held by a series of military dictators (shōgun) and feudal lords (daimyō) and enforced by a class of warrior nobility (samurai). After a century-long period of civil war, the country was reunified in 1603 under the Tokugawa shogunate, which enacted an isolationist foreign policy. In 1854, a United States fleet forced Japan to open trade to the West, which led to the end of the shogunate and the restoration of imperial power in 1868. In the Meiji period, the Empire of Japan adopted a Western-modeled constitution and pursued a program of industrialization and modernization. Amidst a rise in militarism and overseas colonization, Japan invaded China in 1937 and entered World War II as an Axis power in 1941. After suffering defeat in the Pacific War and two atomic bombings, Japan surrendered in 1945 and came under a seven-year Allied occupation, during which it adopted a new constitution and began a military alliance with the United States. Under the 1947 constitution, Japan has maintained a unitary parliamentary constitutional monarchy with a bicameral legislature, the National Diet. Japan is a highly developed country, and a great power in global politics. Its economy is the world's third-largest by nominal GDP and the fourth-largest by PPP. Although Japan has renounced its right to declare war, the country maintains Self-Defense Forces that rank as one of the world's strongest militaries. After World War II, Japan experienced record growth in an economic miracle, becoming the second-largest economy in the world by 1972 but has stagnated since 1995 in what is referred to as the Lost Decades. Japan has the world's highest life expectancy, though it is experiencing a decline in population. A global leader in the automotive, robotics and electronics industries, the country has made significant contributions to science and technology. The culture of Japan is well known around the world, including its art, cuisine, music, and popular culture, which encompasses prominent comic, animation and video game industries. It is a member of numerous international organizations, including the United Nations (since 1956), OECD, G20 and Group of Seven."
# turn it into an array of sentences
docs = corpus.split('.')
print(docs)

['Japan is an island country in East Asia', ' It is situated in the northwest Pacific Ocean, and is bordered on the west by the Sea of Japan, while extending from the Sea of Okhotsk in the north toward the East China Sea, Philippine Sea, and Taiwan in the south', ' Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa', " Tokyo is the nation's capital and largest city, followed by Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe, and Kyoto", ' Japan is the eleventh most populous country in the world, as well as one of the most densely populated and urbanized', " About three-fourths of the country's terrain is mountainous, concentrating its population of 125", '5 million on narrow coastal plains', ' Japan is divided into 47 administrative prefectures and eight traditional regions', ' The Greater Tokyo Area is the most populous metropolitan a

## Encode Corpus
encode each array (sentence) into a 384 dimension vector

In [11]:
corpus_vector = model.encode(docs)
print("Length of vector:", len(corpus_vector[0]))

print(corpus_vector)

Length of vector: 384
[[ 0.05527088  0.04808547 -0.00781395 ... -0.01564412 -0.05199255
  -0.0269122 ]
 [ 0.07182305  0.11629473  0.03326566 ...  0.00400936 -0.04030822
   0.09569606]
 [ 0.11922333  0.00596007 -0.01733764 ...  0.02097987 -0.07156345
   0.01953295]
 ...
 [ 0.07631288 -0.05397929 -0.02969839 ... -0.03893645  0.0111805
   0.04070465]
 [-0.02801095 -0.03043354  0.00067352 ... -0.08902537 -0.00195532
   0.02784135]
 [-0.11883843  0.04829871 -0.00254811 ...  0.1264095   0.04654899
  -0.0157173 ]]


We create a function which calculates the `dot_score` distance between a query vector and the corpus vector

In [12]:
# our calculation function
def calculate(query_vector, corpus_vector):
    # Compute dot score between query and all document embeddings
    scores = util.cos_sim((query_vector, corpus_vector)[0].cpu().tolist()
    
    # Combine docs & scores
    doc_score_pairs = list(zip(docs, scores))

    # Sort by decreasing score
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

    # Output passages & scores
    for doc, score in doc_score_pairs:
        print(score, doc)

SyntaxError: '(' was never closed (215056968.py, line 4)

We take the first couple of paragraphs (summary) from the [Japan Wikipedia Page](https://en.wikipedia.org/wiki/Japan) and convert it into a series of arrays, then encode it as a 384 dimension vector. 

We then take an english-intuitive question, also send that through the same 384 dimension calculation and then the resulting vector query and corpus query are sent through the `calculate` function, where the most similar strings are calculated. 

In [6]:
# Encode our question and documents in 384 dimension
query = "when did china invade japan?"
query_vector = model.encode(query)

calculate(query_vector, corpus_vector)

0.7069158554077148  Amidst a rise in militarism and overseas colonization, Japan invaded China in 1937 and entered World War II as an Axis power in 1941
0.5404084920883179 Japan is an island country in East Asia
0.5016442537307739  Japan has been inhabited since the Upper Paleolithic period (30,000 BC), though the first written mention of the archipelago appears in a Chinese chronicle (the Book of Han) finished in the 2nd century AD
0.49132198095321655  Japan is a highly developed country, and a great power in global politics
0.47492778301239014  In 1854, a United States fleet forced Japan to open trade to the West, which led to the end of the shogunate and the restoration of imperial power in 1868
0.47227388620376587  After suffering defeat in the Pacific War and two atomic bombings, Japan surrendered in 1945 and came under a seven-year Allied occupation, during which it adopted a new constitution and began a military alliance with the United States
0.46403104066848755  Although Japan