## Import Transformer

First we'll import our pre-trained sentence similarity model. This one was trained using BERT techniques on a massive set of tuples from the internet. Tuples take the form of input-output. So for example, an input could be a question, and an output could be an answer. 

In [2]:
%pip install sentence-transformers 



Note: you may need to restart the kernel to use updated packages.


In [16]:
from sentence_transformers import SentenceTransformer, util

from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# Load the model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')



## Prepare Corpus

We are going to pull the summary from the <a href="https://en.wikipedia.org/wiki/Japan">Japan Wikipedia Page</a>, then prepare it for vector embedding. 

In [1]:
# set corpus from first page of wikipedia
corpus = "Japan is an island country in East Asia. It is situated in the northwest Pacific Ocean, and is bordered on the west by the Sea of Japan, while extending from the Sea of Okhotsk in the north toward the East China Sea, Philippine Sea, and Taiwan in the south. Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa. Tokyo is the nation's capital and largest city, followed by Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe, and Kyoto. Japan is the eleventh most populous country in the world, as well as one of the most densely populated and urbanized. About three-fourths of the country's terrain is mountainous, concentrating its population of 125.5 million on narrow coastal plains. Japan is divided into 47 administrative prefectures and eight traditional regions. The Greater Tokyo Area is the most populous metropolitan area in the world, with more than 37.4 million residents. Japan has been inhabited since the Upper Paleolithic period (30,000 BC), though the first written mention of the archipelago appears in a Chinese chronicle (the Book of Han) finished in the 2nd century AD. Between the 4th and 9th centuries, the kingdoms of Japan became unified under an emperor and the imperial court based in Heian-kyō. Beginning in the 12th century, political power was held by a series of military dictators (shōgun) and feudal lords (daimyō) and enforced by a class of warrior nobility (samurai). After a century-long period of civil war, the country was reunified in 1603 under the Tokugawa shogunate, which enacted an isolationist foreign policy. In 1854, a United States fleet forced Japan to open trade to the West, which led to the end of the shogunate and the restoration of imperial power in 1868. In the Meiji period, the Empire of Japan adopted a Western-modeled constitution and pursued a program of industrialization and modernization. Amidst a rise in militarism and overseas colonization, Japan invaded China in 1937 and entered World War II as an Axis power in 1941. After suffering defeat in the Pacific War and two atomic bombings, Japan surrendered in 1945 and came under a seven-year Allied occupation, during which it adopted a new constitution and began a military alliance with the United States. Under the 1947 constitution, Japan has maintained a unitary parliamentary constitutional monarchy with a bicameral legislature, the National Diet. Japan is a highly developed country, and a great power in global politics. Its economy is the world's third-largest by nominal GDP and the fourth-largest by PPP. Although Japan has renounced its right to declare war, the country maintains Self-Defense Forces that rank as one of the world's strongest militaries. After World War II, Japan experienced record growth in an economic miracle, becoming the second-largest economy in the world by 1972 but has stagnated since 1995 in what is referred to as the Lost Decades. Japan has the world's highest life expectancy, though it is experiencing a decline in population. A global leader in the automotive, robotics and electronics industries, the country has made significant contributions to science and technology. The culture of Japan is well known around the world, including its art, cuisine, music, and popular culture, which encompasses prominent comic, animation and video game industries. It is a member of numerous international organizations, including the United Nations (since 1956), OECD, G20 and Group of Seven."
# turn it into an array of sentences
docs = corpus.split('.')
for doc in docs:
    print(doc)

Japan is an island country in East Asia
 It is situated in the northwest Pacific Ocean, and is bordered on the west by the Sea of Japan, while extending from the Sea of Okhotsk in the north toward the East China Sea, Philippine Sea, and Taiwan in the south
 Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa
 Tokyo is the nation's capital and largest city, followed by Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe, and Kyoto
 Japan is the eleventh most populous country in the world, as well as one of the most densely populated and urbanized
 About three-fourths of the country's terrain is mountainous, concentrating its population of 125
5 million on narrow coastal plains
 Japan is divided into 47 administrative prefectures and eight traditional regions
 The Greater Tokyo Area is the most populous metropolitan area in the world, with mor

## Encode Corpus
encode each array (sentence) into a 384 dimension vector

In [5]:
corpus_vector = model.encode(docs)
print("Length of vector:", len(corpus_vector[0]))

print(corpus_vector)

Length of vector: 384
[[ 0.05527085  0.04808538 -0.00781386 ... -0.01564413 -0.05199257
  -0.02691225]
 [ 0.07182306  0.11629469  0.03326562 ...  0.00400932 -0.0403082
   0.09569602]
 [ 0.11922325  0.00596009 -0.01733767 ...  0.02097983 -0.07156345
   0.0195329 ]
 ...
 [ 0.07631288 -0.05397936 -0.02969839 ... -0.03893653  0.0111805
   0.04070463]
 [-0.02801095 -0.03043353  0.00067352 ... -0.08902533 -0.00195529
   0.02784133]
 [-0.11883838  0.04829862 -0.00254809 ...  0.12640943  0.04654909
  -0.01571736]]


In [6]:
len(corpus_vector)
corpus_vector[0].shape

(384,)

In [3]:
from pymongo import MongoClient
MONGO_CONN = 'mongodb+srv://<username>:<password>@retail-demo.2wqno.mongodb.net/?retryWrites=true&w=majority'
client = MongoClient(MONGO_CONN)

col = client['sample']['vectest']


In [8]:
a = list(map(lambda x:{x[0]:x[1].tolist(), "doc":x[2]},zip(["d"]*28, corpus_vector, docs)))

col.delete_many({})
col.insert_many(a)

<pymongo.results.InsertManyResult at 0x2a103b430>

## Embed Our Query

We then take an english-intuitive question, also send that through the same 384 dimension calculation and then the resulting vector query and corpus query are sent through the `calculate` function, where the most similar strings are calculated. 

In [23]:
# Encode our question and documents in 384 dimension

# query = "How many islands are comprised of Japan?"
# query = "Name a few major cities in japan?"
# query = "Which constitution model did Japan adopt?"
# query = "Which are the important cities in japan?"
query = "Japan is divided into how many administrative regions?"
query_vector = model.encode(query)
print(query_vector)

[ 6.96888566e-02 -2.07897052e-02  1.02820890e-02 -1.01067137e-03
 -2.72650030e-02 -9.10138488e-02 -2.13078246e-03  3.08536552e-02
 -3.66504081e-02 -3.39046009e-02  7.94383362e-02 -8.56574848e-02
 -3.01053412e-02  9.26302001e-02  6.57289997e-02 -6.67824522e-02
 -7.48174936e-02 -1.02015361e-02 -2.52876468e-02 -4.20919806e-02
  2.02686386e-03 -4.83456105e-02  5.34150153e-02 -1.70314442e-02
  4.72873896e-02  2.20477302e-02  5.00186831e-02  1.35236857e-02
  3.43261473e-02 -2.62602512e-03 -6.16401024e-02  4.52699810e-02
  1.01693496e-01  1.86019503e-02  3.19578610e-02 -1.40755596e-02
  2.36791652e-02  2.71681976e-02  2.40688622e-02 -2.27920897e-02
 -7.36001655e-02  3.87528837e-02  7.09293708e-02 -1.67964713e-03
  4.17525433e-02  3.42302881e-02 -2.38649156e-02 -1.88812744e-02
  1.26917008e-02 -3.04418206e-02  6.48808256e-02 -1.91413984e-02
 -2.29218677e-02  1.19591087e-01  4.47782427e-02  1.92081649e-02
  9.12994891e-03  4.85389307e-03 -3.98185998e-02  7.66239092e-02
 -5.01065329e-02 -6.45835

## Calculate Similarity

In [24]:
# Calculate cosine similarity between the corpus of vectors and the query vector
scores = util.cos_sim(query_vector, corpus_vector)[0].cpu().tolist()

# Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

# Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

NameError: name 'corpus_vector' is not defined

In [30]:
pipeline = [{
  "$search": {
    "index": "default",
    "knnBeta": {
      "vector": query_vector.tolist(),
      "path": "d",
      # "filter":{
      #           "phrase": {
      #             "path": "doc",
      #             "query": "capital",
      #           }
      #         },
      "k": 3
    }
  }
},
{"$project":{
    "score":{
              '$meta': 'searchScore'
            },
    "doc":1,
    "_id": 0
}}]
res = list(col.aggregate(pipeline))
context = ""
for i in res:
    context += ". "+i['doc']
    print(i['doc'] + "\t score:"+ str(i['score']) )
instruction="Answer is present in the context"



 Japan is divided into 47 administrative prefectures and eight traditional regions	 score:0.9407361149787903
Japan is an island country in East Asia	 score:0.8376696705818176
 Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa	 score:0.8008524179458618


In [31]:
# You can avoid running this cell each time
# Creating a promt template to query



template = """
Answer the question based on the context and adhering to instruciton. If the
question cannot be answered using the information provided answer
with 'I don't know'.
### Context : {context}

### Question: {question}

### Instruction : {instruction}

### Answer:
"""
prompt = PromptTemplate(template=template, input_variables=["question", "context","instruction"])

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

llm = LlamaCpp(
    model_path="/Users/ashwin.gangadhar/projects/llma/llama.cpp/models/alpaca/ggml-model-q4_0.bin", callback_manager=callback_manager, verbose=True
)
llm_chain = LLMChain(prompt=prompt, llm=llm)

llama.cpp: loading model from /Users/ashwin.gangadhar/projects/llma/llama.cpp/models/alpaca/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5439.94 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 |

In [32]:
llm_chain.run({"question":query,"context":context,"instruction": instruction})

47


llama_print_timings:        load time =  2263.57 ms
llama_print_timings:      sample time =     2.14 ms /     3 runs   (    0.71 ms per token,  1404.49 tokens per second)
llama_print_timings: prompt eval time = 10413.60 ms /   182 tokens (   57.22 ms per token,    17.48 tokens per second)
llama_print_timings:        eval time =    99.87 ms /     2 runs   (   49.93 ms per token,    20.03 tokens per second)
llama_print_timings:       total time = 10551.53 ms


'47'