<a href="https://colab.research.google.com/github/UsmanQT/AI-advisory-system/blob/generate-embeddings-alpaca-colab/Load_Embeddings_Generate_Question_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The first part of this notebook deals with reading the embeddings stored in the firebase collection "paragraphs" and computing the cosine similarity of the question's embedding and the embeddings read from the collection. We will get the top matched paragraphs and concatinate those paragraphs to make the context. In the second part of the notebook, we will use that context to answer the question.

In [None]:
!pip install firebase
!pip install firebase-admin
!pip install langchain
!pip install huggingface_hub
!pip install sentence_transformers > /dev/null

In [9]:
from firebase_admin import credentials
from firebase_admin import firestore
import firebase_admin
import numpy as np
from langchain import HuggingFaceHub
import os
import requests
from langchain.embeddings import HuggingFaceEmbeddings
import pandas as pd

In [29]:
# Setting some environment variables, tokens and global variables
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"
hf_token = "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"

In [12]:
# Function to get reference of the database where the embeddings are stored
def initializeFirebase():
  cred = credentials.Certificate("ADD-YOUR-JSON-FILE.json")
  firebase_admin.initialize_app(cred)

  # Get a reference to your Firestore database
  db = firestore.client()

In [13]:
#Function to compute cosine similarity
def cosine_similarity(embedding1, embedding2):
    # Assuming embeddings are numpy arrays
    return np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))

In [71]:
# Function to get the extract the context from similar paragraphs
def getSimilarParagraphs(modelName, askQuestion):
  embeddings = HuggingFaceEmbeddings(model_name=modelName)
  question = askQuestion
  question_embedding = embeddings.embed_documents(question)
  print("Question: ")
  print(question)
  question_embedding = np.array(question_embedding)
  print("Question's embeddings: ")
  print(question_embedding)

  # Initialize an empty list to store the answers
  answers_list = []
  # Initialize a list to store (similarity, text) pairs
  results = []

  # Query the "paragraphs" collection and retrieve the "answer" field from each document
  paragraphs_ref = db.collection("paragraphs")
  docs = paragraphs_ref.stream()

  for doc in docs:
      doc_data = doc.to_dict()
      paragraph_text = doc_data['text']
      if 'embeddings' in doc_data and 'text' in doc_data and len(paragraph_text) >= 70:
        answer_embeddings = np.array(doc_data['embeddings'])

        similarity = cosine_similarity(question_embedding, answer_embeddings)

        results.append((similarity, doc_data['text']))
  # Sort the results by similarity in descending order
  results.sort(key=lambda x: x[0], reverse=True)

  # Return the top similar paragraphs
  n = 10  # Number of similar paragraphs to retrieve
  top_n_results = results[:n]

  print(f"Top {n} paragraphs: ")
  print(top_n_results)

  context = ""
  for i in top_n_results:
    context = context + i[1].strip()
  context = context.replace('\n',' ')
  return context

In [None]:
# Function call to get the reference of database
initializeFirebase()

In [74]:
# Function call to get the context for the question from the similar paragraphs that are stored in firebase database.
question=["who is Jonathan Engelsma?"]
generatedContext = getSimilarParagraphs(modelName="declare-lab/flan-alpaca-large", askQuestion = question)



Question: 
['who is Jonathan Engelsma?']
Question's embeddings: 
[[-0.01108511  0.00080256  0.02336332 ...  0.04015426 -0.00619658
  -0.0211643 ]]
Top 10 paragraphs: 
[(array([0.34684686]), 'Any questions regarding senior projects, can be directed to the current instructors of the course: Dr. Adams and/or Dr. Engelsma.'), (array([0.34323968]), 'Dr. Kalafut’s teaching and research focuses on networking and security. \xa0He completed his Ph.D. in computer science at Indiana University in 2010, where he focused on cyberfraud detection through infrastructure analysis.'), (array([0.33984828]), 'Undergraduates interested in applying for an ACI Residency should email their CV to Dr. Engelsma at least one semester in advance, indicating their interest in the program.'), (array([0.32985065]), '\nPoint of contact:\xa0 Email Rahat Rafiq ( rafiqr@gvsu.edu ) if you have any queries.'), (array([0.32528429]), '\n\t\t\t\t\t\t\tCome and attend the talk by Dr. Mike Doyle titled "The Visible\nEmbryo Proj

In [75]:
generatedContext

'Any questions regarding senior projects, can be directed to the current instructors of the course: Dr. Adams and/or Dr. Engelsma.Dr. Kalafut’s teaching and research focuses on networking and security. \xa0He completed his Ph.D. in computer science at Indiana University in 2010, where he focused on cyberfraud detection through infrastructure analysis.Undergraduates interested in applying for an ACI Residency should email their CV to Dr. Engelsma at least one semester in advance, indicating their interest in the program.Point of contact:\xa0 Email Rahat Rafiq ( rafiqr@gvsu.edu ) if you have any queries.Come and attend the talk by Dr. Mike Doyle titled "The Visible Embryo Project: Following the connections from chick embryos to Bitcoin.Dr. Zachary DeBruine is an Assistant Professor of Computing at Grand Valley State University within the Applied Computing Institute. Dr. DeBruine is leading research in collaboration with academic and industry partners to develop high-performance machine l

The next part of the notebook deals with using a question/answer open source machine learning model. We will feed the model the question and the context (generated in the first part) and get the answer.

FIRST APPROACH USING consciousAI/question-answering model

In [None]:
!pip install torch
from transformers import pipeline
import torch
from transformers import (AutoModelForSeq2SeqLM, AutoTokenizer)

In [77]:
def _generate(query, context, model, device):

    FT_MODEL = AutoModelForSeq2SeqLM.from_pretrained(model).to(device)
    FT_MODEL_TOKENIZER = AutoTokenizer.from_pretrained(model)
    input_text = "Extract the answer from question_context. question: " + query[0] + "</s> question_context: " + context

    input_tokenized = FT_MODEL_TOKENIZER.encode(input_text, return_tensors='pt', truncation=True, padding='max_length', max_length=1024).to(device)
    _tok_count_assessment = FT_MODEL_TOKENIZER.encode(input_text, return_tensors='pt', truncation=True).to(device)

    summary_ids = FT_MODEL.generate(input_tokenized,
                                       max_length=500,
                                       min_length=25,
                                       num_beams=2,
                                       early_stopping=True,
                                   )
    output = [FT_MODEL_TOKENIZER.decode(id, clean_up_tokenization_spaces=True, skip_special_tokens=True) for id in summary_ids]

    return str(output[0])

device = [0 if torch.cuda.is_available() else 'cpu'][0]
print('Answer: ')
print(_generate(question, generatedContext, model="consciousAI/question-answering-generative-t5-v1-base-s-q-c", device=device))


Answer: 
Dr. Joshua Engelsma, is an Assistant Professor of Computing in the area of Automated Fingerprint Identification Systems


SECOND APPROACH USING bert-large model

In [78]:
from transformers import pipeline, AutoModelForQuestionAnswering, AutoTokenizer

# Load the pre-trained model and tokenizer
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
nlp = pipeline("question-answering", model=model, tokenizer=tokenizer)

# Get an answer
question_ask = "Answer this question based on the context and give a detailed answer. question: "+question[0]
answer = nlp(context=generatedContext, question=question_ask)

# Display the answer and confidence score
print("Answer:", answer['answer'])
print("Confidence Score:", answer['score'])

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Answer: Dr
Confidence Score: 0.16120602190494537


THIRD APPROACH USES facebook/bart-large-cnn model

In [79]:
from transformers import BartForConditionalGeneration, BartTokenizer

# Load the pre-trained BART model and tokenizer
model_name2 = "facebook/bart-large-cnn"
model2 = BartForConditionalGeneration.from_pretrained(model_name2)
tokenizer2 = BartTokenizer.from_pretrained(model_name2)


# Convert the question-answering task into a text generation task
input_text = f"Answer the question from the context and do not include information unrelated to the question. Question: {question[0]} Context: {generatedContext}"

# Tokenize the input text
input_ids = tokenizer2.encode(input_text, return_tensors="pt", max_length=1024, truncation=True)

# Generate the answer as a summary
summary_ids = model2.generate(input_ids, max_length= 500, num_beams=4, length_penalty=2.0, early_stopping=True)

# Decode the generated answer
answer = tokenizer2.decode(summary_ids[0], skip_special_tokens=True)

# Print the generated answer
print("Answer:")

print(answer.strip('"Answer the question from the context and do not include information unrelated to the question.'))


Answer:
Come and attend the talk by Dr. Joshua Engelsma, where he sheds light on the recent trends and challenges in the area of Automated Fingerprint Identification Systems (AFIS) Dr. Zachary DeBruine is leading research in collaboration with academic and industry partners to develop high-performance machine learning algorithms to analyze big biolog
