###  Virtual Agent for Frequently Asked Questions

Building Virtual Agent that understands the semantics of user utterances has become simple with transformers based models out there and with the support of large collection of open-source libraries

###### Import libraries for data analysis

In [1]:
import numpy as np
import pandas as pd

###### Import libraries for text mining

In [2]:
from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
from lingualytics.stopwords import en_stopwords
from texthero.preprocessing import remove_digits

###### Import libraries for transformers

In [3]:
from sentence_transformers import SentenceTransformer

###### Import libraries for computing similarities

In [4]:
from torch.nn import CosineSimilarity
import torch

###### Import library for storing into binary file

In [5]:
import pickle

###### Load data source into dataframe

In [6]:
DATA_SOURCE_PATH = r"faqs.csv"
df = pd.read_csv(DATA_SOURCE_PATH, encoding_errors="ignore")
pd.set_option('display.max_colwidth', None)
df

Unnamed: 0,Q,A
0,What is kandi?,"kandi (pronounced kandee) is a platform that helps developers pick the right library, package, code samples, APIs, and cloud functions, by analyzing over 430 million knowledge items."
1,Have feedback or want to know more?,"We are a passionate set of application focused techies. Wed love to hear from you on your feedback, questions, and any other comments.\nDirect Message us on Twitter Message @OpenWeaverInc\nYou can email us at kandi.support@openweaver.com\nJoin our Discord community here"
2,What components does kandi cover?,kandi helps you select software components across:\nPackages from all package managers and repositories\nSource Code across all major code repositories\nCloud Functions and APIs across all hyperscale cloud providers
3,How do I use kandi?,"kandi provides two simplified experiences to help you choose the right software component to accelerate your application development:\n\n1. Search\nYou can search for the component using natural language to describe your functional and technical requirements, and kandi gets to work by matching these over 430 million knowledge items to show you a shortlist.\nYou can further filter them or refine your query and pick your chosen ones based on scores available on the component listing page.\nClick on the components from the list to review detailed insights such as support, quality, security, and a reference guide covering code snippets, community discussions from the provider, and popular channels.\nThe component listing and detailed insights page have links to the software component home. Some software components may have multiple providers, and you can access all the links.\n\n2. Explore\nYou can Explore kandi curated sections across Popular Collections, Hot Tech, and Industry Domains from the Home Page or the Explore Page. These sections list the popular components among your peers, have functional relevance, and positive security, quality, and support scores in the respective areas.\nYou can browse these sections to get industry insights.\nYou can further filter them and pick your chosen ones based on scores available on the component listing page.\nClick on the components from the list to review detailed insights such as support, quality, security, and a reference guide covering code snippets, community discussions from the provider, and popular channels.\nThe component listing and detailed insights page have links to the software component home. Some software components may have multiple providers, and you can access all the links."
4,How do I shortlist components on kandi?,"You can use the below filters to shortlist components based on your architectural preferences:\n\nLanguages This is an expanding list of languages chosen by popularity amongst kandi users.\nLicenses Licenses are grouped by:\n\nOSS License families, covering Permissive, Weak Copyleft, and Strong Copyleft.\nProprietary license category covering the emerging cloud licenses as well as As-a-Service contracts.\nNo License indicates that the respective repository does not have the license file declared as per the repository managers standard. They could still have a license file declared in a different format or section. Components without a license have all rights reserved, and you may not be able to use them. Hence kandi alerts you when a valid license file is not found.\n\nSupport High support indicates a thriving ecosystem across the author and users, that will help you implement with relative ease.\nComponent Types Component Types are grouped by:\n\nLibraries from package managers and repositories that can be readily installed.\nSource Code that may or may not be associated with a package and are from code repositories.\nCloud Functions and APIs that are provided As-a-Service from cloud providers.\n\nSources This is an expanding list of software component sources chosen by popularity amongst kandi users.\nIndustries This indicates the industry domain that the component has been associated with or could be used in, for specific use cases.\nSecurity This reflects the security score of the software component across reported and code-based vulnerabilities."
5,How do I implement the components that I have selected on kandi?,"The component listing and detailed insights page have links to the software component home. Some software components may have multiple providers, and you can access all the links.\nYou can follow implementation instructions from the software component home page based on the component type."


###### Cleanse data by removing numbers and punctutation
This process is part of pre-processing that aids in getting rid of unnecessary text, which would otherwise hinder the learning process of the model. Techniques like stemming, lemmatisation can also help here.

As we're using sentence embedding, we wouldn't be doing extensive pre-processing here. The pre-processing complexity decreases with increase in the quality of the dataset

In [7]:
df['procd_Q'] = df['Q'].pipe(remove_digits).pipe(remove_punctuation)#.pipe(remove_lessthan,length=3)\
                                                    #.pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords))
df

  return s.str.replace(rf"([{punctuation}])+", " ")


Unnamed: 0,Q,A,procd_Q
0,What is kandi?,"kandi (pronounced kandee) is a platform that helps developers pick the right library, package, code samples, APIs, and cloud functions, by analyzing over 430 million knowledge items.",What is kandi
1,Have feedback or want to know more?,"We are a passionate set of application focused techies. Wed love to hear from you on your feedback, questions, and any other comments.\nDirect Message us on Twitter Message @OpenWeaverInc\nYou can email us at kandi.support@openweaver.com\nJoin our Discord community here",Have feedback or want to know more
2,What components does kandi cover?,kandi helps you select software components across:\nPackages from all package managers and repositories\nSource Code across all major code repositories\nCloud Functions and APIs across all hyperscale cloud providers,What components does kandi cover
3,How do I use kandi?,"kandi provides two simplified experiences to help you choose the right software component to accelerate your application development:\n\n1. Search\nYou can search for the component using natural language to describe your functional and technical requirements, and kandi gets to work by matching these over 430 million knowledge items to show you a shortlist.\nYou can further filter them or refine your query and pick your chosen ones based on scores available on the component listing page.\nClick on the components from the list to review detailed insights such as support, quality, security, and a reference guide covering code snippets, community discussions from the provider, and popular channels.\nThe component listing and detailed insights page have links to the software component home. Some software components may have multiple providers, and you can access all the links.\n\n2. Explore\nYou can Explore kandi curated sections across Popular Collections, Hot Tech, and Industry Domains from the Home Page or the Explore Page. These sections list the popular components among your peers, have functional relevance, and positive security, quality, and support scores in the respective areas.\nYou can browse these sections to get industry insights.\nYou can further filter them and pick your chosen ones based on scores available on the component listing page.\nClick on the components from the list to review detailed insights such as support, quality, security, and a reference guide covering code snippets, community discussions from the provider, and popular channels.\nThe component listing and detailed insights page have links to the software component home. Some software components may have multiple providers, and you can access all the links.",How do I use kandi
4,How do I shortlist components on kandi?,"You can use the below filters to shortlist components based on your architectural preferences:\n\nLanguages This is an expanding list of languages chosen by popularity amongst kandi users.\nLicenses Licenses are grouped by:\n\nOSS License families, covering Permissive, Weak Copyleft, and Strong Copyleft.\nProprietary license category covering the emerging cloud licenses as well as As-a-Service contracts.\nNo License indicates that the respective repository does not have the license file declared as per the repository managers standard. They could still have a license file declared in a different format or section. Components without a license have all rights reserved, and you may not be able to use them. Hence kandi alerts you when a valid license file is not found.\n\nSupport High support indicates a thriving ecosystem across the author and users, that will help you implement with relative ease.\nComponent Types Component Types are grouped by:\n\nLibraries from package managers and repositories that can be readily installed.\nSource Code that may or may not be associated with a package and are from code repositories.\nCloud Functions and APIs that are provided As-a-Service from cloud providers.\n\nSources This is an expanding list of software component sources chosen by popularity amongst kandi users.\nIndustries This indicates the industry domain that the component has been associated with or could be used in, for specific use cases.\nSecurity This reflects the security score of the software component across reported and code-based vulnerabilities.",How do I shortlist components on kandi
5,How do I implement the components that I have selected on kandi?,"The component listing and detailed insights page have links to the software component home. Some software components may have multiple providers, and you can access all the links.\nYou can follow implementation instructions from the software component home page based on the component type.",How do I implement the components that I have selected on kandi


###### Load sentence transformer model of your choice for getting sentence embeddings
The model can be chosen by considering various aspects and comparing available models from this link.
https://www.sbert.net/docs/pretrained_models.html

In [8]:
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

###### Find embeddings of sentences and store in a binary file

The binary file storage helps to load and use embeddings later without having the need to computing them again. We use pickle here to store in binary files. You may also use joblib

In [9]:
MODEL_PATH = r"models/model_va.pickle"
q_embs = model.encode(df["procd_Q"]) # computes encode for all the questions from the dataset. 
                                    #Embeddings can be computed in batches for massive dataset.
with open(MODEL_PATH, "wb") as file:
    pickle.dump(q_embs, file)

###### Load embeddings from binary file into memory

In [10]:
with open(r"models/model_va.pickle", "rb") as file:
    q_embs = pickle.load(file)

###### Predict answer to user query
The user query is cleansed and pre-processed as earlier, and then a matching query from data source is predicted. The predicted query is used to look up to find corresponding answer

In [11]:
def pred_answer(usr_query):
    df_query = pd.DataFrame([usr_query], columns=["usr_query"]) # use similar pipeline that was used for computing embeddings from dataset
    df_query["clean_usr_q"] = df_query["usr_query"].pipe(remove_digits).pipe(remove_punctuation)
    usr_q_emb = model.encode(df_query["clean_usr_q"]) # compute embedding
    cosine_similarity = CosineSimilarity()
    q_idx = np.argmax(cosine_similarity(torch.from_numpy(usr_q_emb), torch.from_numpy(q_embs))) # compute cosine similarity and find the matched query
    return df["A"][q_idx.item()] # look up answer of the matched query from the dataframe of input dataset

In [12]:
usr_query = "tell me about kandi"

In [13]:
pred_answer(usr_query)

kandi (pronounced kandee) is a platform that helps developers pick the right library, package, code samples, APIs, and cloud functions, by analyzing over 430 million knowledge items.


  return s.str.replace(rf"([{punctuation}])+", " ")


###### Simulating Virtual Agent

In [14]:
while True:
    usr_q = input("Ask a query:")
    if usr_q == "exit":
        break
    else:
        print("Answer: ", pred_answer(usr_q))
    print("-----------------")

Ask a query:tell me about kandi
usr query is:  tell me about kandi
kandi (pronounced kandee) is a platform that helps developers pick the right library, package, code samples, APIs, and cloud functions, by analyzing over 430 million knowledge items.
-----------------
Ask a query:components in kandi
usr query is:  components in kandi
kandi helps you select software components across:
Packages from all package managers and repositories
Source Code across all major code repositories
Cloud Functions and APIs across all hyperscale cloud providers
-----------------
Ask a query:exit
usr query is:  exit
