In [None]:
!pip install annoy > null
!pip install rank-bm25 > null
!pip install -U sentence-transformers > null

# Overview
For this take home exercise, we hope provide you with a sense of the challenges this role will be tackling. Posh provides coversational AI solutions for credit unions and banks. Our chatbots and IVR agents often are required to support FAQs. One the underlying systems that the NLP team manages is the SemFAQ service which creates a semantic index over FAQs. When users ask question, we use SemFAQ to match the closest semantically relevant question and return an answer. This take home exercise simulates a simplified version of our production system. It consists of a set of tasks that will have you build querying logic and evalutae the performance of the system which should reflect some of the actually challenge this role will be tackling. The in-person interview will cover your solutions to this take-home exercise and broadly explore your thoughts on the some active problems we're trying to solve in this space.

# 1. Setup Indices

In this section we'll create a semantic and lexical index for our corpus. This corpus consists of FAQ questions scraped from the Consumer Financial Protection Bureau. The goal of this SemFAQ simulation is return the document uuid of the FAQ that most closely matches the user question. 


In [42]:
import pandas as pd 
import spacy 
from spacy.lang.en.stop_words import STOP_WORDS
import re

# Load corpus for our indices
docs = pd.read_csv("data/index_docs.csv")

# Preview of the corpus
docs.head()

Unnamed: 0.1,Unnamed: 0,title,text,uuid
0,0,How do I find my state's bank regulator?,Take a look at our list of state banking regul...,hj3Rfuw4Ma6B9MN5xKJY8J
1,1,Can I be personally responsible for paying my ...,In most cases you will not be responsible to p...,KitTQDCXSTZNxUhhMwPHpM
2,2,What can I do if I can’t repay my payday loan?,If you’re having trouble repaying your payday ...,gBGm4SsW94nsUbEw2UG5GT
3,3,"If someone dies owing a debt, does the debt go...","No, when someone dies owing a debt, the debt d...",gn6kqZNaUGfDnVBf7Yt9C4
4,4,Can the dealer increase the interest rate afte...,Some dealers will allow the customer to take p...,7T2JvSnbvWNNecQrvZmKs8


In [None]:
# Preprocess Text for BM25 Index
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "ner"])
def clean_lemmatize(text):
    clean_text = re.sub(r'[^\w\s]', '', text.lower())
    toks = [tok.lemma_ for tok in nlp(clean_text) if tok.text.strip() != ""]
    return toks

docs["toks"] = docs["title"].apply(lambda x: clean_lemmatize(x))

### Build and cache Lexical Index using BM25

In [None]:
from rank_bm25 import BM25Okapi
import pickle 

bm25_idx = BM25Okapi(docs["toks"].tolist())

pickle.dump(bm25_idx, open("bm25_idx.pkl", "wb"))
print("Finished caching")

### Build and cache semantic index using Annoy

In [3]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')

# Generate embeddigns using SentenceTransformers 
doc_embeddings = model.encode(docs["title"].tolist())

In [4]:
from annoy import AnnoyIndex

# Build simple vector index
dim = 768
idx = AnnoyIndex(dim, "angular")

for i,e in enumerate(doc_embeddings):
    idx.add_item(i, e)
    
print("building index")    
idx.build(1000)

idx.save("semantic_idx.ann")
print("Finished building and caching semantic index")

building index
Finished building and caching semantic index


## Index Code

The code below creates classes for the semantic and lexical index and provides supporting query methods. Note documents are indexed in they appear in teh `docs` dataframe. Both index classes will return a numerical index value to the docs dataframe. The relevant document uuid can be found by referencing the docs dataframe

In [32]:
import numpy as np 
import pandas as pd  
from annoy import AnnoyIndex
import pickle 
from rank_bm25 import BM25Okapi
import re
from sentence_transformers import SentenceTransformer
import spacy 
from spacy.lang.en.stop_words import STOP_WORDS

from typing import Tuple, List 

class BM25Index():
    def __init__(self, 
                 idx_loc: str="bm25_idx.pkl", 
                 idx_type: str = "bm250kpi"):
        self.idx = pickle.load(open(idx_loc, "rb"))
        self.idx_type = idx_type
        self.nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "ner"])
           
    def query(self, text: str, num_results: int=5) -> List[Tuple[int, float]]:
        """ Method queries cached bm25 index and return top N results. Returns a
            a list of tuples, where first element is document numerical id and 
            second element is the raw bm25 score."""
        query_toks = self.preprocess_query(text)
        scores = self.idx.get_scores(query_toks)
        sorted_doc_ids = np.argsort(scores)[::-1][:num_results]
        sorted_scores = [scores[idx] for idx in sorted_doc_ids]
        return list(zip(sorted_doc_ids.tolist(), sorted_scores))  

    def preprocess_query(self, text: str) -> List[str]:
        """ Prepare text for querying BM25 index. Text is cleaned, lemmatized and tokenized.
            Method returns a list of tokens. """
        clean_text = re.sub(r'[^\w\s]', '', text.lower())        
        toks = []
        for tok in self.nlp(clean_text):
            if tok.text not in STOP_WORDS and tok.text.strip() != "":
                toks.append(tok.lemma_)
        return toks

class SemanticIndex():
    def __init__(self, 
                 semantic_weights = "all-mpnet-base-v2",
                 idx_dim=768,
                 cache_loc="semantic_idx.ann"):
        self.model = SentenceTransformer(semantic_weights)

        self.idx = AnnoyIndex(idx_dim, "angular")
        self.idx.load(cache_loc, True)
    
    def cosine_similarity_transform(self, angular_distance: float) -> float:
        """ Convert angular distance into cosine similairty score """
        return (2-(angular_distance**2)) / 2
    
    def query(self, text: str, num_results: int=5) -> List[Tuple[int, float]]:
        """ Method queries semantic index and return top N results. Returns a
            a list of tuples, where first element is document numerical id and 
            second element is the cosine similarity score. 
        """
        encoded_query = self.model.encode(text)
        doc_idxs, distances = self.idx.get_nns_by_vector(encoded_query, num_results, search_k=-1, include_distances=True)
        scores = [self.cosine_similarity_transform(dist) for dist in distances]
        return list(zip(doc_idxs, scores))

## 2. Take Home Exercises
### A.) Task: Implement a naive voting scheme to rank results from both indices

For this exercise we'll walk through creating a Query class. This class will contain the logic and query calls to query the lexical and semantic indices. The goal is create a simple class that provided a user query will return the closest documents based on lexical and semantic similarity. 

The `SemanticIndex` and `BM25Index` classes both contain query methods that will query the respectives indices and return a set of relevant document id and scores. Your task is to define a logic that selects the best result provided results from the semantic and lexical index. To help familiarize you with the code, we'll walk through implement a simple voting based logc for returning a query. The next exercise will ask you to improve the query logic and you will free to use choose whatever methodology and logic you care to.

The code below describes how to query the indices and return relevant documents uuids. 

In [33]:
# Example of querying semantic index
# create a semantic index object
sem_idx = SemanticIndex()

# Query index. Index will return a list of tuples. First element contains document index and second contains the score. 
results = sem_idx.query("Will my car be taken if I fail to pay my auto loan?")

# We resolve the document ids into uuids and raw titles by looking them up in the docs dataframe.
for result in results:
    print(f"Result idx {result[0]} | Result score {result[1]}")
    print(f"Document title: {docs.iloc[result[0]].title} | Document uuid: {docs.iloc[result[0]].uuid}")
    print("------------")

Result idx 34 | Result score 0.8249342939786217
Document title: If I can’t make my auto loan payments, will my vehicle be repossessed? | Document uuid: 4bMJCVYav2Ux8BqZFWXT4j
------------
Result idx 188 | Result score 0.6808710552168815
Document title: What should I do if I have problems making my auto loan payments? | Document uuid: cUQV8w9vCk3vWXiVzLDkQc
------------
Result idx 104 | Result score 0.6788633475174617
Document title: What happens if I don’t make the payments on my auto lease? | Document uuid: Q5oynF83ykEDD3yyYdg3wk
------------
Result idx 191 | Result score 0.6668141843904039
Document title: What happens to my credit report if I am late making payments on my auto loan or my car is repossessed? | Document uuid: BLgTGfBWEBQzck3cFU7DpQ
------------
Result idx 195 | Result score 0.6559985774376127
Document title: My car has been repossessed, and I was told it will be sold. What can I do? | Document uuid: ktq5ZmiKPwjfjqe2pkfHTW
------------


prefault is set to true, but MAP_POPULATE is not defined on this platform

In [34]:
# The logic for querying the lexical index is similar
bm25_idx = BM25Index()

# Query index. Index will return a list of tuples. First element contains document index and second contains the score.
results = bm25_idx.query("Will my car be taken if I fail to pay my auto loan?")

# We resolve the document ids into uuids and raw titles by looking them up in the docs dataframe.
for result in results:
    print(f"Result idx {result[0]} | Result score {result[1]}")
    print(f"Document title: {docs.iloc[result[0]].title} | Document uuid: {docs.iloc[result[0]].uuid}")
    print("------------")

Result idx 191 | Result score 7.892702494670768
Document title: What happens to my credit report if I am late making payments on my auto loan or my car is repossessed? | Document uuid: BLgTGfBWEBQzck3cFU7DpQ
------------
Result idx 193 | Result score 7.723089911354117
Document title: What is a “no credit check" or “buy here, pay here” auto loan? | Document uuid: h6RhihYnKDVCMnWc2TbAKS
------------
Result idx 169 | Result score 6.637430659552135
Document title: I bought a used car and financed it at the dealership. They told me that the car was in good shape, but it turned out to have serious mechanical problems. The dealer says the warranty doesn't cover any of the problems. I can't afford to pay my loan payment and get the car fixed. What can I do? | Document uuid: 5oxJppq39oC9UeBqvAwmwe
------------
Result idx 208 | Result score 5.46317950375905
Document title: Should I have car insurance lined up before I purchase a vehicle? | Document uuid: CFkfkm2fpDNFV7gZLxJPxE
------------
Resul

Given the example code above, fill out the logic in the `rank_results` class below. The naive voting logic should do the following provided a list tuples which consist of the document id and score:
1. group the ids and sum all the scores
2. rerank the documents in descending order based on the summed scores
3. return depluciated list of documents and the summed scores


In [35]:
class Query():   
    def __init__(self) -> None:
        self.semantic_idx = SemanticIndex()
        self.lexical_idx = BM25Index()
        self.document_df = pd.read_csv("data/index_docs.csv")
               
    def query(self, text: str) -> str:
        """ Method takes in query and return the uuid of the best document that
            matches the query text.
        """
        
        # 1. Query indices
        sem_results = self.semantic_idx.query(text)
        lexical_results = self.lexical_idx.query(text)
        
        # 2. Combine the results into a single list 
        sem_results.extend(lexical_results)
        
        # 3. Dedup and rank the results
        ranked_results = self.rank_results(sem_results)
        
        # 4. Return top result uuid
        top_result = ranked_results[0][0]
        return self.document_df.iloc[top_result].uuid 
            
    def rank_results(self, results: List[Tuple[int, float]]) -> List[Tuple[int, float]]:
        # Implement your voting logic here.  
        return results

query_obj = Query()
query_obj.query("What happens to my car if I can't pay my loan?")    

prefault is set to true, but MAP_POPULATE is not defined on this platform

'4bMJCVYav2Ux8BqZFWXT4j'

### B.) Evaluate performance on sample queries

In this section we'll evaluate the performance of the `Query` class on a set of generated queries. Run the code below to run the queries and get the predicted document uuids. You'll notice that there is a type column which classifies the query as either paraphrase or keyword. This category designates whether the simulated query was generated using a paraphrasing technique (on the header) or a keyword extraction technique. For our product we often see users ask well formed questions or provide short keyword style inputs. 

In [41]:
queries = pd.read_csv("data/query_sample.csv")[:10]

queries["predicted"] = queries["query"].apply(lambda x: query_obj.query(x))

queries.head()

Unnamed: 0,query,type,gold_uid,predicted
0,information debt collector debt,keyword,RYaR2WYJssw5xcPvPXCJgp,RYaR2WYJssw5xcPvPXCJgp
1,credit union share draft account,keyword,7z2jYhbY4ZiG8DKHh2mSn3,7z2jYhbY4ZiG8DKHh2mSn3
2,can i apply for a credit card with one set of ...,paraphrase,GgS3M8h8998z6nc5mPR5vA,GgS3M8h8998z6nc5mPR5vA
3,if my credit card doesn't work what can i do?,paraphrase,CVcjcDMjxsqCizWrsT3R5u,CVcjcDMjxsqCizWrsT3R5u
4,request credit report child,keyword,CE3RKEmS4ujtA2TfE3v2KP,CE3RKEmS4ujtA2TfE3v2KP


### 

### Task: Analyze results
Analyze the performance predictions above. What metrics do you think would be helpful in evaluating the accuracy and relevancy of the results being results? At mininum implement two metrics in code below and describe how we should interpret those results.

In [None]:
# your code here

* your analysis here *

### 3.) Task: Improvement ranking logic. 

In this section you will have the opportunity to improve the system above. You're welcome to change any of the logic and infrastructure (eg changing embeddings, distance metrics etc) above. Develop your POC below and provide an updated evaluation on the sample queries. For the in person interview, we'll walk through your thoughts. If you have ideas that are more complex, it's ok if you are unable to implement them. We can talk through those in the interview as well.

At minimum we'd like to see some basic improvements to the ranking logic implemented.