# Other Embedding approaches

- First written: 2021-05-10
- Last edited: 2021-05-30

# 1. Current Challenges

Currently there are 2 issues:

1. No good means of quantitatively evaluating model performance. 
- Solution 1: Human eval. Get a bunch of volunteers, (query, link) pairs, and ask volunteers to rate 1-5 for relevance (industry practice). 
- Solution 2: Logging. Put the system in the wild, use clicks as proxies for relevant and impressions / non-clicks otherwise.
    - Problem is that logging is another whole issue altogether. Not a lightweight solution at all
    
2. Lack of data
- LSI inferences are made with a small dataset of at most 100+ words
- We could possibly increase the amount of data collected

# 2. Algorithm Changes

LSI has its limitations: it works well when comparing two documents of similar length, but doesn't when one side is sparse on words / intent. A couple of ideas we can use to handle this:

1. Query Augmentation:
    - Well practiced in [industry today](https://bytes.grubhub.com/search-query-embeddings-using-query2vec-f5931df27d79), but not sure if it's necessarily the best way. 
    - Expand the query by augmenting it with more synonyms
    - Basic form: some kind of w2v "(approximate) nearest neighbour" retrieval, then concat

2. Better-quality embedding representation
- LSI is good at topic finding and analysis, not so much for retrieval
- What if we tried representing the embedding using some pretrained embeddings?
    - Basic ones:
        - w2v
        - glove
    - More advanced:
        - bert-like transformers

In [1]:
import numpy as np
import torch
import pandas as pd
from transformers import pipeline
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')

In [2]:
# loading data
df_schemes = pd.read_csv('../schemes.csv', encoding='mac_roman')
df_schemes.head()

Unnamed: 0,Scheme,Agency,Description,Link,Image,Tag 1,Tag 2,Tag 3
0,Caregivers Support Centre,Caregivers Alliance,Provides support to caregivers of persons with...,https://www.cal.org.sg/caregiver-training,https://chidnast.sirv.com/SchemesSG/CAL.jpg,Caregiver,,
1,Caregiver Training Program,Caregivers Alliance,Provides training to caregivers of persons wit...,https://www.cal.org.sg/caregiver-training,https://chidnast.sirv.com/SchemesSG/CAL.jpg,Caregiver,,
2,Food Assistance,A Packet of Rice,A self setup group which distributes meal box ...,https://www.facebook.com/APacketOfRice/,https://chidnast.sirv.com/SchemesSG/apacketofr...,Low Income,Food,
3,Family LifeAid,Red Cross Singapore,Identified households receive food vouchers ev...,https://www.redcross.sg/get-assistance/family-...,https://chidnast.sirv.com/SchemesSG/redcross.jpg,Low Income,Food,Education
4,Financial Assistance,365 Cancer Prevention Society (365CPS),Cancer treatment can place a heavy financial b...,https://365cps.org.sg/portfolio/financial-supp...,https://chidnast.sirv.com/SchemesSG/365cps.jpg,Low Income,Healthcare,


# Trying transformers from huggingface

https://huggingface.co/transformers/notebooks.html

In [3]:
nlp_features = pipeline('feature-extraction')
output = nlp_features(df_schemes['Description'].tolist())
np.array(output).shape   # (Samples, Tokens, Vector Size)

Downloading:   0%|          | 0.00/263M [00:00<?, ?B/s]

(113, 135, 768)

In [4]:
np.array(output) # hmm it's a token-level summary, not sentence-level summary

array([[[ 0.38643956, -0.10631615, -0.15597686, ..., -0.21457441,
          0.22321901,  0.00472031],
        [ 0.18545376, -0.37182105, -0.14456783, ...,  0.57555652,
          0.29364479,  0.04755285],
        [ 0.07240835, -0.0616543 , -0.24307226, ...,  0.26000339,
         -0.03196095,  0.02543055],
        ...,
        [ 0.18805318, -0.19116542, -0.0366517 , ...,  0.09192437,
          0.132121  ,  0.28939039],
        [ 0.23229843, -0.1421183 , -0.13016975, ...,  0.05070525,
          0.23810981,  0.46606466],
        [ 0.23064086, -0.29060408, -0.06681173, ...,  0.05968447,
          0.1519468 ,  0.3502115 ]],

       [[ 0.34675306, -0.10325469, -0.11605997, ..., -0.19293618,
          0.20233874,  0.08297873],
        [ 0.2038223 , -0.35885975, -0.12805758, ...,  0.60555607,
          0.22940989,  0.09131481],
        [-0.12431835, -0.06780472, -0.15613925, ...,  0.37775403,
         -0.18495724,  0.20772569],
        ...,
        [ 0.19514582, -0.15275784, -0.04281472, ...,  

# Trying sentence_transformers instead

Link: https://www.sbert.net/docs/usage/semantic_textual_similarity.html

In [5]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-distilroberta-base-v2')

In [6]:
embeddings = model.encode(df_schemes['Description'].tolist(), convert_to_tensor=True)

#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)

In [7]:
cosine_scores

tensor([[1.0000, 0.8751, 0.2815,  ..., 0.1324, 0.1757, 0.4162],
        [0.8751, 1.0000, 0.2265,  ..., 0.1660, 0.1694, 0.4121],
        [0.2815, 0.2265, 1.0000,  ..., 0.1074, 0.1156, 0.3095],
        ...,
        [0.1324, 0.1660, 0.1074,  ..., 1.0000, 0.2008, 0.2508],
        [0.1757, 0.1694, 0.1156,  ..., 0.2008, 1.0000, 0.3799],
        [0.4162, 0.4121, 0.3095,  ..., 0.2508, 0.3799, 1.0000]])

In [8]:
def return_query(sentence, embeddings, df):
    query_emb = model.encode(sentence, convert_to_tensor=True)
    cos_sim = util.pytorch_cos_sim(query_emb, embeddings)
    cos_sim_series = pd.Series(cos_sim.numpy().flatten(), name='cos_sim')
    return df.join(cos_sim_series).sort_values('cos_sim', ascending=False)

In [9]:
test_sent = """
my client needs help as their family has lost a breadwinner and they need money to help with daily necessities during COVID. The child is also unable to afford schooling supplies such as computer"""

In [10]:
return_query(test_sent, embeddings, df_schemes).head(5)

Unnamed: 0,Scheme,Agency,Description,Link,Image,Tag 1,Tag 2,Tag 3,cos_sim
81,ComCare Short-To-Medium-Term Assistance,Ministry of Social and Family Development (MSF),Short to medium term assistance for those unab...,https://www.msf.gov.sg/Comcare/Pages/Short-to-...,https://chidnast.sirv.com/SchemesSG/msf.jpg,Family,Low Income,,0.583423
108,The Straits Times School Pocket Money Fund,The Straits Times School Pocket Money Fund (SPMF),To alleviate the financial burden faced by par...,https://www.spmf.org.sg/primary-secondary-stud...,https://chidnast.sirv.com/SchemesSG/stspmf.jpg,Low Income,Education,,0.554491
112,YMCA FACES,Young Men's Christian Association (YMCA),"Short term financial assistance, bridging fund...",https://www.ymca.org.sg/community-services/fin...,https://chidnast.sirv.com/SchemesSG/ymca.jpg,Low Income,Special Needs,Family,0.548327
17,COVID-19 - Family Assistance Fund,Beyond Social Services,Covid 19 Family Assistance Fund Fund will go t...,https://www.beyond.org.sg/faf-faq/,https://chidnast.sirv.com/SchemesSG/beyond.jpg,Low Income,COVID-19,Food,0.532644
105,COVID 19 related support,Support Go Where,Key government resource on COVID-19 related as...,https://www.supportgowhere.gov.sg/schemes/?lan...,https://chidnast.sirv.com/SchemesSG/supportgow...,COVID-19,,,0.509703


# 3. Latency issues

tbd, need to explore faster models (e.g. distilbert)

# 4. Data Augmentation

tbd, need to scrape data