# Other Embedding approaches

- First written: 2021-05-10
- Last edited: 2021-06-18

# 1. Current Challenges

Currently there are 2 issues:

1. No good means of quantitatively evaluating model performance. 
- Solution 1: Human eval. Get a bunch of volunteers, (query, link) pairs, and ask volunteers to rate 1-5 for relevance (industry practice). 
- Solution 2: Logging. Put the system in the wild, use clicks as proxies for relevant and impressions / non-clicks otherwise.
    - Problem is that logging is another whole issue altogether. Not a lightweight solution at all
    
2. Lack of data
- LSI inferences are made with a small dataset of at most 100+ words
- We could possibly increase the amount of data collected

# 2. Algorithm Changes

LSI has its limitations: it works well when comparing two documents of similar length, but doesn't when one side is sparse on words / intent. A couple of ideas we can use to handle this:

1. Query Augmentation:
    - Well practiced in [industry today](https://bytes.grubhub.com/search-query-embeddings-using-query2vec-f5931df27d79), but not sure if it's necessarily the best way. 
    - Expand the query by augmenting it with more synonyms
    - Basic form: some kind of w2v "(approximate) nearest neighbour" retrieval, then concat

2. Better-quality embedding representation
- LSI is good at topic finding and analysis, not so much for retrieval
- What if we tried representing the embedding using some pretrained embeddings?
    - Basic ones:
        - w2v
        - glove
    - More advanced:
        - bert-like transformers

In [1]:
import numpy as np
import torch
import pandas as pd
from transformers import pipeline
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')

In [2]:
# loading data
df_schemes = pd.read_csv('../schemes2.csv', encoding='mac_roman')
df_schemes.head()

Unnamed: 0,Title,Agency,Description,Who's it for,What it gives,Link,Background Image Link,Scheme Type,search_booster(WL)
0,Caregiver Training Program,Caregivers Alliance,Provides training to caregivers of persons wit...,Caregivers,Educational programmes for caregivers,https://www.cal.org.sg/caregiver-training,https://chidnast.sirv.com/SchemesSG/CAL.jpg,"Caregiver,Mental Health",Caregivers feeling overwhelmed. Burnout. Elder...
1,Caregivers Support Centre,Caregivers Alliance,Provides support to caregivers of persons with...,Caregivers,Emotional care,https://www.cal.org.sg/caregivers-support-centre,https://chidnast.sirv.com/SchemesSG/CAL.jpg,"Caregiver,Mental Health",Caregivers feeling overwhelmed. Burnout. Elder...
2,Family LifeAid,Red Cross Singapore,Identified households receive food vouchers ev...,"Low income,Need food support","Financial assistance,Food,Educational programmes",https://www.redcross.sg/get-assistance/family-...,https://chidnast.sirv.com/SchemesSG/redcross.jpg,"Low Income,Food,Education","needs help to get food, meal, hungry, have not..."
3,Food Assistance,A Packet of Rice,A self setup group which distributes meal box ...,"Low income,Need food support",Food,https://www.facebook.com/APacketOfRice/,https://chidnast.sirv.com/SchemesSG/apacketofr...,"Low Income,Food","Needs help to get food, meal, hungry, have not..."
4,Assistance,Filos Community Services,"Bursaries, bread distribution and food rations...","Low income,Need food support","Food,Counselling",https://www.filos.sg/services-assistance,https://chidnast.sirv.com/SchemesSG/filos.jpg,"Low Income,Food","Need to buy daily necessities, food, have not ..."


# Trying transformers from huggingface

https://huggingface.co/transformers/notebooks.html

In [3]:
nlp_features = pipeline('feature-extraction')
output = nlp_features(df_schemes['Description'].tolist())
np.array(output).shape   # (Samples, Tokens, Vector Size)

(160, 150, 768)

In [4]:
np.array(output) # hmm it's a token-level summary, not sentence-level summary

array([[[ 3.46753061e-01, -1.03254691e-01, -1.16059974e-01, ...,
         -1.92936182e-01,  2.02338740e-01,  8.29787254e-02],
        [ 2.03822300e-01, -3.58859748e-01, -1.28057584e-01, ...,
          6.05556071e-01,  2.29409888e-01,  9.13148075e-02],
        [-1.24318346e-01, -6.78047240e-02, -1.56139255e-01, ...,
          3.77754033e-01, -1.84957236e-01,  2.07725689e-01],
        ...,
        [ 1.90674633e-01, -8.14108551e-02, -3.30491364e-01, ...,
          3.47913265e-01,  1.09129354e-01, -2.18179360e-01],
        [ 3.73719096e-01, -5.64273335e-02, -2.42803887e-01, ...,
          2.95044959e-01, -2.26940028e-03, -3.64589065e-01],
        [ 2.22623460e-02,  5.80160730e-02, -1.93460673e-01, ...,
         -2.21969038e-01, -2.56752253e-01, -2.11848006e-01]],

       [[ 3.86439562e-01, -1.06316149e-01, -1.55976862e-01, ...,
         -2.14574412e-01,  2.23219007e-01,  4.72031254e-03],
        [ 1.85453758e-01, -3.71821046e-01, -1.44567832e-01, ...,
          5.75556517e-01,  2.93644786e

# Trying sentence_transformers instead

Link: https://www.sbert.net/docs/usage/semantic_textual_similarity.html

In [5]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-distilroberta-base-v2')

In [6]:
embeddings = model.encode(df_schemes['Description'].tolist(), convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)

In [7]:
cosine_scores

tensor([[1.0000, 0.8751, 0.2848,  ..., 0.2585, 0.2083, 0.2076],
        [0.8751, 1.0000, 0.3767,  ..., 0.3165, 0.2771, 0.2854],
        [0.2848, 0.3767, 1.0000,  ..., 0.1353, 0.4993, 0.3467],
        ...,
        [0.2585, 0.3165, 0.1353,  ..., 1.0000, 0.1863, 0.1581],
        [0.2083, 0.2771, 0.4993,  ..., 0.1863, 1.0000, 0.3350],
        [0.2076, 0.2854, 0.3467,  ..., 0.1581, 0.3350, 1.0000]])

In [8]:
def return_query(sentence, embeddings, df):
    query_emb = model.encode(sentence, convert_to_tensor=True)
    cos_sim = util.pytorch_cos_sim(query_emb, embeddings)
    cos_sim_series = pd.Series(cos_sim.numpy().flatten(), name='cos_sim')
    return df.join(cos_sim_series).sort_values('cos_sim', ascending=False)

In [9]:
test_sent = """
my client needs help as their family has lost a breadwinner and they need money to help with daily necessities during COVID. The child is also unable to afford schooling supplies such as computer"""

In [10]:
return_query(test_sent, embeddings, df_schemes).head(5)

Unnamed: 0,Title,Agency,Description,Who's it for,What it gives,Link,Background Image Link,Scheme Type,search_booster(WL),cos_sim
158,COVID-19 Assistance,NuLife Care & Counselling Services,Information on various financial assistance sc...,"Elderly,Families,Low income","Referral,Food",https://nulife.com.sg/covid-our-efforts/,https://chidnast.sirv.com/SchemesSG/nulife.jpg,"COVID-19,Food","COVID, debt, pandemic, poor, hungry, nothing t...",0.584345
18,ComCare Short-To-Medium-Term Assistance,Ministry of Social and Family Development (MSF),Short to medium term assistance for those unab...,Low income,Financial assistance,https://www.msf.gov.sg/Comcare/Pages/Short-to-...,https://chidnast.sirv.com/SchemesSG/msf.jpg,"Family,Low Income","Need money for daily needs, low income, poor",0.583423
102,The Straits Times School Pocket Money Fund,The Straits Times School Pocket Money Fund (SPMF),To alleviate the financial burden faced by par...,Children from low income families,Financial assistance for education,https://www.spmf.org.sg/primary-secondary-stud...,https://chidnast.sirv.com/SchemesSG/stspmf.jpg,"Low Income,Education","Children, disabled, schoolchildren, need compu...",0.554491
111,YMCA FACES,Young Men's Christian Association (YMCA),"Short term financial assistance, bridging fund...","Low income,Low income families,Special needs","Financial assistance for daily needs,Employmen...",https://www.ymca.org.sg/community-services/fin...,https://chidnast.sirv.com/SchemesSG/ymca.jpg,"Low Income,Special Needs,Family","Financial assistance, money for daily expenses...",0.548327
25,COVID-19 - Family Assistance Fund,Beyond Social Services,Covid 19 Family Assistance Fund Fund will go t...,Low income,"Financial assistance,COVID-19 support",https://www.beyond.org.sg/faf-faq/,https://chidnast.sirv.com/SchemesSG/beyond.jpg,"Low Income,COVID-19,Food","Need money for daily needs, low income, poor, ...",0.532644


# 2.1 Does it work on shorter queries?

In [11]:
from collections import Counter
Counter([x for y in df_schemes['Who\'s it for'].apply(lambda x: x.split(',')).tolist() for x in y])

Counter({'Caregivers': 17,
         'Low income': 30,
         'Need food support': 13,
         'Need education assistance': 1,
         'Kidney patients': 4,
         'PWDs': 9,
         'New parents': 1,
         'Cancer patients': 5,
         'HIV patients': 5,
         'Retrenched': 4,
         'In need of employment': 1,
         'Facing financial hardship': 2,
         'Chinese community': 1,
         'Children': 5,
         'Unable to work': 1,
         'Households with children attending student care': 1,
         'Families': 8,
         'Elderly': 20,
         'Families who have just lost a breadwinner': 2,
         'Ex-offenders': 2,
         'Low income families': 24,
         'Children from low income families': 12,
         'In need of support from COVID-19': 1,
         'In need of mortgage assistance': 6,
         'Low income elderly': 1,
         'Special needs children': 5,
         'Youth-at-risk': 5,
         'In debt': 4,
         'Dyslexic children': 1,
         '

In [12]:
return_query("single parent", embeddings, df_schemes).head(5)  # a bit sketchy, but best attempt here and there.

Unnamed: 0,Title,Agency,Description,Who's it for,What it gives,Link,Background Image Link,Scheme Type,search_booster(WL),cos_sim
64,Home Ownership Plus Education (HOPE) Scheme,Ministry of Social and Family Development (MSF),"Assistance for young, low-income parent(s) who...",Low income families,"Financial assistance for daily expenses,Financ...",https://www.msf.gov.sg/assistance/Pages/Home-O...,https://chidnast.sirv.com/SchemesSG/msf.jpg,"Family,Low Income,Education,Housing","Family planning, low income, no money for food...",0.473492
151,Various services,Big Love Child Protection Specialist Centre,Casework management; child protection; home-ba...,Parents,"Casework,Child protection,Educational programm...",https://www.biglove.org.sg/,https://chidnast.sirv.com/SchemesSG/biglove.jpg,"Children,Family","child protection, abuse, family violence, chil...",0.38351
153,Family services,NuLife Care & Counselling Services,Casework management and counselling including ...,Families,"Casework,Counselling,Referral",https://nulife.com.sg/counselling/,https://chidnast.sirv.com/SchemesSG/nulife.jpg,Family,"casework, counselling, family, individual, soc...",0.320471
57,Fresh Start Housing Scheme,Housing and Development Board (HDB),The Fresh Start Housing Scheme (Fresh Start) a...,Low income families,"Financial assistance,Housing assistance",https://www.hdb.gov.sg/cs/infoweb/residential/...,https://chidnast.sirv.com/SchemesSG/hdb.jpg,Housing,"Need money to tide over difficulties, money fo...",0.314973
80,Ministry of Education ñ Financial Assistance S...,Ministry of Education (MOE),Help needy Singaporean families reduce their b...,Children from low income families,Financial assistance for education,https://beta.moe.gov.sg/fees-assistance-awards...,https://chidnast.sirv.com/SchemesSG/moe.jpg,"Low Income,Education","School fees, students, child, children, study,...",0.307593


# 3. Latency issues

tbd, seems like rn there isn't a strong need to explore faster models (e.g. distilbert)

In [13]:
%%timeit  # at this point anything below 100ms latency is okay. This works primarily because df_schemes is small
return_query(test_sent, embeddings, df_schemes).head(5)

56.7 ms ± 439 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# 4. Data Augmentation

tbd, need to scrape data

In [14]:
import requests
from bs4 import BeautifulSoup

In [15]:
# sample 1 case
url = df_schemes.Link.iloc[1]

In [16]:
def scrape_text(url):
    try:
        r = requests.get(url)
        if r.status_code == 200:
            soup = BeautifulSoup(r.content, 'html.parser')
            text = soup.get_text() or ''
            return text.strip().replace('\n', '').replace('\t', '').replace('|', '')
    except Exception as e:
        print(e)  # this isn't great for productionalizing but i'm just here to scrape and clean text
        return ""

In [17]:
# don't do this at home
df_schemes['scraped_text'] = df_schemes['Link'].apply(scrape_text)

HTTPSConnectionPool(host='www.filos.sg', port=443): Max retries exceeded with url: /services-assistance (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)')))


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


HTTPSConnectionPool(host='www.ccf.org.sg', port=443): Max retries exceeded with url: /programmes-and-services/core-services/financial-assistance/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)')))
HTTPSConnectionPool(host='beta.moe.gov.sg', port=443): Max retries exceeded with url: /fees-assistance-awards-scholarships/financial-assistance/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1076)')))
Invalid URL 'www.boystown.org.sg': No schema supplied. Perhaps you meant http://www.boystown.org.sg?
Invalid URL 'www.acmi.org.sg': No schema supplied. Perhaps you meant http://www.acmi.org.sg?


In [18]:
columns_of_interest = set(['Description', 'What it gives', 'search_booster(WL)', 'scraped_text'])
for c in columns_of_interest:
    df_schemes[c] = df_schemes[c].apply(str)

In [19]:
df_schemes['conjoined_text'] = df_schemes[['Description', 'What it gives', 'search_booster(WL)', 'scraped_text']].agg(' '.join, axis=1)

In [20]:
embeddings_with_more_info = model.encode(df_schemes['conjoined_text'].tolist(), convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores_more_info = util.pytorch_cos_sim(embeddings_with_more_info, embeddings_with_more_info)

In [21]:
return_query(test_sent, embeddings_with_more_info, df_schemes).head(5)

Unnamed: 0,Title,Agency,Description,Who's it for,What it gives,Link,Background Image Link,Scheme Type,search_booster(WL),scraped_text,conjoined_text,cos_sim
25,COVID-19 - Family Assistance Fund,Beyond Social Services,Covid 19 Family Assistance Fund Fund will go t...,Low income,"Financial assistance,COVID-19 support",https://www.beyond.org.sg/faf-faq/,https://chidnast.sirv.com/SchemesSG/beyond.jpg,"Low Income,COVID-19,Food","Need money for daily needs, low income, poor, ...",COVID-19 Family Assistance FundAbout UsOrganis...,Covid 19 Family Assistance Fund Fund will go t...,0.604705
158,COVID-19 Assistance,NuLife Care & Counselling Services,Information on various financial assistance sc...,"Elderly,Families,Low income","Referral,Food",https://nulife.com.sg/covid-our-efforts/,https://chidnast.sirv.com/SchemesSG/nulife.jpg,"COVID-19,Food","COVID, debt, pandemic, poor, hungry, nothing t...",Covid-19 assistance - Our efforts NuLife ...,Information on various financial assistance sc...,0.579191
102,The Straits Times School Pocket Money Fund,The Straits Times School Pocket Money Fund (SPMF),To alleviate the financial burden faced by par...,Children from low income families,Financial assistance for education,https://www.spmf.org.sg/primary-secondary-stud...,https://chidnast.sirv.com/SchemesSG/stspmf.jpg,"Low Income,Education","Children, disabled, schoolchildren, need compu...",SPMF Primary & Secondary Students Toggle navi...,To alleviate the financial burden faced by par...,0.567199
2,Family LifeAid,Red Cross Singapore,Identified households receive food vouchers ev...,"Low income,Need food support","Financial assistance,Food,Educational programmes",https://www.redcross.sg/get-assistance/family-...,https://chidnast.sirv.com/SchemesSG/redcross.jpg,"Low Income,Food,Education","needs help to get food, meal, hungry, have not...",Family LifeAidAbout usHeritage & Milestones70t...,Identified households receive food vouchers ev...,0.521717
51,Financial Asssistance,Children's Cancer Foundation,CCF Financial Assistance (FA) scheme aims to h...,"Caregivers,Cancer patients",Financial assistance for cancer patients,https://www.ccf.org.sg/programmes-and-services...,https://chidnast.sirv.com/SchemesSG/ccf.jpg,"Low Income,Healthcare",take care of child with cancer,,CCF Financial Assistance (FA) scheme aims to h...,0.507862


In [22]:
return_query("single parent", embeddings_with_more_info, df_schemes).head(5)  # got a hit in number 3-4: surfacing pregnant teens!!!

Unnamed: 0,Title,Agency,Description,Who's it for,What it gives,Link,Background Image Link,Scheme Type,search_booster(WL),scraped_text,conjoined_text,cos_sim
64,Home Ownership Plus Education (HOPE) Scheme,Ministry of Social and Family Development (MSF),"Assistance for young, low-income parent(s) who...",Low income families,"Financial assistance for daily expenses,Financ...",https://www.msf.gov.sg/assistance/Pages/Home-O...,https://chidnast.sirv.com/SchemesSG/msf.jpg,"Family,Low Income,Education,Housing","Family planning, low income, no money for food...",Home Ownership Plus Education (HOPE) Scheme M...,"Assistance for young, low-income parent(s) who...",0.4312
151,Various services,Big Love Child Protection Specialist Centre,Casework management; child protection; home-ba...,Parents,"Casework,Child protection,Educational programm...",https://www.biglove.org.sg/,https://chidnast.sirv.com/SchemesSG/biglove.jpg,"Children,Family","child protection, abuse, family violence, chil...",Big Love Child Protection Specialist Centre – ...,Casework management; child protection; home-ba...,0.301316
115,Pregnancy Crisis and Support,Pregnancy Crisis and Support,"For emotional support, guidance, help and refe...","Teenagers facing pregnancy,Pregnant individual...",Counselling for pregnancy crises,https://www.pregnancycrisis.sg/Home,https://chidnast.sirv.com/SchemesSG/pcs.jpg,Youth-at-Risk,"Teenage Pregnancy, Pregnant, Young Mother, Bab...",,"For emotional support, guidance, help and refe...",0.291026
114,Babes - A Helping Hand for Pregnant Teens,Babes - A Helping Hand for Pregnant Teens,Staff will discuss the various options availab...,"Teenagers facing pregnancy,Youth-at-risk",Counselling for pregnant teens,https://www.babes.org.sg/we-are-here-for-you/o...,https://chidnast.sirv.com/SchemesSG/babes.jpg,Youth-at-Risk,"Teenage Pregnancy, Pregnant, Young Mother, Bab...",Our Services Babes Pregnancy – Teenage Pregn...,Staff will discuss the various options availab...,0.289167
80,Ministry of Education ñ Financial Assistance S...,Ministry of Education (MOE),Help needy Singaporean families reduce their b...,Children from low income families,Financial assistance for education,https://beta.moe.gov.sg/fees-assistance-awards...,https://chidnast.sirv.com/SchemesSG/moe.jpg,"Low Income,Education","School fees, students, child, children, study,...",,Help needy Singaporean families reduce their b...,0.277014


# 5. Evaluation

At this point, given only 100+ options to choose from and no tracking information, it's hard to do any kind of evaluation unless we log data and use implicit yeses and nos.

# 6. Productionalization

In [30]:
torch.save(embeddings_with_more_info, '../models/transformer/embeddings.pt')

In [31]:
df_schemes = df_schemes.rename(columns={'Background Image Link': 'Image', 'Title': 'Scheme'})

In [32]:
def return_query(sentence, embeddings, df, limit=50, rel_cap=0.2):
    query_emb = model.encode(sentence, convert_to_tensor=True)
    cos_sim = util.pytorch_cos_sim(query_emb, embeddings)
    cos_sim_series = pd.Series(cos_sim.numpy().flatten(), name='Relevance')
    return df.join(cos_sim_series[cos_sim_series >= rel_cap]).sort_values('Relevance', ascending=False).head(limit)[['Relevance','Scheme','Description', 'Agency', 'Image', 'Link']].to_json(orient="records")

In [33]:
df_schemes.to_csv('../models/transformer/source_data.csv')

In [34]:
# test reloading
df_reload = pd.read_csv('../models/transformer/source_data.csv')
emb = torch.load('../models/transformer/embeddings.pt')

In [35]:
rtn = return_query("single parent", emb, df_reload)

In [38]:
import json
json.loads(rtn)

[{'Relevance': 0.4312001467,
  'Scheme': 'Home Ownership Plus Education (HOPE) Scheme',
  'Description': 'Assistance for young, low-income parent(s) who choose to keep their family small so that they can focus their resources on giving their children a head start, and improve their financial and social situation. Includes grants and support for training, employment, utilities, housing, bursaries, mentoring and family support',
  'Agency': 'Ministry of Social and Family Development (MSF)',
  'Image': 'https://chidnast.sirv.com/SchemesSG/msf.jpg',
  'Link': 'https://www.msf.gov.sg/assistance/Pages/Home-Ownership-Plus-Education-HOPE-Scheme.aspx'},
 {'Relevance': 0.3013159037,
  'Scheme': 'Various services',
  'Description': 'Casework management; child protection; home-based parenting training; building psychological and social resilience for children and young persons; family bonding',
  'Agency': 'Big Love Child Protection Specialist Centre',
  'Image': 'https://chidnast.sirv.com/Schemes