## Sentence Embedding With Hugging Face

When I first tested the model, I had a dataset called 'QA_data' with about 3,700 rows. However, since I wanted to build an application with a wider range of questions, I decided to combine it with the SQuAD dataset, which contains over 80,000 question-answer pairs collected from Wikipedia articles.

In [1]:
import pandas as pd
import numpy as np
import json
import csv

In [2]:
dataa = pd.read_csv('Data/QA_data.csv')

In [3]:
data2 = pd.read_csv('Data/train-squad.csv')

In [4]:
data2.head()

Unnamed: 0.1,Unnamed: 0,context,question,id,answer_start,text
0,0,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce start becoming popular?,56be85543aeaaa14008c9063,269,in the late 1990s
1,1,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,What areas did Beyonce compete in when she was...,56be85543aeaaa14008c9065,207,singing and dancing
2,2,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce leave Destiny's Child and bec...,56be85543aeaaa14008c9066,526,2003
3,3,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,In what city and state did Beyonce grow up?,56bf6b0f3aeaaa14008c9601,166,"Houston, Texas"
4,4,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,In which decade did Beyonce become famous?,56bf6b0f3aeaaa14008c9602,276,late 1990s


In [5]:
data2.columns

Index(['Unnamed: 0', 'context', 'question', 'id', 'answer_start', 'text'], dtype='object')

In [6]:
df2 = data2.drop(['Unnamed: 0', 'context','id', 'answer_start'], axis = 1)
df2 = df2.rename({'text':'answers'}, axis = 1)
df2.head(5)

Unnamed: 0,question,answers
0,When did Beyonce start becoming popular?,in the late 1990s
1,What areas did Beyonce compete in when she was...,singing and dancing
2,When did Beyonce leave Destiny's Child and bec...,2003
3,In what city and state did Beyonce grow up?,"Houston, Texas"
4,In which decade did Beyonce become famous?,late 1990s


In [7]:
df = dataa.drop('url', axis = 1)
df.head(5)

Unnamed: 0,question,answers
0,what is the name of justin bieber brother?,"['Jazmyn Bieber','Jaxon Bieber']"
1,what character did natalie portman play in sta...,['Padmé Amidala']
2,what state does selena gomez?,['New York City']
3,what country is the grand bahama island in?,['Bahamas']
4,what kind of money to take to bahamas?,['Bahamian dollar']


In [8]:
df['answers'] = df['answers'].apply(lambda x : x.replace("[","").replace("]",""))
df['answers'] = df['answers'].apply(lambda x : x.replace("'",""))

In [9]:
df.head(5)

Unnamed: 0,question,answers
0,what is the name of justin bieber brother?,"Jazmyn Bieber,Jaxon Bieber"
1,what character did natalie portman play in sta...,Padmé Amidala
2,what state does selena gomez?,New York City
3,what country is the grand bahama island in?,Bahamas
4,what kind of money to take to bahamas?,Bahamian dollar


In [10]:
df_merged = pd.concat([df, df2], ignore_index=True, sort=False)

In [11]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90599 entries, 0 to 90598
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  90599 non-null  object
 1   answers   90596 non-null  object
dtypes: object(2)
memory usage: 1.4+ MB


In [12]:
df_merged.isnull().sum()

Unnamed: 0,0
question,0
answers,3


In [13]:
df_merged.dropna(subset=['answers'], inplace=True)
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Index: 90596 entries, 0 to 90598
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  90596 non-null  object
 1   answers   90596 non-null  object
dtypes: object(2)
memory usage: 2.1+ MB


In [None]:
#!pip install sentence-transformers



In [14]:
from sentence_transformers import SentenceTransformer
sentences = df_merged['question'].tolist()

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
#!pip install faiss-cpu



In [15]:
import pickle
with open("my-embeddings.pkl", "wb") as fOut:
    pickle.dump({'embeddings': embeddings},fOut)

In [16]:
df_merged.to_csv('df_merged.csv')