# Deep Learning Based NER Semantic Search

This project utilises a NLP model to perform semantic search using named entity recognition (NER). This approach improves the quality of search results by giving more relevant information or suggestions from search queries, as the content and intent of the query are being understood by the model, offering a higher precision in search results than keyword searches [1, 2]. This is achieved through filtering, extracting and analysing of essential information (known as entity) from big datasets and queries by deep learning-based NER [3, 4]. 

## 1. Data

The data used to train the model are articles scraped from Medium, which are loaded from huggingface.

In [7]:
#Install datasets module for loading dataset
!pip install datasets



In [8]:
from datasets import load_dataset

#load dataset and conduct train split then convert into pandas dataframe
df = load_dataset("fabiochiu/medium-articles",
                  data_files="medium_articles.csv",
                  split="train").to_pandas()
df.head()

Unnamed: 0,title,text,url,authors,timestamp,tags
0,Mental Note Vol. 24,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,https://medium.com/invisible-illness/mental-no...,['Ryan Fan'],2020-12-26 03:38:10.479000+00:00,"['Mental Health', 'Health', 'Psychology', 'Sci..."
1,Your Brain On Coronavirus,Your Brain On Coronavirus\n\nA guide to the cu...,https://medium.com/age-of-awareness/how-the-pa...,['Simon Spichak'],2020-09-23 22:10:17.126000+00:00,"['Mental Health', 'Coronavirus', 'Science', 'P..."
2,Mind Your Nose,Mind Your Nose\n\nHow smell training can chang...,https://medium.com/neodotlife/mind-your-nose-f...,[],2020-10-10 20:17:37.132000+00:00,"['Biotechnology', 'Neuroscience', 'Brain', 'We..."
3,The 4 Purposes of Dreams,Passionate about the synergy between science a...,https://medium.com/science-for-real/the-4-purp...,['Eshan Samaranayake'],2020-12-21 16:05:19.524000+00:00,"['Health', 'Neuroscience', 'Mental Health', 'P..."
4,Surviving a Rod Through the Head,"You’ve heard of him, haven’t you? Phineas Gage...",https://medium.com/live-your-life-on-purpose/s...,['Rishav Sinha'],2020-02-26 00:01:01.576000+00:00,"['Brain', 'Health', 'Development', 'Psychology..."


In [9]:
#Check for missing rows
df.isna().sum()

title        5
text         0
url          0
authors      0
timestamp    2
tags         0
dtype: int64

**A. Preprocess data**

This involves dropping missing rows from the loaded data_set, selecting a random sample and selecting the first 1000 characters from text to prepare for embedding due to handling limit of embedding model.

The first 1000 characters for each entry in the text column is selected as it should contain the introduction, and the general idea discussed in each article can usually be found in the title and introduction of the article.

In [58]:
#drop missing rows and sample 20000 articles
df_processed = df.dropna().sample(50000, random_state=42) #random_state for reproducibility
df_processed.head()

Unnamed: 0,title,text,url,authors,timestamp,tags
163963,Konsep Perdagangan Adil (Fair Trade),"Sumber:\n\nJournal\n\nTaylor, Jason E, and Boa...",https://medium.com/hipotesa-indonesia/konsep-p...,['Kim Litelnoni'],2019-06-16 01:17:44.009000+00:00,"['Trade', 'Fair Trade', 'International Relatio..."
166367,Palantir Apollo: Powering SaaS where no SaaS h...,"At Palantir, our approach to software has unde...",https://blog.palantir.com/palantir-apollo-powe...,[],2020-10-08 13:30:34.138000+00:00,"['Palantirtech', 'Continuous Delivery', 'Palan..."
42341,ZEROBANK announces the most feasible ICO proje...,"June 8th, 2018, Singapore — ZeroBank, the inno...",https://medium.com/zerobank-cash/zerobank-the-...,['Zerobank - Your Local Currency'],2018-07-17 03:23:38.526000+00:00,"['Sharingeconomy', 'Bitcoin', 'Blockchain', 'Z..."
59988,7 Reasons Your Pitch Got Rejected,7 Reasons Your Pitch Got Rejected\n\nCommon pi...,https://medium.com/the-lucky-freelancer/7-reas...,['Alicia Wilcox'],2020-09-10 00:51:31.372000+00:00,"['Writing', 'Pitch', 'Freelance', 'Freelance W..."
123742,Why Money Mindset Is Important For Writers,"Writing, Writer, Money Mindset, Abundance\n\nI...",https://medium.com/books-and-midlife-adventure...,['Christie Adams - Writer'],2021-03-25 19:57:25.881000+00:00,"['Writers Life', 'Writers On Medium', 'Money M..."


In [59]:
#Select first 1000 characters from text
df_processed["text"] = df_processed["text"].str[:1000]
#Join article title and text into new column
df_processed["title_text"] = df_processed["title"] +". " + df_processed["text"]

In [60]:
df_processed.head()

Unnamed: 0,title,text,url,authors,timestamp,tags,title_text
163963,Konsep Perdagangan Adil (Fair Trade),"Sumber:\n\nJournal\n\nTaylor, Jason E, and Boa...",https://medium.com/hipotesa-indonesia/konsep-p...,['Kim Litelnoni'],2019-06-16 01:17:44.009000+00:00,"['Trade', 'Fair Trade', 'International Relatio...",Konsep Perdagangan Adil (Fair Trade). Sumber:\...
166367,Palantir Apollo: Powering SaaS where no SaaS h...,"At Palantir, our approach to software has unde...",https://blog.palantir.com/palantir-apollo-powe...,[],2020-10-08 13:30:34.138000+00:00,"['Palantirtech', 'Continuous Delivery', 'Palan...",Palantir Apollo: Powering SaaS where no SaaS h...
42341,ZEROBANK announces the most feasible ICO proje...,"June 8th, 2018, Singapore — ZeroBank, the inno...",https://medium.com/zerobank-cash/zerobank-the-...,['Zerobank - Your Local Currency'],2018-07-17 03:23:38.526000+00:00,"['Sharingeconomy', 'Bitcoin', 'Blockchain', 'Z...",ZEROBANK announces the most feasible ICO proje...
59988,7 Reasons Your Pitch Got Rejected,7 Reasons Your Pitch Got Rejected\n\nCommon pi...,https://medium.com/the-lucky-freelancer/7-reas...,['Alicia Wilcox'],2020-09-10 00:51:31.372000+00:00,"['Writing', 'Pitch', 'Freelance', 'Freelance W...",7 Reasons Your Pitch Got Rejected. 7 Reasons Y...
123742,Why Money Mindset Is Important For Writers,"Writing, Writer, Money Mindset, Abundance\n\nI...",https://medium.com/books-and-midlife-adventure...,['Christie Adams - Writer'],2021-03-25 19:57:25.881000+00:00,"['Writers Life', 'Writers On Medium', 'Money M...",Why Money Mindset Is Important For Writers. Wr...


In [61]:
df_processed["title_text"].head()

163963    Konsep Perdagangan Adil (Fair Trade). Sumber:\...
166367    Palantir Apollo: Powering SaaS where no SaaS h...
42341     ZEROBANK announces the most feasible ICO proje...
59988     7 Reasons Your Pitch Got Rejected. 7 Reasons Y...
123742    Why Money Mindset Is Important For Writers. Wr...
Name: title_text, dtype: object

The new column is created successfully.

## 2. Building the model and creating the vector database

The NER model used is fine-tuned with a BERT-base model.

In [62]:
import torch
#Check if Pytorch is compiled with GPU and if is available
print(f"GPU Name: {torch.cuda.get_device_name()} \nCUDA is available: {torch.cuda.is_available()}")

GPU Name: NVIDIA GeForce RTX 4060 Laptop GPU 
CUDA is available: True


In [38]:
#Instantiate the NER Model
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_id= "dslim/bert-base-NER"

# Load tokenizer from Huggingface
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Instantiate the NER model from Huggingface
model = AutoModelForTokenClassification.from_pretrained(model_id)

# Define what device to use
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

#Load the tokenizer and model into a NER pipeline
nlp = pipeline("ner",
               model=model,
               tokenizer=tokenizer,
               aggregation_strategy="max",
               device=device)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [39]:
text = "Singapore is a country in Southeast Asia"
nlp(text)

[{'entity_group': 'LOC',
  'score': 0.9997849,
  'word': 'Singapore',
  'start': 0,
  'end': 9},
 {'entity_group': 'LOC',
  'score': 0.99670875,
  'word': 'Southeast Asia',
  'start': 26,
  'end': 40}]

The nlp model is successfully instantiated.

**Create embeddings**

Embeddings are created using SentenceTransformer and will be stored in pinecone.

In [40]:
!pip install sentence-transformers



In [41]:
from sentence_transformers import SentenceTransformer

# Instantiate model from Huggingface
retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base", device=device)

#Check if retriever is successfully instantiated
retriever

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

**Initialising pinecone to store our embeddings**

In [42]:
!pip install pinecone



In [43]:
from pinecone import Pinecone, ServerlessSpec

In [44]:
# Importing of api_key to access pinecone DB
import secret_keys
api_key = secret_keys.API_KEY

In [45]:
# Initialise connection to pinecone
pc = Pinecone(api_key=api_key)

In [65]:
# Create index for NER search
index_name = "ner-search"

#Create Index
pc.create_index(name=index_name,
                dimension=retriever.get_sentence_embedding_dimension(), #dimension is the embedding dimension
                metric="cosine", #cosine similarity metric
                spec=ServerlessSpec(cloud="aws",
                                    region="us-east-1")
                )

In [66]:
#Select index in pinecone
pc_index = pc.Index(index_name)
pc_index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

We can see that the empty index is successfully created in pinecone, according to the output above.

In [67]:
#Create function to prepare data by extracting named entities to add to pinecone
def extract_entities(text_list):
  extracted_fields = nlp(text_list)
  entities_list = []
  for text in extracted_fields:
    named_entity = [fields["word"] for fields in text] #extract the word in each dict
    named_entity = list(set(named_entity)) # remove duplicates of entities
    entities_list.append(named_entity)
  return entities_list

In [68]:
#Checking if function is correctly defined
extract_entities(df_processed['title_text'][:1].tolist())

[['Fair Trade',
  'Lyon',
  'New York',
  'Fairtrade',
  'Journal of Consumer Affairs',
  'Journal Taylor',
  'Konsep Perdagangan Adil',
  'Mark',
  'Indonesia',
  'Bristol',
  'Trade',
  'Sarah',
  'Program',
  'Moberg',
  'Fair Trade and',
  'Trade di Indonesia',
  'Hunt',
  'OEC',
  'Social',
  'Boasson',
  'Tim',
  'Justice',
  'NYU Press',
  'Jason E',
  'Vigdis']]

In [69]:
from tqdm.auto import tqdm

# Encoding to be done in batches of 64 to prevent overwhelming the GPU and
# not exceed the limit of sending data to pinecone
batch_size = 64

for i in tqdm(range(0, len(df_processed), batch_size)): #Include progress bar to indicate progress
  # Get the end index of each batch
  i_end = min(i+batch_size, len(df_processed))
  # Extract batch from df_processed
  batch = df_processed.iloc[i:i_end]
  # Generate embeddings for batch
  emb = retriever.encode(batch["title_text"].tolist()).tolist()
  # Extract named entities from batch
  batch["named_entities"] = extract_entities(batch["title_text"].tolist())
  batch.drop("title_text", axis=1, inplace=True) #remove as no longer needed
  # Get metadata
  metadata = batch.to_dict(orient="records")
  # Create unique IDs
  ids = [f"{idx}" for idx in range(i,i_end)]
  # Add all info to upsert list
  to_upsert = list(zip(ids, emb, metadata))
  # Upsert or insert records to pinecone
  up_records = pc_index.upsert(vectors=to_upsert)

# Check that all 20000 vectors are in index
pc_index.describe_index_stats()

  0%|          | 0/782 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  batch["named_entities"] = extract_entities(batch["title_text"].tolist())
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  batch.drop("title_text", axis=1, inplace=True) #remove as no longer needed
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  batch["named_entities"] = extract_entities(batch["title_text"].tolist())
A value is trying to be set on a copy of 

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 49984}},
 'total_vector_count': 49984}

**Creating a search query function**

In [56]:
from pprint import pprint
def pinecone_vector_search(query):
  # Extract named entities from query
  named_entities = extract_entities([query])[0]
  # Create embeddings for query
  xq = retriever.encode(query).tolist()
  # Query pinecone index while applying named entity filter
  xc = pc_index.query(vector=xq, top_k=10, include_metadata=True, #find top 10 related articles by finding the 10 closest distance to query vector
                      filter={"named_entities": {"$in": named_entities}}) #filter named_entities that is within query
  # Extract article titles from search result
  result = [x["metadata"]["title"] for x in xc["matches"]]
  return pprint({"Extracted Named Entities": named_entities, "Result": result})

In [76]:
pinecone_vector_search("What is happening in Singapore?")

{'Extracted Named Entities': ['Singapore'],
 'Result': ['Daily SG: 4 Dec 2014',
            'Foresting and DeCentre Attends Consensus:Singapore 2018',
            'On Capitalism And LGBT Freedoms In Singapore',
            'The Dangerous Influence of “Chinese Privilege” in Singapore',
            'Singapore SMEs must understand these digital trends, or risk '
            'irrelevance',
            'Blockchain Report — 12/6/2018. Summary: Cryptocurrency Scammers '
            'Get…',
            'Singapore’s ‘Fake News’ Crackdown Alarms Tech Giants',
            'Who could be next prime minister',
            'The Case For Women in Tech',
            'Singapore businesses test the use of digital health passports to '
            'check the results of the COVID-19 test for travelers']}


The vector search function will return the top 10 most relevant article titles.

## 3. Discussion

The use of deep-learning based NER with vector search has allowed the use of contextual searches for queries, as shown in the output above. This approach is able to return relevant (and more accurate) results to users even when keywords are not stated in their search query, allowing the deviation from the conventional keyword searches. With this, this enables use cases such as businesses querying social media datasets for market research and searching through internal documents for audit purposes. 

However, the finetuning of the model should be explored as queries like the one shown below do not return the results expected.

In [79]:
pinecone_vector_search("Where are the best spots to tour in Singapore?")

{'Extracted Named Entities': ['Singapore'],
 'Result': ['Affordable Luxury Hotels in Singapore',
            'Singapore SMEs must understand these digital trends, or risk '
            'irrelevance',
            'Daily SG: 4 Dec 2014',
            'What are 5 uneasy situations lesbian couples face in Singapore?',
            'Yoga is Far More than What We See Nowadays',
            'Singapore apartment resale prices analysis',
            'Foresting and DeCentre Attends Consensus:Singapore 2018',
            'Singapore businesses test the use of digital health passports to '
            'check the results of the COVID-19 test for travelers',
            '#14 BINTAN + HAPPY NEW YEAR',
            'Italy Holiday Package From India']}


We could increase the amount of data used for embedding. However this can come with an increase in cost and also lead to the issue of vectors being stored at different sections, making it more difficult for the model to get more accurate results. To circumvent this limitation, we can plot and feed relevant sections of the knowledge graph to the model, although this can be more time-consuming.  

According to a 2023 article by Chan Kwang Wen [5], one approach would be to embed question-answer pairs generated by Large Language models as key-value pairs to improve the model through contrastive learning. Personally, I feel that this could be an interesting approach to fine-tune the search model as it should be able to cover a great subset of use cases. However, creativity is a unique trait of humans and the model might not be able to come up with accurate responses when faced with a totally new query. Hence, much needs to be studied and explored on the topic of fine-tuning semantic search models to improve on their accuracy.


## 4. Credits

The code from this project is derived from https://www.youtube.com/watch?v=3K94GRjDG2Q, to practice the technique and demonstrate how the model works.

**Reference:**

1. https://www.coursera.org/articles/what-is-semantic-search

2. https://spotintelligence.com/2023/10/17/semantic-search/

3. https://www.techtarget.com/whatis/definition/named-entity-recognition-NER#:~:text=Named%20entity%20recognition%20provides%20a,the%20analysis%20of%20emerging%20trends.

4. https://medium.com/etoai/ner-powered-semantic-search-using-lancedb-51051dc3e493

5. https://puddles-of-water.medium.com/improving-the-semantic-search-tool-ef0442f7e972
