[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/search/semantic-search/ner-search/ner-powered-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/search/semantic-search/ner-search/ner-powered-search.ipynb)

# Name Entity Recognition and Semantic Search with Pinecone indexes and Sentence Transformers

This notebook shows how to use Named Entity Recognition (NER) for hybrid metadata + vector search with Pinecone. We will:

1. Extract named entities from text.
2. Store them in a Pinecone index as metadata (alongside respective text vectors).
3. We extract named entities from incoming queries and use them to filter and search only through records containing these named entities.

This is particularly helpful if you want to restrict the search score to records that contain information about the named entities that are also found within the query.

Let's get started.

# Install Dependencies

In [1]:
!pip install sentence_transformers pinecone-client datasets python-dotenv

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pinecone-client
  Downloading pinecone_client-2.2.2-py3-none-any.whl (179 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting transformers<5.0.0,>=4.6.0 (from sentence_transformers)
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m99.1 MB/s[0m eta

## Load the libraries

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
from sentence_transformers import SentenceTransformer

import torch

import pinecone

from tqdm.auto import tqdm

import os
from dotenv import load_dotenv

import warnings
warnings.filterwarnings("ignore")

# Load and Prepare Dataset

We use a dataset containing ~190K news in spanish, extrcted from the CC-news dataset. CC-NEWS-ES-titles is a Spanish-language dataset for news titles generation. The text and titles comes from 2019 and 2020 CC-NEWS data (which is part of Common Crawl).

We select only the test split from the dataset as indexing all the articles may take some time. This dataset can be loaded from the HuggingFace dataset hub as follows:

In [3]:
# load the dataset and convert to pandas dataframe
df = load_dataset(
    "LeoCordoba/CC-NEWS-ES-titles",
    split="test"
).to_pandas()

Downloading builder script:   0%|          | 0.00/3.03k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/4.71k [00:00<?, ?B/s]

Downloading and preparing dataset cc-news-es-titles/default (download: 624.33 MiB, generated: 614.04 MiB, post-processed: Unknown size, total: 1.21 GiB) to /root/.cache/huggingface/datasets/LeoCordoba___cc-news-es-titles/default/0.0.0/4ce1747fb0af21e9f8f8b47a10039a2ea420c706adcb11d31c0edbbcbb3559f9...


Downloading data:   0%|          | 0.00/602M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/370125 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/16092 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/16093 [00:00<?, ? examples/s]

Dataset cc-news-es-titles downloaded and prepared to /root/.cache/huggingface/datasets/LeoCordoba___cc-news-es-titles/default/0.0.0/4ce1747fb0af21e9f8f8b47a10039a2ea420c706adcb11d31c0edbbcbb3559f9. Subsequent calls will reuse this data.


In [4]:
# drop empty rows and select 50k articles
#df = df.dropna().sample(50000, random_state=32)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16093 entries, 0 to 16092
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   text         16093 non-null  object
 1   output_text  16093 non-null  object
dtypes: object(2)
memory usage: 251.6+ KB


We will use the article title and its text for generating embeddings. For that, we join the article title and the first 1000 characters from the article text.

In [5]:
# select first 1000 characters
df["text"] = df["text"].str[:1000]
# join article title and the text
df["title_text"] = df["output_text"] + ". " + df["text"]

In [6]:
df.head(10)

Unnamed: 0,text,output_text,title_text
0,Los latinos en Estados Unidos son consumidores...,\n ¿Se está debilitando el ...,\n ¿Se está debilitando el ...
1,26.05.2019 | 10:52Terrassa en Comú (TeC) llega...,"Xavi Matilla(TeC):""Una gran mayoría tiene gana...","Xavi Matilla(TeC):""Una gran mayoría tiene gana..."
2,Instagram elimina un video de Madonna por desi...,¿Por qué Donald Trump anunció que prohibirá Ti...,¿Por qué Donald Trump anunció que prohibirá Ti...
3,El sector valora el plan de choque municipal a...,"Gestores culturales en Andalucía valoran la ""p...","Gestores culturales en Andalucía valoran la ""p..."
4,Entender cuál es la estructura y cómo funciona...,"Premio internacional para Yamir Moreno, físico...","Premio internacional para Yamir Moreno, físico..."
5,", con el objetivo de firmar dos convenios, uno...",Galmarini firmó convenios con el rector de UBA...,Galmarini firmó convenios con el rector de UBA...
6,.Lo verdaderamente curioso de todo el fenómeno...,"No, Fortnite no se ha acabado para siempre","No, Fortnite no se ha acabado para siempre. .L..."
7,". Así, ‘De Cayetana a Cayetano’ marcó un antes...",Eugenia Martínez de Irujo desvela el verdadero...,Eugenia Martínez de Irujo desvela el verdadero...
8,El torneo Oficial de la Liga Rionegrina de Fút...,El torneo de la Liga Rionegrina tiene nuevo fo...,El torneo de la Liga Rionegrina tiene nuevo fo...
9,Sobre el sur de La Pampa y el extremo sur de l...,\n Alerta meteo...,\n Alerta meteo...


In [None]:
df['title_text'][2]

'¿Por qué Donald Trump anunció que prohibirá TikTok en Estados Unidos?. Instagram elimina un video de Madonna por desinformar sobre el coronavirus"Tengo esa autoridad. Puedo hacerlo con una orden ejecutiva", afirmó el mandatario, quien detalló que planea tomar la decisión este mismo sábado como pronto.A principios de mes, el secretario de Estado de EE.UU., Mike Pompeo, ya dejó entrever que el Gobierno de Trump consideraba restringir el acceso a TikTok en Estados Unidos ante la posibilidad de que\tEstados Unidos supera las 150.000 muertes a causa de la pandemiaEn un evento organizado por el diario The Hill, Pompeo explicó que la Administración está valorando imponer sanciones y aseguró que "en breve" comunicarán al público "la serie de decisiones" que se han tomado.TikTok es una red social desarrollada por ByteDance, con sede en Pekín (China), en la que se comparten videos cortos y que ha logrado un gran éxito entre el público adolescente, pero que a la vez ha levantado\t'

In [23]:
df['title_text'][50]

'La Ópera de Viena dará recitales on-line por el Covid-19. TweetLa Ópera de Viena ha decidido compensar la masiva reducción de la vida social ofreciendo gratis «online» grandes funciones grabadas los últimos años. Con los museos, teatros, centros deportivos o restaurantes cerrados, vieron propicio brindar esta alternativa para ofrecer una vía de esparcimiento a la gente, debido a la expansión del Coronavirus,\tEntre las obras disponibles se cuentan el ciclo de cuatro óperas de «El anillo del nibelungo», de Wagner; una «Tosca» de Puccini con el barítono español Carlos Álvarez; «Romeo y Julieta», dirigida por Plácido Domingo y protagonizada por el peruano Juan Diego Flórez; o el «Falstaff» de Verdi con Zubin Mehta como director.\t'

# Initialize NER Model

To extract named entities, we will use a NER model finetuned on a TinyBERT-base model. The model can be loaded from the HuggingFace model hub as follows:

In [10]:
# set device to GPU if available
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

# Spanish NER model
model_id = "mrm8488/TinyBERT-spanish-uncased-finetuned-ner"

# load the tokenizer from huggingface
tokenizer = AutoTokenizer.from_pretrained(
    model_id
)
# load the NER model from huggingface
model = AutoModelForTokenClassification.from_pretrained(
    model_id,
    max_length=1024
)
# load the tokenizer and model into a NER pipeline
nlp = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="max",
    device=device
)

In [11]:
text = "Londres es la capital de Inglaterra y del Reino Unido"
# use the NER pipeline to extract named entities from the text
nlp(text)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'entity_group': 'LOC',
  'score': 0.94250643,
  'word': 'Londres',
  'start': 0,
  'end': 7},
 {'entity_group': 'LOC',
  'score': 0.9653858,
  'word': 'Inglaterra',
  'start': 25,
  'end': 35},
 {'entity_group': 'LOC',
  'score': 0.8287469,
  'word': 'Reino Unido',
  'start': 42,
  'end': 53}]

Our NER pipeline is working as expected and accurately extracting entities from the text.

# Initialize the Retriever

A retriever model is used to embed passages and queries. It creates embeddings such that queries and passages with similar meanings are close in the vector space. We will use a sentence-transformer model as our retriever. The model can be loaded as follows:

In [12]:
# load the model from huggingface
retriever = SentenceTransformer(
    'mrm8488/distiluse-base-multilingual-cased-v2-finetuned-stsb_multi_mt-es',
    device=device
)
retriever

Downloading (…)fcaf4/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)5c7f5fcaf4/README.md:   0%|          | 0.00/4.97k [00:00<?, ?B/s]

Downloading (…)7f5fcaf4/config.json:   0%|          | 0.00/637 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/539M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)fcaf4/tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)5c7f5fcaf4/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)f5fcaf4/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

# Initialize Pinecone Index

Now we need to initialize our Pinecone index. The Pinecone index stores vector representations of our passages which we can retrieve using another vector (the query vector). We first need to initialize our connection to Pinecone. For this, we need a free [API key](https://app.pinecone.io/); you can find your environment in the [Pinecone console](https://app.pinecone.io) under **API Keys**. We initialize the connection like so:

In [13]:
# Load .env file with environment variables
load_dotenv()

# connect to pinecone environment
pinecone.init(
    api_key=os.environ["PINECONE_API_KEY"],
    environment="us-west4-gcp-free"  # find next to API key in console
)

Now we can create our vector index. We will name it `ner-search` (feel free to chose any name you prefer). We specify the metric type as `cosine` and dimension as `768` as these are the vector space and dimensionality of the vectors output by the retriever model.

In [14]:
index_name = "ner-spanish-search"

# check if the ner-search index exists
if index_name not in pinecone.list_indexes():
    # create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=768,
        metric="cosine"
    )

# connect to ner-search index we created
index = pinecone.Index(index_name)

# Generate Embeddings and Upsert

We generate embeddings for the `title_text` column we created earlier. Alongside the embeddings, we also include the named entities in the index as metadata. Later we will apply a filter based on these named entities when executing queries.

Let's first write a helper function to extract named entities from a batch of text.

In [15]:
def extract_named_entities(text_batch):
    # extract named entities using the NER pipeline
    extracted_batch = nlp(text_batch)
    entities = []
    # loop through the results and only select the entity names
    for text in extracted_batch:
        ne = [entity["word"] for entity in text]
        entities.append(ne)
    return entities

Now we create the embeddings. We do this in batches of `64` to avoid overwhelming machine resources or API request limits.

In [16]:
# we will use batches of 64
batch_size = 64

# Check if index is empty
index_stats_response = index.describe_index_stats()
if index_stats_response['total_vector_count']<100:

    for i in tqdm(range(0, len(df), batch_size)):
        # find end of batch
        i_end = min(i+batch_size, len(df))
        # extract batch
        batch = df.iloc[i:i_end]
        # generate embeddings for batch
        emb = retriever.encode(batch["title_text"].tolist()).tolist()
        # extract named entities from the batch
        entities = extract_named_entities(batch["title_text"].tolist())
        # remove duplicate entities from each record
        batch["named_entities"] = [list(set(entity)) for entity in entities]
        batch = batch.drop('title_text', axis=1)
        # get metadata
        meta = batch.to_dict(orient="records")
        # create unique IDs
        ids = [f"{idx}" for idx in range(i, i_end)]
        # add all to upsert list
        to_upsert = list(zip(ids, emb, meta))
        # upsert/insert these records to pinecone
        _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 16093}},
 'total_vector_count': 16093}

Now we have indexed the articles and relevant metadata. We can move on to querying.

# Querying

First, we will write a helper function to handle the queries.

In [20]:
from pprint import pprint

def search_pinecone(query):
    # extract named entities from the query
    ne = extract_named_entities([query])[0]
    # create embeddings for the query
    xq = retriever.encode(query).tolist()
    # query the pinecone index while applying named entity filter
    xc = index.query(xq, top_k=5, include_metadata=True, filter={"named_entities": {"$in": ne}})
    # extract article titles from the search result
    r = [x["metadata"]["output_text"] for x in xc["matches"]]
    return pprint({"Extracted Named Entities": ne, "Result": r})

Now try a query.

In [21]:
query = "¿Se está debilitando el vínculo de los hispanos en EEUU con la carne de cerdo?"
search_pinecone(query)

{'Extracted Named Entities': ['EEUU'],
 'Result': ['\n'
            '                    ¿Se está debilitando el vínculo de los '
            'hispanos en EEUU con la carne de cerdo? | El Nuevo Herald\n'
            '                ',
            'COVID y desempleo afectan el empadronamiento latino en EEUU',
            'El debate de las mascarillas: ¿hace falta que las lleve todo el '
            'mundo o no?',
            'Estados Unidos.- Sordo lamenta que un "mamarracho" como Trump '
            'ponga en cuestión un sistema democrático y apuesta por Biden',
            'Renovado interés en esclavos de EEUU que huyeron a México']}


In [22]:
query = "Regadio en el Mar Menor"
search_pinecone(query)

{'Extracted Named Entities': ['Regadio', 'Mar Menor'],
 'Result': ['La fiesta del año',
            'Día de la Región, Fitur olerá hoy a gastronomía murciana',
            'Casi 500 hectáreas de regadío ilegal en el Mar Menor multadas en '
            '2018 siguen con riego']}


In [24]:
query = "Opera de Viena dirigida por Plácido Domingo"
search_pinecone(query)

{'Extracted Named Entities': ['Opera de Viena', 'Plácido Domingo'],
 'Result': ['Plácido Domingo recibirá un premio en Austria a su "excepcional '
            'carrera"',
            'Les Arts espera que se reciba a Plácido Domingo "como el gran '
            'artista que es"',
            'Les Arts de València quita el nombre de Plácido Domingo a su '
            'Centro de Perfeccionamiento',
            'José Carreras se presentará en mayo en Colombia con la '
            'Filarmónica de Bogotá',
            'Tokio 2020: En duda presentación de Domingo tras acusaciones']}


These all look like great results, making the most of Pinecone's advanced vector search capabilities while limiting search scope to relevant records only with a named entity filter.