# Searching companies by what they do


This notebook shows how we can use embeddings to generate vector representations of companies and then use these vectors to find companies based on what they do.

The goal of this tutorial is not to develop the best possible method but to explore whether embedding company parameters can generate something useful.

I also compared this approach with a keyword search to see how it performs. It was quite interesting to see the results side by side and judge.

## Install opensearch

We will use opensearch to store the data and search for companies. We will use the opensearch-dsl library to interact with opensearch.
To do that and make this more reproducible, we will use docker to install opensearch.

In [1]:
!docker pull opensearchproject/opensearch:2

2: Pulling from opensearchproject/opensearch
Digest: sha256:1a6d62f4ff2215f66792362a56c64cea29ff6cbe700f95881857f045a6fea3c9
Status: Image is up to date for opensearchproject/opensearch:2
docker.io/opensearchproject/opensearch:2
[1m
What's next:[0m
    View a summary of image vulnerabilities and recommendations → [36mdocker scout quickview opensearchproject/opensearch:2[0m


A lot of the steps in this notebook are manual and can be automated. However, I wanted to keep it simple and show the steps involved in setting up the system.

In [2]:
!docker run -d -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=opensearch-ABN-ameer-2024!" opensearchproject/opensearch:2

3c9209c06805ce2c51cc34a62d71beec941366fe86d6c53ad68082b723660a3d
docker: Error response from daemon: driver failed programming external connectivity on endpoint peaceful_jepsen (9c9672b940fa799a1e65d1e15f1567e1e9732a1c4c45cf5aef4cff5d0b7d80b7): Bind for 0.0.0.0:9600 failed: port is already allocated.


## Install the required libraries

In [3]:
!pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# Create index

In [4]:
from opensearchpy import OpenSearch

Disable many features of OpenSearch to allow querying it without worrying about SSL and certificate issues, which is useful for local development environments.

In [5]:
client = OpenSearch('http://localhost:9200',  http_auth = ('admin','opensearch-ABN-ameer-2024!'), use_ssl = True, verify_certs = False,    ssl_assert_hostname = False,
    ssl_show_warn = False)

In [6]:
client.indices.create(index='abn', ignore=400)

{'error': {'root_cause': [{'type': 'resource_already_exists_exception',
    'reason': 'index [abn/9OaUcTDqRl2oVjSCQ6uUEA] already exists',
    'index': 'abn',
    'index_uuid': '9OaUcTDqRl2oVjSCQ6uUEA'}],
  'type': 'resource_already_exists_exception',
  'reason': 'index [abn/9OaUcTDqRl2oVjSCQ6uUEA] already exists',
  'index': 'abn',
  'index_uuid': '9OaUcTDqRl2oVjSCQ6uUEA'},
 'status': 400}

Double check we can search the index

In [7]:
client.search(index='abn')

{'took': 6,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 0, 'relation': 'eq'},
  'max_score': None,
  'hits': []}}

## Downloading the data

The data is extracted from the ABN bulk data and is stored in the test_data folder. For the sake of simplicity, I have only extracted the company name and the state of the company. I also kept the data small to make it easier to work with. A sample of the data is available in the test_data folder.

If you would like to test with more data, you can grab the entire dataset from the ABN website and extract it to the test_data folder.

## Preparing the infrastructure to embedding 

Choosing which model to use can be a daunting task, so I simply picked the best model in the benchmark that has a nice memory footprint and can be used on a local machine. I chose the e5-small-v2 model. I have used it before, and it was easy to download and use directly from within Sentence Transformers. Feel free to use any other model if you prefer.

In [8]:
from sentence_transformers import SentenceTransformer

def get_encoder(model_name="intfloat/e5-small-v2"):
    return SentenceTransformer(model_name)

  from tqdm.autonotebook import tqdm, trange


### Load the model

In [9]:
embedding_model = get_encoder("intfloat/e5-small-v2")

### Prepare opensearch settings

In [10]:
# Define the mapping for the index
index_settings = {
  "settings": {
    "index.knn": True
  }
}

index_mappings = {
    "properties": {
      "text-field": {
        "type": "text"
      },
      "company_embeddings": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "lucene"
        }
      }
    }
}

try:
    client.indices.delete(index='abn', ignore=400)
except:
    pass
client.indices.create(index='abn', ignore=400)

client.indices.close(index='abn')
# Update the index with the defined mapping
client.indices.put_settings(index='abn', body=index_settings)
client.indices.put_mapping(index='abn', body=index_mappings)
client.indices.open(index='abn')
client.indices.get_settings(index='abn')
client.indices.get_mapping(index='abn')

{'abn': {'mappings': {'properties': {'company_embeddings': {'type': 'knn_vector',
     'dimension': 384,
     'method': {'engine': 'lucene',
      'space_type': 'l2',
      'name': 'hnsw',
      'parameters': {}}},
    'text-field': {'type': 'text'}}}}}


### Ingest data

We will walk through the test_data folder and parse the data. We will then encode the company name and store it in the OpenSearch index.

Note that OpenSearch provides the ability to generate embeddings as part of its pipeline, but for the sake of simplicity, I am generating the embeddings before storing the data in OpenSearch.

In [11]:
from abn_bulk_data_parser import walk_and_parse

In [None]:
for company in walk_and_parse("test_data"):
    company['company_embeddings'] = embedding_model.encode(company['company_name'],normalize_embeddings=True).tolist()
    
    # Ideally, we can do batch, but for the sake of simplicity, we will do one by one.
    response = client.index("abn",body=company)

In [None]:
client.search_template(index='abn',body=
{
  "source": { 
            "query":{"match":{"company_name": "{{company_name}}"}}},
  "params": {
    "company_name": "TESTER"
  }
}
)

### Preparing some search templates

In [306]:
client.put_script ('company_keyword_search_template',
{
  "script": {
   "lang": "mustache",
   "source": { 
            "query":{"match":{"company_name": "{{company_name}}"}}},
   "params": {
    "company_name": "TESTER"
   }
  }
})

{'acknowledged': True}

In [None]:
client.put_script ('company_knn_search_template',
{
  "script": {
    "lang": "mustache",
    "source": {
      "from": "{{from}}{{^from}}0{{/from}}",
      "size": "{{size}}{{^size}}10{{/size}}",
      "query": {
            "knn": {
                "company_embeddings": {
                    "vector": "{{query_embeddings}}",
                    "k": 10
                }
            }
        }
    },
    "params": {
      "query_embeddings": []
    }
  }
})


# Building the app
I have created a simple Streamlit app to compare the search results from the keyword search and the embedding search. The app is available in the st_compare_app.py file. The command below launches the app.

In [None]:
!streamlit run st_compare_app.py

## Search examples

Try searching for following queries that I used and found the results quite interesting:
- Religious organisations
- Hot Coals Catering
- Catering

# Conclusion
The embedding search is quite interesting and can be used to find similar companies based on what they do. The keyword search is quite useful for finding companies based on their names. The embedding search can be further refined by using a more sophisticated model and by using more data. The keyword search can be improved by using a more sophisticated search engine like ElasticSearch or OpenSearch.

One thing that found interesting if we can find a company description (if we link to another dataset) and use that as a search query. This can be quite useful in finding companies that do similar things. 