## Explore Azure OpenAI Service embeddings and document search

#### This notebook will walk through using the Azure OpenAI embeddings API to perform document search where a knowledge base will be queried to find the most relevant document.

#### This notebook will cover:
- Installing Azure OpenAI
- Preparing the data for analysis
- Accessing Azure OpenAI with the endpoint and key
- Using the `text-search-curie-doc-001` and `text-search-curie-query-001` models
- Using [Cosine Similarity](https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/understand-embeddings#cosine-similarity) to rank search results

##### Prerequisites
- Have access to an Azure OpenAI service endpoint and key
- Have deployed `text-search-curie-doc-001` and `text-search-curie-query-001` models in the resource
- Python 3.7.1 or later

##### Setup

The requirements file can be found [here](https://raw.githubusercontent.com/microsoft/OpenAIWorkshop/main/scenarios/powerapp_and_python/python/embeddings/requirements.txt). Right click and select `Save As` to download the file. If conda is installed run the following command to create a new environment:
```bash
conda create --name openaiembeddings --file requirements.txt --channel conda-forge python=3.11
```

Run the following command to activate the newly created environment:
```bash
conda activate openaiembeddings
```
Be sure to select that as your kernel for your notebook in your IDE.

If conda isn't installed, run the following command to install the approriate libraries:
```bash
pip install -r /path/to/requirements.txt
```

##### Import libraries and list models

Update the <code>API_KEY</code> and <code>RESOURCE_ENDPOINT</code> associated to your resource. These can be found by navigating to Azure OpenAI service in the Azure Portal. They will be under the `Keys and Endpoints`

This will print the available models that are deployed in the corresponding Azure OpenAI Resource. Ensure that there's a `text-search-curie-doc-001` and `text-search-curie-query-001` model available. Please note if they have different names under the <code>id</code> parameter.

In [47]:
import openai
import re
import requests
import pandas as pd
from openai.embeddings_utils import get_embedding, cosine_similarity
from transformers import GPT2TokenizerFast

API_KEY = ""
RESOURCE_ENDPOINT = "" 

openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2022-12-01"

url = openai.api_base + "/openai/deployments?api-version=2022-12-01"

r = requests.get(url, headers={"api-key": API_KEY})

print(r.text)

{
  "data": [
    {
      "scale_settings": {
        "scale_type": "standard"
      },
      "model": "text-search-curie-doc-001",
      "owner": "organization-owner",
      "id": "text-search-curie-doc-001",
      "status": "succeeded",
      "created_at": 1679928006,
      "updated_at": 1679928006,
      "object": "deployment"
    },
    {
      "scale_settings": {
        "scale_type": "standard"
      },
      "model": "text-search-curie-query-001",
      "owner": "organization-owner",
      "id": "text-search-curie-query-001",
      "status": "succeeded",
      "created_at": 1679928033,
      "updated_at": 1679928033,
      "object": "deployment"
    },
    {
      "scale_settings": {
        "scale_type": "standard"
      },
      "model": "text-davinci-002",
      "owner": "organization-owner",
      "id": "davinci",
      "status": "succeeded",
      "created_at": 1680015024,
      "updated_at": 1680015024,
      "object": "deployment"
    }
  ],
  "object": "list"
}


##### Read in Data

Read in the CSV file and create Pandas DataFrame. The inital CSV has more columns than needed for the tutorial. The <code>usecols</code> parameter can be used to specify which columns to keep.

In [48]:
df_bills = pd.read_csv("https://raw.githubusercontent.com/microsoft/OpenAIWorkshop/main/scenarios/powerapp_and_python/python/embeddings/bill_sum_data.csv", usecols=['text', 'summary', 'title'])
df_bills

Unnamed: 0,text,summary,title
0,SECTION 1. SHORT TITLE.\n\n This Act may be...,National Science Education Tax Incentive for B...,To amend the Internal Revenue Code of 1986 to ...
1,SECTION 1. SHORT TITLE.\n\n This Act may be...,Small Business Expansion and Hiring Act of 201...,To amend the Internal Revenue Code of 1986 to ...
2,SECTION 1. RELEASE OF DOCUMENTS CAPTURED IN IR...,Requires the Director of National Intelligence...,A bill to require the Director of National Int...
3,SECTION 1. SHORT TITLE.\n\n This Act may be...,National Cancer Act of 2003 - Amends the Publi...,A bill to improve data collection and dissemin...
4,SECTION 1. SHORT TITLE.\n\n This Act may be...,Military Call-up Relief Act - Amends the Inter...,A bill to amend the Internal Revenue Code of 1...
5,SECTION 1. RELIQUIDATION OF CERTAIN ENTRIES PR...,Requires the Customs Service to reliquidate ce...,To provide for reliquidation of entries premat...
6,SECTION 1. SHORT TITLE.\n\n This Act may be...,Service Dogs for Veterans Act of 2009 - Direct...,A bill to require the Secretary of Veterans Af...
7,SECTION 1. SHORT TITLE.\n\n This Act may be...,Race to the Top Act of 2010 - Directs the Secr...,A bill to provide incentives for States and lo...
8,SECTION 1. SHORT TITLE.\n\n This Act may be...,Troop Talent Act of 2013 - Directs the Secreta...,Troop Talent Act of 2013
9,SECTION 1. SHORT TITLE.\n\n This Act may be...,Taxpayer's Right to View Act of 1993 - Amends ...,Taxpayer's Right to View Act of 1993


##### Data Cleaning

Some light data cleaning is preformed to remove redundant whitespace and cleaning up the punctuation to prepare data for tokenization.

In [49]:
# s is input text
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

df_bills['text'] = df_bills["text"].apply(lambda x : normalize_text(x))

Any rows that are too long for the token limit (~2000 tokens) are removed. 

In [50]:
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
df_bills['n_tokens'] = df_bills["text"].apply(lambda x: len(tokenizer.encode(x)))
df_bills = df_bills[df_bills.n_tokens<2000].copy()
len(df_bills)

Token indices sequence length is longer than the specified maximum sequence length for this model (1480 > 1024). Running this sequence through the model will result in indexing errors


12

After removing the rows that are over the token limit, there are 12 remaining. 12 rows are returned on the DataFrame below, thought they have retained their original index. The new column, `n_tokens`, was added from the previous cell.

In [51]:
df_bills

Unnamed: 0,text,summary,title,n_tokens
0,SECTION 1. SHORT TITLE. This Act may be cited ...,National Science Education Tax Incentive for B...,To amend the Internal Revenue Code of 1986 to ...,1480
1,SECTION 1. SHORT TITLE. This Act may be cited ...,Small Business Expansion and Hiring Act of 201...,To amend the Internal Revenue Code of 1986 to ...,1152
2,SECTION 1. RELEASE OF DOCUMENTS CAPTURED IN IR...,Requires the Director of National Intelligence...,A bill to require the Director of National Int...,930
4,SECTION 1. SHORT TITLE. This Act may be cited ...,Military Call-up Relief Act - Amends the Inter...,A bill to amend the Internal Revenue Code of 1...,1048
5,SECTION 1. RELIQUIDATION OF CERTAIN ENTRIES PR...,Requires the Customs Service to reliquidate ce...,To provide for reliquidation of entries premat...,1846
6,SECTION 1. SHORT TITLE. This Act may be cited ...,Service Dogs for Veterans Act of 2009 - Direct...,A bill to require the Secretary of Veterans Af...,872
9,SECTION 1. SHORT TITLE. This Act may be cited ...,Taxpayer's Right to View Act of 1993 - Amends ...,Taxpayer's Right to View Act of 1993,946
12,SECTION 1. FINDINGS. The Congress finds the fo...,Amends the Marine Mammal Protection Act of 197...,To amend the Marine Mammal Protection Act of 1...,1223
14,SECTION 1. SHORT TITLE. This Act may be cited ...,Education and Training for Health Act of 2017 ...,Education and Training for Health Act of 2017,1596
16,SECTION 1. SHORT TITLE. This Act may be cited ...,Andrew Prior Act or Andrew's Law - Amends the ...,Andrew's Law,608


The following command will return the tokenization result of the first row. 

In [53]:
understand_tokenization = tokenizer.tokenize(df_bills.text[0])
print(understand_tokenization)

['S', 'ECTION', 'Ġ1', '.', 'ĠSH', 'ORT', 'ĠTIT', 'LE', '.', 'ĠThis', 'ĠAct', 'Ġmay', 'Ġbe', 'Ġcited', 'Ġas', 'Ġthe', 'Ġ``', 'National', 'ĠScience', 'ĠEducation', 'ĠTax', 'ĠIn', 'cent', 'ive', 'Ġfor', 'ĠBusiness', 'es', 'ĠAct', 'Ġof', 'Ġ2007', "''.", 'ĠSEC', '.', 'Ġ2', '.', 'ĠCR', 'EDIT', 'S', 'ĠFOR', 'ĠC', 'ER', 'TAIN', 'ĠCONTR', 'IB', 'UT', 'IONS', 'ĠBEN', 'EF', 'IT', 'ING', 'ĠSC', 'IENCE', ',', 'ĠTECH', 'N', 'OLOGY', ',', 'ĠENG', 'INE', 'ER', 'ING', ',', 'ĠAND', 'ĠM', 'ATH', 'EM', 'AT', 'ICS', 'ĠED', 'UC', 'ATION', 'ĠAT', 'ĠTHE', 'ĠELE', 'MENT', 'ARY', 'ĠAND', 'ĠSEC', 'OND', 'ARY', 'ĠSCHOOL', 'ĠLEVEL', '.', 'Ġ(', 'a', ')', 'ĠIn', 'ĠGeneral', '.--', 'Sub', 'part', 'ĠD', 'Ġof', 'Ġpart', 'ĠIV', 'Ġof', 'Ġsub', 'chapter', 'ĠA', 'Ġof', 'Ġchapter', 'Ġ1', 'Ġof', 'Ġthe', 'ĠInternal', 'ĠRevenue', 'ĠCode', 'Ġof', 'Ġ1986', 'Ġ(', 'rel', 'ating', 'Ġto', 'Ġbusiness', 'Ġrelated', 'Ġcredits', ')', 'Ġis', 'Ġamended', 'Ġby', 'Ġadding', 'Ġat', 'Ġthe', 'Ġend', 'Ġthe', 'Ġfollowing', 'Ġnew', 'Ġsection', ':

The following command will return the length of the tokenization output in the above cell. Notice how it's the same as the value in the `n_tokens` column.

In [54]:
len(understand_tokenization)

1480

##### Embedding

Before searching, the text documents must be embed and those corresponding embeddings be saved. We embed each chunk using a doc model, this notebook utilizes the `text-search-curie-doc-001` model. These embeddings can be saved locally or stored in a database. As a result, each document has its corresponding embedding vector in the new `curie_search` column.

Note: If your model deployment is *not* named `text-search-curie-doc-001`, please update the <code>engine</code> parameter to the corresponding name.

In [55]:
df_bills['curie_search'] = df_bills["text"].apply(lambda x : get_embedding(x, engine = 'text-search-curie-doc-001'))
df_bills

Unnamed: 0,text,summary,title,n_tokens,curie_search
0,SECTION 1. SHORT TITLE. This Act may be cited ...,National Science Education Tax Incentive for B...,To amend the Internal Revenue Code of 1986 to ...,1480,"[-0.019770914688706398, 0.011169900186359882, ..."
1,SECTION 1. SHORT TITLE. This Act may be cited ...,Small Business Expansion and Hiring Act of 201...,To amend the Internal Revenue Code of 1986 to ...,1152,"[-0.007850012741982937, 0.01001765951514244, 0..."
2,SECTION 1. RELEASE OF DOCUMENTS CAPTURED IN IR...,Requires the Director of National Intelligence...,A bill to require the Director of National Int...,930,"[0.00012103027984267101, 0.011845593340694904,..."
4,SECTION 1. SHORT TITLE. This Act may be cited ...,Military Call-up Relief Act - Amends the Inter...,A bill to amend the Internal Revenue Code of 1...,1048,"[-0.005481021944433451, 0.00856819562613964, -..."
5,SECTION 1. RELIQUIDATION OF CERTAIN ENTRIES PR...,Requires the Customs Service to reliquidate ce...,To provide for reliquidation of entries premat...,1846,"[-0.008310390636324883, -0.004660653416067362,..."
6,SECTION 1. SHORT TITLE. This Act may be cited ...,Service Dogs for Veterans Act of 2009 - Direct...,A bill to require the Secretary of Veterans Af...,872,"[-0.017687108367681503, 0.011164870113134384, ..."
9,SECTION 1. SHORT TITLE. This Act may be cited ...,Taxpayer's Right to View Act of 1993 - Amends ...,Taxpayer's Right to View Act of 1993,946,"[0.0021867561154067516, -0.004219848196953535,..."
12,SECTION 1. FINDINGS. The Congress finds the fo...,Amends the Marine Mammal Protection Act of 197...,To amend the Marine Mammal Protection Act of 1...,1223,"[-0.015813011676073074, 0.009919906966388226, ..."
14,SECTION 1. SHORT TITLE. This Act may be cited ...,Education and Training for Health Act of 2017 ...,Education and Training for Health Act of 2017,1596,"[-0.0150684155523777, 0.005073960404843092, 0...."
16,SECTION 1. SHORT TITLE. This Act may be cited ...,Andrew Prior Act or Andrew's Law - Amends the ...,Andrew's Law,608,"[-0.011593054980039597, 0.022752899676561356, ..."


At the time of the search, the search query will be embed using the corresponding query model, `text-search-query-001`. Using cosine similarity, the closest embedding in the database (or in this case, DataFrame) is found.

In this sample, the user provides the query: 
> "can I get information on cable company tax revenue" 
>
The query is passed through the <code>search_docs</code> function which embeds the query with the corresponding query model and finds the embedding closest to it from the previously embeded documents. This will return the top 4 results.

Note: If your model deployment is *not* named `text-search-curie-query-001`, please update the engine in the below function in <code>get_embedding</code> to the corresponding name.

In [56]:
# search through the reviews for a specific product
def search_docs(df, user_query, top_n=3):
    embedding = get_embedding(
        user_query,
        engine="text-search-curie-query-001"
    )
    df["similarities"] = df.curie_search.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(top_n)
    ).copy()
    return res


res = search_docs(df_bills, "can i get information on cable company tax revenue", top_n=4)
res

Unnamed: 0,text,summary,title,n_tokens,curie_search,similarities
9,SECTION 1. SHORT TITLE. This Act may be cited ...,Taxpayer's Right to View Act of 1993 - Amends ...,Taxpayer's Right to View Act of 1993,946,"[0.0021867561154067516, -0.004219848196953535,...",0.36327
0,SECTION 1. SHORT TITLE. This Act may be cited ...,National Science Education Tax Incentive for B...,To amend the Internal Revenue Code of 1986 to ...,1480,"[-0.019770914688706398, 0.011169900186359882, ...",0.314105
1,SECTION 1. SHORT TITLE. This Act may be cited ...,Small Business Expansion and Hiring Act of 201...,To amend the Internal Revenue Code of 1986 to ...,1152,"[-0.007850012741982937, 0.01001765951514244, 0...",0.297908
18,SECTION 1. SHORT TITLE. This Act may be cited ...,This measure has not been amended since it was...,Veterans Entrepreneurship Act of 2015,1404,"[-0.020315825939178467, 0.0011716989101842046,...",0.295586


The top result has a cosine similarity score of <code>0.36</code> between the query and document. Using this approach, embeddings can be used as a search mechanism across documents in a knowledge base.

In [57]:
res["summary"][res.index[0]]

"Taxpayer's Right to View Act of 1993 - Amends the Communications Act of 1934 to prohibit a cable operator from assessing separate charges for any video programming of a sporting, theatrical, or other entertainment event if that event is performed at a facility constructed, renovated, or maintained with tax revenues or by an organization that receives public financial support. Authorizes the Federal Communications Commission and local franchising authorities to make determinations concerning the applicability of such prohibition. Sets forth conditions under which a facility is considered to have been constructed, maintained, or renovated with tax revenues. Considers events performed by nonprofit or public organizations that receive tax subsidies to be subject to this Act if the event is sponsored by, or includes the participation of a team that is part of, a tax exempt organization."