# This Example uses Azure OpenAI Service embeddings to search for documents


The underlying idea is outlined here: 
https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/understand-embeddings 

It also states that with larger documents the cosine similarity approach is likely to lead to a greater number of common words detected even among completely disparate topics

Orientiert sich stark an folgendem Beispiel:
https://learn.microsoft.com/en-us/azure/cognitive-services/openai/tutorials/embeddings

Create & run in the data/clean/csv folder to download the used data: 
```bash
curl "https://raw.githubusercontent.com/Azure-Samples/Azure-OpenAI-Docs-Samples/main/Samples/Tutorials/Embeddings/data/bill_sum_data.csv" --output bill_sum_data.csv
```



In [5]:
from helpers import get_env
import openai

API_KEY, RESOURCE_ENDPOINT = get_env("azure-openai")
openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2022-12-01"

In [6]:
import os
import pandas as pd
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity
import tiktoken
from num2words import num2words



In [7]:
df=pd.read_csv(os.path.join(os.getcwd(),'data/clean/csv/bill_sum_data.csv'))
df

Unnamed: 0.1,Unnamed: 0,bill_id,text,summary,title,text_len,sum_len
0,0,110_hr37,SECTION 1. SHORT TITLE.\n\n This Act may be...,National Science Education Tax Incentive for B...,To amend the Internal Revenue Code of 1986 to ...,8494,321
1,1,112_hr2873,SECTION 1. SHORT TITLE.\n\n This Act may be...,Small Business Expansion and Hiring Act of 201...,To amend the Internal Revenue Code of 1986 to ...,6522,1424
2,2,109_s2408,SECTION 1. RELEASE OF DOCUMENTS CAPTURED IN IR...,Requires the Director of National Intelligence...,A bill to require the Director of National Int...,6154,463
3,3,108_s1899,SECTION 1. SHORT TITLE.\n\n This Act may be...,National Cancer Act of 2003 - Amends the Publi...,A bill to improve data collection and dissemin...,19853,1400
4,4,107_s1531,SECTION 1. SHORT TITLE.\n\n This Act may be...,Military Call-up Relief Act - Amends the Inter...,A bill to amend the Internal Revenue Code of 1...,6273,278
5,5,107_hr4541,SECTION 1. RELIQUIDATION OF CERTAIN ENTRIES PR...,Requires the Customs Service to reliquidate ce...,To provide for reliquidation of entries premat...,11691,114
6,6,111_s1495,SECTION 1. SHORT TITLE.\n\n This Act may be...,Service Dogs for Veterans Act of 2009 - Direct...,A bill to require the Secretary of Veterans Af...,5328,379
7,7,111_s3885,SECTION 1. SHORT TITLE.\n\n This Act may be...,Race to the Top Act of 2010 - Directs the Secr...,A bill to provide incentives for States and lo...,16668,1525
8,8,113_hr1796,SECTION 1. SHORT TITLE.\n\n This Act may be...,Troop Talent Act of 2013 - Directs the Secreta...,Troop Talent Act of 2013,15352,2151
9,9,103_hr1987,SECTION 1. SHORT TITLE.\n\n This Act may be...,Taxpayer's Right to View Act of 1993 - Amends ...,Taxpayer's Right to View Act of 1993,5633,894


In [8]:
df_bills = df[['text', 'summary', 'title']]
df_bills

Unnamed: 0,text,summary,title
0,SECTION 1. SHORT TITLE.\n\n This Act may be...,National Science Education Tax Incentive for B...,To amend the Internal Revenue Code of 1986 to ...
1,SECTION 1. SHORT TITLE.\n\n This Act may be...,Small Business Expansion and Hiring Act of 201...,To amend the Internal Revenue Code of 1986 to ...
2,SECTION 1. RELEASE OF DOCUMENTS CAPTURED IN IR...,Requires the Director of National Intelligence...,A bill to require the Director of National Int...
3,SECTION 1. SHORT TITLE.\n\n This Act may be...,National Cancer Act of 2003 - Amends the Publi...,A bill to improve data collection and dissemin...
4,SECTION 1. SHORT TITLE.\n\n This Act may be...,Military Call-up Relief Act - Amends the Inter...,A bill to amend the Internal Revenue Code of 1...
5,SECTION 1. RELIQUIDATION OF CERTAIN ENTRIES PR...,Requires the Customs Service to reliquidate ce...,To provide for reliquidation of entries premat...
6,SECTION 1. SHORT TITLE.\n\n This Act may be...,Service Dogs for Veterans Act of 2009 - Direct...,A bill to require the Secretary of Veterans Af...
7,SECTION 1. SHORT TITLE.\n\n This Act may be...,Race to the Top Act of 2010 - Directs the Secr...,A bill to provide incentives for States and lo...
8,SECTION 1. SHORT TITLE.\n\n This Act may be...,Troop Talent Act of 2013 - Directs the Secreta...,Troop Talent Act of 2013
9,SECTION 1. SHORT TITLE.\n\n This Act may be...,Taxpayer's Right to View Act of 1993 - Amends ...,Taxpayer's Right to View Act of 1993


In [9]:
pd.options.mode.chained_assignment = None #https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters

from helpers import normalize_text

df_bills['text']= df_bills["text"].apply(lambda x : normalize_text(x))

In [10]:
tokenizer = tiktoken.get_encoding("cl100k_base")
df_bills['n_tokens'] = df_bills["text"].apply(lambda x: len(tokenizer.encode(x)))
df_bills = df_bills[df_bills.n_tokens<8192]
len(df_bills)

20

In [11]:
# df_bills['ada_v2'] = df_bills["text"].apply(lambda x : get_embedding(x, engine = 'text-embedding-ada-002'))

In [15]:
# save df_bills to csv for later use, same path as above
df_bills.to_csv(os.path.join(os.getcwd(),'data/clean/csv/bill_sum_data_ada_v2.csv'))

In [13]:
# search through the reviews for a specific product
def search_docs(df, user_query, top_n=3, to_print=True):
    embedding = get_embedding(
        user_query,
        engine="text-embedding-ada-002"
    )
    df["similarities"] = df.ada_v2.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(top_n)
    )
    if to_print:
        display(res)
    return res


res = search_docs(df_bills, "Can I get information on cable company tax revenue?", top_n=4)

Unnamed: 0,text,summary,title,n_tokens,ada_v2,similarities
9,SECTION 1. SHORT TITLE. This Act may be cited ...,Taxpayer's Right to View Act of 1993 - Amends ...,Taxpayer's Right to View Act of 1993,947,"[-0.018460797145962715, -0.024794351309537888,...",0.819775
1,SECTION 1. SHORT TITLE. This Act may be cited ...,Small Business Expansion and Hiring Act of 201...,To amend the Internal Revenue Code of 1986 to ...,1183,"[-0.041596777737140656, -0.009042778052389622,...",0.734921
11,SECTION 1. SHORT TITLE. This Act may be cited ...,Wall Street Compensation Reform Act of 2010 - ...,A bill to amend the Internal Revenue Code of 1...,2331,"[-0.047416988760232925, -0.007824325934052467,...",0.734551
19,SECTION 1. SHORT TITLE. This Act may be cited ...,Strengthening the Health Care Safety Net Act o...,To amend title XIX of the Social Security Act ...,2678,"[-0.017817417159676552, -0.007099970709532499,...",0.733241


In [14]:
res["summary"][9]

"Taxpayer's Right to View Act of 1993 - Amends the Communications Act of 1934 to prohibit a cable operator from assessing separate charges for any video programming of a sporting, theatrical, or other entertainment event if that event is performed at a facility constructed, renovated, or maintained with tax revenues or by an organization that receives public financial support. Authorizes the Federal Communications Commission and local franchising authorities to make determinations concerning the applicability of such prohibition. Sets forth conditions under which a facility is considered to have been constructed, maintained, or renovated with tax revenues. Considers events performed by nonprofit or public organizations that receive tax subsidies to be subject to this Act if the event is sponsored by, or includes the participation of a team that is part of, a tax exempt organization."