# Azure AI Search CSV integrated vectorization sample

This Python notebook demonstrates the [integrated vectorization](https://learn.microsoft.com/azure/search/vector-search-integrated-vectorization) and [CSV indexing](https://learn.microsoft.com/en-us/azure/search/search-howto-index-csv-blobs) features of Azure AI Search that are currently in public preview. 

Integrated vectorization takes a dependency on indexers and skillsets and the AzureOpenAIEmbedding skill and your Azure OpenAI resorce for embedding.

This example uses a CSV from the `csv_data` folder for chunking, embedding, indexing, and queries.

### Prerequisites

+ An Azure subscription, with [access to Azure OpenAI](https://aka.ms/oai/access).
 
+ Azure AI Search, any tier, but we recommend Basic or higher for this workload. [Enable semantic ranker](https://learn.microsoft.com/azure/search/semantic-how-to-enable-disable) if you want to run a hybrid query with semantic ranking.

+ A deployment of the `text-embedding-3-large` model on Azure OpenAI.

+ A deployment of the `gpt-4o` model on Azure OpenAI. 

+ Azure Blob Storage. This notebook connects to your storage account and loads a container with the sample CSV.


### Set up a Python virtual environment in Visual Studio Code

1. Open the Command Palette (Ctrl+Shift+P).
1. Search for **Python: Create Environment**.
1. Select **Venv**.
1. Select a Python interpreter. Choose 3.10 or later.

It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments).

### Install packages

In [1]:
! pip install -r indexers-requirements.txt --quiet


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


### Load .env file (Copy .env-sample to .env and update accordingly)

In [2]:
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
import os

load_dotenv(override=True) # take environment variables from .env.

# Variables not used here do not need to be updated in your .env file
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]
credential = AzureKeyCredential(os.getenv("AZURE_SEARCH_ADMIN_KEY")) if os.getenv("AZURE_SEARCH_ADMIN_KEY") else DefaultAzureCredential()
index_name = os.getenv("AZURE_SEARCH_INDEX", "csv-vec")
blob_connection_string = os.environ["BLOB_CONNECTION_STRING"]
# search blob datasource connection string is optional - defaults to blob connection string
# This field is only necessary if you are using MI to connect to the data source
# https://learn.microsoft.com/azure/search/search-howto-indexing-azure-blob-storage#supported-credentials-and-connection-strings
search_blob_connection_string = os.getenv("SEARCH_BLOB_DATASOURCE_CONNECTION_STRING", blob_connection_string)
blob_container_name = os.getenv("BLOB_CONTAINER_NAME", "csv-vec")
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
azure_openai_key = os.getenv("AZURE_OPENAI_KEY")
azure_openai_embedding_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-3-large")
azure_openai_model_name = os.getenv("AZURE_OPENAI_EMBEDDING_MODEL_NAME", "text-embedding-3-large")
azure_openai_model_dimensions = int(os.getenv("AZURE_OPENAI_EMBEDDING_DIMENSIONS", 1024))
# NOTE: The chat deployment should support tool use
# To learn more, please see
# https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#gpt-4-and-gpt-4-turbo-models
# https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#gpt-35
azure_openai_chat_deployment = os.getenv("AZURE_OPENAI_CHATGPT_DEPLOYMENT", "gpt-4o")
azure_openai_api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2024-05-01-preview")


## Connect to Blob Storage and load documents

Retrieve documents from Blob Storage. You can use the sample documents in the data/documents folder.  

In [3]:
from azure.storage.blob import BlobServiceClient  
import glob

def upload_sample_documents(
        blob_connection_string: str,
        blob_container_name: str,
        use_user_identity: bool = True
    ):
    # Connect to Blob Storage
    blob_service_client = BlobServiceClient.from_connection_string(conn_str=blob_connection_string, credential=DefaultAzureCredential() if use_user_identity else None)
    container_client = blob_service_client.get_container_client(blob_container_name)
    if not container_client.exists():
        container_client.create_container()

    documents_directory = "data"
    csv_files = glob.glob(os.path.join(documents_directory, '*.json'))
    for file in csv_files:
        with open(file, "rb") as data:
            name = os.path.basename(file)
            if not container_client.get_blob_client(name).exists():
                container_client.upload_blob(name=name, data=data)

upload_sample_documents(
    blob_connection_string=blob_connection_string,
    blob_container_name=blob_container_name,
    # Set to false if you want to use credentials included in the blob connection string
    # Otherwise your identity will be used as credentials
    use_user_identity=False
)
print(f"Setup sample data in {blob_container_name}")

Setup sample data in blogdemo


## Create a blob data source connector on Azure AI Search

In [4]:
from azure.search.documents.indexes import SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchIndexerDataContainer,
    SearchIndexerDataSourceConnection,
    SoftDeleteColumnDeletionDetectionPolicy
)

# Create a data source
# NOTE: To remove records from a search index, add a column to the row "IsDeleted" set to "True". The next indexer run will remove this record
# To learn more please visit https://learn.microsoft.com/en-us/azure/search/search-howto-index-one-to-many-blobs
indexer_client = SearchIndexerClient(endpoint, credential)
container = SearchIndexerDataContainer(name=blob_container_name)
data_source_connection = SearchIndexerDataSourceConnection(
    name=f"{index_name}-blob",
    type="azureblob",
    connection_string=search_blob_connection_string,
    container=container,
    data_deletion_detection_policy=SoftDeleteColumnDeletionDetectionPolicy(soft_delete_column_name="IsDeleted", soft_delete_marker_value="True")
)
data_source = indexer_client.create_or_update_data_source_connection(data_source_connection)

print(f"Data source '{data_source.name}' created or updated")

Data source 'json-demo-blob' created or updated


## Create a search index

Vector and nonvector content is stored in a search index.

In [7]:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIParameters,
    SemanticConfiguration,
    SemanticSearch,
    SemanticPrioritizedFields,
    SemanticField,
    SearchIndex
)

# Create a search index
# NOTE: You must adjust these fields based on your CSV Schema.
# There is no chunking of the description or title fields in this sample.
# There is a separate AzureSearch_DocumentKey for the key automatically generated by the indexer
# Learn more at https://learn.microsoft.com/en-us/azure/search/search-howto-index-csv-blobs
index_client = SearchIndexClient(endpoint=endpoint, credential=credential)  
fields = [  
    SearchField(name="AzureSearch_DocumentKey",  key=True, type=SearchFieldDataType.String),
    SearchField(name="CompanyName", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=False),
    SearchField(name="performance", type=SearchFieldDataType.Double, sortable=True, filterable=True, facetable=False), 
    SearchField(name="prof", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=False), 
    SearchField(name="json", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=False), 
    SearchField(name="prof_vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=azure_openai_model_dimensions, vector_search_profile_name="myHnswProfile"),
]  
  
# Configure the vector search configuration  
vector_search = VectorSearch(  
    algorithms=[  
        HnswAlgorithmConfiguration(name="myHnsw"),
    ],  
    profiles=[  
        VectorSearchProfile(  
            name="myHnswProfile",  
            algorithm_configuration_name="myHnsw",  
            vectorizer="myOpenAI",  
        )
    ],  
    vectorizers=[  
        AzureOpenAIVectorizer(  
            name="myOpenAI",  
            kind="azureOpenAI",  
            azure_open_ai_parameters=AzureOpenAIParameters(  
                resource_uri=azure_openai_endpoint,  
                deployment_id=azure_openai_embedding_deployment,
                model_name=azure_openai_model_name,
                api_key=azure_openai_key,
            ),
        ),  
    ],  
)  
  
semantic_config = SemanticConfiguration(  
    name="my-semantic-config",  
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="CompanyName"),
        content_fields=[SemanticField(field_name="prof")]  
    ),  
)

# Create the semantic search with the configuration  
semantic_search = SemanticSearch(configurations=[semantic_config])  
  
# Create the search index
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, semantic_search=semantic_search)  
result = index_client.create_or_update_index(index)  
print(f"{result.name} created")  


json-demo created


## Create a skillset

Skills drive integrated vectorization. [AzureOpenAIEmbedding](https://learn.microsoft.com/azure/search/cognitive-search-skill-azure-openai-embedding) handles calls to Azure OpenAI, using the connection information you provide in the environment variables.

In [8]:
from azure.search.documents.indexes.models import (
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    AzureOpenAIEmbeddingSkill,
    SearchIndexerSkillset
)

# Create a skillset  
skillset_name = f"{index_name}-skillset"
  
csv_comb_embedding_skill = AzureOpenAIEmbeddingSkill(  
    description="Skill to generate comb embeddings via Azure OpenAI",  
    context="/document",  
    resource_uri=azure_openai_endpoint,  
    deployment_id=azure_openai_embedding_deployment,  
    model_name=azure_openai_model_name,
    dimensions=azure_openai_model_dimensions,
    api_key=azure_openai_key,  
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/prof"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="embedding", target_name="prof_vector"),  
    ],  
)

skills = [csv_comb_embedding_skill]

skillset = SearchIndexerSkillset(  
    name=skillset_name,  
    description="Skillset to chunk documents and generating embeddings",  
    skills=skills
)
  
client = SearchIndexerClient(endpoint, credential)  
client.create_or_update_skillset(skillset)  
print(f"{skillset.name} created")  


json-demo-skillset created


## Create an indexer

In [10]:
from azure.search.documents.indexes.models import (
    SearchIndexer,
    FieldMapping,
    FieldMappingFunction,
    IndexingParameters,
    IndexingParametersConfiguration,
    BlobIndexerParsingMode
)

# Create an indexer  
indexer_name = f"{index_name}-indexer"  
indexer_parameters = IndexingParameters(
        configuration=IndexingParametersConfiguration(
            parsing_mode='jsonArray',
            query_timeout=None,
            first_line_contains_headers=True))

indexer = SearchIndexer(  
    name=indexer_name,  
    description="Indexer to index documents and generate embeddings",  
    skillset_name=skillset_name,  
    target_index_name=index_name,  
    data_source_name=data_source.name,
    parameters=indexer_parameters,
    field_mappings=[FieldMapping(source_field_name="AzureSearch_DocumentKey", target_field_name="AzureSearch_DocumentKey", mapping_function=FieldMappingFunction(name="base64Encode"))],
    output_field_mappings=[
        FieldMapping(source_field_name="/document/prof_vector", target_field_name="prof_vector"),
    ]
)  

indexer_client = SearchIndexerClient(endpoint, credential)  
indexer_result = indexer_client.create_or_update_indexer(indexer)  
  
# Run the indexer  
indexer_client.run_indexer(indexer_name)  
print(f'{indexer_name} is created and running. If queries return no results, please wait a bit and try again.')  


json-demo-indexer is created and running. If queries return no results, please wait a bit and try again.


## Perform a hybrid search

This example shows a hybrid vector search using the vectorizable text query, all you need to do is pass in text and your vectorizer will handle the query vectorization.
Ask a zoo employment related question that can be answered just using the title and description fields

In [6]:
import json
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

# Pure Vector Search
# Note the target_company json below is also used in the next pane as the user message when sending the call to Azure Open AI
target_company = '[{"CompanyName":"Madeup Inc","State":"NSW","Industry":"Finance","Segment":"Banking","Product 1 Name":"Widget W","Product 1 Performance":85,"Product 2 Name":"","Product 2 Performance":"","Product 3 Name":"Widget Y","Product 3 Performance":74}]'
company_data = json.loads(target_company)[0]
query = f"{company_data['State']} {company_data['Industry']} {company_data['Segment']}"
print(f"Query: {query}")

search_client = SearchClient(endpoint, index_name, credential=credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="prof_vector")
# Use the below query to pass in the raw vector query instead of the query vectorization
# vector_query = RawVectorQuery(vector=generate_embeddings(query), k_nearest_neighbors=3, fields="JsonVector")
  
results = search_client.search(  
    search_text=query,  
    vector_queries= [vector_query],
    select=["CompanyName", "prof", "performance", "json"],
    filter="performance ge 60",
    top=20,
    include_total_count=True
)  

print ('Total Documents Matching Query:', results.get_count())
print("-" * 50)  

# Initialize an empty list to store csv_json values
comp_profiles = []

for result in results:
    print(f"Score: {result['@search.score']}")  
    print(f"CompanyName: {result['CompanyName']}")  
    print(f"profile: {result['prof']}")  
    print(f"performance: {result['performance']}")
    print(f"json: {result['json']}")
    comp_profiles.append(result['json'])
    print("-" * 50)  



Query: NSW Finance Banking
Total Documents Matching Query: 404
--------------------------------------------------
Score: 0.03333333507180214
CompanyName: Sky Robotics
profile: NSW Finance Banking
performance: 61.25
json: {"CompanyName":"Sky Robotics","State":"NSW","Industry":"Finance","Segment":"Banking","Product 1 Name":"Widget W","Product 1 Performance":75.0,"Product 2 Name":"Widget X","Product 2 Performance":87.0,"Product 3 Name":null,"Product 3 Performance":null,"Product 4 Name":"Widget Z","Product 4 Performance":21.0,"Product 5 Name":null,"Product 5 Performance":null,"Product 6 Name":null,"Product 6 Performance":null,"Product 7 Name":null,"Product 7 Performance":null,"Product 8 Name":"Gizmo D","Product 8 Performance":62.0}
--------------------------------------------------
Score: 0.032786883413791656
CompanyName: Consulting Apex
profile: NSW Finance Banking
performance: 61.0
json: {"CompanyName":"Consulting Apex","State":"NSW","Industry":"Finance","Segment":"Banking","Product 1 Na

## Send top 20 results to Azure Open AI to make reccomendations

In [16]:
from openai import AzureOpenAI

### Create AOAI Client ###
chat_client = AzureOpenAI(
    api_key = azure_openai_key,  
    api_version=azure_openai_api_version,
    azure_endpoint = azure_openai_endpoint
    )

# Prepare the message content with the company profiles
message_content = "You are an assistant that recommends products to companies.\nYou will receive a company profile and you need to recommend the next product that the company should buy.\nBelow are some examples of similar companies and their products. Use the below to recommend the next products that the company should buy.\n\n"
for i, profile in enumerate(comp_profiles, start=1):
    message_content += f"### Company {i}\n{profile}\n\n"

#message_content += "Recommend the next product that the company should buy using data exclusively from the above text. Take a step-by-step approach in your response, cite product performance examples from the provided data and give reasoning before sharing final answer."
message_content += "Recommend the next product that the company should buy using data exclusively from the above text. You process should involve a step-by-step approach, which the following steps: 1. Analyze the company profile, 2. Identify Relevant Companies, 3. Calculate Average Performance, 4. Recommend two products a primary and secondary, 5. Justify primary and secondary Recommendations."

# Send a chat call to generate an answer
response = chat_client.chat.completions.create(
    model=azure_openai_chat_deployment,
    messages=[
        {"role": "system", "content": message_content},
        {"role": "user", "content": target_company}
    ]
)

print(response.choices[0].message.content)

### Step-by-Step Approach

#### 1. Analyze the Company Profile
- **Company Name**: Madeup Inc
- **State**: NSW
- **Industry**: Finance
- **Segment**: Banking
- **Current Products**: 
  - Widget W (Performance: 85)
  - Widget Y (Performance: 74)

#### 2. Identify Relevant Companies
Identify companies in the same state (NSW), industry (Finance), and segment (Banking):
- Sky Robotics
- Consulting Apex
- Dynamics Arcadia
- Orion Consulting
- Nexus Hyper
- Sea Sphere
- Communications Ignite
- Moon Sea
- Helix Industries
- Applications Pulsar
- Phoenix Management

#### 3. Calculate Average Performance of Possible Products
**Products to Consider** (Average performance based on the products used by relevant companies):
- Widget X
  - Sky Robotics: 87.0
  - Nexus Hyper: 85.0
  - Communications Ignite: 76.0
  - Sea Sphere: 67.0
  - Phoenix Robotics: 68.0
  - Strategies Quest: 49.0
  - Average performance: (87 + 85 + 76 + 67 + 68 + 49) / 6 = 72.0
  
- Widget Z:
  - Sky Robotics: 21.0
  - Consulti