# CosmosDB as a single source

# Prereqs


In [1]:
#!import code/Setup.cs


Loading extensions from `C:\Users\cmccullough\.nuget\packages\microsoft.data.analysis\0.21.0\interactive-extensions\dotnet\Microsoft.Data.Analysis.Interactive.dll`

# Vector Store Setup

## Database Configuration: Overview


### **Understand Index Type, Vector Data Type, and Distance Functions**

#### **`Vector Index Type`**

This option determines how vectors are indexed within Cosmos DB to optimize search performance.

- **`flat` Index Type**: Use for low-dimensional, exact searches on smaller datasets.
- **`quantizedFlat` Index Type**: Choose when you need to balance performance and storage with acceptable accuracy loss in high-dimensional data.
- **`diskANN` Index Type**: Opt for large-scale, high-dimensional datasets where approximate searches suffice, and speed is critical.

<details>
<summary>
Options
</summary>

- **`flat`**: Stores vectors alongside other indexed properties without additional indexing structures. Supports up to **505 dimensions**.

  **When to Use:**

  - **Low-dimensional data**: Ideal for applications with vectors up to 505 dimensions.
  - **Exact search requirements**: When you need precise search results.
  - **Small to medium datasets**: Efficient for datasets where the index size won't become a bottleneck.

    **Real-World Scenario:**

    - **Customer Segmentation**: A retail company uses customer feature vectors (age, income, purchase history) with dimensions well below 505 to segment customers. Exact matches are important for targeted marketing campaigns.

- **`quantizedFlat`**: Compresses (quantizes) vectors before indexing, improving performance at the cost of some accuracy. Supports up to **4096 dimensions**.

  **When to Use:**

  - **High-dimensional data with storage constraints**: Suitable for vectors up to 4096 dimensions where storage efficiency is important.
  - **Performance-critical applications**: When reduced latency and higher throughput are needed.
  - **Acceptable accuracy trade-off**: Minor losses in accuracy are acceptable for performance gains.

    **Real-World Scenario:**

    - **Mobile Image Recognition**: An app recognizes objects using high-dimensional image embeddings. Quantization reduces the storage footprint and improves search speed, crucial for mobile devices with limited resources.

- **`diskANN`**: Utilizes the DiskANN algorithm for approximate nearest neighbor searches, optimized for speed and efficiency. Supports up to **4096 dimensions**.

  **When to Use:**

  - **Large-scale, high-dimensional data**: Best for big datasets where quick approximate searches are acceptable.
  - **Real-time applications**: When fast response times are critical.
  - **Scalability needs**: Suitable for applications expected to grow significantly.

  **Real-World Scenario:**

  - **Semantic Search Engines**: A search engine indexes millions of documents using embeddings from language models like BERT (768 dimensions). DiskANN allows users to get fast search results by efficiently handling high-dimensional data.
</details>

---

#### **`Vector Data Type`**

Specifies the data type of the vector components.

- **`float32` Datatype**: Default choice for precision; use when storage is less of a concern.
- **`uint8` and `int8` Datatypes**: Use for storage efficiency, particularly when data can be quantized.

<details>
<summary>Options</summary>

- **`float32`** (default): 32-bit floating-point numbers.

  **When to Use:**

  - **High precision requirements**: Necessary when the application demands precise calculations.
  - **Standard ML embeddings**: Most machine learning models output float32 vectors.

  **Real-World Scenario:**

  - **Scientific Simulations**: In climate modeling, vectors represent complex data where precision is vital for accurate simulations and predictions.

- **`uint8`**: 8-bit unsigned integers.

  **When to Use:**

  - **Memory optimization**: Reduces storage needs when precision can be sacrificed.
  - **Quantized models**: When vectors are output from models that already quantize data.

  **Real-World Scenario:**

  - **Basic Image Features**: Storing color histograms for image retrieval systems, where each bin can be represented with an 8-bit integer.

- **`int8`**: 8-bit integer with potentially specialized encoding (interpretation may vary; assuming it's an 8-bit integer with logarithmic encoding).

  **When to Use:**

  - **Custom quantization schemes**: When using specialized compression techniques that map floating-point values to an 8-bit integer scale.
  - **Edge devices**: Ideal for applications on devices with extreme memory limitations.

  **Real-World Scenario:**

  - **Audio Fingerprinting**: Compressing audio feature vectors for song recognition apps where storage and quick retrieval are essential.
</details>

---
#### **`Dimension Size`**

The length of the vectors being indexed. Ranges from 0-4096, default is **1536**.
<details>
<summary>Options</summary>


**When to Consider Lower Dimensions (≤ 505):**

  - **Simpler models**: Applications using basic embeddings or feature vectors.
  - **Flat index type**: Required when using the `flat` index type due to its dimension limit.

  *Real-World Scenario:*

  - **Keyword Matching**: Using low-dimensional TF-IDF vectors for document similarity in a content management system.

  **When to Consider Higher Dimensions (506 - 4096):**

  - **Complex models**: Deep learning applications with high-dimensional embeddings.
  - **Advanced search features**: When richer representations of data are necessary for accuracy.

  *Real-World Scenario:*

  - **Face Recognition**: Using high-dimensional embeddings (e.g., 2048 dimensions) to represent facial features for security systems.
</details>

---

#### **`Distance Function`**

Determines how similarity between vectors is calculated. Select based on the nature of similarity in your application—`cosine` for orientation, `dot product` when magnitude matters, and `euclidean` for spatial relevance.

<details>
<summary>Options</summary>

- **`cosine`**: Measures the cosine of the angle between vectors.

  **When to Use:**

  - **Orientation-focused similarity**: When the magnitude is less important than the direction.
  - **Normalized data**: Ideal when vectors are normalized to unit length.

  **Real-World Scenario:**

  - **Document Similarity**: In text analytics, comparing documents based on topic similarity where word counts are normalized.

- **`dot product`**: Computes the scalar product of two vectors.

  **When to Use:**

  - **Magnitude matters**: When both direction and magnitude are significant.
  - **Machine learning models**: Often used in recommendation systems where strength of preferences is important.

  **Real-World Scenario:**

  - **Personalized Recommendations**: Matching users to products by calculating the dot product of user and item embeddings in a collaborative filtering system.

- **`euclidean`**: Calculates the straight-line distance between vectors.

  **When to Use:**

  - **Spatial distance relevance**: When physical distance correlates with similarity.
  - **High-dimensional data**: Suitable for embeddings where both magnitude and direction impact similarity.

  **Real-World Scenario:**

  - **Anomaly Detection**: Identifying outliers in network traffic patterns by measuring Euclidean distances in feature space.

---

### **Option Combinations and Preferred Use-Cases**



#### **Combination 1: Low-Dimensional, Exact Searches**

- **`vectorIndexType`**: `flat`
- **`datatype`**: `float32`
- **`dimensions`**: ≤ 505
- **`distanceFunction`**: `cosine`

**Real-World Scenario:**

- **Small-Scale Text Classification**: A startup builds a news categorization tool using word embeddings (300 dimensions). Exact cosine similarity searches ensure accurate article tagging without the overhead of approximate methods.

---

#### **Combination 2: High-Dimensional, Performance-Critical Applications**

- **`vectorIndexType`**: `diskANN`
- **`datatype`**: `float32`
- **`dimensions`**: 768 - 1536
- **`distanceFunction`**: `cosine` or `dot product`

**Real-World Scenario:**

- **Real-Time Recommendations**: A streaming service uses user and content embeddings (1024 dimensions) to provide instantaneous movie recommendations. DiskANN accelerates search times, offering a smooth user experience despite the large dataset.

---

#### **Combination 3: Storage-Efficient High-Dimensional Data**

- **`vectorIndexType`**: `quantizedFlat`
- **`datatype`**: `uint8` or `int8`
- **`dimensions`**: 2048
- **`distanceFunction`**: `cosine`

**Real-World Scenario:**

- **Mobile Visual Search**: An app allows users to search for products by uploading photos. High-dimensional image embeddings are quantized to fit the storage constraints of mobile devices, and approximate searches provide quick results.

---

#### **Combination 4: Precision-Critical Scientific Computing**

- **`vectorIndexType`**: `flat`
- **`datatype`**: `float32`
- **`dimensions`**: 4096
- **`distanceFunction`**: `euclidean`

**Real-World Scenario:**

- **Genomic Data Analysis**: Researchers analyze genetic sequences represented as high-dimensional vectors. Precise Euclidean distance calculations are essential for identifying genetic similarities and mutations.

---

#### **Combination 5: Medium-Dimensional Data with Storage Constraints**

- **`vectorIndexType`**: `quantizedFlat`
- **`datatype`**: `uint8`
- **`dimensions`**: 500
- **`distanceFunction`**: `dot product`

**Real-World Scenario:**

- **IoT Sensor Data**: A network of sensors generates medium-dimensional vectors representing environmental data. Quantization reduces storage and transmission costs, and dot product calculations help in identifying patterns and anomalies efficiently.

## Database Configuration: Implementation

### Setup Data Models

In [3]:
using Microsoft.Azure.Cosmos;
using Microsoft.SemanticKernel.Connectors.AzureCosmosDBNoSQL;
using Microsoft.SemanticKernel.Data;
using IndexKind = Microsoft.Azure.Cosmos.IndexKind;
using System.Reflection;
using System.Text.Json.Serialization;
using Azure.Core.Serialization;
using Container = Microsoft.Azure.Cosmos.Container;

// public record 10KDocument(
//     [property: VectorStoreRecordKey] string HotelId,
//     [property: VectorStoreRecordData] string HotelName,
//     [property: VectorStoreRecordData] string Description,
//     [property: VectorStoreRecordVector(Dimensions: 4, IndexKind: IndexKind.Hash, DistanceFunction: DistanceFunction.CosineSimilarity), JsonPropertyName("description_embeddings")] ReadOnlyMemory<float>? DescriptionEmbeddings);


public record PartitionedEntity(string PartitionKey, string Type){

    [JsonConstructor]
    public PartitionedEntity(string PartitionKey, string Type, string Id): this(PartitionKey, Type)
    {
        this.PartitionKey = PartitionKey;
        this.Type = Type;
        this.Id = Id;
    }
    public string Id { get; set; } = null;
    public string SourceUri { get; set; } = string.Empty;
};

public record CompanyOfficer(
    int CompanyCIK,
    string FirstName,
    string LastName,
    int? Age,
    string Title,
    int? YearBorn,
    long TotalPay
): PartitionedEntity(CompanyCIK.ToString(), "CompanyOfficer", $"{CompanyCIK}_{FirstName + LastName}");

public record BasicCompanyInfo (
    string Address1,
    string City,
    string State,
    string Zip,
    string Country,
    string Phone,
    string Website,
    string Industry,
    string Sector,
    string LongBusinessSummary,
    ICollection<CompanyOfficer> CompanyOfficers,
    string IrWebsite,
    string Exchange,
    string QuoteType,
    string TickerSymbol,
    string UnderlyingSymbol,
    string ShortName,
    string SecName,
    int CIK,
    string PrimaryExchange,
    ICollection<string> AssociatedCusips
): PartitionedEntity(PartitionKey: CIK.ToString(), Type: "CompanyInfo", Id: CIK.ToString());

var Form10KSections = new Dictionary<string, string>
{
    { "item1", "Business: requires a description of the company’s business, including its main products and services, what subsidiaries it owns, and what markets it operates in" },
    { "item1a", "Risk Factors: includes information about the most significant risks that apply to the company or to its securities" },
    { "item1b", "Unresolved Staff Comments: requires the company to explain certain comments it has received from the SEC staff on previously filed reports that have not been resolved after an extended period of time" },
    { "item2", "Properties: includes information about the company’s significant properties, such as principal plants, mines and other materially important physical properties" },
    { "item3", "Legal Proceedings: requires the company to include information about significant pending lawsuits or other legal proceedings, other than ordinary litigation" },
    { "item7", "Management’s Discussion and Analysis of Financial Condition and Results of Operations (MD&A): gives the company’s perspective on the business results of the past financial year. This section, known as the MD&A for short, allows company management to tell its story in its own words" },
    { "item7a", "Quantitative and Qualitative Disclosures About Market Risk: requires information about the company’s exposure to market risk, such as interest rate risk, foreign currency exchange risk, commodity price risk or equity price risk" },
    { "item8", "Financial Statements and Supplementary Data: requires the company’s audited financial statements" },
    { "item10", "Directors, Executive Officers and Corporate Governance: requires information about the background and experience of the company’s directors and executive officers, the company’s code of ethics, and certain qualifications for directors and committees of the board of directors" },
    { "item11", "Executive Compensation: includes detailed disclosure about the company’s compensation policies and programs and how much compensation was paid to the top executive officers of the company in the past year" },
    { "item15", "Exhibits, Financial Statement Schedules: Many exhibits are required, including documents such as the company’s bylaws, copies of its material contracts, and a list of the company’s subsidiaries" }
};
public record SecForm10KSection(int CIK, DateTime FilingDate, string SectionName, string SectionShortName, string SectionText, ReadOnlyMemory<float> ContentEmbedding): PartitionedEntity(CIK.ToString(), "10-K", $"{CIK}_{FilingDate}_{SectionName}");
public record SecForm13FHolding(int CIK, string ManagerName, string SecurityName, int Shares, int Value, string SecurityType, string Cusip, DateTime ReportedDate): PartitionedEntity(Cusip, "13F-HR", $"{Cusip}_{ManagerName}");
public record SecForm13D(int CIK, string ReportingPerson, DateTime FilingDate, string Description): PartitionedEntity(CIK.ToString(), "13D");

public record DailyMarketData(string Symbol, DateTime Date, float Open, float High, float Low, float Close, long Volume): PartitionedEntity(Symbol, "DailyMarketData");
public record NewsArticle(string Headline, string ArticleText, string SourceName, string Uri, DateTime PublishDate): PartitionedEntity(SourceName, "NewsArticle");
public class Todo 
{
    public string Title { get; set; }
    public bool IsDone { get; set; }
    public int Id { get; set; }
    public string PartitionKey { get; set; }

    public string Type { get; set; }

}
public class CosmosWrapper
{
    private CosmosNoSqlService CosmosService;
    public Database DatabaseClient;
    public const string DEFAULT_PARTITION_KEY_PATH = "/PartitionKey";

    public CosmosWrapper()
    {
        this.CosmosService = new CosmosNoSqlService();
        this.DatabaseClient = this.CosmosService.databaseClient;
    }

    public async Task<Container> GetOrCreateContainer(string containerName)
    {   var containerProperties = new ContainerProperties(containerName, DEFAULT_PARTITION_KEY_PATH){
            IndexingPolicy = new IndexingPolicy(){
                IncludedPaths = new Collection<IncludedPath>(){
                    new IncludedPath(){
                        Path = "/*"
                    }
                },
                ExcludedPaths = new Collection<ExcludedPath>(){
                    new ExcludedPath(){
                        Path = "/\"_etag\"/?"
                    }
                }
            }
        };
        var c = await this.DatabaseClient.CreateContainerIfNotExistsAsync(containerProperties);
        return c.Container;
    }
}



# Semantic Search on Cosmos DB

# Structured Database Copilot
NL2SQL - Database query generation

### Considerations
- Usually good at building most of the database query, however it needs prompt tuning or native functions to improve the where clause.

**Example User Stories**
- I have application monitoring or metric data that I want to derive insights from. 
- I want to chat over the entire corpus of Service Now or other ICM support ticket information