# CosmosDB as a single source

# Prereqs


In [81]:
#!import code/Setup.cs


# Vector Store Setup

## Database Configuration: Overview


### **Understand Index Type, Vector Data Type, and Distance Functions**

#### **`Vector Index Type`**

This option determines how vectors are indexed within Cosmos DB to optimize search performance.

- **`flat` Index Type**: Use for low-dimensional, exact searches on smaller datasets.
- **`quantizedFlat` Index Type**: Choose when you need to balance performance and storage with acceptable accuracy loss in high-dimensional data.
- **`diskANN` Index Type**: Opt for large-scale, high-dimensional datasets where approximate searches suffice, and speed is critical.

<details>
<summary>
Options
</summary>

- **`flat`**: Stores vectors alongside other indexed properties without additional indexing structures. Supports up to **505 dimensions**.

  **When to Use:**

  - **Low-dimensional data**: Ideal for applications with vectors up to 505 dimensions.
  - **Exact search requirements**: When you need precise search results.
  - **Small to medium datasets**: Efficient for datasets where the index size won't become a bottleneck.

    **Real-World Scenario:**

    - **Customer Segmentation**: A retail company uses customer feature vectors (age, income, purchase history) with dimensions well below 505 to segment customers. Exact matches are important for targeted marketing campaigns.

- **`quantizedFlat`**: Compresses (quantizes) vectors before indexing, improving performance at the cost of some accuracy. Supports up to **4096 dimensions**.

  **When to Use:**

  - **High-dimensional data with storage constraints**: Suitable for vectors up to 4096 dimensions where storage efficiency is important.
  - **Performance-critical applications**: When reduced latency and higher throughput are needed.
  - **Acceptable accuracy trade-off**: Minor losses in accuracy are acceptable for performance gains.

    **Real-World Scenario:**

    - **Mobile Image Recognition**: An app recognizes objects using high-dimensional image embeddings. Quantization reduces the storage footprint and improves search speed, crucial for mobile devices with limited resources.

- **`diskANN`**: Utilizes the DiskANN algorithm for approximate nearest neighbor searches, optimized for speed and efficiency. Supports up to **4096 dimensions**.

  **When to Use:**

  - **Large-scale, high-dimensional data**: Best for big datasets where quick approximate searches are acceptable.
  - **Real-time applications**: When fast response times are critical.
  - **Scalability needs**: Suitable for applications expected to grow significantly.

  **Real-World Scenario:**

  - **Semantic Search Engines**: A search engine indexes millions of documents using embeddings from language models like BERT (768 dimensions). DiskANN allows users to get fast search results by efficiently handling high-dimensional data.
</details>

---

#### **`Vector Data Type`**

Specifies the data type of the vector components.

- **`float32` Datatype**: Default choice for precision; use when storage is less of a concern.
- **`uint8` and `int8` Datatypes**: Use for storage efficiency, particularly when data can be quantized.

<details>
<summary>Options</summary>

- **`float32`** (default): 32-bit floating-point numbers.

  **When to Use:**

  - **High precision requirements**: Necessary when the application demands precise calculations.
  - **Standard ML embeddings**: Most machine learning models output float32 vectors.

  **Real-World Scenario:**

  - **Scientific Simulations**: In climate modeling, vectors represent complex data where precision is vital for accurate simulations and predictions.

- **`uint8`**: 8-bit unsigned integers.

  **When to Use:**

  - **Memory optimization**: Reduces storage needs when precision can be sacrificed.
  - **Quantized models**: When vectors are output from models that already quantize data.

  **Real-World Scenario:**

  - **Basic Image Features**: Storing color histograms for image retrieval systems, where each bin can be represented with an 8-bit integer.

- **`int8`**: 8-bit integer with potentially specialized encoding (interpretation may vary; assuming it's an 8-bit integer with logarithmic encoding).

  **When to Use:**

  - **Custom quantization schemes**: When using specialized compression techniques that map floating-point values to an 8-bit integer scale.
  - **Edge devices**: Ideal for applications on devices with extreme memory limitations.

  **Real-World Scenario:**

  - **Audio Fingerprinting**: Compressing audio feature vectors for song recognition apps where storage and quick retrieval are essential.
</details>

---
#### **`Dimension Size`**

The length of the vectors being indexed. Ranges from 0-4096, default is **1536**.
<details>
<summary>Options</summary>


**When to Consider Lower Dimensions (≤ 505):**

  - **Simpler models**: Applications using basic embeddings or feature vectors.
  - **Flat index type**: Required when using the `flat` index type due to its dimension limit.

  *Real-World Scenario:*

  - **Keyword Matching**: Using low-dimensional TF-IDF vectors for document similarity in a content management system.

  **When to Consider Higher Dimensions (506 - 4096):**

  - **Complex models**: Deep learning applications with high-dimensional embeddings.
  - **Advanced search features**: When richer representations of data are necessary for accuracy.

  *Real-World Scenario:*

  - **Face Recognition**: Using high-dimensional embeddings (e.g., 2048 dimensions) to represent facial features for security systems.
</details>

---

#### **`Distance Function`**

Determines how similarity between vectors is calculated. Select based on the nature of similarity in your application—`cosine` for orientation, `dot product` when magnitude matters, and `euclidean` for spatial relevance.

<details>
<summary>Options</summary>

- **`cosine`**: Measures the cosine of the angle between vectors.

  **When to Use:**

  - **Orientation-focused similarity**: When the magnitude is less important than the direction.
  - **Normalized data**: Ideal when vectors are normalized to unit length.

  **Real-World Scenario:**

  - **Document Similarity**: In text analytics, comparing documents based on topic similarity where word counts are normalized.

- **`dot product`**: Computes the scalar product of two vectors.

  **When to Use:**

  - **Magnitude matters**: When both direction and magnitude are significant.
  - **Machine learning models**: Often used in recommendation systems where strength of preferences is important.

  **Real-World Scenario:**

  - **Personalized Recommendations**: Matching users to products by calculating the dot product of user and item embeddings in a collaborative filtering system.

- **`euclidean`**: Calculates the straight-line distance between vectors.

  **When to Use:**

  - **Spatial distance relevance**: When physical distance correlates with similarity.
  - **High-dimensional data**: Suitable for embeddings where both magnitude and direction impact similarity.

  **Real-World Scenario:**

  - **Anomaly Detection**: Identifying outliers in network traffic patterns by measuring Euclidean distances in feature space.

---

### **Option Combinations and Preferred Use-Cases**



#### **Combination 1: Low-Dimensional, Exact Searches**

- **`vectorIndexType`**: `flat`
- **`datatype`**: `float32`
- **`dimensions`**: ≤ 505
- **`distanceFunction`**: `cosine`

**Real-World Scenario:**

- **Small-Scale Text Classification**: A startup builds a news categorization tool using word embeddings (300 dimensions). Exact cosine similarity searches ensure accurate article tagging without the overhead of approximate methods.

---

#### **Combination 2: High-Dimensional, Performance-Critical Applications**

- **`vectorIndexType`**: `diskANN`
- **`datatype`**: `float32`
- **`dimensions`**: 768 - 1536
- **`distanceFunction`**: `cosine` or `dot product`

**Real-World Scenario:**

- **Real-Time Recommendations**: A streaming service uses user and content embeddings (1024 dimensions) to provide instantaneous movie recommendations. DiskANN accelerates search times, offering a smooth user experience despite the large dataset.

---

#### **Combination 3: Storage-Efficient High-Dimensional Data**

- **`vectorIndexType`**: `quantizedFlat`
- **`datatype`**: `uint8` or `int8`
- **`dimensions`**: 2048
- **`distanceFunction`**: `cosine`

**Real-World Scenario:**

- **Mobile Visual Search**: An app allows users to search for products by uploading photos. High-dimensional image embeddings are quantized to fit the storage constraints of mobile devices, and approximate searches provide quick results.

---

#### **Combination 4: Precision-Critical Scientific Computing**

- **`vectorIndexType`**: `flat`
- **`datatype`**: `float32`
- **`dimensions`**: 4096
- **`distanceFunction`**: `euclidean`

**Real-World Scenario:**

- **Genomic Data Analysis**: Researchers analyze genetic sequences represented as high-dimensional vectors. Precise Euclidean distance calculations are essential for identifying genetic similarities and mutations.

---

#### **Combination 5: Medium-Dimensional Data with Storage Constraints**

- **`vectorIndexType`**: `quantizedFlat`
- **`datatype`**: `uint8`
- **`dimensions`**: 500
- **`distanceFunction`**: `dot product`

**Real-World Scenario:**

- **IoT Sensor Data**: A network of sensors generates medium-dimensional vectors representing environmental data. Quantization reduces storage and transmission costs, and dot product calculations help in identifying patterns and anomalies efficiently.

## Implementation using Financial Datasets

### Setup Containers


1. **`CompanyData`**

    - **Data Types**: `BasicCompanyInfo`, `CompanyOfficer`
    - **Partition Key**: `/Cik`
    - **Id**: `/Cik` to ensure there is only 1 basic information document per company
    - **Vector Paths**: ``
    - **Notes**:
        - **Optimized for Company Queries**: Facilitates queries and reports scoped to specific companies.
        - **Rationale**: Embedding reduces the need for cross-partition queries and improves read performance when retrieving company information along with its officers.


2. **`FinancialFilings`**

    - **Data Types**: `Form10KSection`, `Form13D`
    - **Partition Key**: `/Cik`
    - **Id**: `
    - **Indexing**:
        - **Enable Vector Indexing**: For `Form10KSection` embeddings.
    - **Notes**:
        - **Efficient Semantic Search**: Supports AI-driven searches over financial filings.

3. **`MarketData`**

    - **Data Types**: `DailyMarketData`
    - **Partition Key**: `/Symbol`
    - **Notes**:

        - **High Write Throughput**: Allocate sufficient RU/s to handle frequent updates.
4. **`Holdings`**

    - **Data Types**: `Form13FHolding`
    - **Partition Key**: `/Cusip`
    - **Alternate Partition Key**: `/ManagerName` if queries are more often by manager.
    - **Notes**:

        - **Facilitates Cross-Company Queries**: Efficiently retrieve holdings data for reports.
5. **`NewsArticles`**

    - **Data Types**: `NewsArticle`
    - **Partition Key**: `/PublishDate` (e.g., formatted as `yyyy-MM` for monthly partitions)
    - **Indexing**:

        - **Enable Vector Indexing**: For `ArticleText` embeddings.
    - **Notes**:

        - **Time-Based Partitioning**: Improves performance for time-bound queries.


### Classes
- Every document that will have vector search has only 1 embedding field


In [90]:
#pragma warning disable SKEXP0001,SKEXP0020

/**
* Just for reference, Semantic Kernel has abstracted concepts around vector databases for AI RAG applications.
* This solution could be built using annotations instead and the underlying DB could be changed without changing the application code.
public record 10KDocument(
    [property: VectorStoreRecordKey] string HotelId,
    [property: VectorStoreRecordData] string HotelName,
    [property: VectorStoreRecordData] string Description,
    [property: VectorStoreRecordVector(Dimensions: 4, IndexKind: IndexKind.Hash, DistanceFunction: DistanceFunction.CosineSimilarity), JsonPropertyName("description_embeddings")] ReadOnlyMemory<float>? DescriptionEmbeddings);

*/
[AttributeUsage(AttributeTargets.Class, Inherited = false, AllowMultiple = false)]
public class VectorStoreEntityAttribute: Attribute
{
    public string CollectionName { get; init; }
    public string DocumentType { get; init; }
}

[AttributeUsage(AttributeTargets.Property, Inherited = false, AllowMultiple = false)]
public sealed class VectorStoreIdAttribute(): Attribute;

[AttributeUsage(AttributeTargets.Property, Inherited = false, AllowMultiple = false)]
public sealed class VectorStorePartitionKeyAttribute(): Attribute;

[AttributeUsage(AttributeTargets.Property, Inherited = false, AllowMultiple = false)]
public sealed class VectorStoreDocumentTypeAttribute(): Attribute;

[AttributeUsage(AttributeTargets.Property, Inherited = false, AllowMultiple = false)]
public sealed class VectorStoreEmbeddingAttribute(): Attribute;

[AttributeUsage(AttributeTargets.Property, Inherited = false, AllowMultiple = false)]
public sealed class VectorStoreEmbeddingDataAttribute(): Attribute;

public abstract record VectorStoreEntity
{   

    [JsonIgnore]
    public double SimilarityScore { get; set; } = 0.0;

    [JsonIgnore]
    public double RelevanceScore => (SimilarityScore + 1) / 2;

    public void UpdateTokenCount(Tokenizer tokenizer)
    {
        var tokenCount = this.GetTokenCount(tokenizer);
        this.GetType().GetProperty("Tokens")?.SetValue(this, tokenCount);
    }

    public async Task UpdateEmbedding(ITextEmbeddingGenerationService textEmbeddingService, CancellationToken cancellationToken = default)
    {
        var embedding = await this.GetEmbedding(textEmbeddingService, cancellationToken);
        this.GetType().GetProperty("Embedding")?.SetValue(this, embedding);
    }

    public string GetContextWindow() => this.GetType().GetProperties(BindingFlags.Public | BindingFlags.Instance)
            .Where(p => p.GetCustomAttribute<VectorStoreEmbeddingAttribute>() != null)
            .Select(p => {
                var value = p.GetValue(this);
                return value switch
                {
                    string stringValue => stringValue,
                    IEnumerable<string> stringList => stringList.Aggregate(new StringBuilder(), (sb, value) => sb.AppendLine(value), sb => sb.ToString()),
                    _ => string.Empty
                };
            })
            .Aggregate(new StringBuilder(), (sb, value) => sb.AppendLine(value), sb => sb.ToString());

    public Task<ReadOnlyMemory<float>> GetEmbedding(ITextEmbeddingGenerationService textEmbeddingService, CancellationToken cancellationToken = default)
    {
        var embeddingString = this.GetContextWindow();
        return textEmbeddingService.GenerateEmbeddingAsync(value: embeddingString, cancellationToken: cancellationToken);
    }

    public int GetTokenCount(Tokenizer s_tokenizer) => s_tokenizer.CountTokens(this.GetContextWindow());

    public PartitionKey GetPartitionKey() => this.GetType().GetProperties(BindingFlags.Public | BindingFlags.Instance)
        .Where(p => p.GetCustomAttribute<VectorStorePartitionKeyAttribute>() != null)
        .Select(p => p.GetValue(this) as string ?? string.Empty)
        .Where(value => !string.IsNullOrWhiteSpace(value))
        .Aggregate(new PartitionKeyBuilder(), (pk, value) => pk.Add(value), pk => pk.Build());  

    public string GetId() => this.GetType().GetProperties(BindingFlags.Public | BindingFlags.Instance)
        .Where(p => p.GetCustomAttribute<VectorStoreIdAttribute>() != null)
        .Select(p => p.GetValue(this) as string)
        .FirstOrDefault();

};

/**
    * Container: Cache
*/
[VectorStoreEntity(CollectionName = "cache")]
public record CacheItem(
    [property: VectorStoreId, VectorStorePartitionKey] string Id, 
    [property: VectorStoreEmbeddingData] string Prompts,
    string Completion, 
    [property: VectorStoreEmbedding] ReadOnlyMemory<float> Embedding): VectorStoreEntity
{
    public int Ttl => CalculateTtl();
    public int CacheHits { get; set; } = 1; // Start with 1 to avoid immediate eviction
    public DateTime CreatedAt { get; set; } = DateTime.UtcNow;
    public void RegisterHit()
    {
        CacheHits++;
    }

    private int CalculateTtl()
    {
        var elapsedTime = DateTime.UtcNow - CreatedAt;
        var baseTtl = (int)TimeSpan.FromHours(1).TotalSeconds;
        return CacheHits * baseTtl - (int)elapsedTime.TotalSeconds;
    }
}

// Container Properties
public static ContainerProperties getSemanticCacheContainerProperties() {
    var properties = new ContainerProperties(id: "cache", partitionKeyPath: "/id"){
            IndexingPolicy = new (){
                VectorIndexes = new ()
                {
                    new VectorIndexPath(){
                        Path = "/embedding",
                        Type = VectorIndexType.DiskANN
                    }
                },
            },
            VectorEmbeddingPolicy = new VectorEmbeddingPolicy(new (){
                new Embedding(){
                    Path = "/embedding",
                    Dimensions = 1536,
                    DataType = VectorDataType.Float32,
                    DistanceFunction = Microsoft.Azure.Cosmos.DistanceFunction.Cosine
                }
            }),
            DefaultTimeToLive = (int)TimeSpan.FromDays(1).TotalSeconds
    };
    properties.IndexingPolicy.IncludedPaths.Add(new () { Path = "/*" });

    return properties;
} 

/**
    * Container: ChatThreads
*/

// Using Semantic Kernel's ChatHistory class as a building block for ChatThreads.
[VectorStoreEntity(CollectionName = "chat", DocumentType = "ChatThreadMessage")]
public record ChatThreadMessage(
    [property: VectorStoreId] string Id, 
    [property: VectorStorePartitionKey] string UserId, 
    [property: VectorStorePartitionKey] string ThreadId, 
    ChatMessageContent MessageContent, 
    int Tokens = 0): VectorStoreEntity
{
    public string Type => "ChatThreadMessage";
    public bool Deleted { get; set; } = false;


    public DateTime CreatedAt { get; set; } = DateTime.UtcNow;
    public DateTime LastUpdatedAt { get; set; } = DateTime.UtcNow;

    [JsonIgnore]
    public bool CacheHit { get; set; } = false;

    [VectorStoreEmbeddingData,JsonIgnore]
    public string ContextWindow => MessageContent.Content;
    public static implicit operator ChatMessageContent(ChatThreadMessage v) => v.MessageContent;
};

[VectorStoreEntity(CollectionName = "chat", DocumentType = "ChatThread")]
public record ChatThread(
    [property: VectorStorePartitionKey] string UserId, 
    [property: VectorStorePartitionKey] string ThreadId, 
    string DisplayName = "New Chat", 
    int? Tokens = 0): VectorStoreEntity
{
    public string Type => "ChatThread";
    
    [VectorStoreId]
    public string Id { get; set; } = ThreadId;

    public bool Deleted { get; set; } = false;
    public DateTime CreatedAt { get; set; } = DateTime.UtcNow;
    public DateTime LastUpdatedAt { get; set; } = DateTime.UtcNow;

    [JsonIgnore]
    public IEnumerable<ChatThreadMessage> Messages { get; set; } = new List<ChatThreadMessage>();

    public string GetContextWindowWithinLimit(int maxTokens, int currentTokens = 0) => Messages.Reverse().TakeWhile((x, i) => {
        currentTokens += x.Tokens;
        return currentTokens <= maxTokens;
    }).Select(m => m.ContextWindow).Aggregate(new StringBuilder(), (sb, value) => sb.AppendLine(value), sb => sb.ToString());

    [VectorStoreEmbeddingData,JsonIgnore]
    public string FullContextWindow => Messages.Select(m => m.ContextWindow).Aggregate(new StringBuilder(), (sb, value) => sb.AppendLine(value), sb => sb.ToString());

    public IEnumerable<string> GetMessageContentsForRole(AuthorRole role) => Messages.Where(x => x.MessageContent.Role == role).Select(x => x.MessageContent.Content);
    public IEnumerable<string> GetAllMessageContents() => Messages.Select(x => x.MessageContent.Content);
    public static implicit operator ChatHistory(ChatThread v) => new ChatHistory(messages: v.Messages.Select(m => (ChatMessageContent)m));
};


// Container Properties
public static ContainerProperties getChatThreadContainerProperties() {
    var properties = new ContainerProperties(id: "chat", partitionKeyPaths: new Collection<string>(){ "/userId", "/threadId" });
    properties.IndexingPolicy.ExcludedPaths.Add(new () { Path = "/*" });

    foreach (var path in new string[] { "/userId/?", "/threadId/?"})
    {
        properties.IndexingPolicy.IncludedPaths.Add(new () { Path = path });
    }

    return properties;
}


/**
    * Container: CompanyInfo
*/

public record CompanyOfficer(
    string FirstName,
    string LastName,
    int? Age,
    string Title,
    int? YearBorn,
    long TotalPay
);

public record SecurityListing(
    string Cusip, 
    string Name,  
    string Exchange, 
    string Symbol, 
    string IsinNumber);

[VectorStoreEntity(CollectionName = "companyInfo", DocumentType = "CompanyInfo")]
public record CompanyInfo(
    int Cik,
    [property: VectorStoreEmbeddingData, VectorStorePartitionKey] string Sector,
    [property: VectorStoreEmbeddingData, VectorStorePartitionKey] string Industry,
    [property: VectorStoreEmbeddingData, VectorStorePartitionKey] string SubIndustry,
    [property: VectorStoreEmbeddingData] string Cusip6,
    string Lei,
    string CompanyName,
    string Address1,
    string City,
    string State,
    string Zip,
    string Country,
    string Phone,
    string Website,
    [property: VectorStoreEmbeddingData] string LongBusinessSummary,
    [property: VectorStoreEmbeddingData] List<string> ReferenceNames,
    [property: VectorStoreEmbeddingData] List<string> Tickers,
    List<CompanyOfficer> CompanyOfficers,
    List<SecurityListing> SecurityListings,
    string WebsiteUrl,
    [property: VectorStoreEmbeddingField] ReadOnlyMemory<float>? Embedding = null
): VectorStoreEntity
{
    public string Type { get; set; } = "CompanyInfo";

    [VectorStoreId]
    public string Id { get; set; } = Cik.ToString();

    [VectorStorePartitionKey]
    public string PartitionKey { get; set; } = Cik.ToString();
};

[VectorStoreEntity(CollectionName = "companyInfo", DocumentType = "10-K-section")]
public record Form10KSection(
    int Cik,
    [property: VectorStoreEmbeddingData, VectorStorePartitionKey] string Sector,
    [property: VectorStoreEmbeddingData, VectorStorePartitionKey] string Industry,
    [property: VectorStoreEmbeddingData, VectorStorePartitionKey] string SubIndustry,
    string SequenceId, 
    DateTime FilingDate, 
    string SectionName, 
    string SectionShortName, 
    [property: VectorStoreEmbeddingData] string SectionText, 
    [property: VectorStoreEmbeddingField] ReadOnlyMemory<float>? Embedding = null, 
    string Type = "10-K-section"): VectorStoreEntity
    {

    [VectorStoreId]
    public string Id { get; set; } = $"{Cik}_{FilingDate:yyyy-MM-dd}_{Type}_{SectionName}";

    [VectorStorePartitionKey]
    public string PartitionKey { get; set; } = Cik.ToString();

    [Description("The CIK of the manager who filed the SEC document.")]
    public string FilerCik { get; set; } = Cik.ToString();

    [JsonIgnore, Description("The Accession Number is a unique identifier assigned by the SEC to each filing.")]
    public string AccessionNumber => $"{FilerCik.ToString().PadLeft(10, '0')}-{FilingDate:yy}-{SequenceId.PadLeft(4, '0')}";

    [JsonIgnore]
    public Uri SourceUri => new Uri($"https://www.sec.gov/Archives/edgar/data/{Cik}/{AccessionNumber}.txt");
}

[VectorStoreEntity(CollectionName = "companyInfo", DocumentType = "13D")]
public record Form13D(
    int Cik, 
    [property: VectorStoreEmbeddingData, VectorStorePartitionKey] string Sector,
    [property: VectorStoreEmbeddingData, VectorStorePartitionKey] string Industry,
    [property: VectorStoreEmbeddingData, VectorStorePartitionKey] string SubIndustry,
    string SequenceId, 
    string ReportingPerson, 
    DateTime FilingDate, 
    string Description, 
    ReadOnlyMemory<float>? Embedding = null, 
    string Type = "13D"): VectorStoreEntity
{

    [VectorStoreId]
    public string Id { get; set; } = $"{Cik}_{FilingDate:yyyy-MM-dd}_{Type}";

    [VectorStorePartitionKey]
    public string PartitionKey { get; set; } = Cik.ToString();

    [Description("The CIK of the manager who filed the SEC document.")]
    public string FilerCik { get; set; } = Cik.ToString();

    [JsonIgnore, Description("The Accession Number is a unique identifier assigned by the SEC to each filing.")]
    public string AccessionNumber => $"{FilerCik.ToString().PadLeft(10, '0')}-{FilingDate:yy}-{SequenceId.PadLeft(4, '0')}";

    [JsonIgnore]
    public Uri SourceUri => new Uri($"https://www.sec.gov/Archives/edgar/data/{Cik}/{AccessionNumber}.txt");
}

[VectorStoreEntity(CollectionName = "companyInfo", DocumentType = "13F-HR")]
public record Form13FHR(
    int Cik, 
    [property: VectorStoreEmbeddingData, VectorStorePartitionKey] string Sector,
    [property: VectorStoreEmbeddingData, VectorStorePartitionKey] string Industry,
    [property: VectorStoreEmbeddingData, VectorStorePartitionKey] string SubIndustry,
    string SequenceId, 
    string Cusip,  
    DateTime FilingDate, 
    int ManagerCik, 
    string ManagerName, 
    string SecurityName, 
    int Shares, 
    int Value, 
    string SecurityType, 
    ReadOnlyMemory<float>? Embedding = null, 
    string Type = "13F-HR"): VectorStoreEntity
{
    [VectorStoreId]
    public string Id { get; set; } = $"{Cik}_{FilingDate:yyyy-MM-dd}_{Type}";

    [VectorStorePartitionKey]
    public string PartitionKey { get; set; } = Cik.ToString();

    [Description("The CIK of the manager who filed the SEC document.")]
    public string FilerCik { get; set; } = Cik.ToString();

    [JsonIgnore, Description("The Accession Number is a unique identifier assigned by the SEC to each filing.")]
    public string AccessionNumber => $"{FilerCik.ToString().PadLeft(10, '0')}-{FilingDate:yy}-{SequenceId.PadLeft(4, '0')}";

    [JsonIgnore]
    public Uri SourceUri => new Uri($"https://www.sec.gov/Archives/edgar/data/{Cik}/{AccessionNumber}.txt");
}

// Container Properties
public static ContainerProperties getCompanyInfoContainerProperties() {
    var properties = new ContainerProperties(id: "companyInfo", partitionKeyPath: "/partitionKey"){
            IndexingPolicy = new (){
                VectorIndexes = new ()
                {
                    new VectorIndexPath(){
                        Path = "/embedding",
                        Type = VectorIndexType.DiskANN
                    }
                },
            },
            VectorEmbeddingPolicy = new VectorEmbeddingPolicy(new (){
                new Embedding(){
                    Path = "/embedding",
                    Dimensions = 1536,
                    DataType = VectorDataType.Float32,
                    DistanceFunction = Microsoft.Azure.Cosmos.DistanceFunction.Cosine
                }
            })
    };
    properties.IndexingPolicy.ExcludedPaths.Add(new () { Path = "/embedding/*" });
    properties.IndexingPolicy.ExcludedPaths.Add(new () { Path = "/sectionText/*" });
    properties.IndexingPolicy.ExcludedPaths.Add(new () { Path = "/longBusinessSummary/*" });

    return properties;
} 

/**
    * Container: MarketData
*/
public record DailyStockMarketReport(int Cik, string Symbol, DateTime Date, float Open, float High, float Low, float Close, long Volume) {
    public string Type => "DailyMarketData";
    public string Id { get; set; } = $"{Symbol}_{Date:yyyy-MM-dd}";
    public string PartitionKey { get; set; } = Symbol;
}

public record NewsArticle(string Headline, string ArticleText, string SourceName, string Uri, DateTime PublishDate){
    public string Type => "NewsArticle";
    public string Id => $"{SourceName}_{Uri}";
    public string PartitionKey { get; set; } = SourceName;
}

public static ContainerProperties getMarketDataContainerProperties() {
    var properties = new ContainerProperties(id: "marketData", partitionKeyPath: "/partitionKey"){
            IndexingPolicy = new (){
                VectorIndexes = new ()
                {
                    new VectorIndexPath(){
                        Path = "/embedding",
                        Type = VectorIndexType.DiskANN
                    }
                },
            },
            VectorEmbeddingPolicy = new VectorEmbeddingPolicy(new (){
                new Embedding(){
                    Path = "/embedding",
                    Dimensions = 1536,
                    DataType = VectorDataType.Float32,
                    DistanceFunction = Microsoft.Azure.Cosmos.DistanceFunction.Cosine
                }
            })
    };
    properties.IndexingPolicy.ExcludedPaths.Add(new () { Path = "/embedding/*" });
    properties.IndexingPolicy.ExcludedPaths.Add(new () { Path = "/sectionText/*" });
    properties.IndexingPolicy.ExcludedPaths.Add(new () { Path = "/longBusinessSummary/*" });

    return properties;
} 

var Form10KSectionDescriptions = new Dictionary<string, string>
{
    { "item1", "Business: requires a description of the company’s business, including its main products and services, what subsidiaries it owns, and what markets it operates in" },
    { "item1a", "Risk Factors: includes information about the most significant risks that apply to the company or to its securities" },
    { "item1b", "Unresolved Staff Comments: requires the company to explain certain comments it has received from the SEC staff on previously filed reports that have not been resolved after an extended period of time" },
    { "item2", "Properties: includes information about the company’s significant properties, such as principal plants, mines and other materially important physical properties" },
    { "item3", "Legal Proceedings: requires the company to include information about significant pending lawsuits or other legal proceedings, other than ordinary litigation" },
    { "item7", "Management’s Discussion and Analysis of Financial Condition and Results of Operations (MD&A): gives the company’s perspective on the business results of the past financial year. This section, known as the MD&A for short, allows company management to tell its story in its own words" },
    { "item7a", "Quantitative and Qualitative Disclosures About Market Risk: requires information about the company’s exposure to market risk, such as interest rate risk, foreign currency exchange risk, commodity price risk or equity price risk" },
    { "item8", "Financial Statements and Supplementary Data: requires the company’s audited financial statements" },
    { "item10", "Directors, Executive Officers and Corporate Governance: requires information about the background and experience of the company’s directors and executive officers, the company’s code of ethics, and certain qualifications for directors and committees of the board of directors" },
    { "item11", "Executive Compensation: includes detailed disclosure about the company’s compensation policies and programs and how much compensation was paid to the top executive officers of the company in the past year" },
    { "item15", "Exhibits, Financial Statement Schedules: Many exhibits are required, including documents such as the company’s bylaws, copies of its material contracts, and a list of the company’s subsidiaries" }
};

public class VectorStore
{
    private readonly Database _database;
    private readonly Tokenizer _tokenizer;
    private readonly ITextEmbeddingGenerationService _textEmbeddingService;

    public VectorStore(Database database, Tokenizer tokenizer, ITextEmbeddingGenerationService textEmbeddingService)
    {
        _database = database;
        _tokenizer = tokenizer;
        _textEmbeddingService = textEmbeddingService;
    }

    public VectorCollection<TRecord> GetContainer<TRecord>() where TRecord : VectorStoreEntity
    {
        var containerName = typeof(TRecord).GetCustomAttribute<VectorStoreEntityAttribute>().CollectionName;
        var container = _database.GetContainer(containerName);
        return new VectorCollection<TRecord>(container, _tokenizer, _textEmbeddingService);
    }

    public async IAsyncEnumerable<string> ListCollectionNamesAsync([EnumeratorCancellation] CancellationToken cancellationToken = default)
    {
        const string Query = "SELECT VALUE(c.id) FROM c";

        using var feedIterator = this._database.GetContainerQueryIterator<string>(Query);

        while (feedIterator.HasMoreResults)
        {
            var next = await feedIterator.ReadNextAsync(cancellationToken).ConfigureAwait(false);

            foreach (var containerName in next.Resource)
            {
                yield return containerName;
            }
        }
    }
}

public class VectorCollection<TRecord> where TRecord : VectorStoreEntity
{
    public readonly string DocumentType = typeof(TRecord).GetCustomAttribute<VectorStoreEntityAttribute>().DocumentType;
    private readonly Container _container;
    private readonly Tokenizer _tokenizer;
    private readonly ITextEmbeddingGenerationService _textEmbeddingService;
    private readonly string _idPropertyName;
    private readonly string[] _partitionKeyPropertyNames;
    private readonly string _embeddingPropertyName;
    
    public VectorCollection(Container container, Tokenizer tokenizer, ITextEmbeddingGenerationService textEmbeddingService)
    {
        _container = container;
        _tokenizer = tokenizer;
        _textEmbeddingService = textEmbeddingService;
        _idPropertyName = GetIdPropertyName();
        _partitionKeyPropertyNames = GetPartitionKeyPropertyNames().ToArray();
        _embeddingPropertyName = GetEmbeddedFieldPropertyName();
    }

    public async Task UpsertAsync(TRecord item, bool updateEmbeddingFields = false, CancellationToken cancellationToken = default)
    {
        if(updateEmbeddingFields)
        {
            item.UpdateTokenCount(_tokenizer);
            await item.UpdateEmbedding(_textEmbeddingService);
        }
        
        await _container.UpsertItemAsync(item, cancellationToken: cancellationToken);
    }

    public async Task<TRecord> GetByKeyAsync(string id, PartitionKey partitionKey, CancellationToken cancellationToken = default)
    {
        return await _container.ReadItemAsync<TRecord>(id, partitionKey, cancellationToken: cancellationToken);
    }


    public Task RemoveAsync(TRecord item, CancellationToken cancellationToken = default)
    {
        return _container.DeleteItemAsync<TRecord>(item.GetId(), item.GetPartitionKey(), cancellationToken: cancellationToken);
    }

    // public async IAsyncEnumerable<TRecord> GetBatchAsync(Func<TRecord, bool> predicate, string[] selectFields, CancellationToken cancellationToken = default)
    // {
    //     const string WhereClauseDelimiter = " OR ";
    //     const string SelectClauseDelimiter = ",";

    //     const string RecordKeyVariableName = "rk";
    //     const string PartitionKeyVariableName = "pk";

    //     const string TableVariableName = "x";

    //     var selectClauseArguments = string.Join(SelectClauseDelimiter,
    //         selectFields.Select(field => $"{TableVariableName}.{field}"));
        
    //     var whereClauseArguments = convertFromPredicate(predicate);

    //     // convert the predicate to a SQL query
    //     var queryText = $"SELECT {selectClauseArguments} FROM {TableVariableName} WHERE {whereClauseArguments}";

    //     var queryDefinition = new QueryDefinition(queryText);
    //     // add any parameters to the query
    //     // queryDefinition.WithParameter("@paramName", paramValue);

    //     using var feedIterator = this._container
    //      .GetItemQueryIterator<TRecord>(queryDefinition);

    //     while (feedIterator.HasMoreResults)
    //     {
    //         foreach (var document in await feedIterator.ReadNextAsync(cancellationToken).ConfigureAwait(false))
    //         {
    //             yield return document;
    //         }
    //     }
    // }

    public async IAsyncEnumerable<TRecord> GetBatchAsync(Func<TRecord, TRecord> projection, Func<TRecord, bool> predicate, CancellationToken cancellationToken = default)
    {

        var queryable = _container.GetItemLinqQueryable<TRecord>(allowSynchronousQueryExecution: true);


        using var feedIterator = this._container
         .GetItemQueryIterator<TRecord>(queryDefinition);

        while (feedIterator.HasMoreResults)
        {
            foreach (var document in await feedIterator.ReadNextAsync(cancellationToken).ConfigureAwait(false))
            {
                yield return document;
            }
        }
    }

    public async IAsyncEnumerable<(TRecord, double)> GetNearestMatchesAsync(
        ReadOnlyMemory<float> embedding,
        string[] fields,
        int limit = 1,
        double minRelevanceScore = 0.0,
        [EnumeratorCancellation] CancellationToken cancellationToken = default)
    {
        const string VectorVariableName = "@vectors";
        const string LimitVariableName = "@limit";
        const string MinRelevanceVariableName = "@minRelevanceScore";

        Func<string, string> getSelectItems = (string prefix) => string.Join($", {prefix}.", fields);

        string queryText = $"""
            SELECT 
                Top {LimitVariableName} 
                {getSelectItems("p")}
            FROM 
                (SELECT {getSelectItems("s")},
                VectorDistance(s.{_embeddingPropertyName}, {VectorVariableName}, false) as similarityScore FROM s) 
            p 
            WHERE 
                p.similarityScore >= {MinRelevanceVariableName}
            ORDER BY 
                p.similarityScore desc
            """;

        var queryDefinition = new QueryDefinition(queryText)
            .WithParameter(VectorVariableName, embedding.ToArray())
            .WithParameter(LimitVariableName, limit)
            .WithParameter(MinRelevanceVariableName, minRelevanceScore);

        using var feedIterator = this._container
         .GetItemQueryIterator<TRecord>(queryDefinition);

        while (feedIterator.HasMoreResults)
        {
            foreach (var document in await feedIterator.ReadNextAsync(cancellationToken).ConfigureAwait(false))
            {
                var relevanceScore = (document.SimilarityScore + 1) / 2;
                if (relevanceScore >= minRelevanceScore)
                {
                    yield return (document, relevanceScore);
                }
            }
        }
    }

    public async IAsyncEnumerable<ItemResponse<TRecord>> DeleteNearestMatchAsync(
        ReadOnlyMemory<float> embedding,
        [EnumeratorCancellation] CancellationToken cancellationToken = default)
    {
        const string VectorVariableName = "@vectors";
        const string SimilarityScoreVariableName = "@similarityScore";
        double similarityScore = 0.99;

        string queryText = $"""
            SELECT 
                Top 1 c.id
            FROM 
                (SELECT c.id,
                VectorDistance(s.{_embeddingPropertyName}, {VectorVariableName}, false) as similarityScore FROM c)
            p 
            WHERE 
                p.similarityScore >= {SimilarityScoreVariableName}
            ORDER BY 
                p.similarityScore desc
            """;

        var queryDefinition = new QueryDefinition(queryText)
            .WithParameter(VectorVariableName, embedding.ToArray())
            .WithParameter(SimilarityScoreVariableName, similarityScore);

        using var feedIterator = this._container
         .GetItemQueryIterator<TRecord>(queryDefinition);

        while (feedIterator.HasMoreResults)
        {
            foreach (var document in await feedIterator.ReadNextAsync(cancellationToken).ConfigureAwait(false))
            {
                yield return await _container.DeleteItemAsync<TRecord>(document.GetId(), document.GetPartitionKey(), cancellationToken: cancellationToken);
            }
        }
    }

    private string GetEmbeddedFieldPropertyName() => typeof(TRecord).GetProperties(BindingFlags.Public | BindingFlags.Instance)
        .Where(p => p.GetCustomAttribute<VectorStoreEmbeddingAttribute>() != null)
        .Select(p => JsonSerializer.Serialize(p.Name, new JsonSerializerOptions { PropertyNamingPolicy = JsonNamingPolicy.CamelCase }))
        .First();

    private IEnumerable<string> GetPartitionKeyPropertyNames() => typeof(TRecord).GetProperties(BindingFlags.Public | BindingFlags.Instance)
        .Where(p => p.GetCustomAttribute<VectorStorePartitionKeyAttribute>() != null)
        .Select(p => JsonSerializer.Serialize(p.Name, new JsonSerializerOptions { PropertyNamingPolicy = JsonNamingPolicy.CamelCase }));

    private string GetIdPropertyName() => typeof(TRecord).GetProperties(BindingFlags.Public | BindingFlags.Instance)
        .Where(p => p.GetCustomAttribute<VectorStoreIdAttribute>() != null)
        .Select(p => JsonSerializer.Serialize(p.Name, new JsonSerializerOptions { PropertyNamingPolicy = JsonNamingPolicy.CamelCase }))
        .First();
}

public class RagContextBuilder
{
    private readonly VectorCollection<ChatThread> _chatThreadCollection;
    private readonly VectorCollection<ChatThreadMessage> _chatThreadMessageCollection;
    private readonly VectorCollection<CacheItem> _cacheCollection;
    private readonly VectorCollection<CompanyInfo> _companyInfoCollection;
    private string _activeThread {get; set;}

    private static Specification<ChatThread> allChatThreadsForUser(string userId) => new PropertyEqualSpecification<ChatThread>(x => x.Type, _chatThreadCollection.DocumentType)
        .And(new PropertyEqualSpecification<ChatThread>(x => x.UserId, userId));

    private static Specification<ChatThreadMessage> allMessagesForThread(string threadId) => new PropertyEqualSpecification<ChatThreadMessage>(x => x.Type, _chatThreadMessageCollection.DocumentType)
        .And(new PropertyEqualSpecification<ChatThreadMessage>(x => x.ThreadId, threadId));
    
    public RagContextBuilder(Kernel semanticKernel)
    {
        var vectorStore = semanticKernel.GetRequiredService<VectorStore>();
        _chatThreadCollection = vectorStore.GetContainer<ChatThread>();
        _chatThreadMessageCollection = vectorStore.GetContainer<ChatThreadMessage>();
        _cacheCollection = vectorStore.GetContainer<CacheItem>();
        _companyInfoCollection = vectorStore.GetContainer<CompanyInfo>();
    }

}


Error: (692,142): error CS0120: An object reference is required for the non-static field, method, or property 'RagContextBuilder._chatThreadCollection'
(695,157): error CS0120: An object reference is required for the non-static field, method, or property 'RagContextBuilder._chatThreadMessageCollection'
(573,41): error CS0103: The name 'queryDefinition' does not exist in the current context

In [56]:
#pragma warning disable SKEXP0001,SKEXP0020

var cosmosDb = cosmosNoSqlService.databaseClient;
var kernel = skBuilder.Services
.AddTransient<VectorStore>((sp) => {
  var tokenizer = sp.GetRequiredService<Tokenizer>();
  var textEmbeddingService = sp.GetRequiredService<ITextEmbeddingGenerationService>();
  return new VectorStore(cosmosDb, tokenizer, textEmbeddingService);
});


# Semantic Search on Cosmos DB

# Structured Database Copilot
NL2SQL - Database query generation

### Considerations
- Usually good at building most of the database query, however it needs prompt tuning or native functions to improve the where clause.

**Example User Stories**
- I have application monitoring or metric data that I want to derive insights from. 
- I want to chat over the entire corpus of Service Now or other ICM support ticket information