# Embedding Analysis Notebook

This notebook provides a template for analyzing embeddings using the Embedding Projector. It includes functions for loading and caching embeddings and labels, as well as a dashboard app for visualizing embeddings.

## How to Use This Notebook

1. Run the cells in the "Setup" section to import necessary libraries and functions.
2. Modify the variables in the "Configuration" section to specify the dataset and embedding files.
3. Run the cells in the "Load Data" section to load the embeddings and labels.
4. Run the cells in the "Visualize Embeddings" section to launch the dashboard app.


Note: The dataset caching functionality is handled by the functions imported from `embedding_utils.py`. You can modify variables to change how embeddings work and specify different datasets.

In [None]:
import os
import sys
import logging
import pandas as pd
import openai
import subprocess
from pathlib import Path
from kedro.framework.project import configure_project

# Add the path to the utils
utils_path = os.path.abspath('/home/wadmin/embed_norm/apps/embed_norm/embed_norm_test')
if utils_path not in sys.path:
    sys.path.append(utils_path)

root_path = subprocess.check_output(['git', 'rev-parse', '--show-toplevel']).decode().strip()
os.chdir(Path(root_path) / 'pipelines' / 'matrix')

from embedding_utils import (
    process_model,
    # process_model_combinations,
    embedding_models_info,
    parse_list_string,
    load_datasets,
    load_categories,
    load_embeddings_and_labels,
    missing_data_rows_dict,
    generate_candidate_pairs,
    refine_candidate_mappings_with_llm,
    find_additional_mappings_with_curategpt
)

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')

## Set Up Variables

Specify your dataset names, seeds, and cache directory.

In [None]:
# Set up variables
cache_dir = root_path + '/apps/embed_norm/cached_datasets'
os.makedirs(cache_dir, exist_ok=True)

# Seeds for sampling
seed1 = 54321  # Seed for positive samples
seed2 = 67890  # Seed for negative samples

# Dataset configuration
# dataset_name = 'rtx_kg2'  # Replace with your dataset name integration.int.rtx.
# nodes_dataset_name = 'ingestion.raw.rtx_kg2.nodes@pandas'  # Replace with your nodes dataset name
# edges_dataset_name = 'ingestion.raw.rtx_kg2.edges@pandas'  # Replace with your edges dataset name

dataset_name = 'rtx_kg2'
nodes_dataset_name = 'integration.int.rtx.nodes'
edges_dataset_name = 'integration.int.rtx.edges'

# Categories to process
categories = ['All Categories']  # Specify categories or use 'All Categories'

# Model configuration
model_name = 'OpenAI'
model_info = embedding_models_info[model_name]

# OpenAI API Key (if using OpenAI embeddings)
openai.api_key = os.getenv("OPENAI_API_KEY")  # Ensure your API key is set

# New adjustable sample size and ratio
total_sample_size = 1000  # Total number of samples
positive_ratio = 0.2     # Ratio of positive samples

# Cache suffix based on sample size and ratio
positive_n = int(total_sample_size * positive_ratio)
negative_n = total_sample_size - positive_n
cache_suffix = f"_pos_{positive_n}_neg_{negative_n}"

## Load Datasets and Embeddings

Use the provided functions to load datasets and embeddings, handling caching automatically.

In [None]:
# Configure Kedro project
configure_project('matrix')

# Load categories
categories = load_categories(
    cache_dir=f'{cache_dir}/categories',
    dataset_name=dataset_name,
    nodes_dataset_name=nodes_dataset_name
)

# Load datasets
positive_datasets, datasets = load_datasets(
    cache_dir=f'{cache_dir}/datasets',
    dataset_name=dataset_name,
    nodes_dataset_name=nodes_dataset_name,
    edges_dataset_name=edges_dataset_name,
    categories=categories,
    seed1=seed1,
    seed2=seed2,
    total_sample_size=total_sample_size,
    positive_ratio=positive_ratio
)

# # Load embeddings and labels (if cached)
# embeddings_dict, labels_dict = load_embeddings_and_labels(
#     cache_dir=f'{cache_dir}/embeddings',
#     dataset_name=dataset_name,
#     model_name=model_name,
#     categories=categories,
#     seed=seed2,
#     combinations=False,
#     cache_suffix=cache_suffix
# )

# # Similarly for positive datasets
# embeddings_dict_pos, labels_dict_pos = load_embeddings_and_labels(
#     cache_dir=f'{cache_dir}/embeddings',
#     dataset_name=dataset_name,
#     model_name=model_name,
#     categories=categories,
#     seed=seed1,
#     combinations=False,
#     cache_suffix=cache_suffix
# )

## Define Text Representation and Label Generation Functions

Customize how embeddings are generated by defining custom text representation

In [None]:
# Define custom text representation function
# def node_to_string(row):
#     fields = [row.get('name', ''), row.get('description', '')]
#     text_values = []
#     for field_value in fields:
#         if pd.notnull(field_value):
#             parsed_list = parse_list_string(field_value)
#             text_values.extend(parsed_list)
#     return ' '.join(text_values).strip()

def node_to_string(row, text_fields=None):
    if text_fields is None:
        text_fields = ['all_names:string[]', 'all_categories:string[]']
    global missing_data_rows_dict
    fields = [row.get(field, '') for field in text_fields]
    missing_fields = [field for field, value in zip(text_fields, fields)
                      if pd.isnull(value) or not str(value).strip()]
    for missing_field in missing_fields:
        if missing_field not in missing_data_rows_dict:
            missing_data_rows_dict[missing_field] = []
        missing_data_rows_dict[missing_field].append(row)
    text_values = []
    for field_value in fields:
        parsed_list = parse_list_string(field_value)
        text_values.extend(parsed_list)
    text_representation = ' '.join(text_values).strip()
    if not text_representation:
        logging.warning(f"Empty text representation for row with index {row.name}")
    return text_representation

Define how plot labels are created

In [None]:
# Define custom label generation function
def label_func(row):
    return f"{row['id:ID']}, {row['name']}, custom label"

## Process Models

Generate and cache embeddings for your datasets using the `process_model` functions

### Generate or Load Embeddings for Datasets

In [None]:

# Process embeddings for the datasets
model_name, embeddings_dict = process_model(
    model_name=model_name,
    model_info=model_info,
    datasets=datasets,
    cache_dir=f'{cache_dir}/embeddings',
    seed=seed2,
    text_representation_func=node_to_string,
    label_generation_func=label_func,
    dataset_name=dataset_name,
    use_ontogpt=False, # not working right now
    cache_suffix=cache_suffix
)
# Reload embeddings after generation
# embeddings_dict, labels_dict = load_embeddings_and_labels(
#     cache_dir=f'{cache_dir}/embeddings',
#     dataset_name=dataset_name,
#     model_name=model_name,
#     categories=categories,
#     seed=seed2,
#     combinations=False,
#     cache_suffix=cache_suffix
# )

# Process embeddings for positive datasets
model_name, embeddings_dict_pos = process_model(
    model_name=model_name,
    model_info=model_info,
    datasets=positive_datasets,
    cache_dir=f'{cache_dir}/embeddings',
    seed=seed1,
    text_representation_func=node_to_string,
    label_generation_func=label_func,
    dataset_name=dataset_name,
    use_ontogpt=False, # not working right now
    cache_suffix=cache_suffix
)
# Reload embeddings after generation
# embeddings_dict_pos, labels_dict_pos = load_embeddings_and_labels(
#     cache_dir=f'{cache_dir}/embeddings',
#     dataset_name=dataset_name,
#     model_name=model_name,
#     categories=categories,
#     seed=seed1,
#     combinations=False,
#     cache_suffix=cache_suffix
# )


### Generate or Load Embeddings with Combinations (Optional)
experimental, my attempt at generating sets of multiple embeddings for each node based on different combinations of features. probably semi useless, enhancing the node text representation using an LLM would probably be more useful

In [None]:
# # Define custom text representations for combinations
# def node_to_strings(row):
#     names_field = 'name'
#     categories_field = 'category'
#     names_field_value = row.get(names_field, '')
#     categories_field_value = row.get(categories_field, '')
    
#     names_list = parse_list_string(names_field_value)
#     categories_list = parse_list_string(categories_field_value)
    
#     from itertools import product
#     combinations = list(product(names_list, categories_list))
    
#     text_representations = [' '.join(combination).strip() for combination in combinations]
#     return text_representations

# # Generate embeddings with combinations
# model_name, embeddings_dict = process_model_combinations(
#     model_name=model_name,
#     model_info=model_info,
#     datasets=datasets,
#     cache_dir=cache_dir,
#     seed=seed2,
#     text_representation_func=node_to_strings,
#     label_generation_func=label_func,
#     dataset_name=dataset_name
# )

# Process embeddings for positive datasets with combinations
# model_name, embeddings_dict_pos = process_model_combinations(
#     model_name=model_name,
#     model_info=model_info,
#     datasets=positive_datasets,
#     cache_dir=cache_dir,
#     seed=seed1,
#     text_representation_func=node_to_strings,
#     label_generation_func=label_func,
#     dataset_name=dataset_name
# )

### Visualize Embeddings

In [None]:
# Example visualization code (e.g., using matplotlib or a dashboard app)
# This is a placeholder; replace with your visualization code
import matplotlib.pyplot as plt

# For example, visualize embeddings for 'All Categories'
category = 'All Categories'
embeddings = embeddings_dict[category]
labels = labels_dict[category]

# Use dimensionality reduction for visualization
from sklearn.decomposition import PCA

reduced_embeddings = PCA(n_components=2).fit_transform(embeddings)

plt.figure(figsize=(10, 10))
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], alpha=0.5)
plt.title(f'Embeddings Visualization for {category}')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.show()

## Running the Dashboard App

After generating the embeddings, you can use the dashboard app to visualize them

### Instructions to Run the Dashboard App

1. Ensure that the embeddings have been generated and saved in the cache directory.
2. Navigate to the directory containing `app.py`.
3. Run the app using the command:
   ```
   python app.py
   ```
4. Open the provided URL (usually `http://0.0.0.0:3000`) in your web browser.

The dashboard should now display the embeddings and allow you to interact with them.

## Adding New Models

To add new models, update the `embedding_models_info` dictionary in `embedding_utils.py` with the details of the new model.

### Example: Adding a New Hugging Face Model

```python
# In embedding_utils.py
embedding_models_info = {
    'OpenAI': {
        'type': 'openai',
    },
    'YourModelName': {
        'type': 'hf',
        'tokenizer_name': 'your-model-tokenizer-name',
        'model_name': 'your-model-name'
    },
}
```

After updating, you can regenerate embeddings for the new model using the same process.