# Huggingface Embedding Model
* Purpose of notebook is to generate embeddings from corpus containining metadata extracted from notebook metadata logging process (including summaries, topics, libraries, and functions)
* Embeddings will be stored as part of the metadata payload
* Utilizes Huggingface embedding model `jinaai/jina-embeddings-v2-base-en` from `transformers`

In [1]:
import json
from transformers import AutoModel
# import pandas as pd 
import numpy as np
import os
from dotenv import load_dotenv



In [2]:
ENV_PATH = "/Users/davidbickham/Desktop/DS_Learning/Projects/Notebook_Metadata_Logging/.env"
JSON_PATH = "/Users/davidbickham/Desktop/DS_Learning/Projects/Notebook_Metadata_Logging/notebook_metadata_logger.json"

In [3]:
# Load from a JSON file
with open(JSON_PATH, "r", encoding="utf-8") as f:
    data = json.load(f)

In [4]:
data[0]

{'File Name': 'scikit-learn-regression-uber-eta.ipynb',
 'File Path': '/Users/davidbickham/Desktop/Personal/DeepLearningAI/Retrieval_Augmented_Generation/notebook_metadata_logging/scikit-learn-regression-uber-eta.ipynb',
 'Last Modified': '2025-07-31',
 'Summary': 'This Jupyter notebook performs a comprehensive regression analysis to predict Uber ride ETA (Estimated Time of Arrival) using trip and weather data. The script loads training data and weather information from Google Drive, conducts exploratory data analysis with correlation heatmaps and visualizations, then performs feature engineering by extracting time-based features and merging weather data. The preprocessing pipeline includes standardizing numeric features (pressure measurements) and one-hot encoding categorical features (day of week), followed by training a Random Forest regressor with hyperparameter tuning via GridSearchCV. The notebook concludes with extensive model evaluation using multiple regression metrics (RMSE, 

# Concatenate Metadata into Single String

In [5]:
summary = data[0]['Summary']
summary

'This Jupyter notebook performs a comprehensive regression analysis to predict Uber ride ETA (Estimated Time of Arrival) using trip and weather data. The script loads training data and weather information from Google Drive, conducts exploratory data analysis with correlation heatmaps and visualizations, then performs feature engineering by extracting time-based features and merging weather data. The preprocessing pipeline includes standardizing numeric features (pressure measurements) and one-hot encoding categorical features (day of week), followed by training a Random Forest regressor with hyperparameter tuning via GridSearchCV. The notebook concludes with extensive model evaluation using multiple regression metrics (RMSE, MAE, R²), residual analysis plots, feature importance analysis, and detailed diagnostic visualizations to assess model performance across different feature levels.'

In [6]:
topics = data[0]['Topics']
def extract_text_from_list(metadata_list):
    if metadata_list:
        return  ", ".join(metadata_list) + "."
    else:
        return ""

extract_text_from_list(topics)

'Supervised Learning, Regression, Time Series Forecasting, Scikit-learn, Feature Engineering.'

In [7]:
def extract_text_from_list(metadata_list):
    if metadata_list:
        return ", ".join(metadata_list) + "."
    else:
        return ""

def get_embedding_text(summary, *metadata_lists):
    if not metadata_lists:
        return summary
    
    extra_texts = [extract_text_from_list(lst) for lst in metadata_lists if lst]
    embedding_text = " ".join([summary] + extra_texts)
    
    return embedding_text

In [8]:
summary = data[0]['Summary']
topics = data[0]['Topics']
functions = data[0]['Functions']
libraries = data[0]['Libraries']

embedding_text = get_embedding_text(summary, topics, functions, libraries)
embedding_text

'This Jupyter notebook performs a comprehensive regression analysis to predict Uber ride ETA (Estimated Time of Arrival) using trip and weather data. The script loads training data and weather information from Google Drive, conducts exploratory data analysis with correlation heatmaps and visualizations, then performs feature engineering by extracting time-based features and merging weather data. The preprocessing pipeline includes standardizing numeric features (pressure measurements) and one-hot encoding categorical features (day of week), followed by training a Random Forest regressor with hyperparameter tuning via GridSearchCV. The notebook concludes with extensive model evaluation using multiple regression metrics (RMSE, MAE, R²), residual analysis plots, feature importance analysis, and detailed diagnostic visualizations to assess model performance across different feature levels. Supervised Learning, Regression, Time Series Forecasting, Scikit-learn, Feature Engineering. categori

# Generate Embedding

In [9]:
load_dotenv(dotenv_path=ENV_PATH)
HUGGINGFACE_API_KEY = os.getenv('HUGGINGFACE_API_KEY')

In [10]:
# !pip install transformers
api_key = HUGGINGFACE_API_KEY  # Your Hugging Face token here

model = AutoModel.from_pretrained(
    'jinaai/jina-embeddings-v2-base-en',
    trust_remote_code=True,
    use_auth_token=api_key
)



config.json: 0.00B [00:00, ?B/s]

configuration_bert.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- configuration_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_bert.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- modeling_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/275M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [11]:
embedding = model.encode([embedding_text])

In [17]:
print(embeddings.shape)
print(embeddings[0,:].shape)
embeddings

(1, 768)
(768,)


array([[-3.54017466e-01, -9.23667789e-01,  6.21676743e-01,
         5.52692056e-01, -5.18640876e-02,  1.84424341e-01,
         1.88734978e-01, -5.05429029e-01,  3.87434989e-01,
         4.29343164e-01, -6.92087948e-01, -4.16551858e-01,
        -3.03237975e-01, -3.41011226e-01, -7.20602334e-01,
         1.38995183e+00, -9.31221396e-02, -1.61874235e-01,
        -1.50377363e-01,  1.78253472e-01, -2.36591429e-01,
        -5.76801419e-01, -3.53407115e-01,  1.47966102e-01,
         5.06129265e-01,  5.06500788e-02,  5.84479451e-01,
         4.55616951e-01,  3.60024214e-01,  1.58902586e-01,
        -3.96767817e-02, -1.95503190e-01, -1.90983072e-01,
        -1.48323756e-02,  2.52990603e-01, -6.12879209e-02,
        -2.70114601e-01, -1.18243195e-01,  7.03782201e-01,
         5.35114229e-01, -5.80036581e-01,  4.23090875e-01,
        -4.39957716e-02,  3.26594621e-01, -6.77264690e-01,
         3.78832698e-01, -3.84781778e-01, -1.95119992e-01,
        -3.14056963e-01,  3.60297374e-02,  5.25028184e-0

In [21]:
def get_metadata_embedding(text: str, embedding_model='jinaai/jina-embeddings-v2-base-en'):
    model = AutoModel.from_pretrained(
    embedding_model,
    trust_remote_code=True,
    use_auth_token=api_key
    )

    embedding = model.encode([text])
    return embedding[0,:]

In [22]:
metadata_embedding = get_metadata_embedding(embedding_text)



In [23]:
metadata_embedding

array([-3.54017466e-01, -9.23667789e-01,  6.21676743e-01,  5.52692056e-01,
       -5.18640876e-02,  1.84424341e-01,  1.88734978e-01, -5.05429029e-01,
        3.87434989e-01,  4.29343164e-01, -6.92087948e-01, -4.16551858e-01,
       -3.03237975e-01, -3.41011226e-01, -7.20602334e-01,  1.38995183e+00,
       -9.31221396e-02, -1.61874235e-01, -1.50377363e-01,  1.78253472e-01,
       -2.36591429e-01, -5.76801419e-01, -3.53407115e-01,  1.47966102e-01,
        5.06129265e-01,  5.06500788e-02,  5.84479451e-01,  4.55616951e-01,
        3.60024214e-01,  1.58902586e-01, -3.96767817e-02, -1.95503190e-01,
       -1.90983072e-01, -1.48323756e-02,  2.52990603e-01, -6.12879209e-02,
       -2.70114601e-01, -1.18243195e-01,  7.03782201e-01,  5.35114229e-01,
       -5.80036581e-01,  4.23090875e-01, -4.39957716e-02,  3.26594621e-01,
       -6.77264690e-01,  3.78832698e-01, -3.84781778e-01, -1.95119992e-01,
       -3.14056963e-01,  3.60297374e-02,  5.25028184e-02, -3.30138326e-01,
        1.75367236e-01, -