# Download Dataset and Create Embeddings

This notebook demonstrates how to download a dataset from Hugging Face, process it, and create embeddings using Databricks AI functions. The workflow includes:

1. **Dataset Download**: Load the Stanford IMDB movie review dataset from Hugging Face
2. **Data Processing**: Combine train, test, and unsupervised datasets into a single DataFrame
3. **Unity Catalog Storage**: Save the processed dataset to Unity Catalog
4. **Embedding Generation**: Create vector embeddings using Databricks AI_QUERY function

## Prerequisites

- Access to Databricks workspace with Unity Catalog enabled
- Permissions to create tables in the specified catalog and schema
- Access to Databricks Foundation Model APIs for embedding generation
- Internet connectivity to download from Hugging Face

## Dataset Information

We'll be working with the **Stanford IMDB Movie Review Dataset**:
- **Source**: `stanfordnlp/imdb` on Hugging Face
- **Content**: Movie reviews with sentiment labels
- **Size**: ~100,000 movie reviews
- **Purpose**: Text classification and sentiment analysis

## Step 1: Configuration

Configure your Unity Catalog destination for the dataset:

In [None]:
UC_CATALOG = "users"
UC_SCHEMA = "alex_miller"
UC_TABLE = "imdb"

In [0]:
from datasets import load_dataset

ds = load_dataset("stanfordnlp/imdb")

## Step 3: Process and Combine Dataset

Process the downloaded dataset by:

1. **Convert to Pandas**: Transform Hugging Face dataset splits into pandas DataFrames
2. **Combine splits**: Merge train, test, and unsupervised datasets into a single dataset
3. **Add unique IDs**: Generate monotonically increasing IDs for each record
4. **Convert to Spark**: Create a Spark DataFrame for efficient processing and storage

The resulting DataFrame contains:
- **text**: Movie review text
- **label**: Sentiment label (0 = negative, 1 = positive, or unlabeled)
- **id**: Unique identifier for each review


In [0]:
from pyspark.sql import functions as F
import pandas as pd

train_dataset = ds['train'].to_pandas()
val_dataset = ds['unsupervised'].to_pandas()
test_dataset = ds['test'].to_pandas()
all_dataset = pd.concat([train_dataset, val_dataset, test_dataset], ignore_index=True)
spark_dataframe = spark.createDataFrame(all_dataset) \
    .withColumn("id", F.monotonically_increasing_id())

display(spark_dataframe)

## Step 4: Save Dataset to Unity Catalog

Save the processed dataset to Unity Catalog for persistent storage and easy access across your Databricks workspace. The data is stored in Delta format, providing:

- **ACID transactions**: Reliable data operations
- **Time travel**: Version history and rollback capabilities
- **Schema enforcement**: Data quality and consistency
- **Optimized performance**: Fast queries and analytics

The table will be created at: `{UC_CATALOG}.{UC_SCHEMA}.{UC_TABLE}`


In [0]:
spark_dataframe.write.mode("overwrite").saveAsTable(f"{UC_CATALOG}.{UC_SCHEMA}.{UC_TABLE}")

## Step 5: Create Embeddings using AI_QUERY

Generate vector embeddings for the movie review texts using Databricks AI_QUERY function. This step:

1. **Creates a new table**: `{UC_TABLE}_embeddings` with all original columns plus embeddings
2. **Generates embeddings**: Uses the `databricks-gte-large-en` model to create 1024-dimensional vectors
3. **Processes all records**: Applies the embedding function to each review text

### About the Embedding Model

- **Model**: `databricks-gte-large-en` (General Text Embeddings)
- **Dimensions**: 1024 (suitable for semantic similarity tasks)
- **Use case**: Optimized for English text understanding and similarity search
- **Performance**: Handles large text volumes efficiently

### Expected Output

The embeddings table will contain:
- All original columns (`text`, `label`, `id`)
- New `embeddings` column with 1024-dimensional vectors
- Ready for vector search index creation

In [None]:
spark.sql(f"""CREATE TABLE IF NOT EXISTS {UC_CATALOG}.{UC_SCHEMA}.{UC_TABLE}_embeddings AS
          SELECT
            *,
            AI_QUERY(
              'databricks-gte-large-en', 
              text
            ) AS embeddings
          FROM {UC_CATALOG}.{UC_SCHEMA}.{UC_TABLE}""")

In [None]:
spark.table(f"{UC_CATALOG}.{UC_SCHEMA}.{UC_TABLE}_embeddings").display()

## Next Steps

After completing this notebook, you'll have:

1. **Raw dataset**: `{UC_CATALOG}.{UC_SCHEMA}.{UC_TABLE}` - Original IMDB reviews
2. **Embeddings dataset**: `{UC_CATALOG}.{UC_SCHEMA}.{UC_TABLE}_embeddings` - Reviews with vector embeddings

**Continue to**: `02-create-vector-search-index.ipynb` to create a vector search index from your embeddings.

## Important Notes

### Performance Considerations
- **Dataset size**: ~100k reviews may take 10-15 minutes to process embeddings
- **Embedding generation**: AI_QUERY processes records in batches automatically
- **Resource usage**: Monitor cluster resources during embedding generation

### Cost Optimization
- **Foundation Model APIs**: Embedding generation incurs costs per token processed
- **Cluster sizing**: Use appropriately sized clusters for your dataset
- **Batch processing**: AI_QUERY automatically optimizes batch sizes

### Troubleshooting

**Common Issues:**
- **Permission errors**: Ensure you have CREATE TABLE permissions in Unity Catalog
- **Model access**: Verify access to Databricks Foundation Model APIs
- **Memory issues**: For very large datasets, consider processing in chunks

**Data Quality:**
- Review text should be clean and properly formatted
- Check for any null or empty text values before embedding generation
- Verify embedding dimensions match your expected model output (1024 for gte-large-en)

## Resources

- [Databricks AI_QUERY Documentation](https://docs.databricks.com/en/sql/language-manual/functions/ai_query.html)
- [Hugging Face Datasets Library](https://huggingface.co/docs/datasets/)
- [Unity Catalog Documentation](https://docs.databricks.com/en/data-governance/unity-catalog/)
