# Chapter 2a: Text Columns Deep Dive

**Purpose:** Transform TEXT columns (tickets, emails, messages) into numeric features using embeddings and dimensionality reduction.

**When to use this notebook:**
- Your dataset contains TEXT columns (unstructured text data)
- Detected automatically if ColumnType.TEXT found in findings

**What you'll learn:**
- How text embeddings capture semantic meaning
- Why PCA reduces dimensions while preserving variance
- How to choose between fast vs high-quality embedding models

**Outputs:**
- PC features (text_pc1, text_pc2, ...) for each TEXT column
- TextProcessingMetadata in findings
- Recommendations for production pipeline

---

## Two Approaches to Text Feature Engineering

| Approach | Method | When to Use |
|----------|--------|-------------|
| **1. Embeddings + PCA** (This notebook) | Sentence-transformers → PCA | General semantic features |
| **2. LLM Labeling** (Future) | LLM on samples → Train classifier | Specific categories needed |

### Approach 1: Embeddings + Dimensionality Reduction (Current)

```
TEXT Column → Embeddings → PCA → pc1, pc2, ..., pcN
```

- **Embeddings**: Dense vectors capturing semantic meaning (similar texts = similar vectors)
- **PCA**: Reduces dimensions to N components covering target variance (default 95%)
- **Output**: Numeric features usable with standard ML models

### Embedding Model Options

| Model | Size | Embedding Dim | Speed | Quality | Best For |
|-------|------|---------------|-------|---------|----------|
| **MiniLM** (default) | 90 MB | 384 | Fast | Good | CPU, quick iteration, small datasets |
| **Qwen3-0.6B** | 1.2 GB | 1024 | Medium | Better | GPU available, production quality |
| **Qwen3-4B** | 8 GB | 2560 | Slow | High | 16GB+ GPU, multilingual, high accuracy |
| **Qwen3-8B** | 16 GB | 4096 | Slowest | Highest | 32GB+ GPU, research, max quality |

**Note:** Models are downloaded on first use (lazy loading). Qwen3 models require GPU for reasonable performance.

### Approach 2: LLM Labeling (Future Enhancement)

```
TEXT Column → Sample → LLM Labels → Train Classifier → Apply to All
```

- Use when you need specific categorical labels (sentiment, topic, intent)
- More expensive but more interpretable

## 2a.1 Load Previous Findings

In [None]:
from customer_retention.analysis.auto_explorer import ExplorationFindings, TextProcessingMetadata
from customer_retention.analysis.visualization import ChartBuilder, display_figure, display_table, console
from customer_retention.core.config.column_config import ColumnType
from customer_retention.stages.profiling import (
    TextColumnProcessor, TextProcessingConfig, TextColumnResult,
    TextEmbedder, TextDimensionalityReducer,
    EMBEDDING_MODELS, get_model_info, list_available_models
)
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from customer_retention.core.config.experiments import FINDINGS_DIR, EXPERIMENTS_DIR, OUTPUT_DIR, setup_experiments_structure


In [None]:
# === CONFIGURATION ===
from pathlib import Path

# FINDINGS_DIR imported from customer_retention.core.config.experiments

findings_files = [f for f in FINDINGS_DIR.glob("*_findings.yaml") if "multi_dataset" not in f.name]
if not findings_files:
    raise FileNotFoundError(f"No findings files found in {FINDINGS_DIR}. Run notebook 01 first.")

findings_files.sort(key=lambda f: f.stat().st_mtime, reverse=True)
FINDINGS_PATH = str(findings_files[0])

print(f"Found {len(findings_files)} findings file(s)")
print(f"Using: {FINDINGS_PATH}")

findings = ExplorationFindings.load(FINDINGS_PATH)
print(f"\nLoaded findings for {findings.column_count} columns from {findings.source_path}")

In [None]:
# Identify TEXT columns
text_columns = [
    name for name, col in findings.columns.items()
    if col.inferred_type == ColumnType.TEXT
]

if not text_columns:
    print("\u26a0\ufe0f No TEXT columns detected in this dataset.")
    print("   This notebook is only needed when TEXT columns are present.")
    print("   Continue to notebook 03_quality_assessment.ipynb")
else:
    print(f"\u2705 Found {len(text_columns)} TEXT column(s):")
    for col in text_columns:
        col_info = findings.columns[col]
        print(f"   - {col} (Confidence: {col_info.confidence:.0%})")

## 2a.2 Load Source Data

In [None]:
from customer_retention.stages.temporal import load_data_with_snapshot_preference, TEMPORAL_METADATA_COLS

df, data_source = load_data_with_snapshot_preference(findings, output_dir=str(FINDINGS_DIR))
charts = ChartBuilder()

print(f"Loaded {len(df):,} rows x {len(df.columns)} columns")
print(f"Data source: {data_source}")

## 2a.3 Configuration

### Available Embedding Models

Run the cell below to see available models and their specifications. Then configure your choice.

In [None]:
# Display available embedding models
print("Available Embedding Models")
print("=" * 80)
print(f"{'Preset':<15} {'Model':<35} {'Size':<10} {'Dim':<8} {'GPU?'}")
print("-" * 80)

for preset in list_available_models():
    info = get_model_info(preset)
    size = f"{info['size_mb']} MB" if info['size_mb'] < 1000 else f"{info['size_mb']/1000:.1f} GB"
    gpu = "Yes" if info['gpu_recommended'] else "No"
    print(f"{preset:<15} {info['model_name']:<35} {size:<10} {info['embedding_dim']:<8} {gpu}")
    print(f"                {info['description']}")
    print()

print("\nModels are downloaded on first use. Choose based on your hardware and quality needs.")

In [None]:
# === TEXT PROCESSING CONFIGURATION ===
# Choose your embedding model preset:
#   "minilm"     - Fast, CPU-friendly, good for exploration (default)
#   "qwen3-0.6b" - Better quality, needs GPU
#   "qwen3-4b"   - High quality, needs 16GB+ GPU
#   "qwen3-8b"   - Highest quality, needs 32GB+ GPU

EMBEDDING_PRESET = "minilm"  # Change this to try different models

# PCA configuration
VARIANCE_THRESHOLD = 0.95  # Keep components explaining 95% of variance
MIN_COMPONENTS = 2         # At least 2 features per text column
MAX_COMPONENTS = None      # No upper limit (set to e.g., 20 to cap)

# Get model info and create config
model_info = get_model_info(EMBEDDING_PRESET)
config = TextProcessingConfig(
    embedding_model=model_info["model_name"],
    variance_threshold=VARIANCE_THRESHOLD,
    max_components=MAX_COMPONENTS,
    min_components=MIN_COMPONENTS,
    batch_size=32
)

print("Text Processing Configuration")
print("=" * 50)
print(f"  Preset: {EMBEDDING_PRESET}")
print(f"  Model: {config.embedding_model}")
print(f"  Model size: {model_info['size_mb']} MB")
print(f"  Embedding dimension: {model_info['embedding_dim']}")
print(f"  GPU recommended: {'Yes' if model_info['gpu_recommended'] else 'No'}")
print()
print(f"  Variance threshold: {config.variance_threshold:.0%}")
print(f"  Min components: {config.min_components}")
print(f"  Max components: {config.max_components or 'unlimited'}")

if model_info['gpu_recommended']:
    print()
    print("Note: This model works best with GPU. Processing may be slow on CPU.")

## 2a.4 Text Column Analysis

Before processing, let's understand each TEXT column.

In [None]:
if text_columns:
    for col_name in text_columns:
        print(f"\n{'='*70}")
        print(f"Column: {col_name}")
        print(f"{'='*70}")
        
        text_series = df[col_name].fillna("")
        
        # Basic statistics
        non_empty = (text_series.str.len() > 0).sum()
        avg_length = text_series.str.len().mean()
        max_length = text_series.str.len().max()
        
        print(f"\n\U0001f4ca Statistics:")
        print(f"   Total rows: {len(text_series):,}")
        print(f"   Non-empty: {non_empty:,} ({non_empty/len(text_series)*100:.1f}%)")
        print(f"   Avg length: {avg_length:.0f} characters")
        print(f"   Max length: {max_length:,} characters")
        
        # Sample texts
        print(f"\n\U0001f4dd Sample texts:")
        samples = text_series[text_series.str.len() > 10].head(3)
        for i, sample in enumerate(samples, 1):
            truncated = sample[:100] + "..." if len(sample) > 100 else sample
            print(f"   {i}. {truncated}")
        
        # Text length distribution
        lengths = text_series.str.len()
        fig = go.Figure()
        fig.add_trace(go.Histogram(x=lengths[lengths > 0], nbinsx=50,
                                    marker_color='steelblue', opacity=0.7))
        fig.add_vline(x=lengths.median(), line_dash="solid", line_color="green",
                      annotation_text=f"Median: {lengths.median():.0f}")
        fig.update_layout(
            title=f"Text Length Distribution: {col_name}",
            xaxis_title="Character Count",
            yaxis_title="Frequency",
            template="plotly_white",
            height=350
        )
        display_figure(fig)

## 2a.5 Process Text Columns

This step:
1. Generates embeddings using sentence-transformers
2. Applies PCA to reduce dimensions
3. Creates PC feature columns

In [None]:
if text_columns:
    processor = TextColumnProcessor(config)
    
    print("Processing TEXT columns...")
    print("(This may take a moment for large datasets)\n")
    
    results = []
    df_processed = df.copy()
    
    for col_name in text_columns:
        print(f"\n{'='*70}")
        print(f"Processing: {col_name}")
        print(f"{'='*70}")
        
        df_processed, result = processor.process_column(df_processed, col_name)
        results.append(result)
        
        print(f"\n\u2705 Processing complete:")
        print(f"   Embedding shape: {result.embeddings_shape}")
        print(f"   Components kept: {result.n_components}")
        print(f"   Explained variance: {result.explained_variance:.1%}")
        print(f"   Features created: {', '.join(result.component_columns)}")
    
    print(f"\n\n{'='*70}")
    print("PROCESSING SUMMARY")
    print(f"{'='*70}")
    print(f"\nOriginal columns: {len(df.columns)}")
    print(f"New columns added: {len(df_processed.columns) - len(df.columns)}")
    print(f"Total columns: {len(df_processed.columns)}")

## 2a.6 Visualize Results

Understanding the PC features created from text embeddings.

In [None]:
if text_columns and results:
    for result in results:
        print(f"\n{'='*70}")
        print(f"Results: {result.column_name}")
        print(f"{'='*70}")
        
        # Explained variance per component
        reducer = processor._reducers[result.column_name]
        var_ratios = reducer._pca.explained_variance_ratio_
        cumulative = np.cumsum(var_ratios)
        
        fig = make_subplots(rows=1, cols=2,
                            subplot_titles=("Variance per Component", "Cumulative Variance"))
        
        fig.add_trace(go.Bar(
            x=[f"PC{i+1}" for i in range(len(var_ratios))],
            y=var_ratios,
            marker_color='steelblue'
        ), row=1, col=1)
        
        fig.add_trace(go.Scatter(
            x=[f"PC{i+1}" for i in range(len(cumulative))],
            y=cumulative,
            mode='lines+markers',
            line_color='green'
        ), row=1, col=2)
        
        fig.add_hline(y=config.variance_threshold, line_dash="dash", line_color="red",
                      annotation_text=f"Target: {config.variance_threshold:.0%}",
                      row=1, col=2)
        
        fig.update_layout(
            title=f"PCA Results: {result.column_name}",
            height=400,
            template="plotly_white",
            showlegend=False
        )
        fig.update_yaxes(title_text="Variance Ratio", row=1, col=1)
        fig.update_yaxes(title_text="Cumulative Variance", row=1, col=2)
        display_figure(fig)
        
        # PC feature distributions
        if len(result.component_columns) >= 2:
            fig = px.scatter(
                df_processed,
                x=result.component_columns[0],
                y=result.component_columns[1],
                title=f"PC1 vs PC2: {result.column_name}",
                opacity=0.5
            )
            fig.update_layout(template="plotly_white", height=400)
            display_figure(fig)

## 2a.7 Update Findings with Text Processing Metadata

In [None]:
if text_columns and results:
    for result in results:
        metadata = TextProcessingMetadata(
            column_name=result.column_name,
            embedding_model=config.embedding_model,
            embedding_dim=result.embeddings_shape[1],
            n_components=result.n_components,
            explained_variance=result.explained_variance,
            component_columns=result.component_columns,
            variance_threshold_used=config.variance_threshold,
            processing_approach="pca"
        )
        findings.text_processing[result.column_name] = metadata
        
        print(f"\u2705 Added metadata for {result.column_name}:")
        print(f"   Model: {metadata.embedding_model}")
        print(f"   Components: {metadata.n_components}")
        print(f"   Explained variance: {metadata.explained_variance:.1%}")
    
    findings.save(FINDINGS_PATH)
    print(f"\nFindings saved to: {FINDINGS_PATH}")

## 2a.8 Generate Recommendations

In [None]:
if text_columns and results:
    print("\n" + "="*70)
    print("PRODUCTION RECOMMENDATIONS")
    print("="*70)
    
    for result in results:
        print(f"\n\U0001f527 {result.column_name}:")
        print(f"   Action: embed_reduce (embeddings + PCA)")
        print(f"   Model: {config.embedding_model}")
        print(f"   Variance threshold: {config.variance_threshold:.0%}")
        print(f"   Expected features: {result.n_components}")
        print(f"   Feature names: {', '.join(result.component_columns[:3])}...")
    
    print("\n\U0001f4a1 These recommendations will be used by the pipeline generator.")
    print("   The same processing will be applied in production.")

---

## Summary

In this notebook, we:

1. **Analyzed** TEXT columns for length and content patterns
2. **Generated embeddings** using sentence-transformers
3. **Applied PCA** to reduce dimensions while preserving variance
4. **Created numeric features** (pc1, pc2, ...) for downstream ML
5. **Updated findings** with processing metadata

## Key Results

| Column | Components | Explained Variance |
|--------|------------|--------------------|
| (Filled by execution) | | |

---

## Next Steps

Continue to **03_quality_assessment.ipynb** to:
- Analyze duplicate records and value conflicts
- Deep dive into missing value patterns
- Analyze outliers with IQR method
- Get cleaning recommendations