Here is the deep dive **exclusively** on the Text Pipeline (TF-IDF + SVD).

### 1. The Logic: From "Words" to "Concepts"

You are converting raw sentences into 50 mathematical "concepts". Here is exactly what happens in that specific branch of your pipeline:

#### Step A: TF-IDF (The "Flavor" Profiler)
*   **What it does:** It turns text into a giant table where columns are words (e.g., "cancer", "pain", "vaccine").
*   **The Math:**
    *   **TF (Term Frequency):** How often does "insulin" appear in *this* trial? (High count = important for this row).
    *   **IDF (Inverse Document Frequency):** How rare is "insulin" across *all* trials?
        *   "Patient" appears everywhere $\to$ Weight = 0.01 (Ignored).
        *   "Insulin" appears rarely $\to$ Weight = 0.95 (High Signal).
*   **Result:** A sparse matrix of **(84,268 rows × 5,000 columns)**. Most cells are zero.

#### Step B: SVD (The Compressor)
*   **What it does:** It looks at those 5,000 columns and realizes that "insulin", "diabetes", "glucose", and "type 2" almost always appear together.
*   **The Math:** It collapses these correlated words into a single "Topic" (or Component).
    *   Instead of 4 columns, you get **1 column** (let's call it `Topic_Metabolic`).
    *   If a trial has "insulin" and "glucose", it gets a high score on `Topic_Metabolic`.
*   **Result:** A dense matrix of **(84,268 rows × 50 columns)**.

---

### 2. The Code: Visualize & Audit the Text Signals

Run this code to see exactly what your SVD learned. It will show you the **Top Words** for the main topics and the **Explained Variance** (how much information was kept).

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Access the specific Text Pipeline steps
# We dig into the pipeline to get the trained objects
txt_pipe = pipeline.named_transformers_['txt_tags_svd']
tfidf_step = txt_pipe.named_steps['tfidf']
svd_step = txt_pipe.named_steps['svd']

# Get the vocabulary (The 5,000 words TF-IDF kept)
vocab = tfidf_step.get_feature_names_out()

# ==========================================================
# STAT 1: EXPLAINED VARIANCE (How much info did we keep?)
# ==========================================================
total_var = svd_step.explained_variance_ratio_.sum() * 100
print(f"\n>>> SVD PERFORMANCE METRICS")
print(f"Total Information Retained by 50 Components: {total_var:.2f}%")
print("   (Note: In text data, 10-20% is usually enough to capture the main topics. >50% is excellent.)")

# Plot the cumulative variance
plt.figure(figsize=(10, 4))
plt.plot(np.cumsum(svd_step.explained_variance_ratio_), marker='.')
plt.title('Cumulative Explained Variance by SVD Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance')
plt.grid(True)
plt.show()

# ==========================================================
# STAT 2: DECODING THE TOPICS (What do the columns mean?)
# ==========================================================
print("\n>>> DECODING THE TOP 6 'HIDDEN' TOPICS")
print("These are the words that drive the values in your new 'truncatedsvd' columns.\n")

def get_top_words_for_topic(topic_idx, n_words=8):
    # Get the row from the V matrix (components_)
    component = svd_step.components_[topic_idx]
    # Sort indices by the absolute weight (magnitude of contribution)
    top_indices = np.argsort(component)[::-1][:n_words]
    # Map indices to words
    top_words = [(vocab[i], component[i]) for i in top_indices]
    return top_words

# Create a plot for the top 6 topics
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

for i in range(6):
    words_weights = get_top_words_for_topic(i)
    words, weights = zip(*words_weights)
    
    sns.barplot(x=list(weights), y=list(words), ax=axes[i], palette='viridis')
    axes[i].set_title(f"SVD Component {i} (Topic {i})")
    axes[i].set_xlabel("Weight (Importance)")

plt.tight_layout()
plt.show()
```

### 3. How to Interpret the Output

#### A. The Variance Plot
*   **Curve Shape:** It should rise steeply at the beginning and then flatten out.
*   **Meaning:** The first few components (0, 1, 2) capture the "Big Themes" (Cancer, Pain, Vaccines). The later components (40-50) capture tiny details.
*   **The %:** If it says **"Total Information Retained: 15%"**, don't panic. Text is messy. Capturing 15-20% of the *mathematical variance* of 5,000 words often captures 90% of the *semantic meaning* needed for prediction.

#### B. The Bar Charts (The "Fingerprints")
This is the most important check.
*   **Component 0:** Usually the "Average Trial". It might contain generic medical words like *patients, disease, treatment*.
*   **Component 1:** Often the largest specific field. Likely **Oncology** (*tumor, solid, advanced, metastatic*).
*   **Component 2:** Often the "Opposite" of Component 1. If Comp 1 was Cancer, Comp 2 might be **Cardiology** or **Pain**.
*   **Component 3:** Might be **Infectious Disease** (*covid, vaccine, virus*).

**If you see these clear clusters, your TF-IDF + SVD is working perfectly.** It has successfully taught the computer the difference between a Cancer trial and a Covid trial without you ever explicitly programming those rules.