20260211
recording some nlp tricks
- original blog: https://machinelearningmastery.com/7-advanced-feature-engineering-tricks-using-llm-embeddings/ 

# Semantic Similarity as a Feature

use anchor in sentence similarity comparison

In [1]:
# don't know, should always use: ./.venv//bin/python -m pip install sentence_transformers

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
 
# Initialize model and encode anchors
model = SentenceTransformer('all-MiniLM-L6-v2')
anchors = ["billing issue", "login problem", "feature request"]
anchor_embeddings = model.encode(anchors)
 
# Encode a new ticket
new_ticket = ["I can't access my account, it says password invalid."]
ticket_embedding = model.encode(new_ticket)
 
# Calculate similarity features
similarity_features = cosine_similarity(ticket_embedding, anchor_embeddings)
print(similarity_features)  # e.g., [[0.1, 0.85, 0.3]] -> high similarity to "login problem"

  from .autonotebook import tqdm as notebook_tqdm
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1104.34it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


[[0.36162615 0.5896069  0.03221758]]


try business case（used for text classification)

In [None]:
# don't know, should always use: ./.venv//bin/python -m pip install sentence_transformers

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
 
# Initialize model and encode anchors
model = SentenceTransformer('shibing624/text2vec-base-chinese')
anchors = ["买车", "没提到要买车"]
anchor_embeddings = model.encode(anchors)
 
# Encode a new ticket
new_ticket = ["ET9展车已到门店，从兰州新区接送用户到万象城门店看车，单程60公里"]
ticket_embedding = model.encode(new_ticket)
 
# Calculate similarity features
similarity_features = cosine_similarity(ticket_embedding, anchor_embeddings)
print(similarity_features)

Loading weights: 100%|██████████| 199/199 [00:00<00:00, 975.10it/s, Materializing param=pooler.dense.weight]                               
[1mBertModel LOAD REPORT[0m from: shibing624/text2vec-base-chinese
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


[[0.36438686 0.42217815]]


In [None]:
# 测试：能否提取几种汽车品牌，不准。觉得是too specific。比如上面这种general带语义的判断结果还ok。

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
 
# Initialize model and encode anchors
model = SentenceTransformer('shibing624/text2vec-base-chinese')
anchors = ["提到多个汽车品牌", "仅提到一种汽车品牌", "未提到汽车品牌"]
anchor_embeddings = model.encode(anchors)
 
# Encode a new ticket
new_ticket = ["用户是经常在国外，但是一直有在炒股，比较关注蔚来，每次有新车型上市都会关心讨论。用户也接触过好几个顾问，但是觉得回复速度和专业性会差一些，比较认可我们。下单前，用户提出想要有人能接一下，跟淳淳约了et9体验，帮他协调，可以去接送他，觉得很尊贵，后面满意锁单"]
ticket_embedding = model.encode(new_ticket)
 
# Calculate similarity features
similarity_features = cosine_similarity(ticket_embedding, anchor_embeddings)
print(similarity_features)

Loading weights: 100%|██████████| 199/199 [00:00<00:00, 1021.08it/s, Materializing param=pooler.dense.weight]                              
[1mBertModel LOAD REPORT[0m from: shibing624/text2vec-base-chinese
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


[[0.5112165  0.5099434  0.42966738]]


# Dimensionality Reduction and Denoising
LLM embeddings are high-dimensional (e.g., 384 or 768). Reducing dimensions can remove noise, cut computational cost, and sometimes reveal more accurate patterns.
The “curse of dimensionality” means some models (like Random Forests) may perform poorly when many dimensions are uninformative.

In [None]:
text_dataset = [
    "I was charged twice for my subscription",
    "Cannot reset my password",
    "Would love to see dark mode added",
    "My invoice shows the wrong amount",
    "Login keeps failing with error 401",
    "Please add export to PDF feature",
]

from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.manifold import TSNE  # For visualization, not typically for feature engineering
 
# Assume 'embeddings' is a numpy array of shape (n_samples, 384)
embeddings = np.array([model.encode(text) for text in text_dataset])
 
# Method 1: PCA - for linear relationships
# n_components must be <= min(n_samples, n_features)
n_components = min(50, len(text_dataset))
pca = PCA(n_components=n_components)
reduced_pca = pca.fit_transform(embeddings)
 
# Method 2: TruncatedSVD - similar, works on matrices from TF-IDF as well
svd = TruncatedSVD(n_components=n_components)
reduced_svd = svd.fit_transform(embeddings)
 
print(f"Original shape: {embeddings.shape}")
print(f"Reduced shape: {reduced_pca.shape}")
print(f"PCA retains {sum(pca.explained_variance_ratio_):.2%} of variance.")

# The code above works because PCA finds axes of maximum variance, often capturing the most significant semantic information in fewer, uncorrelated dimensions.

Original shape: (6, 768)
Reduced shape: (6, 6)
PCA retains 100.00% of variance.


Note that dimensionality reduction is lossy. Always test whether reduced features maintain or improve model performance. PCA is linear; for nonlinear relationships, consider UMAP (but be mindful of its sensitivity to hyperparameters).

# Cluster Labels and Distances as Features

- Use unsupervised clustering on your collection embeddings to discover natural thematic groups. Use cluster assignments and distances to cluster centroids as new categorical and continuous features.
- The problem: your data may have unknown or emerging categories not captured by predefined anchors (remember the semantic similarity trick). Clustering all document embeddings and then using the results as features addresses this.

- (exactly what i did in my project...to check what texts are naturally together which will generate topics, which can be used to fill what we missed by using "anchor list")

In [None]:
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import LabelEncoder
 
# Cluster the embeddings
# n_clusters must be <= n_samples
n_clusters = min(10, len(text_dataset))
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings)
 
# Feature 1: Cluster assignment (encode if your model needs numeric)
encoder = LabelEncoder()
cluster_id_feature = encoder.fit_transform(cluster_labels)
 
# Feature 2: Distance to each cluster centroid
distances_to_centroids = kmeans.transform(embeddings)  
# Shape: (n_samples, n_clusters)
# 'distances_to_centroids' now has up to n_clusters new continuous features per sample
 
# Combine with original embeddings or use alone
enhanced_features = np.hstack([embeddings, distances_to_centroids])

- This works because it provides the model with structural knowledge about the data’s natural grouping, which can be highly informative for tasks like classification or anomaly detection.

- Note: we’re using n_clusters = min(10, len(text_dataset)) because we don’t have much data. Choosing the number of clusters (k) is critical—use the elbow method or domain knowledge. DBSCAN is an alternative for density-based clustering that does not require specifying k.



# Text Difference Embeddings
- For tasks involving pairs of texts (for example, duplicate-question detection and semantic search relevance), the interaction between embeddings is more important than the embeddings in isolation.

- Simply concatenating two embeddings doesn’t explicitly model their relationship. A better approach is to create features that encode the difference and element-wise product between embeddings.

In [7]:
# For pairs of texts (e.g., query and document, ticket1 and ticket2)
texts1 = ["I can't log in to my account"]
texts2 = ["Login keeps failing with error 401"]
 
embeddings1 = model.encode(texts1)
embeddings2 = model.encode(texts2)
 
# Basic concatenation (baseline)
concatenated = np.hstack([embeddings1, embeddings2])
 
# Advanced interaction features
absolute_diff = np.abs(embeddings1 - embeddings2)  # Captures magnitude of disagreement
elementwise_product = embeddings1 * embeddings2     # Captures alignment (like a dot product per dimension)
 
# Combine all for a rich feature set
interaction_features = np.hstack([embeddings1, embeddings2, absolute_diff, elementwise_product])
interaction_features

array([[ 0.3350369 , -0.1173774 ,  0.32774225, ...,  0.01259114,
         1.8002882 ,  0.00377938]], shape=(1, 3072), dtype=float32)

- Why does this work? The difference vector highlights where semantic meanings diverge. The product vector increases where they agree. This design is influenced by successful neural network architectures like Siamese Networks used in similarity learning.

- This approach roughly quadruples the feature dimension. Apply dimensionality reduction (as above) and regularization to control size and noise.

#  Sentence-Level vs. Word-Level Embedding Aggregation
- The problem we're solving here: a single sentence embedding for a long, multi-topic document can lose fine-grained information.
- LLMs can embed words, sentences, or paragraphs. For longer documents, strategically aggregating word-level embeddings can capture information that a single document-level embedding might miss.

In [None]:
# To address this, use a token-embedding model (e.g., all-MiniLM-L6-v2 in word-piece mode or bert-base-uncased from Transformers), then pool key tokens.

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
 
# Load a model that provides token embeddings
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
 
def get_pooled_embeddings(text, pooling_strategy="mean"):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    token_embeddings = outputs.last_hidden_state  # Shape: (batch, seq_len, hidden_dim)
    attention_mask = inputs["attention_mask"].unsqueeze(-1)  # (batch, seq_len, 1)
 
    if pooling_strategy == "mean":
        # Masked mean to ignore padding tokens
        masked = token_embeddings * attention_mask
        summed = masked.sum(dim=1)
        counts = attention_mask.sum(dim=1).clamp(min=1)
        return (summed / counts).squeeze(0).numpy()
    elif pooling_strategy == "max":
        # Very negative number for masked positions
        masked = token_embeddings.masked_fill(attention_mask == 0, -1e9)
        return masked.max(dim=1).values.squeeze(0).numpy()
    elif pooling_strategy == "cls":
        return token_embeddings[:, 0, :].squeeze(0).numpy()
 
# Example: Get mean of non-padding token embeddings
doc_embedding = get_pooled_embeddings("A long document about several topics.")

- pooling
    - Why it works: Mean pooling averages out noise, while max pooling highlights the most salient features. For tasks where specific keywords are critical (e.g., sentiment from “amazing” vs. “terrible”), this can be more effective than standard sentence embeddings.

- padding & attention masks & CLS
    - Note that this can be computationally heavier than sentence-transformers. It also requires careful handling of padding and attention masks. The [CLS] token embedding is often fine-tuned for specific tasks but may be less general as a feature.

# Embeddings as Input for Feature Synthesis (AutoML)
- Let automated feature engineering tools treat your embeddings as raw input to discover complex, non-linear interactions you might not consider manually. Manually engineering interactions between hundreds of embedding dimensions is impractical.
- One practical approach is to use scikit-learn’s PolynomialFeatures on reduced-dimension embeddings.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA
 
# Start with reduced embeddings to avoid explosion
# n_components must be <= min(n_samples, n_features)
n_components_poly = min(20, len(text_dataset))
pca = PCA(n_components=n_components_poly)
embeddings_reduced = pca.fit_transform(embeddings)
 
# Generate polynomial and interaction features up to degree 2
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
synthesized_features = poly.fit_transform(embeddings_reduced)
 
print(f"Reduced dimensions: {embeddings_reduced.shape[1]}")
print(f"Synthesized features: {synthesized_features.shape[1]}")  # n + n*(n+1)/2 features
