Elevate your TYPO3 website with intelligent, AI-powered content recommendations using advanced NLP technology.
The Semantic Suggestion extension revolutionizes the way related content is presented on TYPO3 websites. Moving beyond traditional category-based functionalities, this extension employs advanced NLP (Natural Language Processing) and TF-IDF (Term Frequency-Inverse Document Frequency) analysis to create genuinely relevant content connections.
Since version 2.0.0, similarity scores are stored in a dedicated database table (tx_semanticsuggestion_similarities) instead of the TYPO3 cache. A Scheduler task handles calculation and storage, ensuring persistence and performance.
🚀 NEW in v3.0.0: Enhanced with nlp_tools integration for professional-grade text analysis, featuring intelligent language detection for TYPO3 12/13 multilingual sites and optimized German language support.
- 🎯 Highly Relevant Links: TF-IDF vectorization with advanced NLP creates genuinely relevant content connections
- 🌍 Multilingual Intelligence: Smart language detection for TYPO3 12/13 sites with automatic content analysis
- 🇩🇪 German Language Optimized: Professional stemming and handling of compound words (Automobilindustrie ↔ Automobil)
- 📈 Increased User Engagement: Keep visitors on your site longer by offering truly related content
- 🔍 Semantic Cocoon: Contributes to a high-quality semantic network, enhancing SEO and navigation
- ⚡ Intelligent Automation: Reduces manual linking work while improving internal link quality by 30-50%
- The similarity calculation process performed by the Scheduler task can take time, especially on sites with a large number of pages (>500 pages might require 30s or more depending on the server).
- Displaying suggestions and statistics (reading from the database) is optimized.
- Use the backend module to assess the performance and relevance of suggestions for your specific setup.
- TF-IDF Vectorization: Professional-grade similarity calculation replacing basic cosine similarity
- Intelligent Language Detection: Hybrid approach using TYPO3 configuration + content analysis
- Advanced Text Processing: Stemming, stop word removal, and text normalization via nlp_tools
- TYPO3 12/13 Compatibility: Uses modern Context API and Site Configuration
- Smart Language Detection:
- Priority 1: Uses TYPO3 site language configuration (
de_DE.UTF-8→de) - Priority 2: Falls back to intelligent content analysis when TYPO3 config is uncertain
- Priority 3: Confidence scoring prevents misdetection of mixed-language content
- Priority 1: Uses TYPO3 site language configuration (
- Multi-site Ready: Supports complex multilingual TYPO3 setups
- Professional Stemming: Uses Wamania\Snowball for German morphology
- Compound Word Support: Recognizes relationships between "Automobil" and "Automobilindustrie"
- Umlaut Handling: Perfect support for ä, ö, ü, ß characters
- German Stop Words: Optimized list including der, die, das, ein, eine, mit, von, etc.
- TF-IDF Vectors: More accurate similarity scoring than traditional word counting
- Confidence Thresholds: Prevents false positives in language detection
- Fallback Mechanisms: Graceful degradation if NLP processing fails
- Performance Caching: TF-IDF vectors and stemming results are cached
- Introduction
- Features
- Requirements
- Installation
- Language Detection System
- Configuration
- Usage (Frontend)
- Backend Module
- Scheduler Task
- Similarity Logic (TF-IDF Enhanced)
- Multilingual Sites Setup
- Display Customization
- Debugging & Performance
- Migration from v2.x
- Contributing
- License
- Support
- Advanced TF-IDF Analysis: Professional semantic similarity calculation using vectorization
- Database Storage: Persistent similarity scores in
tx_semanticsuggestion_similaritiestable - Automated Processing: Scheduler task for background calculation and updates
- Frontend Integration: Clean API for displaying suggestions (title, media, excerpt)
- Backend Analytics: Comprehensive statistics and performance metrics
- Intelligent Language Detection: Hybrid TYPO3 + content analysis approach
- Professional Text Processing: Stemming, stop word removal, text normalization
- German Language Excellence: Optimized for compound words and umlauts
- Multi-site Support: Handles complex TYPO3 multilingual configurations
- Confidence Scoring: Prevents false language detection in mixed content
- Highly Configurable: TypoScript settings for display and Scheduler for analysis scope
- Performance Optimized: TF-IDF vector caching and intelligent fallbacks
- Flexible Exclusions: Page-level exclusions for analysis and/or display
- Debug Mode: Comprehensive logging for development and troubleshooting
- TYPO3: 12.0.0 - 13.9.99
- PHP: 8.0 or higher
- Dependencies:
cywolf/nlp-tools(automatically installed via Composer)wamania/php-stemmer(for advanced stemming support)
Composer Installation (recommended)
- Install the extension with dependencies:
composer require talan-hdf/semantic-suggestion composer require cywolf/nlp-tools
- Activate both extensions in the TYPO3 Extension Manager.
- Clear TYPO3 cache:
./vendor/bin/typo3 cache:flush - (Optional) Run unit tests to verify installation:
./vendor/bin/phpunit --configuration phpunit.xml.dist --testsuite unit
Manual Installation
- Download the extension from the TER or GitHub.
- Upload the archive to
typo3conf/ext/. - Activate the extension in the Extension Manager.
The extension uses a hybrid approach that combines TYPO3's multilingual configuration with intelligent content analysis:
-
🎯 Priority 1: TYPO3 Site Configuration (for TYPO3 12/13)
// Uses TYPO3's Context API and Site Configuration $site = $request->getAttribute('site'); $siteLanguage = $site->getLanguageById($languageId); $locale = $siteLanguage->getLocale(); // e.g., "de_DE.UTF-8" return strtolower(substr($locale->getLanguageCode(), 0, 2)); // → "de"
-
🔍 Priority 2: Content Analysis (when TYPO3 config is uncertain)
- Extracts character trigrams from text content
- Compares against language profiles built from stop words
- Uses confidence scoring to prevent false positives
-
⚖️ Priority 3: Confidence Verification
// If confidence difference < 30%, trust TYPO3 context if (($firstScore - $secondScore) / $firstScore < 0.3) { return $this->getTypo3LanguageContext(); }
# site/config.yaml
languages:
- languageId: 0
locale: 'en_US.UTF-8' # → nlp_tools detects "en"
- languageId: 1
locale: 'de_DE.UTF-8' # → nlp_tools detects "de"
- languageId: 2
locale: 'fr_FR.UTF-8' # → nlp_tools detects "fr"Result: nlp_tools uses TYPO3 configuration directly via Site API
Page with:
- sys_language_uid = 0 (English)
- But content: "Die deutsche Automobilindustrie ist sehr wichtig"
Detection results:
- Content analysis: 85% German confidence
- TYPO3 context: English
- Final decision: German (high confidence overrides TYPO3)
Detection scores: French: 45%, German: 42%
Confidence difference: (45-42)/45 = 6.7% < 30%
→ Falls back to TYPO3 language context
Yes, the extension fully integrates with nlp_tools:
// In PageAnalysisService.php
protected function getCurrentLanguage(): string
{
// Uses nlp_tools LanguageDetectionService
$detectedLanguage = $this->languageDetector->detectLanguage('');
return $detectedLanguage; // Smart hybrid detection
}The language detection leverages:
- LanguageDetectionService: Trigram analysis and confidence scoring
- TextAnalysisService: Advanced text processing and stemming
- TextVectorizerService: TF-IDF vectorization for similarity calculation
All languages benefit from nlp_tools integration, but with different optimization levels:
| Language | Code | Stemming | Stop Words | TF-IDF | Improvement |
|---|---|---|---|---|---|
| 🇩🇪 German | de |
✅ Advanced (Wamania\Snowball) | ✅ Specialized | ✅ Full | +40-50% |
German Features:
- Professional compound word handling (
Automobilindustrie↔Automobil) - Perfect umlaut support (
ä, ö, ü, ß) - Specialized stop words (
der, die, das, ein, eine, mit, von, zu...)
| Language | Code | Stemming | Stop Words | TF-IDF | Improvement |
|---|---|---|---|---|---|
| 🇫🇷 French | fr |
✅ Wamania\Snowball | ✅ Specialized | ✅ Full | +30-40% |
| 🇬🇧 English | en |
✅ Wamania\Snowball | ✅ Specialized | ✅ Full | +30-40% |
| 🇪🇸 Spanish | es |
✅ Wamania\Snowball | ✅ Specialized | ✅ Full | +30-40% |
| Language | Code | Stemming | Stop Words | TF-IDF | Improvement |
|---|---|---|---|---|---|
| 🇮🇹 Italian | it |
❌ Basic tokenization | ✅ Full | +20-30% | |
| 🇵🇹 Portuguese | pt |
❌ Basic tokenization | ✅ Full | +20-30% | |
| 🇳🇱 Dutch | nl |
❌ Basic tokenization | ✅ Full | +20-30% | |
| All Others | * |
❌ Basic tokenization | ✅ Full | +20-25% |
✅ All languages benefit from:
- TF-IDF Vectorization: Professional semantic similarity calculation
- Intelligent Language Detection: Hybrid TYPO3 + content analysis
- Advanced Text Cleaning: Unicode normalization, accent removal
- Performance Caching: TF-IDF vectors and processing results cached
// ALL languages get TF-IDF processing
$tfidfResult = $textVectorizer->createTfIdfVectors([$text1, $text2], $language);
$similarity = $textVectorizer->cosineSimilarity($vector1, $vector2);
// Only de/fr/en/es get advanced stemming
if (in_array($language, ['de', 'fr', 'en', 'es'])) {
$stemmedWords = $textAnalyzer->stem($text, $language);
} else {
$stemmedWords = $textAnalyzer->tokenize($text); // Basic tokenization
}# Scheduler Configuration: minimumSimilarity = 0.15
plugin.tx_semanticsuggestion_suggestions.settings {
enableStemming = 1 # CRITICAL for compound words
proximityThreshold = 0.25 # Lower threshold (compound matching)
minTextLength = 50 # German compound words in short text
analyzedFields {
title = 2.0 # German titles contain key compounds
keywords = 2.5 # German keywords very specific
content = 1.0 # Standard weight
description = 1.2 # Meta descriptions useful
abstract = 1.3 # German abstracts well-structured
}
}
# Scheduler Configuration: minimumSimilarity = 0.2
plugin.tx_semanticsuggestion_suggestions.settings {
enableStemming = 1 # Advanced stemming available
proximityThreshold = 0.3 # Standard TF-IDF threshold
minTextLength = 50 # Standard minimum length
analyzedFields {
title = 1.5 # Titles important but less compound
keywords = 2.0 # Keywords still highly relevant
content = 1.0 # Main content
description = 1.0 # Meta descriptions
abstract = 1.1 # Abstracts helpful
}
}
# Scheduler Configuration: minimumSimilarity = 0.25
plugin.tx_semanticsuggestion_suggestions.settings {
enableStemming = 0 # No advanced stemming, but TF-IDF still helps
proximityThreshold = 0.35 # Higher threshold (less precise without stemming)
minTextLength = 100 # Longer text needed for TF-IDF accuracy
analyzedFields {
title = 1.8 # Rely more on titles without stemming
keywords = 2.2 # Keywords become more important
content = 0.8 # Content less reliable without stemming
description = 1.0 # Standard weight
abstract = 1.0 # Standard weight
}
}
Note: Even without stemming, languages like Italian and Portuguese still get significant improvements from TF-IDF vectorization compared to basic word counting.
🎯 NEW in v3.1: Storage vs Display Quality Configuration for clarity! Separate qualityLevel parameters for storage (Scheduler) and display (TypoScript) with clear explanations.
The configuration separates what gets stored (Scheduler) from what gets displayed (TypoScript) for maximum flexibility:
🎯 CONFIGURATION FLOW:
┌─────────────────────┐ ┌──────────────────┐ ┌─────────────────────┐ ┌──────────────────┐
│ Storage QualityLevel│───▶│ Storage: direct │───▶│ Display QualityLevel│───▶│ Display Filter │
│ (Scheduler Task) │ │ (exact match) │ │ (TypoScript) │ │ (quality) │
│ 0.3 → stores ≥0.3 │ │ │ │ 0.4 → shows ≥0.4 │ │ │
└─────────────────────┘ └──────────────────┘ └─────────────────────┘ └──────────────────┘
CONTROLS DATABASE SAVES SIMILARITIES CONTROLS FRONTEND USER SEES RESULTS
STORAGE EFFICIENCY FOR PRECISION DISPLAY QUALITY FILTERED SUGGESTIONS
Benefits:
- ✅ Clear separation of storage vs display logic
- ✅ Flexible filtering (display can be stricter than storage)
- ✅ Performance optimization (store broad, display selective)
- ✅ Backward compatibility with legacy configurations
- ✅ Self-explanatory values (higher = more selective)
⚠️ DEPRECATED: This section documents the old system for reference. Use unifiedqualityLevelinstead!
Click to view legacy configuration details
Understanding the configuration hierarchy is critical for proper setup. The extension uses a two-tier system where Scheduler settings always take precedence over TypoScript settings:
🔄 CONFIGURATION FLOW:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ Scheduler │───▶│ Database │───▶│ TypoScript │───▶│ Frontend │
│ Settings │ │ Storage │ │ Settings │ │ Display │
└─────────────────┘ └──────────────────┘ └─────────────────┘ └──────────────────┘
LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 4
(Analysis & (Persistent (Display & (User Sees
Storage) Data) Filtering) Results)
- Controls: What gets analyzed and stored in database
- Authority: Absolute - cannot be overridden by TypoScript
- Key Settings:
startPageId: Defines analysis scopeexcludePages: Pages never analyzed (permanent exclusion)minimumSimilarity: Minimum score to store in databaserecursiveExclusion: How exclusions are applied
- Contains: Only similarities ≥
minimumSimilarityfrom Scheduler - Source: Generated by Scheduler task execution
- Limitation: TypoScript cannot access data that was never stored
- Controls: What gets displayed from existing database data
- Limitation: Can only filter/limit existing data, cannot create new data
- Key Settings:
proximityThreshold: Minimum score to display (must be ≥ SchedulerminimumSimilarity)maxSuggestions: Limit displayed resultsexcludePages: Pages excluded from display only (still analyzed/stored)
- Shows: Final filtered and limited results
- Source: Database data filtered by TypoScript settings
-
Scheduler ALWAYS wins:
Scheduler excludePages = "42,56" TypoScript excludePages = "" (empty) → Pages 42,56 will NEVER appear (not analyzed at all) -
TypoScript cannot override Scheduler thresholds:
Scheduler minimumSimilarity = 0.5 TypoScript proximityThreshold = 0.3 → Impossible! No similarity < 0.5 exists in database -
TypoScript can only be MORE restrictive:
Scheduler minimumSimilarity = 0.3 ✅ TypoScript proximityThreshold = 0.5 ✅ → Works: Shows only similarities ≥ 0.5 from stored data ≥ 0.3
🔧 CRITICAL: This is the primary configuration that controls what gets analyzed and stored in the database. TypoScript cannot override these settings!
Create a "Semantic Suggestion: Generate Similarities" task in the TYPO3 Scheduler module with these settings:
⚠️ WARNING: Choose yourminimumSimilaritycarefully - it cannot be lowered retroactively without re-running the entire analysis!
-
startPageId(required): The UID of the root page from which the analysis will begin. This defines the scope of the analysis for this task run. Each task execution is linked to a Start Page ID (stored asroot_page_idin the DB).- Example:
1(for site root page)
- Example:
-
excludePages(optional): Comma-separated list of page UIDs that will not be analyzed, and their similarities will not be stored.- Example:
42,56,78
- Example:
-
minimumSimilarity(required): Threshold (0.0 to 1.0) below which a pair of similar pages will not be saved to the database. This controls storage efficiency.- Example:
0.3(saves only pairs with similarity ≥ 30%)
- Example:
Recommended scheduling:
- Frequency: Daily or weekly
- Execution time: During off-peak hours (e.g., 2:00 AM)
🎨 DISPLAY FILTER: These settings control the frontend display and the analysis algorithm details. They can only filter/limit what was already stored by the Scheduler.
❌ CONSTRAINT:
proximityThresholdMUST be ≥ SchedulerminimumSimilarity(otherwise no suggestions will display)
Define them in your TypoScript Setup file under plugin.tx_semanticsuggestion_suggestions.settings.
plugin.tx_semanticsuggestion_suggestions.settings {
# --- Frontend Display Settings ---
# ⚠️ IMPORTANT: Must be ≥ Scheduler minimumSimilarity
proximityThreshold = 0.25 # TF-IDF optimized threshold (was 0.5 in v2.x)
maxSuggestions = 3 # Maximum number of suggestions to display
excerptLength = 100 # Max length of the text excerpt
excludePages = # Pages to exclude from DISPLAY only (comma-separated UIDs)
# --- Analysis Algorithm Settings (Used by Scheduler task) ---
recencyWeight = 0.2 # Weight of recency in the final score (0.0 to 1.0)
# Fields analyzed and their weights (TF-IDF enhanced)
analyzedFields {
title = 1.5 # Page titles are highly relevant
description = 1.0 # Meta descriptions
keywords = 2.0 # Explicit keywords have high weight
abstract = 1.2 # Page abstracts/summaries
content = 1.0 # Main page content (can be noisy)
}
# --- NLP & Language Settings (v3.0+) ---
enableStemming = 1 # Enable advanced stemming (especially for German)
defaultLanguage = en # Fallback language
minTextLength = 50 # Minimum text length for analysis
confidenceThreshold = 0.3 # Language detection confidence threshold
# --- Debugging ---
debugMode = 0 # Enable debug logs (0 or 1)
}
# Main plugin configuration
plugin.tx_semanticsuggestion_suggestions {
settings {
proximityThreshold = {$plugin.tx_semanticsuggestion_suggestions.settings.proximityThreshold}
maxSuggestions = {$plugin.tx_semanticsuggestion_suggestions.settings.maxSuggestions}
excludePages = {$plugin.tx_semanticsuggestion_suggestions.settings.excludePages}
excerptLength = {$plugin.tx_semanticsuggestion_suggestions.settings.excerptLength}
recencyWeight = {$plugin.tx_semanticsuggestion_suggestions.settings.recencyWeight}
debugMode = {$plugin.tx_semanticsuggestion_suggestions.settings.debugMode}
analyzedFields {
title = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.title}
description = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.description}
keywords = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.keywords}
abstract = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.abstract}
content = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.content}
}
}
view {
# Paths to your Fluid templates if you wish to customize them
templateRootPaths.10 = EXT:your_extension/Resources/Private/Templates/
partialRootPaths.10 = EXT:your_extension/Resources/Private/Partials/
layoutRootPaths.10 = EXT:your_extension/Resources/Private/Layouts/
}
}
# Reusable TypoScript object
lib.semantic_suggestion = USER
lib.semantic_suggestion {
userFunc = TYPO3\CMS\Extbase\Core\Bootstrap->run
extensionName = SemanticSuggestion
pluginName = Suggestions
vendorName = TalanHdf
view =< plugin.tx_semanticsuggestion_suggestions.view
persistence =< plugin.tx_semanticsuggestion_suggestions.persistence
settings =< plugin.tx_semanticsuggestion_suggestions.settings
}
# Content element integration
tt_content.list.20.semanticsuggestion_suggestions =< plugin.tx_semanticsuggestion_suggestions
plugin.tx_semanticsuggestion_suggestions {
settings {
# 🧠 NLP Features (NEW in v3.0)
enableStemming = 1 # Enable advanced stemming (German compound words)
defaultLanguage = en # Fallback language
minTextLength = 50 # Minimum text length for analysis
confidenceThreshold = 0.3 # Language detection confidence threshold
# 🌐 Language Mapping (TYPO3 12/13)
languageMapping {
0 = en # Language UID 0 → English
1 = fr # Language UID 1 → French
2 = de # Language UID 2 → German
3 = es # Language UID 3 → Spanish
4 = it # Language UID 4 → Italian
5 = pt # Language UID 5 → Portuguese
}
# 🇩🇪 German Language Optimization
proximityThreshold = 0.25 # Lower threshold for TF-IDF (more sensitive)
# 📊 TF-IDF Algorithm Settings
analyzedFields {
title = 1.5 # German titles often contain key compound words
keywords = 2.0 # German keywords are highly indicative
content = 1.0 # Base content weight
description = 1.0
abstract = 1.2
}
}
}
For a German/English bilingual site:
# constants.typoscript
plugin.tx_semanticsuggestion_suggestions.settings {
# 🇩🇪 German language optimization
enableStemming = 1 # Critical for compound words
proximityThreshold = 0.25 # Lower threshold for German compound matching
confidenceThreshold = 0.3
# Language detection mapping
languageMapping {
0 = en # TYPO3 language UID 0 = English
1 = de # TYPO3 language UID 1 = German
}
# German-optimized field weights
analyzedFields {
title = 2.0 # German titles contain key compounds
keywords = 2.5 # German keywords are very specific
content = 1.0 # Standard content weight
}
}
# IMPORTANT: Create separate Scheduler tasks:
# Task 1: English content (startPageId = 1, minimumSimilarity = 0.3)
# Task 2: German content (startPageId = 10, minimumSimilarity = 0.25)
- Analysis Scope: Defined by the Scheduler task's
startPageId. - DB Storage: Controlled by the Scheduler task's
minimumSimilarityandexcludePages. - Similarity Calculation: Performed by the
PageAnalysisService(called by the Scheduler task), which uses the TypoScript settingsanalyzedFieldsandrecencyWeight. - Frontend Display: Reads from the DB and filters/limits based on the TypoScript settings
proximityThreshold,maxSuggestions,excludePages. - Backend Display: Reads from the DB (based on the selected
root_page_id) and filters based on the TypoScriptproximityThreshold.
Key Points:
- The
proximityThreshold(TypoScript) cannot display suggestions with a score lower than theminimumSimilarity(Scheduler) because they were not saved. For the TypoScript setting to be effective, it must be ≥ the Scheduler threshold. - A page excluded in the Scheduler will never be analyzed/stored. A page excluded only in TypoScript will be analyzed/stored (if not excluded in Scheduler) but not displayed. It's often simpler to keep the
excludePageslists synchronized. - You can create multiple Scheduler tasks with different
startPageIdvalues to analyze different sections of the site.
Create a "Semantic Suggestion: Generate Similarities" task with these settings:
startPageId(required): Root page UID for analysis scopequalityLevel(required): Quality threshold (0.1-1.0) for suggestions- Storage: Automatically set to
qualityLevel - 0.1(broad data collection) - Display: Uses
qualityLeveldirectly (quality suggestions to users)
- Storage: Automatically set to
excludePages(optional): Pages to exclude from both analysis and displayrecursiveExclusion(optional): Apply exclusions recursively
Recommended Quality Levels:
🇩🇪 German sites: 0.25 (compound word optimization)
🌍 Standard sites: 0.30 (balanced quality/quantity)
📚 Content sites: 0.35 (higher quality threshold)
💎 Premium sites: 0.40 (very selective suggestions)
plugin.tx_semanticsuggestion_suggestions {
settings {
# 🎯 UNIFIED QUALITY CONTROL
qualityLevel = 0.3 # Single parameter for all quality control
# Display Settings
maxSuggestions = 3 # Number of suggestions to show
excludePages = # Additional pages to exclude from display
excerptLength = 100 # Text excerpt length
# NLP Settings (v3.0+)
enableStemming = 1 # Enable advanced text processing
defaultLanguage = en # Fallback language
debugMode = 0 # Debug logging
}
}
# Single quality level controls everything
qualityLevel = 0.3
# Result:
# - Storage threshold: 0.2 (collects broad range)
# - Display threshold: 0.3 (shows quality suggestions)
# - No configuration conflicts possible
# Lower threshold for German compound words
qualityLevel = 0.25
enableStemming = 1
# Result optimized for compound words like:
# "Automobil" ↔ "Automobilindustrie"
# Higher threshold for premium content
qualityLevel = 0.4
# Result:
# - Storage threshold: 0.3 (good range)
# - Display threshold: 0.4 (only excellent suggestions)
The extension automatically migrates old configurations:
Legacy v3.0:
Scheduler: minimumSimilarity = 0.4
TypoScript: proximityThreshold = 0.5
Auto-migrated to v3.1:
qualityLevel = 0.5 (migrated from proximityThreshold)
Internal storage = 0.4 (preserved)
- Identify your old
proximityThresholdfrom TypoScript - Set
qualityLevelto that value - Remove old parameters:
- Delete
proximityThresholdfrom TypoScript - Delete
minimumSimilarityfrom Scheduler (auto-computed) - Merge duplicate
excludePageslists
- Delete
Example Migration:
# OLD v3.0 configuration
plugin.tx_semanticsuggestion_suggestions.settings {
proximityThreshold = 0.35 # OLD: Display threshold
excludePages = 42,56,78 # OLD: Display exclusions only
}
# NEW v3.1 configuration
plugin.tx_semanticsuggestion_suggestions.settings {
qualityLevel = 0.35 # NEW: Unified quality control
excludePages = 42,56,78 # NEW: Unified exclusions (analysis + display)
}
# Scheduler task OLD: minimumSimilarity = 0.25, excludePages = ""
# Scheduler task NEW: qualityLevel = 0.35 (automatically computes storage = 0.25)
Understanding these technical constraints will save you hours of debugging:
❌ WRONG Configuration:
Scheduler: minimumSimilarity = 0.7
TypoScript: proximityThreshold = 0.3
→ Result: NO suggestions displayed (none stored below 0.7)
✅ CORRECT Configuration:
Scheduler: minimumSimilarity = 0.3
TypoScript: proximityThreshold = 0.7
→ Result: Shows high-quality suggestions from broader stored data
Rule: proximityThreshold ≥ minimumSimilarity (or equal)
❌ PROBLEMATIC:
Scheduler: excludePages = "" (empty)
TypoScript: excludePages = "42,56,78"
→ Result: Pages 42,56,78 are analyzed/stored but never displayed (wasted processing)
✅ EFFICIENT:
Scheduler: excludePages = "42,56,78"
TypoScript: excludePages = "" (empty or same list)
→ Result: Pages 42,56,78 are never processed (faster, cleaner)
⚠️ TF-IDF produces LOWER scores than v2.x cosine similarity:
v2.x typical range: 0.3-0.9
v3.0 TF-IDF range: 0.05-0.4
❌ Legacy Configuration:
minimumSimilarity = 0.8 → NO results with TF-IDF
✅ TF-IDF Optimized:
minimumSimilarity = 0.15 → Good range for TF-IDF
proximityThreshold = 0.25 → Quality suggestions
🇩🇪 German sites need LOWER thresholds due to compound word stemming:
❌ Standard Configuration:
proximityThreshold = 0.5 → Misses compound relationships
✅ German Optimized:
minimumSimilarity = 0.15
proximityThreshold = 0.25
enableStemming = 1
→ Captures "Automobil" ↔ "Automobilindustrie" relationships
⚠️ CONSTRAINT: Each language needs separate analysis
❌ Single Task Configuration:
Task 1: startPageId = 1 (analyzing both English + German pages)
→ Result: Language mixing, poor similarity quality
✅ Multi-Task Configuration:
Task 1: startPageId = 1 (English root) - minimumSimilarity = 0.3
Task 2: startPageId = 10 (German root) - minimumSimilarity = 0.25
→ Result: Optimized per-language analysis
Sites with >500 pages:
❌ Permissive Configuration:
minimumSimilarity = 0.05 → Database bloat (millions of records)
maxSuggestions = 10 → Slow frontend queries
✅ Performance Optimized:
minimumSimilarity = 0.25 → Quality storage
proximityThreshold = 0.4 → Fast display
maxSuggestions = 3 → Quick queries
Before deploying, verify these constraints:
- Threshold Check:
proximityThreshold≥minimumSimilarity - Exclusion Sync: Scheduler
excludePagesincludes all TypoScript exclusions - TF-IDF Adjustment: Thresholds lowered from v2.x values (≤ 0.4 typically)
- Language Separation: Each language has its own Scheduler task
- Performance Test: Task execution time acceptable for your server
- Storage Monitoring: Database table
tx_semanticsuggestion_similaritiessize reasonable
Watch for these symptoms:
| Symptom | Likely Cause | Solution |
|---|---|---|
| Zero suggestions displayed | proximityThreshold > all stored similarities |
Lower proximityThreshold or minimumSimilarity |
| Poor suggestion quality | Threshold too low | Raise proximityThreshold |
| Missing expected pages | Pages excluded in Scheduler | Check excludePages settings |
| Scheduler timeouts | Large site with low threshold | Raise minimumSimilarity |
| Mixed language results | Single task for multilingual site | Create per-language tasks |
Integrate the plugin into your Fluid templates to display suggestions:
<f:cObject typoscriptObjectPath='lib.semantic_suggestion' />Or include it directly in TypoScript:
# Include on a page
page.10 =< lib.semantic_suggestion
# Or in a content element
lib.myContent = COA
lib.myContent {
10 =< lib.semantic_suggestion
}
The plugin will read relevant suggestions for the current page from the database, applying filters defined in the TypoScript settings (proximityThreshold, maxSuggestions, excludePages).
A backend module ("Semantic Suggestion" under "Web") allows visualizing the results of the analyses stored in the database.
- Analysis Selection: Choose which analysis to view (based on the
startPageId/root_page_idof executed Scheduler tasks). - Detailed Statistics: Most similar pairs, score distribution, pages with the most links, language statistics.
- Configuration Overview: Reminder of the main parameters used (display threshold, etc.).
- Performance Metrics: Module load time, number of stored pairs for the selected analysis.
The "Semantic Suggestion: Generate Similarities" task is essential for the extension's operation.
- Role: Calculates similarities between pages (using
PageAnalysisService) and saves relevant results (above theminimumSimilaritythreshold) to thetx_semanticsuggestion_similaritiestable. - Configuration: Set the
startPageId,excludePages, andminimumSimilarityvia the Scheduler interface. - Frequency: Schedule its execution regularly (e.g., daily, weekly) during off-peak hours to keep suggestions up-to-date without impacting site performance.
-
🎯 Language Detection (NEW in v3.0)
Text Input → TYPO3 Site Context → Content Analysis → Language: "de" -
📝 Advanced Text Processing (via nlp_tools)
Raw Text → Stop Words Removal → Stemming (German) → Clean Tokens Example: "Die deutschen Automobilhersteller" → "deutsch automobilherstell" (stemmed) -
🔢 TF-IDF Vectorization (NEW in v3.0)
[Text1, Text2] → TF-IDF Vectors → Cosine Similarity Score Traditional: Simple word counting TF-IDF: Professional semantic analysis -
💾 Smart Storage & Display
Similarity Score + Recency Boost → Database → Frontend Filter Scheduler threshold: 0.3 → Display threshold: 0.5
// Simple word frequency comparison
$similarity = $dotProduct / ($magnitude1 * $magnitude2);// Professional semantic analysis
$tfidfResult = $this->textVectorizer->createTfIdfVectors([$text1, $text2], $language);
$similarity = $this->textVectorizer->cosineSimilarity($vector1, $vector2);
// + German stemming, stop word removal, confidence scoringPerformance Impact: 30-50% better accuracy, especially for German compound words.
The extension automatically detects languages from your TYPO3 site configuration:
# config/sites/main/config.yaml
languages:
-
languageId: 0
title: 'English'
hreflang: 'en-US'
locale: 'en_US.UTF-8' # → Auto-detected as "en"
iso-639-1: 'en'
-
languageId: 1
title: 'Deutsch'
hreflang: 'de-DE'
locale: 'de_DE.UTF-8' # → Auto-detected as "de"
iso-639-1: 'de'
-
languageId: 2
title: 'Français'
hreflang: 'fr-FR'
locale: 'fr_FR.UTF-8' # → Auto-detected as "fr"
iso-639-1: 'fr'For optimal performance, create separate Scheduler tasks for each language:
Task 1: "Generate Similarities - English"
- startPageId: 1 (English root)
- minimumSimilarity: 0.3
Task 2: "Generate Similarities - German"
- startPageId: 2 (German root)
- minimumSimilarity: 0.25 # Lower for German (compound words)
Task 3: "Generate Similarities - French"
- startPageId: 3 (French root)
- minimumSimilarity: 0.3
plugin.tx_semanticsuggestion_suggestions.settings {
# German compound words need lower thresholds
proximityThreshold = 0.25
# Enable stemming for better compound word detection
enableStemming = 1
# German titles are often highly descriptive
analyzedFields {
title = 2.0 # Higher weight for German titles
keywords = 2.5 # German keywords are very specific
}
}
Solution: Check your TypoScript language mapping:
plugin.tx_semanticsuggestion_suggestions.settings {
languageMapping {
0 = en
1 = de # Make sure this matches your site language UID
2 = fr
}
}
Enable debug mode to see detection process:
plugin.tx_semanticsuggestion_suggestions.settings {
debugMode = 1
}
Check logs for:
Language detected via nlp_tools: de
TF-IDF similarity calculated: language=de, vocabularySize=156
Confidence verification: firstScore=0.85, secondScore=0.45, using content analysis
Modify the appearance of suggestions by overriding the plugin's Fluid template (List.html). Configure the paths to your custom templates in TypoScript (see Configuration section).
Enable comprehensive logging to troubleshoot language detection and similarity calculation:
plugin.tx_semanticsuggestion_suggestions.settings {
debugMode = 1
}
Debug logs show:
[INFO] Language detected via nlp_tools: de
[DEBUG] TF-IDF similarity calculated: page1=123, page2=456, similarity=0.75, vocabularySize=245
[DEBUG] German stemming applied: "Automobilindustrie" → "automobilindustr"
[DEBUG] Confidence verification: using TYPO3 context due to low confidence difference
[ERROR] Failed to create TF-IDF vectors: text too short, using fallback calculation
Backend Module shows performance metrics:
- TF-IDF processing time per page pair
- Language detection accuracy
- Vocabulary size per language
- Cache hit rates for stemming/vectorization
Scheduler Task execution time:
- Small sites (<100 pages): ~10-30 seconds
- Medium sites (100-500 pages): ~1-5 minutes
- Large sites (>500 pages): ~5-30 minutes
# Scheduler Configuration: minimumSimilarity = 0.3 (storage optimization)
plugin.tx_semanticsuggestion_suggestions.settings {
# Performance optimizations
minTextLength = 100 # Skip short content (faster processing)
proximityThreshold = 0.4 # Higher display threshold (faster queries)
maxSuggestions = 3 # Limit results (faster rendering)
# Selective field analysis (reduce processing time)
analyzedFields {
title = 2.0 # Titles are fast to process
keywords = 2.0 # Keywords are lightweight
content = 0.5 # Reduce content weight (can be slow)
description = 1.0 # Meta descriptions are fast
abstract = 0 # Disable if not commonly used
}
# Conservative language settings
confidenceThreshold = 0.4 # Higher confidence reduces processing
enableStemming = 1 # Keep enabled (cached results)
}
# SCHEDULER OPTIMIZATION for large sites:
# Split into multiple tasks:
# Task 1: Pages 1-100 (daily)
# Task 2: Pages 101-200 (every 2 days)
# Task 3: Pages 201+ (weekly)
Ensure TYPO3 cache is properly configured for optimal performance:
# Clear cache after configuration changes
./vendor/bin/typo3 cache:flush
# Monitor cache effectiveness
./vendor/bin/typo3 cache:listGroupsWhen suggestions aren't working as expected, follow this systematic approach:
# Check if task has run successfully
./vendor/bin/typo3 scheduler:run
# Check TYPO3 logs for task errors
tail -f var/log/typo3_*.log | grep -i semanticExpected Output:
[INFO] Starting similarity generation task, startPageId: 1, minimumSimilarity: 0.3
[INFO] Similarity generation task completed successfully
-- Check if similarities are stored
SELECT COUNT(*) as total_similarities
FROM tx_semanticsuggestion_similarities;
-- Check score distribution
SELECT
ROUND(similarity_score, 1) as score_range,
COUNT(*) as count,
sys_language_uid
FROM tx_semanticsuggestion_similarities
GROUP BY ROUND(similarity_score, 1), sys_language_uid
ORDER BY score_range DESC;Healthy Output Example:
score_range | count | sys_language_uid
0.4 | 15 | 0
0.3 | 42 | 0
0.2 | 128 | 0
0.1 | 203 | 0
# Enable debug mode temporarily
plugin.tx_semanticsuggestion_suggestions.settings {
debugMode = 1
}
Check Debug Logs:
tail -f typo3temp/logs/semantic_suggestion.logSymptoms:
- Frontend shows empty suggestions list
- Backend module shows 0 similar pairs
Diagnosis Checklist:
✓ Scheduler task executed successfully?
✓ Database contains similarities?
✓ proximityThreshold ≤ stored similarities?
✓ Current page has stored similarities?
Solutions:
-
Threshold Too High
❌ Current: proximityThreshold = 0.8 ✅ Fix: proximityThreshold = 0.3 (or lower) # Or check what's actually in database: SELECT MAX(similarity_score) FROM tx_semanticsuggestion_similarities; -
Scheduler Never Ran
# Manual execution ./vendor/bin/typo3 scheduler:run <task_id> # Check task configuration SELECT * FROM tx_scheduler_task WHERE classname LIKE '%Similarities%';
-
Wrong Root Page
❌ Current: startPageId = 1, viewing page = 42 ✅ Fix: startPageId = 1, ensure page 42 is child of page 1 # Verify page tree relationship SELECT pid, title FROM pages WHERE uid = 42;
Symptoms:
- Suggestions shown but irrelevant
- Mixed languages in suggestions
- Very low similarity scores
Solutions:
-
TF-IDF Score Adjustment (v3.0+)
❌ Legacy: proximityThreshold = 0.7 ✅ TF-IDF: proximityThreshold = 0.25 # TF-IDF scores are naturally lower -
Language Separation Required
❌ Single task: Pages 1-100 (mixed EN/DE content) ✅ Multi-task: - Task 1: English pages (1-50) - Task 2: German pages (51-100) -
Field Weight Optimization
# Increase weight of reliable fields analyzedFields { title = 2.0 # Titles are usually accurate keywords = 2.5 # Keywords are intentional content = 0.5 # Content can be noisy }
Symptoms:
- Page A should suggest Page B (they're clearly related)
- Page B exists in database but not suggested to Page A
Diagnosis:
-- Check if relationship exists in database
SELECT similarity_score
FROM tx_semanticsuggestion_similarities
WHERE page_id = A AND similar_page_id = B;
-- Check if excluded somewhere
SELECT exclude_pages FROM tx_scheduler_task WHERE classname LIKE '%Similarities%';Solutions:
-
Page Excluded in Scheduler
❌ Scheduler excludePages = "42,56,78" (contains Page B) ✅ Remove Page B from exclusions, re-run Scheduler -
Similarity Below Threshold
-- Find actual similarity score SELECT similarity_score FROM tx_semanticsuggestion_similarities WHERE page_id = A AND similar_page_id = B; -- If score = 0.22 but proximityThreshold = 0.3 -- Lower the threshold or improve content similarity
-
Text Content Insufficient
Check if Page A or B has minimal text content: - Minimum 50 characters required - Pure image pages won't generate similarities - Check 'minTextLength' setting
Symptoms:
- Task shows "Failed" status
- PHP timeout errors in logs
- Task takes >5 minutes
Solutions:
-
Increase Processing Limits
# In Scheduler task or php.ini ini_set('max_execution_time', 300); // 5 minutes ini_set('memory_limit', '512M');
-
Reduce Analysis Scope
❌ Current: startPageId = 1 (1000+ pages) ✅ Split: - Task 1: startPageId = 1 (pages 1-100) - Task 2: startPageId = 101 (pages 101-200) -
Increase Threshold
❌ Current: minimumSimilarity = 0.05 (stores everything) ✅ Optimized: minimumSimilarity = 0.25 (quality only)
Symptoms:
- English page suggests German pages
- Suggestions ignore language boundaries
Solutions:
-
Enable Language Mapping
plugin.tx_semanticsuggestion_suggestions.settings { languageMapping { 0 = en 1 = de 2 = fr } } -
Check Site Configuration
# site/config.yaml should have proper locales languages: - languageId: 0 locale: 'en_US.UTF-8' # ← Must be properly formatted - languageId: 1 locale: 'de_DE.UTF-8' # ← Must be properly formatted
-
Separate Scheduler Tasks
Instead of: One task analyzing mixed language tree Use: One task per language branch
# Clear all caches
./vendor/bin/typo3 cache:flush
# Re-run all scheduler tasks
./vendor/bin/typo3 scheduler:run
# Check database table size
echo "SELECT COUNT(*) FROM tx_semanticsuggestion_similarities;" | mysql your_db
# Reset extension configuration (emergency)
./vendor/bin/typo3 configuration:remove --path="EXTENSIONS/semantic_suggestion"
./vendor/bin/typo3 extension:deactivate semantic_suggestion
./vendor/bin/typo3 extension:activate semantic_suggestionBefore going live, verify:
- Scheduler Tasks: All tasks run successfully without timeouts
- Database Check:
tx_semanticsuggestion_similaritiescontains expected data - Threshold Validation:
proximityThreshold≥minimumSimilarity - Language Testing: Each language shows appropriate suggestions
- Performance Test: Frontend loads suggestions in <200ms
- Content Quality: Manual review of suggestion relevance
- Exclusion Review: All intentionally excluded pages work correctly
Use this comprehensive checklist to ensure your configuration is optimal:
- startPageId exists and is accessible:
SELECT title FROM pages WHERE uid = [startPageId]; - minimumSimilarity appropriate for TF-IDF: Between 0.1 and 0.4 (not v2.x values like 0.8)
- excludePages list verified: All UIDs exist and are intentionally excluded
- Task execution successful: Check task history and logs for errors
- Multilingual separation: Each language has its own task (recommended)
- Threshold hierarchy respected:
proximityThreshold ≥ minimumSimilarity - TF-IDF thresholds updated: Not using v2.x legacy values (>0.5)
- Language settings match site:
languageMappingcorresponds to TYPO3 language UIDs - Field weights optimized: Higher weights for title/keywords, lower for content
- Performance settings:
maxSuggestionsandminTextLengthappropriate for site size
-- Verify data exists and has reasonable distribution
SELECT
sys_language_uid,
MIN(similarity_score) as min_score,
MAX(similarity_score) as max_score,
AVG(similarity_score) as avg_score,
COUNT(*) as total_pairs
FROM tx_semanticsuggestion_similarities
GROUP BY sys_language_uid;
-- Check for unexpected language mixing
SELECT DISTINCT root_page_id, sys_language_uid, COUNT(*) as pairs
FROM tx_semanticsuggestion_similarities
GROUP BY root_page_id, sys_language_uid;- Plugin displays correctly:
<f:cObject typoscriptObjectPath='lib.semantic_suggestion' />works - Suggestions appear on pages: Test on multiple pages with different content
- Language separation working: German pages don't show English suggestions
- Exclusions effective: Excluded pages don't appear in suggestions
- Performance acceptable: Page load time impact <100ms
- nlp_tools dependency installed:
composer show cywolf/nlp-tools - Stemming working (German sites): Debug logs show stemmed words
- Language detection accurate: Pages analyzed in correct language
- Site configuration proper: Locales formatted as
de_DE.UTF-8(not justde)
| Warning Sign | Quick Test | Solution |
|---|---|---|
| Zero suggestions anywhere | SELECT COUNT(*) FROM tx_semanticsuggestion_similarities; |
Lower thresholds or check task |
| Very low similarity scores (all <0.1) | Check debug logs for TF-IDF failures | Verify text length and language detection |
| Mixed language results | Test DE page shows EN suggestions | Separate scheduler tasks |
| Poor suggestion quality | Manual review of top suggestions | Adjust field weights or thresholds |
| Slow performance | Frontend timing >500ms | Increase thresholds or reduce maxSuggestions |
Rate your setup (aim for 80%+ before going live):
Basic Setup (50 points)
- Scheduler task runs (20 pts)
- Database contains data (15 pts)
- Frontend shows suggestions (15 pts)
Optimization (30 points)
- TF-IDF thresholds optimized (10 pts)
- Language separation implemented (10 pts)
- Performance <200ms (10 pts)
Advanced Features (20 points)
- German stemming active (5 pts)
- Debug logging configured (5 pts)
- Exclusions properly managed (5 pts)
- Field weights customized (5 pts)
Score: ___ / 100
Target: 80+ for production deployment Minimum: 60+ for staging/testing
The extension includes automatic fallback mechanisms:
- TF-IDF Processing: Falls back to old cosine similarity if nlp_tools fails
- Language Detection: Falls back to TypoScript mapping if site detection fails
- Text Processing: Falls back to basic stop word removal if stemming fails
-
Install nlp_tools dependency:
composer require cywolf/nlp-tools
-
Update TypoScript configuration (add new NLP settings):
plugin.tx_semanticsuggestion_suggestions.settings { enableStemming = 1 languageMapping { 0 = en 1 = de # ... your language mapping } } -
Clear TYPO3 cache:
./vendor/bin/typo3 cache:flush
-
Regenerate similarities (run Scheduler task):
- All existing similarities will be recalculated with TF-IDF
- German content should see immediate improvement
- Check debug logs to verify language detection
-
Monitor performance:
- Check backend module for accuracy improvements
- Compare similarity scores before/after migration
- Verify multilingual sites work correctly
If you experience issues, you can disable advanced features:
plugin.tx_semanticsuggestion_suggestions.settings {
enableStemming = 0 # Disable stemming
debugMode = 1 # Enable debug logging
minTextLength = 200 # Increase minimum text length
}
The extension will automatically fall back to v2.x behavior while maintaining database compatibility.
Contributions are welcome! Fork the repository, create a branch, make your changes, and submit a Pull Request.
This project is licensed under the GNU General Public License v2.0 or later. See the LICENSE file.
Contact: Wolfangel Cyril (cyril.wolfangel@gmail.com) Bugs & Features: GitHub Issues Documentation & Updates: GitHub Repository
