-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Codebase Review: ClusterFlick Scripts
Date: February 2026
Reviewer: Senior Software Engineer Assessment
Overview
This is a well-structured data aggregation pipeline for London cinema listings. The codebase scrapes 200+ cinema/venue websites, normalizes the data, matches against TMDB, and produces a unified dataset.
Strengths
1. Clear Architecture & Separation of Concerns
The pipeline follows a logical flow (retrieve → transform → combine → match) with clean separation. Each cinema/source is self-contained with its own attributes.js, retrieve.js, and transform.js. This makes adding new venues straightforward.
2. Excellent Use of Snapshots & HTTP Recording
The testing strategy using PollyJS to record HTTP interactions is solid. It enables deterministic testing of web scrapers without hitting live APIs. The custom ChunkedFsPersister for large HAR files shows thoughtful handling of test artifacts.
3. Robust Title Normalization
The normalize-title.js contains extensive domain knowledge for matching movie titles. While the approach is brute-force (500+ corrections), it reflects real-world complexity of cinema data.
4. Multi-Layer Matching Strategy
The TMDB matching logic in get-movie-data.js is sophisticated - it tries director/actor matching, year variations, and falls back to LLM assistance. The layered approach maximizes match rates.
5. Good Error Handling for External APIs
The withRetry, fetchWithRetry, and runLlmFunction wrappers handle rate limits, transient failures, and model-specific errors gracefully.
Areas of Concern
1. Technical Debt in normalize-title.js (~785 lines)
File: common/normalize-title.js
This file is a maintainability hazard. The 500+ hardcoded corrections will grow unboundedly. Each new mismatched title adds another entry.
Recommendations:
- A rules engine with categorized transformations
- LLM-based title normalization as a first pass
- A separate data file for cinema-specific quirks
2. Massive Opt-In List in transform/index.js
File: scripts/transform/index.js (lines 79-310)
The 200+ element optedIn array is manually maintained and will drift.
Recommendations:
- Store opt-in status in each cinema's
attributes.js - Invert to an opt-out list (if most cinemas should be opted in)
- Use a configuration file separate from code
3. Synchronous File I/O in Cache
File: common/cache.js
The cache uses synchronous fs.existsSync, fs.statSync, fs.readFileSync, and fs.writeFileSync. For a data pipeline this likely isn't a bottleneck, but it's inconsistent with the async/await patterns elsewhere.
4. LLM JSON Parsing Fragility
File: common/llm-client.js (lines 51-64)
The manual JSON cleanup regex patterns are brittle. The LLM model (gemini-2.5-flash-lite) should be prompted to return structured JSON, or use Gemini's built-in JSON mode if available.
5. No TypeScript / Type Definitions
The codebase is 100% JavaScript with no JSDoc type annotations. Given the data-heavy nature (movies, performances, venues with specific schemas), TypeScript would catch data shape mismatches at compile time rather than runtime.
6. Test Coverage Appears Limited
While there are 154 test files, they're primarily snapshot-based integration tests. There's no unit testing visible for:
get-movie-data.jsmatching logicsource-utils.jsvenue matching- The combine logic's deduplication algorithm
7. Inconsistent Variable Declarations
File: common/utils.js (line 1)
There's inconsistent use of var vs const. Should be const throughout.
Architectural Suggestions
1. Extract Venue Matching to Dedicated Module
The logic for matching venues across sources (geo distance, postcode fallback, name normalization) is scattered. A dedicated VenueMatcher class would centralize this:
class VenueMatcher {
constructor(knownCinemas, options = {}) { ... }
findMatch(venueName, coordinates, address) { ... }
}2. Consider a Plugin System for Common Patterns
Several cinema sites share the same booking platform:
common/curzon.com/common/odeon.co.uk/common/myvue.com/- etc.
These could be formalized as "adapters" with configuration rather than separate implementations:
// Instead of copying code for each Cineworld location
createCineworldCinema({
id: 'cineworld-bexleyheath',
siteId: '8',
name: 'Cineworld Bexleyheath',
// ...
});3. Schema Validation Earlier in Pipeline
The schema validation happens at the end of transform. Consider validating earlier (after individual cinema transforms) to catch issues closer to their source.
4. Structured Logging
Replace console.log with structured logging. The emoji-based output is charming for development but makes log aggregation/parsing difficult in CI or production contexts.
5. Rate Limiting Centralization
Rate limiting for TMDB, LLM, and other APIs is handled ad-hoc with sleep(). A centralized rate limiter (like bottleneck) would be more robust.
What's Working Well
- Domain Modeling - The schema is well-designed for cinema data
- Graceful Degradation - Unmatched movies still flow through as "unmatched"
- Historical Data Preservation - The
seentimestamp tracking for "new" movies - Cache Strategy - Daily cache keys prevent stale data while avoiding API overload
- CI Pipeline - Simple but effective GitHub Actions setup
Priority Recommendations
| Priority | Issue | Effort | Description |
|---|---|---|---|
| High | Extract opt-in list from code to config/attributes | Low | Move the 200+ cinema opt-in list to individual attributes.js files |
| High | Add unit tests for matching logic | Medium | Cover get-movie-data.js, source-utils.js, and combine deduplication |
| Medium | Refactor normalize-title.js into categorized rules | Medium | Split corrections into categories (spelling, formatting, cinema-specific) |
| Medium | Use LLM structured output mode | Low | Leverage Gemini's JSON mode to avoid brittle regex parsing |
| Low | Migrate to TypeScript | High | Add type safety for data shapes flowing through the pipeline |
| Low | Add structured logging | Low | Replace console.log with a proper logging library |
Conclusion
This is a well-crafted project that demonstrates solid engineering fundamentals. The main areas for improvement center around reducing manual maintenance burden (the corrections lists) and adding safety nets (TypeScript, more unit tests) for the complex matching logic.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status