Skip to content

Review LLM feedback #208

@alistairjcbrown

Description

@alistairjcbrown

Codebase Review: ClusterFlick Scripts

Date: February 2026
Reviewer: Senior Software Engineer Assessment


Overview

This is a well-structured data aggregation pipeline for London cinema listings. The codebase scrapes 200+ cinema/venue websites, normalizes the data, matches against TMDB, and produces a unified dataset.


Strengths

1. Clear Architecture & Separation of Concerns

The pipeline follows a logical flow (retrieve → transform → combine → match) with clean separation. Each cinema/source is self-contained with its own attributes.js, retrieve.js, and transform.js. This makes adding new venues straightforward.

2. Excellent Use of Snapshots & HTTP Recording

The testing strategy using PollyJS to record HTTP interactions is solid. It enables deterministic testing of web scrapers without hitting live APIs. The custom ChunkedFsPersister for large HAR files shows thoughtful handling of test artifacts.

3. Robust Title Normalization

The normalize-title.js contains extensive domain knowledge for matching movie titles. While the approach is brute-force (500+ corrections), it reflects real-world complexity of cinema data.

4. Multi-Layer Matching Strategy

The TMDB matching logic in get-movie-data.js is sophisticated - it tries director/actor matching, year variations, and falls back to LLM assistance. The layered approach maximizes match rates.

5. Good Error Handling for External APIs

The withRetry, fetchWithRetry, and runLlmFunction wrappers handle rate limits, transient failures, and model-specific errors gracefully.


Areas of Concern

1. Technical Debt in normalize-title.js (~785 lines)

File: common/normalize-title.js

This file is a maintainability hazard. The 500+ hardcoded corrections will grow unboundedly. Each new mismatched title adds another entry.

Recommendations:

  • A rules engine with categorized transformations
  • LLM-based title normalization as a first pass
  • A separate data file for cinema-specific quirks

2. Massive Opt-In List in transform/index.js

File: scripts/transform/index.js (lines 79-310)

The 200+ element optedIn array is manually maintained and will drift.

Recommendations:

  • Store opt-in status in each cinema's attributes.js
  • Invert to an opt-out list (if most cinemas should be opted in)
  • Use a configuration file separate from code

3. Synchronous File I/O in Cache

File: common/cache.js

The cache uses synchronous fs.existsSync, fs.statSync, fs.readFileSync, and fs.writeFileSync. For a data pipeline this likely isn't a bottleneck, but it's inconsistent with the async/await patterns elsewhere.

4. LLM JSON Parsing Fragility

File: common/llm-client.js (lines 51-64)

The manual JSON cleanup regex patterns are brittle. The LLM model (gemini-2.5-flash-lite) should be prompted to return structured JSON, or use Gemini's built-in JSON mode if available.

5. No TypeScript / Type Definitions

The codebase is 100% JavaScript with no JSDoc type annotations. Given the data-heavy nature (movies, performances, venues with specific schemas), TypeScript would catch data shape mismatches at compile time rather than runtime.

6. Test Coverage Appears Limited

While there are 154 test files, they're primarily snapshot-based integration tests. There's no unit testing visible for:

  • get-movie-data.js matching logic
  • source-utils.js venue matching
  • The combine logic's deduplication algorithm

7. Inconsistent Variable Declarations

File: common/utils.js (line 1)

There's inconsistent use of var vs const. Should be const throughout.


Architectural Suggestions

1. Extract Venue Matching to Dedicated Module

The logic for matching venues across sources (geo distance, postcode fallback, name normalization) is scattered. A dedicated VenueMatcher class would centralize this:

class VenueMatcher {
  constructor(knownCinemas, options = {}) { ... }
  findMatch(venueName, coordinates, address) { ... }
}

2. Consider a Plugin System for Common Patterns

Several cinema sites share the same booking platform:

  • common/curzon.com/
  • common/odeon.co.uk/
  • common/myvue.com/
  • etc.

These could be formalized as "adapters" with configuration rather than separate implementations:

// Instead of copying code for each Cineworld location
createCineworldCinema({
  id: 'cineworld-bexleyheath',
  siteId: '8',
  name: 'Cineworld Bexleyheath',
  // ...
});

3. Schema Validation Earlier in Pipeline

The schema validation happens at the end of transform. Consider validating earlier (after individual cinema transforms) to catch issues closer to their source.

4. Structured Logging

Replace console.log with structured logging. The emoji-based output is charming for development but makes log aggregation/parsing difficult in CI or production contexts.

5. Rate Limiting Centralization

Rate limiting for TMDB, LLM, and other APIs is handled ad-hoc with sleep(). A centralized rate limiter (like bottleneck) would be more robust.


What's Working Well

  1. Domain Modeling - The schema is well-designed for cinema data
  2. Graceful Degradation - Unmatched movies still flow through as "unmatched"
  3. Historical Data Preservation - The seen timestamp tracking for "new" movies
  4. Cache Strategy - Daily cache keys prevent stale data while avoiding API overload
  5. CI Pipeline - Simple but effective GitHub Actions setup

Priority Recommendations

Priority Issue Effort Description
High Extract opt-in list from code to config/attributes Low Move the 200+ cinema opt-in list to individual attributes.js files
High Add unit tests for matching logic Medium Cover get-movie-data.js, source-utils.js, and combine deduplication
Medium Refactor normalize-title.js into categorized rules Medium Split corrections into categories (spelling, formatting, cinema-specific)
Medium Use LLM structured output mode Low Leverage Gemini's JSON mode to avoid brittle regex parsing
Low Migrate to TypeScript High Add type safety for data shapes flowing through the pipeline
Low Add structured logging Low Replace console.log with a proper logging library

Conclusion

This is a well-crafted project that demonstrates solid engineering fundamentals. The main areas for improvement center around reducing manual maintenance burden (the corrections lists) and adding safety nets (TypeScript, more unit tests) for the complex matching logic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions