Review LLM feedback

# Codebase Review: ClusterFlick Scripts

**Date:** February 2026
**Reviewer:** Senior Software Engineer Assessment

---

## Overview

This is a well-structured data aggregation pipeline for London cinema listings. The codebase scrapes 200+ cinema/venue websites, normalizes the data, matches against TMDB, and produces a unified dataset.

---

## Strengths

### 1. Clear Architecture & Separation of Concerns

The pipeline follows a logical flow (`retrieve → transform → combine → match`) with clean separation. Each cinema/source is self-contained with its own `attributes.js`, `retrieve.js`, and `transform.js`. This makes adding new venues straightforward.

### 2. Excellent Use of Snapshots & HTTP Recording

The testing strategy using PollyJS to record HTTP interactions is solid. It enables deterministic testing of web scrapers without hitting live APIs. The custom `ChunkedFsPersister` for large HAR files shows thoughtful handling of test artifacts.

### 3. Robust Title Normalization

The `normalize-title.js` contains extensive domain knowledge for matching movie titles. While the approach is brute-force (500+ corrections), it reflects real-world complexity of cinema data.

### 4. Multi-Layer Matching Strategy

The TMDB matching logic in `get-movie-data.js` is sophisticated - it tries director/actor matching, year variations, and falls back to LLM assistance. The layered approach maximizes match rates.

### 5. Good Error Handling for External APIs

The `withRetry`, `fetchWithRetry`, and `runLlmFunction` wrappers handle rate limits, transient failures, and model-specific errors gracefully.

---

## Areas of Concern

### 1. Technical Debt in `normalize-title.js` (~785 lines)

**File:** `common/normalize-title.js`

This file is a maintainability hazard. The 500+ hardcoded corrections will grow unboundedly. Each new mismatched title adds another entry.

**Recommendations:**
- A rules engine with categorized transformations
- LLM-based title normalization as a first pass
- A separate data file for cinema-specific quirks

### 2. Massive Opt-In List in `transform/index.js`

**File:** `scripts/transform/index.js` (lines 79-310)

The 200+ element `optedIn` array is manually maintained and will drift.

**Recommendations:**
- Store opt-in status in each cinema's `attributes.js`
- Invert to an opt-out list (if most cinemas should be opted in)
- Use a configuration file separate from code

### 3. Synchronous File I/O in Cache

**File:** `common/cache.js`

The cache uses synchronous `fs.existsSync`, `fs.statSync`, `fs.readFileSync`, and `fs.writeFileSync`. For a data pipeline this likely isn't a bottleneck, but it's inconsistent with the async/await patterns elsewhere.

### 4. LLM JSON Parsing Fragility

**File:** `common/llm-client.js` (lines 51-64)

The manual JSON cleanup regex patterns are brittle. The LLM model (gemini-2.5-flash-lite) should be prompted to return structured JSON, or use Gemini's built-in JSON mode if available.

### 5. No TypeScript / Type Definitions

The codebase is 100% JavaScript with no JSDoc type annotations. Given the data-heavy nature (movies, performances, venues with specific schemas), TypeScript would catch data shape mismatches at compile time rather than runtime.

### 6. Test Coverage Appears Limited

While there are 154 test files, they're primarily snapshot-based integration tests. There's no unit testing visible for:
- `get-movie-data.js` matching logic
- `source-utils.js` venue matching
- The combine logic's deduplication algorithm

### 7. Inconsistent Variable Declarations

**File:** `common/utils.js` (line 1)

There's inconsistent use of `var` vs `const`. Should be `const` throughout.

---

## Architectural Suggestions

### 1. Extract Venue Matching to Dedicated Module

The logic for matching venues across sources (geo distance, postcode fallback, name normalization) is scattered. A dedicated `VenueMatcher` class would centralize this:

```javascript
class VenueMatcher {
  constructor(knownCinemas, options = {}) { ... }
  findMatch(venueName, coordinates, address) { ... }
}
```

### 2. Consider a Plugin System for Common Patterns

Several cinema sites share the same booking platform:
- `common/curzon.com/`
- `common/odeon.co.uk/`
- `common/myvue.com/`
- etc.

These could be formalized as "adapters" with configuration rather than separate implementations:

```javascript
// Instead of copying code for each Cineworld location
createCineworldCinema({
  id: 'cineworld-bexleyheath',
  siteId: '8',
  name: 'Cineworld Bexleyheath',
  // ...
});
```

### 3. Schema Validation Earlier in Pipeline

The schema validation happens at the end of transform. Consider validating earlier (after individual cinema transforms) to catch issues closer to their source.

### 4. Structured Logging

Replace `console.log` with structured logging. The emoji-based output is charming for development but makes log aggregation/parsing difficult in CI or production contexts.

### 5. Rate Limiting Centralization

Rate limiting for TMDB, LLM, and other APIs is handled ad-hoc with `sleep()`. A centralized rate limiter (like `bottleneck`) would be more robust.

---

## What's Working Well

1. **Domain Modeling** - The schema is well-designed for cinema data
2. **Graceful Degradation** - Unmatched movies still flow through as "unmatched"
3. **Historical Data Preservation** - The `seen` timestamp tracking for "new" movies
4. **Cache Strategy** - Daily cache keys prevent stale data while avoiding API overload
5. **CI Pipeline** - Simple but effective GitHub Actions setup

---

## Priority Recommendations

| Priority | Issue | Effort | Description |
|----------|-------|--------|-------------|
| **High** | Extract opt-in list from code to config/attributes | Low | Move the 200+ cinema opt-in list to individual `attributes.js` files |
| **High** | Add unit tests for matching logic | Medium | Cover `get-movie-data.js`, `source-utils.js`, and combine deduplication |
| **Medium** | Refactor normalize-title.js into categorized rules | Medium | Split corrections into categories (spelling, formatting, cinema-specific) |
| **Medium** | Use LLM structured output mode | Low | Leverage Gemini's JSON mode to avoid brittle regex parsing |
| **Low** | Migrate to TypeScript | High | Add type safety for data shapes flowing through the pipeline |
| **Low** | Add structured logging | Low | Replace console.log with a proper logging library |

---

## Conclusion

This is a well-crafted project that demonstrates solid engineering fundamentals. The main areas for improvement center around reducing manual maintenance burden (the corrections lists) and adding safety nets (TypeScript, more unit tests) for the complex matching logic.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review LLM feedback #208

Codebase Review: ClusterFlick Scripts

Overview

Strengths

1. Clear Architecture & Separation of Concerns

2. Excellent Use of Snapshots & HTTP Recording

3. Robust Title Normalization

4. Multi-Layer Matching Strategy

5. Good Error Handling for External APIs

Areas of Concern

1. Technical Debt in `normalize-title.js` (~785 lines)

2. Massive Opt-In List in `transform/index.js`

3. Synchronous File I/O in Cache

4. LLM JSON Parsing Fragility

5. No TypeScript / Type Definitions

6. Test Coverage Appears Limited

7. Inconsistent Variable Declarations

Architectural Suggestions

1. Extract Venue Matching to Dedicated Module

2. Consider a Plugin System for Common Patterns

3. Schema Validation Earlier in Pipeline

4. Structured Logging

5. Rate Limiting Centralization

What's Working Well

Priority Recommendations

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Priority	Issue	Effort	Description
High	Extract opt-in list from code to config/attributes	Low	Move the 200+ cinema opt-in list to individual `attributes.js` files
High	Add unit tests for matching logic	Medium	Cover `get-movie-data.js`, `source-utils.js`, and combine deduplication
Medium	Refactor normalize-title.js into categorized rules	Medium	Split corrections into categories (spelling, formatting, cinema-specific)
Medium	Use LLM structured output mode	Low	Leverage Gemini's JSON mode to avoid brittle regex parsing
Low	Migrate to TypeScript	High	Add type safety for data shapes flowing through the pipeline
Low	Add structured logging	Low	Replace console.log with a proper logging library

Review LLM feedback #208

Description

Codebase Review: ClusterFlick Scripts

Overview

Strengths

1. Clear Architecture & Separation of Concerns

2. Excellent Use of Snapshots & HTTP Recording

3. Robust Title Normalization

4. Multi-Layer Matching Strategy

5. Good Error Handling for External APIs

Areas of Concern

1. Technical Debt in normalize-title.js (~785 lines)

2. Massive Opt-In List in transform/index.js

3. Synchronous File I/O in Cache

4. LLM JSON Parsing Fragility

5. No TypeScript / Type Definitions

6. Test Coverage Appears Limited

7. Inconsistent Variable Declarations

Architectural Suggestions

1. Extract Venue Matching to Dedicated Module

2. Consider a Plugin System for Common Patterns

3. Schema Validation Earlier in Pipeline

4. Structured Logging

5. Rate Limiting Centralization

What's Working Well

Priority Recommendations

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Technical Debt in `normalize-title.js` (~785 lines)

2. Massive Opt-In List in `transform/index.js`