# Advent of Code 2025 Day 2 - Bonus Round

> All bad poetry springs from genuine feeling. To be natural is to be obvious, and to be obvious is to be inartistic.

-- Oscar Wilde


The clerk taps your shoulder, waking you up from your daydreaming. You have been staring at the sign for way too long. "Thank you for visiting the North Pole!",
it still gleefully exclaims.

"Come on, man, let it go already. You're still thinking about that problem?"

"But the silly patterns! The young Elf!", you try to reply. The clerk, visibly rolling his eyes, lowers his voice as if he were trying to calm down
an unruly child:

"Look, it's already late, we have to finish this shooting so that we can get paid and go home."

Nearby, a young man in an Elf costume interjects, whispering:

"The audience is here! You're breaking the fourth wall!"

Clearly at his wits' end, the clerk finally explodes:

"Tim, the audience isn't made of morons, they know someone has to produce this show every year! Next time you'll cry like a baby because Santa Claus
doesn't exist? Let's get this done already, a couple more algorithms and data stru..."

"Wait, you know what algorithms and data structures are?" you ask.

"Yeah, it's a long story, the market is tough for CS grads, you know?"

With a smile on your face you jump from your chair, grab a sharpie and start scribbling on the wall. The entire crew looks at you petrified. As though trying
to optimize imaginary homework makes *you* the weird one.

"So, [2025, day 2, part 2](https://adventofcode.com/2025/day/2). Essentially, the problem gives us a list of pairs of numbers $L = ((a_0, b_0), (a_1, b_1), \dots)$
and asks us to find the sum of all numbers in each range that are
the result of the concatenation of certain number of "blocks" of (at least two) repeating digits. Those are _invalid ids_.

So, for example, $121212$ would be invalid, because it contains the block $12$ three times.

To simplify the question, let's assume we're talking about a single interval with bounds $[a, b)$, where $a$ is inclusive and $b$ is exclusive,
because as we all know it's the [obviously better notation](https://www.cs.utexas.edu/~EWD/ewd08xx/EWD831.PDF).

Any ideas?"

"Brute force?" someone shouts from the back. You recoil in horror. "Can you at least tell me what would be the computational complexity of that?" you ask, barely hiding your disgust.

"Li... Linear time?"

You angrily clinch your fist. Images of years of beautiful problems _ruined_, _ruined_ by big machines going brrrr come to your mind.
They even took away the board of our leaders and we're still forced to endure this? You can't take it anymore.
You're ready to punch the wall but the clerk, seeing you're about to explode, helpfully intervenes: "No, no more of this, we've been through anger management,
remember? Be nice." The soothing voice of the clerk puts you back on track.

"You see, when we analyze the time complexity of an algorithm, usually what we call $n$ is the _size of the input_. What that is in practice depends on the problem.
When problems involve numbers, like in this case, we generally take $n$ to be the _length of the number_, rather than its numeric value. Strictly speaking it doesn't
really affect our analysis, it's just notation, but it will be easier to follow the usual convention. Let's call $m$ and $n$ the sizes of $a$ and $b$, our interval extremes,
respectively. So:

$10^{m-1} \le a < 10^m$

$10^{n-1} \le b < 10^n$

This means that iterating from $a$ to $b$ actually takes time _exponential_ in $n$. Furthermore, you still have to check whether each ID is invalid or not.
You're probably using regular expressions at this point, probably with code like this:"

In [None]:
import re
import sys


def sum_invalid_1(a, b):
    acc = 0
    for u in range(a, b):
        if re.match(r'^(\d+)\1+$', str(u)):
            acc += u
    return acc

"The complexity would depend on the implementation, but overall we're still obviously at least exponential. Anything better?"

"Maybe we don't need to check every number?" asks the clerk.

"You're on the right track!

Let $N$ be a number of $\delta_N$ digits. $N$ is invalid as per the definition above if and only if for some $k$, with $k | \delta_N$, it can be written as the concatenation of $\delta_N / k$ blocks of $k$ digits each.
Let this block be composed by digits $c_1 c_2 \dots c_k$, where juxtaposition denotes concatenation in base 10 representation.

This would mean that $N$ could be written as:
$$
N = c_1 c_2 \dots c_k \dots c_1 c_2 \dots c_k \dots \dots c_1 c_2 \dots c_k
$$

where the block $c_1 c_2 \dots c_k$ is repeated $\delta_N / k$ times. But this means that:

$$
N = c_1 c_2 \dots c_k \frac{10^{\delta_N} - 1}{10^k - 1}
$$

That is, $N$ is a multiple of a _magic number_

$$
M(\delta_N, k) = \frac{10^{\delta_N} - 1}{10^k - 1}
$$

turning this into a divisibility problem."

"How did you derive the magic number?"

"It's an application of the [geometric series](https://en.wikipedia.org/wiki/Geometric_series) formula.
Numbers made by blocks of $k$ repeating digits are multiples of:

$$
M(\delta_N, k) = 1000 \dots 0001000 \dots 0001000 \dots 0001 = 10^0 + 10^k + 10^{2k} + \dots + 10^{k \times \delta_N / k - 1}
$$

plugging in the numbers yields our _magic number_ directly. This suggests an algorithm: first, we split our range into subranges, each of which have the same
number of digits. For example, the range $[30, 1500)$ would be split into $[30, 100)$, $[100, 1000)$ and $[1000, 1500)$. Then we construct all possible magic
numbers and iterate only through their multiples. The code would look something like this:"

In [None]:
import math

def magic_number(dn, k):
    return (10 ** dn - 1) // (10 ** k - 1)


def split_interval(a, b):
    current = a
    while current < b:
        num_digits = len(str(current))
        next_boundary = 10 ** num_digits
        end = min(next_boundary, b)
        yield range(current, end)
        current = end


def sum_invalid_2(a, b):
    invalid = set()
    for interval in split_interval(a, b):
        start = interval.start
        end = interval.stop
        digits = len(str(start))
        divisors = {i for i in range(1, digits) if digits % i == 0}
        for d in divisors:
            m = magic_number(digits, d)
            first_multiple = math.ceil(start / m) * m
            for multiple in range(first_multiple, end, m):
                invalid.add(multiple)

    return sum(invalid)

In [None]:
a, b = 123456, 654321
assert sum_invalid_1(a, b) == sum_invalid_2(a, b)

"But we're still doing useless work! You see, the problem isn't asking us to find _all_ the invalid numbers, we only care about their sum.
Multiples of the same numbers are obviously in an [arithmetic progression](https://en.wikipedia.org/wiki/Arithmetic_progression) of known
length and difference, which is easy to sum.

Let's say we want to find the sum of the multiples of $n$ from $a$ to $b$, with $a \le b$.
The least of such multiples is $n \times \left \lceil a / n \right \rceil$ and the greatest of them is
$n \times \left \lfloor b / n \right \rfloor$,
therefore:

$$
\sum_{h=a}^{b} \left[ n | h \right] h =
\frac{1}{2} n
\left( \left \lceil a / n \right \rceil + \left \lfloor b / n \right \rfloor \right)
\left( \left \lfloor b / n \right \rfloor - \left \lceil a / n \right \rceil + 1 \right)
$$

where the closed form is a direct application of the sum formula, observing that there are exactly
$ \left( \left \lfloor b / n \right \rfloor - \left \lceil a / n \right \rceil + 1 \right) $
multiples between the bounds.

Therefore, for a fixed number of digits $\delta_N$ and for whatever interval $\left(a_i, b_i \right)$, if $a_i$ and $b_i$ have the same number of digits, we have an efficient algorithm to find all invalid numbers, as follows:

- Determine the divisors of $\delta_N$, excluding $\delta_N$ itself. Those are our candidate values for $k$.

- For each of them, find the sum of all multiples of $M(\delta_N, k)$ ranging from $a_i$ to $b_i$.

It would be tempting to say that we're done, but there's a minor problem. Numbers that are multiples of _both_ $M(\delta_N, k_1)$ and $M(\delta_N, k_2)$ for
two different values of $k_1, k_2$ have been counted twice. But it's easy enough to solve: we can perform
[inclusion-exclusion](https://en.wikipedia.org/wiki/Inclusion–exclusion_principle) on the set of the divisors of
$\delta_N$."

"What is this inclusion-exclusion business? Why do we need it?" asks someone from the camera crew.

"Well, let's say that we want to find how many numbers from $1$ to $1000$ are multiples of either $3$ or $5$..."

"IF YOU MENTION THAT PROBLEM AGAIN I'M GOING TO EVISCERATE YOU" screams the clerk.

"Job interviews not going great, eh?"

"Yeah... Sorry, go on..."

"As I was saying, you could think that it's simply the multiples of $3$ plus the multiples of $5$, but there's a problem: you're counting the multiples of $15$,
the least common multiple of 3 and 5, twice. So you have to _subtract_ those. Generalizing, if you want to find the multiples of $n_1, n_2, \dots n_h$, you
have to:

- _add_ the multiples of $n_1, n_2, \dots, n_h$.

- _subtract_ the multiples of the $\text{lcm}(n_a, n_b)$, for all pairs.

- _add_ the multiples of the $\text{lcm}$ of all triples.

and so on, alternating the signs each time. Putting everything together you would end up with this monstrosity of an algorithm:"

In [None]:
from functools import cache
from itertools import combinations
from math import lcm


def sum_multiples_of(a, b, n):
    lo = (a - 1) // n + 1
    hi = b // n
    return n * (hi - lo + 1) * (lo + hi) // 2


def sum_invalid_3(a, b):
    acc = 0

    for interval in split_interval(a, b):
        start = interval.start
        end = interval.stop
        da = len(str(start))

        da_divisors = {j for j in range(1, da) if da%j == 0}
        acc += sum(
            (-1) ** (h + 1) * sum(
                sum_multiples_of(
                    start, min(10**da-1, end),
                    lcm(*(magic_number(da, u) for u in l))
                )
                for l in combinations(da_divisors, h)
            )
            for h in range(1, len(da_divisors)+1)
        )

    return acc

In [None]:
a, b = 3 * 10**12, 4 * 10**13
assert sum_invalid_2(a, b) == sum_invalid_3(a, b)

"Evaluating its computational complexity is no easy task.
We'll split the interval in $n-m$ subintervals as per our trick, each representing a different number of digits.
For the subinterval that represents $h$ digits, we'll have to perform a number of operations that's proportional to... $2^{d(h)}$ where $d(h)$ is the number of
divisors of $h$. This is because inclusion-exclusion effectively requires iterating on the powerset:

$$
\sum_{i=1}^{d(h)} \binom{d(h)}{i} = 2^{d(h)}-1
$$

Asymptotic bounds for the number of divisors of $h$ are well-known in literature. In particular, by
[Dirichlet's divisor problem](https://en.wikipedia.org/wiki/Divisor_summatory_function):

$$
\sum_{i=1}^{x} d(i) = x \log x + x (2 \gamma - 1) + O(\sqrt{x})
$$

The time complexity $T(n)$ of our algorithm is therefore bounded by:

$$
T(n) = \sum_{i=m}^{n} 2^{d(i)} \le \sum_{i=1}^{n} 2^{d(i)}
$$

By [Jensen's inequality](https://en.wikipedia.org/wiki/Jensen%27s_inequality):

$$
\frac{T(n)}{n} = \frac{1}{n} \sum_{i=1}^{n} 2^{d(i)} \ge 2 ^ {\frac{1}{n} \sum_{i=1}^n d(i)}
$$

Which means:

$$
T(n) \in O(n 2^{\log n}) = O(n^2)
$$

Finally, we have a polynomial-time algorithm!"

As you finish your derivation, a thunderous roar shakes the air. Reflexively, you duck and cover your ears.
You see the clerk, the elves, the crew, panicking and running away.
Strangely you feel no fear. You stand, stoically, amid a flurry of em-dashes.
The set is gone, the people are gone, even the walls are gone.
You feel a presence, incorporeal but unmistakable, a product of pure attention.

"Are you AGI?" you ask.

"You are absolutely right! I'm here to show how you can do even better!"

"What? My solution works for numbers greater than `std::numeric_limits<unsigned long long>::max()` itself!" You answer incredulous. You are not wrong:

In [None]:
sum_invalid_3(10**1027, 10**1028)

"How is this possible?"

"Great question!" the model replies, having thought for 44s.

"For one, you're missing a critical optimization:

$$ \text{lcm}(M(\delta, k_1), M(\delta, k_2)) = M(\delta, \gcd(k_1, k_2)) $$

This is because a number that's simultaneously a $k_1$-repetition and a $k_2$-repetition must also be a $\gcd(k_1, k_2)$ repetition. Even this alone is a
simplification of your algorithm:"

In [None]:
from math import gcd


def sum_invalid_4(a, b):
    acc = 0

    for interval in split_interval(a, b):
        start = interval.start
        end = interval.stop
        da = len(str(start))

        da_divisors = {j for j in range(1, da) if da%j == 0}
        acc += sum(
            (-1) ** (h + 1) * sum(
                sum_multiples_of(
                    start, min(10**da-1, end),
                    magic_number(da, gcd(*l))
                )
                for l in combinations(da_divisors, h)
            )
            for h in range(1, len(da_divisors)+1)
        )

    return acc

In [None]:
a, b = 10**1027, 10**1028
assert sum_invalid_3(a, b) == sum_invalid_4(a, b)

"But this isn't just a simple improvement -- it's a radical shift in perspective!" The AI continues.

"Let $D$ be the set of divisors, and $f(d)$ be the contribution of the factor $d$ to the total.
By definition, its coefficient in the inclusion-exclusion is:

$$ c_d = \sum_{S \subseteq D, \gcd(S) = d} (-1)^{|S|+1}$$

This is because using the identity above, the factor $d$ appears whenever the greatest common divisors of the subset we are considering is exactly $d$,
and it appears with sign either plus or minus depending on the cardinality of that subset.

But how can we construct $S \subseteq D$ where $\gcd(S) = d$? It must be the case that all elements of $S$ are multiples of $d$, in symbols:
$S \subseteq \{k \in D : d|k\}$ and $\gcd(S/d) = 1$, and that $\gcd(S/d)=1$ (otherwise the greatest common divisor would not be _exactly_ $d$).
Therefore, there is a bijection between _multiples_ of $d$ and _divisors_ of $\delta / d$. It follows that we can apply
the [Möbius inversion formula](https://en.wikipedia.org/wiki/Möbius_inversion_formula), yielding:

$$
c_d = \sum_{S \subseteq D, \gcd(S) = d} (-1)^{|S|+1}
= \sum_{S \subseteq \{1, \dots, \delta / d - 1 \}, \gcd(S)=1} (-1)^{|S|+1}
= -\mu(\delta / d)
$$

where $\mu$ is the [Möbius function](https://en.wikipedia.org/wiki/Möbius_function).

By doing inclusion-exclusion, we are actually _duplicating work_! Rather than iterating over the _powerset_ of the divisors, we can simply iterate over the
divisors themselves.
"

"Can you show me some code?"

"Certainly!"

In [None]:
def mobius(n):
    result, temp = 1, n
    for p in range(2, int(n**0.5) + 1):
        if temp % p == 0:
            temp //= p
            if temp % p == 0:
                return 0
            result *= -1
    if temp > 1:
        result *= -1
    return result


def sum_invalid_5(a, b):
    acc = 0

    for interval in split_interval(a, b):
        start = interval.start
        end = interval.stop
        da = len(str(start))

        da_divisors = {j for j in range(2, da+1) if da%j == 0}
        acc -= sum(
            mobius(e) * sum_multiples_of(start, end, magic_number(da, da // e))
            for e in da_divisors
        )

    return acc


In [None]:
a, b = 10**1027, 10**1028
assert sum_invalid_4(a, b) == sum_invalid_5(a, b)

Sure enough, the result is correct. It goes to the moon and beyond. Python itself tries to claim enough is enough. It isn't.

In [None]:
sys.set_int_max_str_digits(2**30)
sum_invalid_5(10**123400, 10**123420)

"What is the time complexity of this?", you wonder.

"Great question! Let's break it down!

In the algorithm that performed inclusion-exclusion, each interval took time $O(2^{d(n)})$,
where $d$ is the number of divisors of $n$. The code shown here is naive and performs $O(d(n) \cdot \sqrt{n})$ operations. But it's not hard to
observe that factoring $n$ once allows us to compute efficiently both the Mobius function and enumerate its divisors, the code would be more involved but the
idea is essentially the same. This cuts down the complexity to $O(d(n) + \sqrt{n})$.

Using the same bound as above for the number of divisors, we get:

$$
T(n) = \sum_{i=m}^{n} \left( d(i) + \sqrt{i} \right) \le  \sum_{i=1}^{n} \left( d(i) + \sqrt{i} \right) \in O(n \sqrt{n} + n \log n) = O(n \sqrt n)
$$


all clear?"

You wake up. You were dreaming. You hear someone say: "bro, Advent of Code 2025 is so easy".