Skip to content

Semiprime Factorization

Stephen Dunn edited this page Mar 17, 2015 · 20 revisions

Introduction

Rapid semiprime factorization is an open problem that sits neatly between mathematics and computer science. It is an interesting problem for a number of reasons, including its simplicity to understand and its importance in both cryptography and number theory. First we will discuss the problem itself, and then present some novel results that using heuristic search techniques.

A semiprime number is the product of any two prime numbers. A prime number is any whole number greater than 1 (natural number) that is evenly divisible only by itself and 1. For example, 21 is not a prime number because it is evenly divisible by 7 and 3. However, 21 is a semiprime number because both 7 and 3 are prime, and 21 is therefore the product of two primes. 41 is another example of a prime number because only 41 * 1 = 41. There are infinitely many prime numbers, as shown by the prime number theorem, and thus there are countably infinite corresponding semiprime numbers that can be constructed from any pair of primes, including the same prime used twice.

For the remainder of this discussion, we will only be considering a subset of the semiprime numbers. That is, we are only considering semiprimes whose least significant digit is: 1, 3, 7, or 9. Why? Because the remaining semiprimes are either even (and therefore trivially divisible by 2), or end in 5 (and likewise are multiples of 5). These numbers don't interest us because their factors are immediately apparent, and consequently they aren't used in security applications. You may be thinking this is just an artifact of the base 10 system and you'd be correct, but be quick to note that an analogous subset principle holds in whatever base you choose.

Problem Statement

Given a semiprime s, how quickly can you find its prime divisors p and q?

By definition, there must exist some prime numbers p and q such that p * q = s. We want to find either p and/or q quickly. Often these values represent the secret keys in security applications, and the inherent difficulty in rapidly factoring s is the basis of many modern cryptographic algorithms. Typically s is chosen to be a very large number, computed as the product of two near-equal length prime numbers. The hope is thatñdue to the problem's intractabilityñunwanted third parties cannot deduce the original factors quickly enough to break the encryption of a communication.

Heuristic Search Approach

This page is not intended as an introduction to heuristic search or cryptography. However, you can read my other medium-length wiki page that covers the basics of A*, a fundamental heuristic search algorithm used in the remainder of this research, and you may find it useful to review that page first before proceeding.

The idea is to use heuristic search methods to perform semiprime factorization. We use standard A* with some (hopefully!) informed heuristics to guide our search for factors in a bottom-up fashion. This is accomplished by finding partial factors and continually adding digits to each putative factor until either a solution is found or the candidate factors become untenable. Let's look at an example.

Suppose we wish to factor the semiprime 143, which is produced by multiplying the two (twin) prime factors 11 and 13. We start by building a "factor tree" from least-significant digit to most, by considering only valid possible candidates, as illustrated here:

143 Tree

The boxed branches are the path to the correct solution. The root node shows the target number 143 for readability, but it should technically be the empty set, since at the start no factors have been generated yet.

At the deepest depth shown (depth = 2) each pair of partial candidates, when multiplied together, produce a value of the form ...43. The other digits are not yet considered and more importantly are not yet fixed, so (naively) we can't say anything about their correctness. For example, take the node showing ...21 * ...83 which produces 1743. Of course, we can immediately filter a node containing a value that is too large (1743 > 143), so in practice we wouldn't bother generating many of the children shown. However, in this simple introductory example we will not concern ourselves with many optimizations. The point here is to realize that as the semiprime value grows arbitrarily large, we will wish to know exactly how many digits have been fixed, or are no longer subject to change by the addition of further digits. For example, later expanding the node to ...921 * ...883 has no effect on the lowest two digits of the resulting product, which are forever fixed as ...43 (and for completeness ...921 * ...883 = ...813243).

TL;DR, to summarize the preceding paragraph as a rule:

The least significant n digits have been fixed at the nth depth in the search tree.

You probably noticed how for even a simple (i.e. small) number like 143, this tree grows quite rapidly. Unless we have strong heuristics to guide the search, we could spend ample time exploring the wrong nodes.

The previous example is shown in familiar base 10 for clarity. The underlying operating system is naturally storing numbers in binary, so let's take a look at how the exact same example looks in its native representation:

143 Tree Binary

Notice that the branching factor has decreased to a maximum of 4 (2^2) from 100 (10^2)! However, the appearance an advantage is illusory, as the the depth has increased to compensate.

Constructing Useful Heuristics

Let's recall our goal is to somehow choose the correct possible-factors-branch to follow, given that we are provided with a semiprime number. Since the distribution of digits within any of the potential prime factors is random, it follows that the distribution within the product semiprime is also random. However, there is some information that we can leverage to our advantage:

  1. Semiprime numbers used in security applications are approximately the same bit-length.
  2. The distribution of digits in prime numbers obeys a normal distribution. That is, the distribution of 0s and 1s tends to be 50/50.
  3. The distribution of digits in semiprime numbers obeys a normal distribution.

We can use these facts to construct multiple heuristics and, as we will see, combine them to produce a heuristic function superior to any of the individual heuristics. Our heuristic guides are therefore constructed according to the following basic principles:

  • Favor branches with approximately equal-length factors.
  • Favor branches that maintain the expected binary distribution of factors.
  • Favor branches whose current partial product has the smallest Hamming distance from the goal

The final metric listed simply measures the number of differences between the currently produced product and the goal. For example, if the goal is 10101001 (169 = 13 * 13) and the current product is 1101001, then the hamming distance = 2. This follows from the difference in length and value in the most significant digit place.

In the source code provided on this project's github, the three principles listed above are implemented as heuristics numbered 1, 2, and 3.

Heuristic 1

Takes the sum of all 0s and 1s across the current candidate factors. Measures the difference between 50% ratio and that sum as 0s/(0s+1s). We use 0s in the numerator because all semiprime numbers terminate in a 1 (because they are odd), and they begin with a 1 (because any leading 0s would be discarded).

Heuristic 2

First calculate the proportion of 0s and 1s in each prime factor separately, then take their sum and calculate the difference from the expected value.

Heuristic 3

Calculate the Hamming distance to the goal.

Experimental Results

The best results on average, across several hundred million semiprimes, are produced by using a combination of all three heuristics, as shown below for the first 17,051 non-trivial semiprimes:

17051 Results

The points are labeled by their heuristic, with 0 indicating no heuristic (i.e. breadth-first search). As we can see, heuristic 2 appears to be a dud, while the best performance is obtained through a combination of heuristics 1 and 3 combined.