Semiprime Factorization

Stephen Dunn edited this page Sep 2, 2016 · 20 revisions

If you're not really a "math person", there is another version of this page available.

Introduction

Rapid semiprime factorization is an open problem that sits neatly between mathematics and computer science. It is an interesting problem for a number of reasons, including its simplicity to understand and its importance in both cryptography and number theory. First we will discuss the problem itself, and then present some novel results that use heuristic search techniques.

A semiprime number is the product of any two prime numbers. A prime number is any whole number greater than 1 (natural number) that is evenly divisible only by itself and 1. For example, 21 is not a prime number because it is evenly divisible by 7 and 3. However, 21 is a semiprime number because both 7 and 3 are prime, and 21 is therefore the product of two primes. 41 is another example of a prime number because only 41 * 1 = 41. There are infinitely many prime numbers, as shown by the prime number theorem, and thus there are countably infinite corresponding semiprime numbers that can be constructed from any pair of primes, including the same prime used twice.

For the remainder of this discussion, we will only be considering a subset of the semiprime numbers. That is, we are only considering semiprimes whose least significant digit is: 1, 3, 7, or 9. Why? Because the remaining semiprimes are either even (and therefore trivially divisible by 2), or end in 5 (and likewise are multiples of 5). These numbers don't interest us because their factors are immediately apparent, and consequently they aren't used in security applications. You may be thinking this is just an artifact of the base 10 system and you'd be correct, but be quick to note that an analogous subset principle holds in whatever base you choose.

Problem Statement

Given a semiprime s, how quickly can you find its prime divisors p and q?

By definition, there must exist some prime numbers p and q such that p * q = s. We want to find either p and/or q quickly. Often these values represent the secret keys in security applications, and the inherent difficulty in rapidly factoring s is the basis of many modern cryptographic algorithms. Typically s is chosen to be a very large number, computed as the product of two near-equal length prime numbers. The hope is that—due to the problem's intractability—unwanted third parties cannot deduce the original factors quickly enough to break the encryption of a communication.

Heuristic Search Approach

This page is not intended as an introduction to heuristic search or cryptography. However, you can read my other medium-length wiki page that covers the basics of A*, a fundamental heuristic search algorithm used in the remainder of this research, and you may find it useful to review that page first before proceeding.

The idea is to use heuristic search methods to perform semiprime factorization. We use standard A* with some (hopefully!) informed heuristics to guide our search for factors in a bottom-up fashion. This is accomplished by finding partial factors and continually adding digits to each putative factor until either a solution is found or the candidate factors become untenable. Let's look at an example.

Suppose we wish to factor the semiprime 143, which is produced by multiplying the two (twin) prime factors 11 and 13. We start by building a "factor tree" from least-significant digit to most, by considering only valid possible candidates, as illustrated here:

143 Tree

The boxed branches are the path to the correct solution. The root node shows the target number 143 for readability, but it should technically be the empty set, since at the start no factors have been generated yet.

At the deepest depth shown (depth = 2) each pair of partial candidates, when multiplied together, produce a value of the form ...43. The other digits are not yet considered and more importantly are not yet fixed, so (naively) we can't say anything about their correctness. For example, take the node showing ...21 * ...83 which produces 1743. Of course, we can immediately filter a node containing a value that is too large (1743 > 143), so in practice we wouldn't bother generating many of the children shown. However, in this simple introductory example we will not concern ourselves with many optimizations. The point here is to realize that as the semiprime value grows arbitrarily large, we will wish to know exactly how many digits have been fixed, or are no longer subject to change by the addition of further digits. For example, later expanding the node to ...921 * ...883 has no effect on the lowest two digits of the resulting product, which are forever fixed as ...43 (and for completeness ...921 * ...883 = ...813243).

TL;DR, to summarize the preceding paragraph as a rule:

The least significant n digits have been fixed at the nth depth in the search tree.

You probably noticed how for even a simple (i.e. small) number like 143, this tree grows quite rapidly. Unless we have strong heuristics to guide the search, we could spend ample time exploring the wrong nodes.

The previous example is shown in familiar base 10 for clarity. The underlying operating system is naturally storing numbers in binary, so let's take a look at how the exact same example looks in its native representation:

143 Tree Binary

Notice that the branching factor has decreased to a maximum of 4 (2^2) from 100 (10^2)! Even more, the true branching factor is only 2, because in practice only 2 branches can possibly produce the desired value. However, the appearance of an advantage is illusory, as the the depth has increased to compensate.

Constructing Useful Heuristics

Let's recall our goal is to somehow choose the correct possible-factors-branch to follow, given that we are provided with a semiprime number. Since the distribution of digits within any of the potential prime factors is random, it follows that the distribution within the product semiprime is also random. However, there is some information that we can leverage to our advantage:

  1. Semiprime numbers used in security applications are approximately the same bit-length.
  2. The distribution of digits in prime numbers obeys a uniform distribution. That is, the distribution of 0s and 1s tends to be 50/50.
  3. The distribution of digits in semiprime numbers obeys a normal distribution.

We can use these facts to construct multiple heuristics and, as we will see, combine them to produce a heuristic function superior to any of the individual heuristics. Our heuristic guides are therefore constructed according to the following basic principles:

  • Favor branches with approximately equal-length factors.
  • Favor branches that maintain the expected binary distribution of factors.
  • Favor branches whose current partial product has the smallest Hamming distance from the goal.

The final metric listed simply measures the number of differences between the currently produced product and the goal. For example, if the goal is 10101001 (169 = 13 * 13) and the current product is 1101001, then the hamming distance = 2. This follows from the difference in length and value in the most significant digit place.

In the source code provided on this project's github, the three principles listed above are implemented as heuristics numbered 1, 2, and 3.

Heuristic 1

Takes the sum of all 0s and 1s across the current candidate factors. Measures the difference between 50% ratio and that sum as 0s/(0s+1s). We use 0s in the numerator because all semiprime numbers terminate in a 1 (because they are odd), and they begin with a 1 (because any leading 0s would be discarded).

Heuristic 2

First calculate the proportion of 0s and 1s in each prime factor separately, then take their sum and calculate the difference from the expected value.

Heuristic 3

Calculate the Hamming distance to the goal. Products that are shorter than the target are not first padded with leading 0s to the appropriate length.

Heuristic 4

Same as 3, except bias toward breadth-first behavior at the start.

Experimental Results

Results below shown for the first 17,051 non-trivial semiprimes using only the first 3 heuristics:

17051 Results

The points are labeled by their heuristic, with 0 indicating no heuristic (i.e. breadth-first search). As we can see, heuristic 2 appears to be a dud, while the best performance was obtained through heuristic 3.

Some deficiencies in the current implementation to consider:

  • The first three heuristics do not take into account the length of the partial factors, meaning that semi-depth-first behavior becomes apparent with longer semiprimes. A better heuristic would take this into account by initially expanding numerous starting branches outward until some ideal depth is reached.
  • The heuristics do not presently take into account the likelihood of particular factor branches based upon the normal distributions of the factors themselves. For example, suppose we are attempting to factor a large semiprime that consists almost entirely of 1s. What is the probability that both factors also consist entirely of 1s? According to the experimentally observed distributions of factors and their corresponding products, the probability of this is very slim.

If we now add our 4th heuristic to the mix, we observe a dramatic improvement:

17051 Results

Heuristic 4 is by far superior thus far. Currently I am experimenting with improving it further and will post results for many more semiprimes soon.

Neat Plots

What happens if we calculate the proportion of 0s / 1s in 2 prime factors, and then compare it to the resulting semiprime's distribution? Here's a plot for the first 3 million semiprimes:

3 Million Semiprimes

Neat, right? Now what happens if we pad the prime numbers with 0s to match the length of their output semiprime? Here's a plot of the first 1.1 million:

1.1 Million Padded Semiprimes

You probably noticed some distinct gaps in those plots. Remember that since we are only looking at particular starting points (primes), there are only so many discrete values possible when we take fractions. So gaps are to be expected. However, there does appear to be a more interesting pattern happening here.

If we take the original plot of primes to semiprimes, but instead look at it in 3 dimensions (axis = p, q, and p*q), then we see something more regular:

1.1 Million Semiprimes in 3D

These pictures struck me as reminiscent of a familiar Linux screen saver that renders triangular fractals, which makes me wonder about some deep relationship between the two:

Linux Screensaver

Okay, I'm ready to help!

Awesome! Thanks for reading this and your willingness to help. In return, I will include your name and organization (if you have one) in the final publication of my work in whatever academic journal receives it, whether or not you actually crack a number. If one of your devices actually is responsible for finding the factors, I will list you first in the "special thanks" section. You have a pretty good chance of cracking a number, seeing as not many people are likely to help me! I will also post a link to the final publication here (probably in an AI or math journal, but it isn't decided yet.)

All you need to do is download and run the semiprime factorization client! If you don't trust me, build it from source.

If you're already running the client, thanks again!