# Euro-millions Number analysis
Read out the historic data to base our new numbers with and perform some data exploration.

In [1]:
import pandas as pd

df = pd.read_excel("lottery-numbers-tracker.xlsx", sheet_name="EuroMillions", header=0)

df.head()


Unnamed: 0,DrawDate,Ball 1,Ball 2,Ball 3,Ball 4,Ball 5,Lucky Star 1,Lucky Star 2
0,2023-09-26 00:00:00,2,6,14,19,23,5,7
1,2023-09-22 00:00:00,3,23,24,34,35,5,8
2,2023-09-19 00:00:00,10,15,31,41,42,2,5
3,2023-09-15 00:00:00,12,14,21,45,48,8,11
4,2023-09-12 00:00:00,5,14,36,40,42,2,11


The probability for each ball to be drawn given that there is no bias is:

In [2]:
probabilities = {i+1: 1/50 for i in range(50)}
print(probabilities)
probabilities_ls = {i+1: 1/12 for i in range(12)}
print(probabilities_ls)

{1: 0.02, 2: 0.02, 3: 0.02, 4: 0.02, 5: 0.02, 6: 0.02, 7: 0.02, 8: 0.02, 9: 0.02, 10: 0.02, 11: 0.02, 12: 0.02, 13: 0.02, 14: 0.02, 15: 0.02, 16: 0.02, 17: 0.02, 18: 0.02, 19: 0.02, 20: 0.02, 21: 0.02, 22: 0.02, 23: 0.02, 24: 0.02, 25: 0.02, 26: 0.02, 27: 0.02, 28: 0.02, 29: 0.02, 30: 0.02, 31: 0.02, 32: 0.02, 33: 0.02, 34: 0.02, 35: 0.02, 36: 0.02, 37: 0.02, 38: 0.02, 39: 0.02, 40: 0.02, 41: 0.02, 42: 0.02, 43: 0.02, 44: 0.02, 45: 0.02, 46: 0.02, 47: 0.02, 48: 0.02, 49: 0.02, 50: 0.02}
{1: 0.08333333333333333, 2: 0.08333333333333333, 3: 0.08333333333333333, 4: 0.08333333333333333, 5: 0.08333333333333333, 6: 0.08333333333333333, 7: 0.08333333333333333, 8: 0.08333333333333333, 9: 0.08333333333333333, 10: 0.08333333333333333, 11: 0.08333333333333333, 12: 0.08333333333333333}


Now we sum the number of occurrences for each ball using numpy's unique function:

In [3]:
import numpy as np
n = df.shape[0] * 5
occurrences = np.concatenate([df["Ball 1"], df["Ball 2"], df["Ball 3"], df["Ball 4"], df["Ball 5"]])

unique, counts = np.unique(occurrences, return_counts=True)
print(unique, len(unique))
running_counts = dict(zip(unique, counts))
print(running_counts)


[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50] 50
{1: 10, 2: 14, 3: 19, 4: 9, 5: 14, 6: 18, 7: 16, 8: 14, 9: 15, 10: 20, 11: 18, 12: 18, 13: 19, 14: 16, 15: 14, 16: 17, 17: 21, 18: 15, 19: 20, 20: 14, 21: 25, 22: 9, 23: 22, 24: 19, 25: 22, 26: 17, 27: 19, 28: 16, 29: 18, 30: 9, 31: 18, 32: 17, 33: 17, 34: 27, 35: 27, 36: 17, 37: 17, 38: 16, 39: 11, 40: 16, 41: 12, 42: 19, 43: 15, 44: 20, 45: 19, 46: 12, 47: 18, 48: 24, 49: 12, 50: 14}


Same unique counts for the lucky star balls:

In [4]:
n_ls = df.shape[0] * 2
occurrences_ls = np.concatenate([df["Lucky Star 1"], df["Lucky Star 2"]])

unique_ls, counts_ls = np.unique(occurrences_ls, return_counts=True)
print(unique_ls, len(unique_ls))
running_counts_ls = dict(zip(unique_ls, counts_ls))
print(running_counts_ls)


[ 1  2  3  4  5  6  7  8  9 10 11 12] 12
{1: 21, 2: 30, 3: 39, 4: 18, 5: 25, 6: 31, 7: 27, 8: 26, 9: 22, 10: 35, 11: 40, 12: 24}


Now we calculate the probability of each number and how it deviates from the raw unbiased probabilities calculated previously. It turns out the standard deviation is around a quarter of the mean, and by using the central limit theorem, we get the s.d. of the true distribution as 

In [5]:
import math
running_probabilities = {i+1: (running_counts[i + 1] / n) for i in range(50)}
print(running_probabilities)
ave = sum([running_probabilities[i + 1] for i in range(50)]) / 50.0
sd = math.sqrt(sum([(ave - running_probabilities[i + 1])**2 for i in range(50)]) / 50.0)
print(ave, sd)
print("s.d. of the true distribution", sd/math.sqrt(len(df[0:])))

{1: 0.011834319526627219, 2: 0.016568047337278107, 3: 0.022485207100591716, 4: 0.010650887573964497, 5: 0.016568047337278107, 6: 0.021301775147928994, 7: 0.01893491124260355, 8: 0.016568047337278107, 9: 0.01775147928994083, 10: 0.023668639053254437, 11: 0.021301775147928994, 12: 0.021301775147928994, 13: 0.022485207100591716, 14: 0.01893491124260355, 15: 0.016568047337278107, 16: 0.020118343195266272, 17: 0.02485207100591716, 18: 0.01775147928994083, 19: 0.023668639053254437, 20: 0.016568047337278107, 21: 0.029585798816568046, 22: 0.010650887573964497, 23: 0.02603550295857988, 24: 0.022485207100591716, 25: 0.02603550295857988, 26: 0.020118343195266272, 27: 0.022485207100591716, 28: 0.01893491124260355, 29: 0.021301775147928994, 30: 0.010650887573964497, 31: 0.021301775147928994, 32: 0.020118343195266272, 33: 0.020118343195266272, 34: 0.03195266272189349, 35: 0.03195266272189349, 36: 0.020118343195266272, 37: 0.020118343195266272, 38: 0.01893491124260355, 39: 0.01301775147928994, 40: 0.

This suggests there is an inherent bias in the sample of around 4 in 10000, and is consistent with why the probability of a win is a few order of magnitudes larger than it is supposed to be from a purely theoretical evaluation.
Same calculation for the lucky star balls suggests a much larger bias here:

In [6]:
running_probabilities_ls = {i+1: (running_counts_ls[i + 1] / n_ls) for i in range(12)}
print(running_probabilities_ls)
ave_ls = sum([running_probabilities_ls[i + 1] for i in range(12)]) / 12.0
sd_ls = math.sqrt(sum([(ave_ls - running_probabilities_ls[i + 1])**2 for i in range(12)]) / 12.0)
print(ave_ls, sd_ls)
print("s.d. of the true distribution", sd_ls/math.sqrt(len(df[0:])))


{1: 0.0621301775147929, 2: 0.08875739644970414, 3: 0.11538461538461539, 4: 0.05325443786982249, 5: 0.07396449704142012, 6: 0.09171597633136094, 7: 0.07988165680473373, 8: 0.07692307692307693, 9: 0.0650887573964497, 10: 0.10355029585798817, 11: 0.11834319526627218, 12: 0.07100591715976332}
0.08333333333333333 0.019877361312373148
s.d. of the true distribution 0.001529027793259473


Now we evaluate how far apart each ball is compared to the truly unbiased mid-points of a perfectly unbiased distribution. ave_dv below is what the separation is supposed to be in a non-biased distribution but sd_dv is how that figure deviates from the ideal case, square rooted to accommodate the deviation in either direction. And it turns out the numbers tend to be less spread out from each other than the non-biased case, by around a quarter.

In [7]:
dv = 0
ddv = 0
for i in range(int(n / 5)):
    dv += sum([
        df["Ball 1"][i],
        df["Ball 2"][i] - df["Ball 1"][i],
        df["Ball 3"][i] - df["Ball 2"][i],
        df["Ball 4"][i] - df["Ball 3"][i],
        df["Ball 5"][i] - df["Ball 4"][i],
        50 - df["Ball 5"][i]
    ])
ave_dv = dv / (n * 6 / 5)

for i in range(int(n / 5)):
    ddv += sum([
        (ave_dv - df["Ball 1"][i])**2,
        (ave_dv - (df["Ball 2"][i] - df["Ball 1"][i]))**2,
        (ave_dv - (df["Ball 3"][i] - df["Ball 2"][i]))**2,
        (ave_dv - (df["Ball 4"][i] - df["Ball 3"][i]))**2,
        (ave_dv - (df["Ball 5"][i] - df["Ball 4"][i]))**2,
        (ave_dv - (50 - df["Ball 5"][i]))**2
    ])
sd_dv = math.sqrt(ddv / (n * 6 / 5))
print(ave_dv, sd_dv)


8.333333333333334 6.410461535179594


Repeat the same for lucky star balls, and it turns out there is even more bias, around a third: 

In [8]:
dv_ls = 0
ddv_ls = 0
for i in range(int(n_ls / 2)):
    dv_ls += sum([
        df["Lucky Star 1"][i],
        df["Lucky Star 2"][i] - df["Lucky Star 1"][i],
        12 - df["Lucky Star 2"][i]
    ])
ave_dv_ls = dv_ls / (n_ls * 3 / 2)

for i in range(int(n_ls / 2)):
    ddv_ls += sum([
        (ave_dv_ls - df["Lucky Star 1"][i])**2,
        (ave_dv_ls - (df["Lucky Star 2"][i] - df["Lucky Star 1"][i]))**2,
        (ave_dv_ls - (12 - df["Lucky Star 2"][i]))**2
    ])
sd_dv_ls = math.sqrt(ddv_ls / (n_ls * 3 / 2))
print(ave_dv_ls, sd_dv_ls)

4.0 2.721091554963393


One way to make sense of this is to consider the case for lucky star balls. In a situation where there is no bias and there are infinite samples to choose from, the average probability of getting the two balls will be 4 and 8, 4 representing the smaller ball values and 8 the bigger ones. In this case the average outcome of the balls are perfectly and evenly skewed, and the distance between each draw and its neighbour is 4. that is to say 0-4, 4-8, 8-12 all have neighbouring distances of 4. 

Now consider our case where this figure is closer to 2.71, this means it is more likely to get a neighbouring average figure of 2.71, so it is more likely to get numbers like 0-2, 2-4, 4-12, which would give `sqrt((4-2)**2 + (4-2)**2 + (4-8)**2 / 3) ~ 2.82`

Now we look at repetitions in the past 10 draws for each ball:

In [9]:
rep_counts = {i+1: 0 for i in range(50)}
total_reps = 0
for i in range(int(n / 5) - 10):
    for j in range(10):
        for k in range(5):
            if (any([
                df["Ball {}".format(k + 1)][i] == df["Ball 1"][i + j + 1],
                df["Ball {}".format(k + 1)][i] == df["Ball 2"][i + j + 1],
                df["Ball {}".format(k + 1)][i] == df["Ball 3"][i + j + 1],
                df["Ball {}".format(k + 1)][i] == df["Ball 4"][i + j + 1],
                df["Ball {}".format(k + 1)][i] == df["Ball 5"][i + j + 1]
            ])):
                rep_counts[df["Ball {}".format(k + 1)][i]] += 1
                total_reps += 1
print(rep_counts)
rep_averages = {}
for i in range(50):
    rep_averages[i + 1] = rep_counts[i + 1] / total_reps
print(rep_averages)
rep_ave = sum([rep_averages[i + 1] for i in range(50)]) / 50.0
rep_sd = math.sqrt(sum([(rep_ave - rep_averages[i + 1])**2 for i in range(50)]) / 50.0)
print(rep_ave, rep_sd)

{1: 7, 2: 8, 3: 14, 4: 4, 5: 10, 6: 11, 7: 12, 8: 13, 9: 10, 10: 27, 11: 23, 12: 12, 13: 16, 14: 11, 15: 7, 16: 22, 17: 24, 18: 17, 19: 22, 20: 12, 21: 39, 22: 5, 23: 24, 24: 21, 25: 20, 26: 13, 27: 19, 28: 22, 29: 13, 30: 4, 31: 15, 32: 15, 33: 14, 34: 36, 35: 51, 36: 14, 37: 15, 38: 8, 39: 6, 40: 15, 41: 12, 42: 18, 43: 14, 44: 22, 45: 16, 46: 9, 47: 16, 48: 36, 49: 6, 50: 11}
{1: 0.008631319358816275, 2: 0.009864364981504316, 3: 0.01726263871763255, 4: 0.004932182490752158, 5: 0.012330456226880395, 6: 0.013563501849568433, 7: 0.014796547472256474, 8: 0.016029593094944512, 9: 0.012330456226880395, 10: 0.03329223181257707, 11: 0.02836004932182491, 12: 0.014796547472256474, 13: 0.01972872996300863, 14: 0.013563501849568433, 15: 0.008631319358816275, 16: 0.027127003699136867, 17: 0.029593094944512947, 18: 0.02096177558569667, 19: 0.027127003699136867, 20: 0.014796547472256474, 21: 0.04808877928483354, 22: 0.006165228113440197, 23: 0.029593094944512947, 24: 0.025893958076448828, 25: 0.02

Now the probability of a repetition for each ball is 0.2 * 1/10 attempts so roughly 0.02 which is very close to the average evaluated. However, the standard deviation is very large here, showing a tendency of numbers to repeat themselves in these smaller cycles. 

Repeating the same operation for lucky star numbers:

In [10]:
rep_counts_ls = {i+1: 0 for i in range(12)}
total_reps_ls = 0
for i in range(int(n / 5) - 10):
    for j in range(10):
        for k in range(2):
            if (any([
                df["Lucky Star {}".format(k + 1)][i] == df["Lucky Star 1"][i + j + 1],
                df["Lucky Star {}".format(k + 1)][i] == df["Lucky Star 2"][i + j + 1],
            ])):
                rep_counts_ls[df["Lucky Star {}".format(k + 1)][i]] += 1
                total_reps_ls += 1
print(rep_counts_ls)
rep_averages_ls = {}
for i in range(12):
    rep_averages_ls[i + 1] = rep_counts_ls[i + 1] / total_reps_ls
print(rep_averages_ls)
rep_ave_ls = sum([rep_averages_ls[i + 1] for i in range(12)]) / 12.0
rep_sd_ls = math.sqrt(sum([(rep_ave_ls - rep_averages_ls[i + 1])**2 for i in range(12)]) / 12.0)
print(rep_ave_ls, rep_sd_ls)


{1: 36, 2: 41, 3: 99, 4: 14, 5: 33, 6: 54, 7: 37, 8: 29, 9: 23, 10: 67, 11: 83, 12: 23}
{1: 0.06679035250463822, 2: 0.07606679035250463, 3: 0.1836734693877551, 4: 0.025974025974025976, 5: 0.061224489795918366, 6: 0.10018552875695733, 7: 0.0686456400742115, 8: 0.05380333951762523, 9: 0.04267161410018553, 10: 0.12430426716141002, 11: 0.15398886827458255, 12: 0.04267161410018553}
0.08333333333333336 0.04611834873950868


We see that the s.d. here is also quite large it is generally apparent that balls repeat themselves in cycles of 10. The degree to which this is true is still unknown but is possible to evaluate it if a running average of this figure is kept and its variance evaluated over a longer period.

Now we put together our findings and try to assign a weight to each of the evaluated probabilities in a way that the new guesses will get scores based on them. 

It is difficult to guess how much each probability would influence future events but this should be looked at and adjusted over time to monitor how the bias is shifting in the outcome distributions. Perhaps a GAN can be optimised to consider all of these factors in guessing of new numbers.

In [11]:
from random import randint
candidates = {}
for i in range(0, 1000):
    new_guess = []
    while len(new_guess) < 5:
        new_random = randint(1, 50)
        if new_random not in new_guess:
            new_guess.append(new_random)
    if (str(new_guess) in candidates):
        continue

    new_guess.sort()
    # print(new_guess)
    score = 0
    ave_score = 0
    rep_score = 0
    for j in range(len(new_guess)):
        if abs(running_probabilities[new_guess[j]] - ave) < (1.75*sd):
            ave_score += abs((abs(running_probabilities[new_guess[j]] - ave) - sd))
        else:
            ave_score += abs(running_probabilities[new_guess[j]] - ave)

        if (rep_averages[new_guess[j]] - rep_ave) < (1.45*rep_sd):
            rep_score += abs((abs(rep_averages[new_guess[j]] - rep_ave) - rep_sd))
        else:
            rep_score += abs(rep_averages[new_guess[j]] - rep_ave)
    # print(ave_score)
    # print(rep_score)

    guess_ave_dv = sum([
        new_guess[0],
        new_guess[1] - new_guess[0],
        new_guess[2] - new_guess[1],
        new_guess[3] - new_guess[2],
        new_guess[4] - new_guess[3],
        50 - new_guess[4]
    ]) / 6

    guess_ddv = sum([
        (guess_ave_dv - new_guess[0])**2,
        (guess_ave_dv - (new_guess[1] - new_guess[0]))**2,
        (guess_ave_dv - (new_guess[2] - new_guess[1]))**2,
        (guess_ave_dv - (new_guess[3] - new_guess[2]))**2,
        (guess_ave_dv - (new_guess[4] - new_guess[3]))**2,
        (guess_ave_dv - (50 - new_guess[4]))**2
    ])
    guess_sd_dv = math.sqrt(guess_ddv / 6)
    ddv_score = abs(sd_dv - guess_sd_dv) / (n / 5)
    # print(ddv_score)

    candidates[str(new_guess)] = ave_score + rep_score + ddv_score




In [12]:
candidates_ls = {}
for i in range(0, 1000):
    new_guess_ls = []
    while len(new_guess_ls) < 2:
        new_random = randint(1, 12)
        if new_random not in new_guess_ls:
            new_guess_ls.append(new_random)
    if (str(new_guess_ls) in candidates_ls):
        continue

    new_guess_ls.sort()
    # print(new_guess_ls)
    score = 0
    ave_score = 0
    rep_score = 0
    for j in range(len(new_guess_ls)):
        if abs(running_probabilities_ls[new_guess_ls[j]] - ave_ls) < (sd_ls):
            ave_score += abs((abs(running_probabilities_ls[new_guess_ls[j]] - ave_ls) - sd_ls))
        else:
            ave_score += abs(running_probabilities_ls[new_guess_ls[j]] - ave_ls)

        if (rep_averages_ls[new_guess_ls[j]] - rep_ave_ls) < (2*rep_sd_ls):
            rep_score += abs((abs(rep_averages_ls[new_guess_ls[j]] - rep_ave_ls) - rep_sd_ls))
        else:
            rep_score += abs(rep_averages_ls[new_guess_ls[j]] - rep_ave_ls)
    # print(ave_score)
    # print(rep_score)

    guess_ave_dv = sum([
        new_guess_ls[0],
        new_guess_ls[1] - new_guess_ls[0],
        12 - new_guess_ls[1]
    ]) / 3

    guess_ddv = sum([
        (guess_ave_dv - new_guess_ls[0])**2,
        (guess_ave_dv - (new_guess_ls[1] - new_guess_ls[0]))**2,
        (guess_ave_dv - (12 - new_guess_ls[1]))**2
    ])
    guess_sd_dv = math.sqrt(guess_ddv / 3)
    ddv_score = abs(sd_dv_ls - guess_sd_dv) / (n_ls / 2)
    # print(ddv_score)

    candidates_ls[str(new_guess_ls)] = ave_score + rep_score + ddv_score



200th to 240th best guesses for the ball numbers:

In [13]:
for i in sorted(candidates.items(), key=lambda item: item[1])[200:240]:
    print(i)

('[1, 16, 27, 28, 31]', 0.04822465220008795)
('[5, 25, 31, 40, 50]', 0.048281454340841815)
('[17, 25, 26, 38, 45]', 0.048290491464448776)
('[10, 36, 38, 39, 43]', 0.04837258923186778)
('[1, 17, 22, 27, 40]', 0.04842283897563189)
('[19, 28, 33, 43, 50]', 0.04851817028755132)
('[10, 12, 13, 27, 39]', 0.04853003115318179)
('[4, 19, 20, 23, 27]', 0.04856248089797499)
('[4, 24, 46, 49, 50]', 0.04856331759664642)
('[18, 19, 25, 36, 46]', 0.04858643107864346)
('[4, 17, 23, 26, 41]', 0.048586505045757455)
('[16, 22, 27, 38, 39]', 0.048601320411285606)
('[4, 5, 16, 19, 36]', 0.048640070109278254)
('[10, 12, 28, 39, 47]', 0.04864535079929605)
('[5, 8, 19, 23, 37]', 0.04864570888360313)
('[1, 12, 27, 40, 46]', 0.04869521050925775)
('[4, 6, 7, 27, 38]', 0.048746805748856946)
('[16, 31, 38, 45, 49]', 0.04880806604762517)
('[19, 20, 24, 29, 40]', 0.04888647923951949)
('[4, 15, 18, 22, 28]', 0.048900552650966475)
('[11, 17, 33, 39, 40]', 0.04892382397371297)
('[6, 14, 27, 41, 49]', 0.0490844024378058

top guesses for the lucky star numbers:

In [14]:
for i in sorted(candidates_ls.items(), key=lambda item: item[1]):
    print(i)

('[9, 12]', 0.026134840602816425)
('[9, 10]', 0.037411984797505536)
('[8, 9]', 0.03846338792559345)
('[8, 12]', 0.04628626387195414)
('[10, 12]', 0.04783487068465428)
('[5, 12]', 0.04884311808554339)
('[5, 9]', 0.052877237850177186)
('[6, 12]', 0.05440256799008534)
('[6, 9]', 0.05558329238539599)
('[8, 10]', 0.05605495897669527)
('[4, 12]', 0.05755065812934637)
('[7, 9]', 0.05826436257496997)
('[1, 9]', 0.059186451829030116)
('[4, 9]', 0.059679061710933096)
('[7, 12]', 0.06218142812715014)
('[2, 9]', 0.06371312626549187)
('[5, 10]', 0.06761541353195585)
('[9, 11]', 0.07159465549609806)
('[6, 10]', 0.0725637010965416)
('[4, 10]', 0.0731226835496209)
('[2, 12]', 0.07577559720324972)
('[5, 8]', 0.07584328279424411)
('[7, 10]', 0.07653932512920644)
('[1, 12]', 0.07707186318883977)
('[6, 8]', 0.07725478348637205)
('[4, 5]', 0.077444995620539)
('[5, 6]', 0.07859750928174308)
('[2, 10]', 0.07930460310546027)
('[7, 8]', 0.07951891140475352)
('[1, 10]', 0.08110109364500914)
('[1, 8]', 0.0824410

# GAN timeseries generation

In this section we will extract the data to be loaded into a neural net to make predictions and compare them to the existing ones we evaluated here. Eventually the idea is to use a GAN to enhance this prediction.

In [15]:
df[["Lucky Star 1", "Lucky Star 2", "Ball 1", "Ball 2", "Ball 3", "Ball 4", "Ball 5"]].to_csv("euromillions-dataset.csv", index=None)