# Objective

Predict 3 of 6 numbers in a dataset of a Israeli lottery game. The dataset is based on the results of Isareli general lottery games between September 1968 and May 2022.

[URL](https://medium.com/@polanitzer/how-to-guess-accurately-3-lottery-numbers-out-of-6-using-lstm-model-e148d1c632d6)

## How it does it work?

The israeli lottert called Loto is a weekly game where the participant chooses 6 numbers out of 37 and an additional number out of 7.

### Working the data

In [1]:
import numpy as np 
import pandas as pd  
from sklearn.preprocessing import StandardScaler
from keras.models import Sequential
from keras.layers import LSTM, Dense, Bidirectional, Dropout

let's load the dataset

In [2]:
file = './data/IsraeliLottery.csv'
df = pd.read_csv(file)

In [3]:
df.head()

Unnamed: 0,Game,Date,A,B,C,D,E,F
0,6801,03/09/1968,3,14,18,22,25,33
1,6802,10/09/1968,13,20,23,29,32,34
2,6803,17/09/1968,8,12,26,27,34,38
3,6804,24/09/1968,1,14,17,26,35,39
4,6805,01/10/1968,1,7,8,9,11,30


In [4]:
df.tail()

Unnamed: 0,Game,Date,A,B,C,D,E,F
4042,3467,14/05/2022,9,16,17,21,26,29
4043,3468,17/05/2022,1,4,5,24,31,32
4044,3469,19/05/2022,1,8,18,25,29,30
4045,3470,21/05/2022,3,4,5,15,24,33
4046,3471,24/05/2022,6,10,13,20,23,35


In [5]:
df.shape

(4047, 8)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4047 entries, 0 to 4046
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Game    4047 non-null   int64 
 1   Date    4047 non-null   object
 2   A       4047 non-null   int64 
 3   B       4047 non-null   int64 
 4   C       4047 non-null   int64 
 5   D       4047 non-null   int64 
 6   E       4047 non-null   int64 
 7   F       4047 non-null   int64 
dtypes: int64(7), object(1)
memory usage: 253.1+ KB


No null/missing values

In [7]:
df.describe()

Unnamed: 0,Game,A,B,C,D,E,F
count,4047.0,4047.0,4047.0,4047.0,4047.0,4047.0,4047.0
mean,4695.399061,5.698542,11.428713,17.128737,22.948357,28.795404,34.635285
std,3106.179305,4.635402,5.988067,6.698026,6.820599,6.61636,5.910514
min,1035.0,1.0,2.0,3.0,4.0,7.0,13.0
25%,2046.5,2.0,7.0,12.0,18.0,24.0,31.0
50%,3058.0,4.0,10.0,17.0,23.0,29.0,35.0
75%,8011.5,8.0,15.0,22.0,28.0,33.0,38.0
max,9934.0,31.0,37.0,44.0,46.0,48.0,49.0


some features will not be usefull as "Game" and "Date", so, we'll drop off them.

In [8]:
df.drop(['Game', 'Date'], axis=1, inplace=True)
df.head()

Unnamed: 0,A,B,C,D,E,F
0,3,14,18,22,25,33
1,13,20,23,29,32,34
2,8,12,26,27,34,38
3,1,14,17,26,35,39
4,1,7,8,9,11,30


Deep learning algorithms expect all input features to vary in a similar way, and ideally to have a mean of 0, and variance of 1. We must rescale our data so that it futfills these requirements.

In [9]:
scaler = StandardScaler().fit(df.values)
transformed_dataset = scaler.transform(df.values)
transformed_df = pd.DataFrame(data=transformed_dataset, index=df.index)

Checking the scaled data:

In [10]:
transformed_df.head()

Unnamed: 0,0,1,2,3,4,5
0,-0.582231,0.429455,0.130094,-0.13906,-0.57371,-0.276708
1,1.575345,1.431572,0.876674,0.887369,0.484404,-0.107497
2,0.496557,0.095416,1.324623,0.594104,0.786722,0.569346
3,-1.013746,0.429455,-0.019223,0.447471,0.937882,0.738557
4,-1.013746,-0.739681,-1.363068,-2.045287,-2.689939,-0.784341


Defining variables

In [11]:
number_of_games = df.values.shape[0]
number_of_games

4047

In [12]:
# Amount of games we need to take into consideration for predicction
window_length = 7

In [13]:
# Ball number
number_of_features = df.values.shape[1]
number_of_features

6

Let's crate X and y for each row in our scaled data. It should have format Keras LSTM model (rows, window size, balls)

In [14]:
X = np.empty([number_of_games - window_length, window_length, number_of_features], dtype=float)

y = np.empty([number_of_games - window_length, number_of_features],dtype=float)

for i in range(number_of_games-window_length):
    X[i] = transformed_df.iloc[i : i + window_length, 0 : number_of_features]

    y[i] = transformed_df.iloc[i+window_length : i+window_length+1, 0 : number_of_features]

Let's check out X shape:

In [15]:
X.shape

(4040, 7, 6)

In [16]:
y.shape

(4040, 6)

Let's check out our first scaled sample (made of seven consecutevely games)

In [17]:
X[0]

array([[-0.58223113,  0.429455  ,  0.1300936 , -0.13906023, -0.57371015,
        -0.27670816],
       [ 1.57534546,  1.43157169,  0.87667445,  0.88736948,  0.4844041 ,
        -0.10749723],
       [ 0.49655716,  0.0954161 ,  1.32462296,  0.59410385,  0.78672246,
         0.56934648],
       [-1.01374645,  0.429455  , -0.01922257,  0.44747103,  0.93788164,
         0.7385574 ],
       [-1.01374645, -0.73968114, -1.36306809, -2.04528682, -2.68993865,
        -0.78434094],
       [ 0.2807995 , -0.40564224, -1.06443575, -0.8722243 , -0.87602851,
        -0.10749723],
       [-1.01374645, -0.90670059,  0.87667445,  0.88736948,  0.63556328,
        -0.10749723]])

Let's check out our first scaled label

In [18]:
y[0]

array([-0.79798879,  0.26243555,  0.42872594,  0.1542054 , -0.57371015,
       -1.29197372])

Let's check out the second scaled sample (made of seven consecutive lottery games)

In [19]:
X[1]

array([[ 1.57534546,  1.43157169,  0.87667445,  0.88736948,  0.4844041 ,
        -0.10749723],
       [ 0.49655716,  0.0954161 ,  1.32462296,  0.59410385,  0.78672246,
         0.56934648],
       [-1.01374645,  0.429455  , -0.01922257,  0.44747103,  0.93788164,
         0.7385574 ],
       [-1.01374645, -0.73968114, -1.36306809, -2.04528682, -2.68993865,
        -0.78434094],
       [ 0.2807995 , -0.40564224, -1.06443575, -0.8722243 , -0.87602851,
        -0.10749723],
       [-1.01374645, -0.90670059,  0.87667445,  0.88736948,  0.63556328,
        -0.10749723],
       [-0.79798879,  0.26243555,  0.42872594,  0.1542054 , -0.57371015,
        -1.29197372]])

Let's check out our second scaled label

In [20]:
y[1]

array([-0.58223113, -0.40564224,  1.32462296,  1.32726792,  0.78672246,
        0.23092462])

## Modeling

First, let's initialise the RNN (recurrent Neural Network)

In [21]:
model = Sequential()

Let's add the input layer and the LSTM layer

In [22]:
model.add(Bidirectional(LSTM(240, input_shape = (window_length, number_of_features), return_sequences = True)))

Let's add a first Dropout layer in order to reduce overfitting

In [23]:
model.add(Dropout(0.2))

Let's add a second LSTM layer

In [24]:
model.add(Bidirectional(LSTM(240, input_shape = (window_length, number_of_features), return_sequences= True)))

Let's add a second Dropout layer

In [25]:
model.add(Dropout(0.2))

Then, let’s add a third LSTM layer

In [26]:
model.add(Bidirectional(LSTM(240, input_shape = (window_length, number_of_features), return_sequences = True)))

Now, let’s add a fourth LSTM layer

In [27]:
model.add(Bidirectional(LSTM(240, input_shape = (window_length, number_of_features), return_sequences = False)))

Next, let’s add a dense layer

In [28]:
model.add(Dense(59))

Finally, let’s add the last output layer

In [29]:
model.add(Dense(number_of_features))

Now, let's compile the RNN

In [30]:
from tensorflow import keras
from tensorflow.keras.optimizers import Adam

model.compile(optimizer=Adam(learning_rate=0.0001), loss='mse', metrics=['accuracy'] )

Next, let’s train our LSTM model

In [31]:
model.fit(x=X, y=y, batch_size=100, epochs=300, verbose=2)

Epoch 1/300
41/41 - 39s - loss: 0.9241 - accuracy: 0.2515 - 39s/epoch - 948ms/step
Epoch 2/300
41/41 - 15s - loss: 0.8969 - accuracy: 0.2782 - 15s/epoch - 370ms/step
Epoch 3/300
41/41 - 16s - loss: 0.8922 - accuracy: 0.2686 - 16s/epoch - 386ms/step
Epoch 4/300
41/41 - 15s - loss: 0.8933 - accuracy: 0.2775 - 15s/epoch - 367ms/step
Epoch 5/300
41/41 - 15s - loss: 0.8911 - accuracy: 0.2817 - 15s/epoch - 364ms/step
Epoch 6/300
41/41 - 15s - loss: 0.8917 - accuracy: 0.2780 - 15s/epoch - 369ms/step
Epoch 7/300
41/41 - 15s - loss: 0.8895 - accuracy: 0.2775 - 15s/epoch - 357ms/step
Epoch 8/300
41/41 - 15s - loss: 0.8901 - accuracy: 0.2775 - 15s/epoch - 367ms/step
Epoch 9/300
41/41 - 15s - loss: 0.8897 - accuracy: 0.2760 - 15s/epoch - 371ms/step
Epoch 10/300
41/41 - 16s - loss: 0.8904 - accuracy: 0.2743 - 16s/epoch - 382ms/step
Epoch 11/300
41/41 - 15s - loss: 0.8896 - accuracy: 0.2745 - 15s/epoch - 362ms/step
Epoch 12/300
41/41 - 15s - loss: 0.8879 - accuracy: 0.2770 - 15s/epoch - 368ms/step
E

<keras.src.callbacks.History at 0x1c4a8d02440>

## Evaluation

Let's take the results of the last 8 Israelli general lottery games:

In [32]:
to_predict = df.tail(8)
to_predict

Unnamed: 0,A,B,C,D,E,F
4039,1,14,15,17,28,37
4040,3,13,21,24,27,35
4041,17,19,21,23,25,34
4042,9,16,17,21,26,29
4043,1,4,5,24,31,32
4044,1,8,18,25,29,30
4045,3,4,5,15,24,33
4046,6,10,13,20,23,35


Let's remove the last raw from the 8 last games

In [33]:
to_predict.drop([to_predict.index[-1]], axis=0, inplace=True)
to_predict

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  to_predict.drop([to_predict.index[-1]], axis=0, inplace=True)


Unnamed: 0,A,B,C,D,E,F
4039,1,14,15,17,28,37
4040,3,13,21,24,27,35
4041,17,19,21,23,25,34
4042,9,16,17,21,26,29
4043,1,4,5,24,31,32
4044,1,8,18,25,29,30
4045,3,4,5,15,24,33


We got exactly the 7 last games before the May 24th, 2022 lottety game.

Now, let’s take the results of the May 24th, 2022 lottety game and place it into a variable called “prediction”

In [34]:
prediction = df.tail(1)
prediction

Unnamed: 0,A,B,C,D,E,F
4046,6,10,13,20,23,35


Next, we have to change the format of our last 7 games from dataframe to np.array in order to insert them into our model

In [35]:
to_predict = np.array(to_predict)
to_predict

array([[ 1, 14, 15, 17, 28, 37],
       [ 3, 13, 21, 24, 27, 35],
       [17, 19, 21, 23, 25, 34],
       [ 9, 16, 17, 21, 26, 29],
       [ 1,  4,  5, 24, 31, 32],
       [ 1,  8, 18, 25, 29, 30],
       [ 3,  4,  5, 15, 24, 33]], dtype=int64)

Then, we have to re-scale those 7 games

In [36]:
scaled_to_predict = scaler.transform(to_predict)
scaled_to_predict

array([[-1.01374645,  0.429455  , -0.3178549 , -0.8722243 , -0.12023262,
         0.40013555],
       [-0.58223113,  0.26243555,  0.57804211,  0.1542054 , -0.27139179,
         0.0617137 ],
       [ 2.4383761 ,  1.26455224,  0.57804211,  0.00757259, -0.57371015,
        -0.10749723],
       [ 0.71231482,  0.76349389, -0.01922257, -0.28569304, -0.42255097,
        -0.95355186],
       [-1.01374645, -1.24073948, -1.8110166 ,  0.1542054 ,  0.33324492,
        -0.44591908],
       [-1.01374645, -0.57266169,  0.1300936 ,  0.30083822,  0.03092656,
        -0.78434094],
       [-0.58223113, -1.24073948, -1.8110166 , -1.16548993, -0.72486933,
        -0.27670816]])

Now, let’s predict the results (i.e., the 6 numbers) of the May 24th, 2022 lottety game based on those 7 games:

In [39]:
y_pred = model.predict(np.array([scaled_to_predict]))

print("The predicted numbers in the last lottery game are:", scaler.inverse_transform(y_pred).astype(int)[0])

The predicted numbers in the last lottery game are: [ 5  9 12 19 22 35]


Let’s see what were the real results of the May 24th, 2022 lottety game:

In [40]:
prediction = np.array(prediction)

print("The actual numbers in the last lottery game were:", prediction[0])

The actual numbers in the last lottery game were: [ 6 10 13 20 23 35]


3 number's out of 6 numbers, not bad at all!!! especially considering the fact that there was not supposed to be a model within the data, that is, the numbers had to be 100% random.