# INTRODUCTION

We'll be looking at a dataset involving historical data for the Ontario Lottery 6/49 ([found here](https://www.kaggle.com/datascienceai/lottery-dataset)). This had 3665 drawings from 1982 to 2018.  Fundamentally this lottery works with a drawing of 6 numbers from 1 to 49 whereby the number do not get reused in the drawing.  

The purpose of this project is to aim at creating a guestimate of the chances of winning this lottery based on historical data.  Here we want to be able to build functions that enable the user to answer questions like: 

1) What is the probability of winning the big prize with a single ticket?

2) What is the probability of winning the big prize if we play 40 different tickets (or any other number)?

3) What is the probability of having at least five (or four, or three, or two) winning numbers on a single ticket?

The latter matters as a prize is still won (like a small percentage of a pot [less than 5%] or a few thousand of dollars). 

### STEP 1: Creating Core Functions

The 1st goal here is to be able to create functions that produce repeated probabilities and combinations.  In this case, we want to create a function to calculate certain probabilities. In this case, it'll be factorials and combinations.

In [1]:
# Creating factorial function 

def factorial(n): 
    value = 1
    for i in range(n+1):
        if i == 0:
            continue
        else:
            value *= i
    return value
        
# Creating combination function

def combination(n, k):
    numerator = factorial(n)
    denominator = factorial(k) * factorial(n-k)
    return numerator/denominator

### STEP 2: One-ticket Probability

So we want to be able to first calculate the probabilty of winning the big prize with various numbers on a single ticket. Based on the 6/49 lottery, this means numbers from 1-49.  

Looking at this, we'll need to write a function that: 

1) select user input form 1 to 49 (non-replacement)

2) 6 numbers that are produced come out as a list that's used to input into the function 

3) there is a print out of the probability value in a friendly way 

In [2]:
# num of possible outcomes 

total_combos = combination(49,6)
total_combos

# input 1 combination of 6 digits out of the possible outcomes
odds = 1/total_combos
print(odds)

# Output 
print("The probability of winning the 6/49 lottery off of {} ticket is {}%".format(1, odds))


7.151123842018516e-08
The probability of winning the 6/49 lottery off of 1 ticket is 7.151123842018516e-08%


In [3]:
def one_ticket_probability(array):
    length = len(array)
    unique_combo = combination(49, length)
    probability = 1/unique_combo
    sentence = "The probability of winning the 6/49 lottery from a single ticket is {}%".format(probability)
    return sentence

In [4]:
one_ticket_probability([1,2,3,4,5,6])

'The probability of winning the 6/49 lottery from a single ticket is 7.151123842018516e-08%'

### STEP 3: Historical Data Check

For the first version of the app, however, users should also be able to compare their ticket against the historical lottery data in Canada and determine whether they would have ever won by now. The data set contains historical data for 3,665 drawings (each row shows data for a single drawing), dating from 1982 to 2018. For each drawing, we can find the six numbers drawn: NUMBER DRAWN 1 to 6. 

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

% matplotlib inline

In [6]:
data = pd.read_csv("649.csv")
print(data.shape)

(3665, 11)


In [7]:
data.head()

Unnamed: 0,PRODUCT,DRAW NUMBER,SEQUENCE NUMBER,DRAW DATE,NUMBER DRAWN 1,NUMBER DRAWN 2,NUMBER DRAWN 3,NUMBER DRAWN 4,NUMBER DRAWN 5,NUMBER DRAWN 6,BONUS NUMBER
0,649,1,0,6/12/1982,3,11,12,14,41,43,13
1,649,2,0,6/19/1982,8,33,36,37,39,41,9
2,649,3,0,6/26/1982,1,6,23,24,27,39,34
3,649,4,0,7/3/1982,3,9,10,13,20,43,34
4,649,5,0,7/10/1982,5,14,21,31,34,47,45


In [8]:
data.tail()

Unnamed: 0,PRODUCT,DRAW NUMBER,SEQUENCE NUMBER,DRAW DATE,NUMBER DRAWN 1,NUMBER DRAWN 2,NUMBER DRAWN 3,NUMBER DRAWN 4,NUMBER DRAWN 5,NUMBER DRAWN 6,BONUS NUMBER
3660,649,3587,0,6/6/2018,10,15,23,38,40,41,35
3661,649,3588,0,6/9/2018,19,25,31,36,46,47,26
3662,649,3589,0,6/13/2018,6,22,24,31,32,34,16
3663,649,3590,0,6/16/2018,2,15,21,31,38,49,8
3664,649,3591,0,6/20/2018,14,24,31,35,37,48,17


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3665 entries, 0 to 3664
Data columns (total 11 columns):
PRODUCT            3665 non-null int64
DRAW NUMBER        3665 non-null int64
SEQUENCE NUMBER    3665 non-null int64
DRAW DATE          3665 non-null object
NUMBER DRAWN 1     3665 non-null int64
NUMBER DRAWN 2     3665 non-null int64
NUMBER DRAWN 3     3665 non-null int64
NUMBER DRAWN 4     3665 non-null int64
NUMBER DRAWN 5     3665 non-null int64
NUMBER DRAWN 6     3665 non-null int64
BONUS NUMBER       3665 non-null int64
dtypes: int64(10), object(1)
memory usage: 315.0+ KB


In [10]:
data.describe(include='all')

Unnamed: 0,PRODUCT,DRAW NUMBER,SEQUENCE NUMBER,DRAW DATE,NUMBER DRAWN 1,NUMBER DRAWN 2,NUMBER DRAWN 3,NUMBER DRAWN 4,NUMBER DRAWN 5,NUMBER DRAWN 6,BONUS NUMBER
count,3665.0,3665.0,3665.0,3665,3665.0,3665.0,3665.0,3665.0,3665.0,3665.0,3665.0
unique,,,,3591,,,,,,,
top,,,,6/30/2012,,,,,,,
freq,,,,4,,,,,,,
mean,649.0,1819.494952,0.030832,,7.327694,14.568076,21.890859,28.978445,36.162619,43.099045,24.599454
std,0.0,1039.239544,0.237984,,5.811669,7.556939,8.170073,8.069724,7.19096,5.506424,14.360038
min,649.0,1.0,0.0,,1.0,2.0,3.0,4.0,11.0,13.0,0.0
25%,649.0,917.0,0.0,,3.0,9.0,16.0,23.0,31.0,40.0,12.0
50%,649.0,1833.0,0.0,,6.0,14.0,22.0,30.0,37.0,45.0,25.0
75%,649.0,2749.0,0.0,,10.0,20.0,28.0,35.0,42.0,47.0,37.0


The data doesn't appear to have any missing data and the data appears to be in the correct data type. 

### STEP 4: Create a function to compare set of numbers with old winning number and put out odds of winning

This will need an input of 6 different numbers, which will come out as a list and then using that to create a probability of winning. Since there are a lot of moving parts, best to break it down:

In [11]:
# Create a function that extracts all winning 6 number from the historical data 
## Takes in a row in the lottery dataframe as an input and return a set containing all 6 winning number 

def extract_numbers(row):
    row = row[4:10]
    return set(row.values) # using .values to return all values 
    

In [12]:
type(data.iloc[0:10].apply(extract_numbers, axis = 1))

pandas.core.series.Series

In [13]:
# Create a function that takes two input (a list containing user numers and a pandas series containing sets with winning numbers (i.e. use the extract numbers))

winning_numbers = data.apply(extract_numbers, axis = 1)

def check_historical_occurence(my_numbers, winning_numbers):
    conversion = set(my_numbers)
    times = 0
    
    for each in winning_numbers:
        if each == conversion:
            times += 1
    
    if times == 0: 
        print("Historically speaking,{} have not been winning numbers for 6/49.".format(my_numbers))
    else:
        print("The Number of times {} were winning numbers for the 6/49 lottery was {}.".format(conversion, times))
        print("The probability of winning with these numbers (historically) are: {}%.".format(times/len(winning_numbers)))
        print("Overall, you have a 1 in {} chance of winning the 6/49.".format(combination(49,6)))

In [14]:
# Testing out the functions. 

check_1 = check_historical_occurence([11,17,23,25,32,48], winning_numbers)
print("\n")
check_2 = check_historical_occurence([8,33,36,37,39,41], winning_numbers)

Historically speaking,[11, 17, 23, 25, 32, 48] have not been winning numbers for 6/49.


The Number of times {33, 36, 37, 39, 8, 41} were winning numbers for the 6/49 lottery was 1.
The probability of winning with these numbers (historically) are: 0.00027285129604365623%.
Overall, you have a 1 in 13983816.0 chance of winning the 6/49.


### Step 5: Multi-ticket Probability

Since addicts usually play more than a single ticket on a single draw, the throught process is that more chances of winning. Since I want to help them get a better understanding of their chance of winning, we need to create a function that calculates the chances of winning for any number of different tickets.

This means we'll need to have the user input the number of different tickets that they want to play (BUT NOT THE ACTUAL COMBO), which will then output some value b/t 1 adn 13,983,816 and some text indicating your chance of winning the prize.  

Oh, because I'm from the province with this lottery, I know that each ticket is like $3.00. So it's also probably a good idea to get some reference of the cost for the number of tickets purchased. 

In [15]:
def multi_ticket_probability(tickets):
    cost = 3*tickets
    possible_outcomes = combination(49,6)
    probability_to_win = tickets/possible_outcomes*100
    
    print("The percentage that you'll win the 6/49 jackpot based on {} ticket(s) purchased is {}%.  Considering that it costs $3.00 per ticket, you'll be spending ${}.".format(tickets, probability_to_win, cost))
    

In [16]:
for each in [1, 10, 100, 10000, 1000000, 6991908, 13983816]:
    multi_ticket_probability(each)
    print("\n")

The percentage that you'll win the 6/49 jackpot based on 1 ticket(s) purchased is 7.151123842018516e-06%.  Considering that it costs $3.00 per ticket, you'll be spending $3.


The percentage that you'll win the 6/49 jackpot based on 10 ticket(s) purchased is 7.151123842018517e-05%.  Considering that it costs $3.00 per ticket, you'll be spending $30.


The percentage that you'll win the 6/49 jackpot based on 100 ticket(s) purchased is 0.0007151123842018516%.  Considering that it costs $3.00 per ticket, you'll be spending $300.


The percentage that you'll win the 6/49 jackpot based on 10000 ticket(s) purchased is 0.07151123842018516%.  Considering that it costs $3.00 per ticket, you'll be spending $30000.


The percentage that you'll win the 6/49 jackpot based on 1000000 ticket(s) purchased is 7.151123842018517%.  Considering that it costs $3.00 per ticket, you'll be spending $3000000.


The percentage that you'll win the 6/49 jackpot based on 6991908 ticket(s) purchased is 50.0%.  Cons

### STEP 6: Less winning Numbers

Since you can win smaller prizes with fewer matching numbers (minimum is really 3), we want to reflect this. To do so, we'll need to have an input of 6 different numbers and an integer that represent number of matches with winning number.  This will print out info about the probability of having inputted number of winning number. 

In [35]:
def probability_less_6(integer):
    if integer < 2 or integer > 5: 
        print("Pick a value between 2 and 5")
    else:
        combination_ticket = combination(6, integer) # Number of possible combinations with for x-number out of 6 total digits
        combination_exact = combination(43, 6-integer) # Number of possible successful outcomes out of the remaining numbers (minus ours which is incorrect)
        successful_outcomes = combination_ticket * combination_exact
        total_outcomes = combination(49,6)
        probability_num_successful = successful_outcomes/total_outcomes
        percentage = probability_num_successful*100
        
        print("From having {} numbers out of the six correct, I would have {} possible 6-digit combinations available that could've been correct. Given that there are {} possible 6-digit draw combinations, I would expect to have {:,.3f}% chance of achieve this".format(integer, combination_exact, total_outcomes, percentage))

In [36]:
probability_less_6(5)

From having 5 numbers out of the six correct, I would have 43.0 possible 6-digit combinations available that could've been correct. Given that there are 13983816.0 possible 6-digit draw combinations, I would expect to have 0.002% chance of achieve this
