# Analyzing Lotto Data

I've just downloaded a CSV file from https://catalog.data.gov/dataset/lottery-take-5-winning-numbers
I'm now going to read teh file into a dataframe and see if there is any correlations between the winning numbers.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
num_to_month = {
    '01' : 'Jan',
    '02' : 'Feb',
    '03' : 'Mar',
    '04' : 'Apr',
    '05' : 'May',
    '06' : 'Jun',
    '07' : 'Jul',
    '08' : 'Aug',
    '09' : 'Sep',
    '10' : 'Oct',
    '11' : 'Nov',
    '12' : 'Dec'
}

In [3]:
df = pd.read_csv("Lottery_Take_5_Winning_Numbers.csv")
df.head()

Unnamed: 0,Draw Date,Winning Numbers,Bonus #
0,03/16/2019,01 04 06 17 31,
1,03/15/2019,10 19 21 24 29,
2,03/14/2019,07 13 32 33 34,
3,03/13/2019,07 12 13 21 37,
4,03/12/2019,01 06 31 35 39,


Let's see how many players had a non-NaN Bonus # value.

In [4]:
bonus = df["Bonus #"] == True
sum(bonus)

1

Only 1, seems this column is pretty unnecessary for the analysis we're currently doing. In case, we'd like to analyze this winner later, here she/he is at index 5600:

In [5]:
df[bonus]

Unnamed: 0,Draw Date,Winning Numbers,Bonus #
5600,10/30/2003,08 26 27 28 36,1.0


Let's convert the Draw Dates to 3 columns: month, date, year.

In [11]:
def to_month_date_year(ser):
    lst = ser[0].split('/')
    return pd.Series(list(map(int,lst)))

dates = df.apply(to_month_date_year, 1).rename(index=str, columns={0: "month", 1: "date", 2: "year"})
dates.head()

Unnamed: 0,month,date,year
0,3,16,2019
1,3,15,2019
2,3,14,2019
3,3,13,2019
4,3,12,2019


Each element in the column of winning numbers is currently a string. Strings aren't very conducive to the plots I am wanting so let's make a column for each number.

In [12]:
def to_int(string_of_nums):
    lst_of_ints = list(map(int, string_of_nums.split()))
    return pd.Series(lst_of_ints)
    
to_int("01 04 06 17 31")     

0     1
1     4
2     6
3    17
4    31
dtype: int64

In [13]:
temp = df["Winning Numbers"].apply(to_int)
winners = pd.DataFrame()
for i in range(1, 6):
    winners[i] = temp[i - 1]
winners.head()

Unnamed: 0,1,2,3,4,5
0,1,4,6,17,31
1,10,19,21,24,29
2,7,13,32,33,34
3,7,12,13,21,37
4,1,6,31,35,39


In [14]:
games = pd.concat([dates.reset_index(drop=True),winners.reset_index(drop=True)], axis=1)
games.head()

Unnamed: 0,month,date,year,1,2,3,4,5
0,3,16,2019,1,4,6,17,31
1,3,15,2019,10,19,21,24,29
2,3,14,2019,7,13,32,33,34
3,3,13,2019,7,12,13,21,37
4,3,12,2019,1,6,31,35,39


In [15]:
games.describe()

Unnamed: 0,month,date,year,1,2,3,4,5
count,8007.0,8007.0,8007.0,8007.0,8007.0,8007.0,8007.0,8007.0
mean,6.509304,15.712002,2007.435369,6.641938,13.261771,19.835769,26.596103,33.259023
std,3.461776,8.798263,6.832169,5.189784,6.636895,6.983484,6.598717,5.275412
min,1.0,1.0,1992.0,1.0,2.0,3.0,5.0,7.0
25%,3.0,8.0,2002.0,3.0,8.0,15.0,22.0,31.0
50%,7.0,16.0,2008.0,5.0,12.0,20.0,27.0,35.0
75%,10.0,23.0,2013.0,9.0,18.0,25.0,32.0,37.0
max,12.0,31.0,2019.0,31.0,35.0,37.0,38.0,39.0


Now that we have the dataframe converted to ints, let's make some plots and see if there are any inferences we could make to help us become millionaires! First, I'm going to make a function to count the distribution of numbers for the inputed dataframe.

In [72]:
counts = {i:0 for i in range(1, 40)}
def count(ser):
    for i in ser:
        counts[i] += 1

In [58]:
toy = games.iloc[:, 3:].head()
toy.head()

Unnamed: 0,1,2,3,4,5
0,1,4,6,17,31
1,10,19,21,24,29
2,7,13,32,33,34
3,7,12,13,21,37
4,1,6,31,35,39


In [71]:
games.iloc[:, 3:].apply(count, axis=1)
counts

False

In [None]:
def plot_games(df, lst_of_games):
    #df: a dataframe that is of the form shown under the command "games.head()" above.
    #lst_of_games: the list of games to analyze
    counts = {}
     

In [18]:
games[games["month"] == 1].head()

Unnamed: 0,month,date,year,1,2,3,4,5
44,1,31,2019,13,14,17,20,21
45,1,30,2019,13,19,36,37,39
46,1,29,2019,9,26,29,30,34
47,1,28,2019,8,12,35,36,39
48,1,27,2019,6,13,14,17,38


In [25]:
games.iloc[1]

month       3
date       15
year     2019
1          10
2          19
3          21
4          24
5          29
Name: 1, dtype: int64