![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banner_Top_06.06.18.jpg?raw=true)

# OpenData 101 Example 2 - The Lottery

## Introduction
Introduce why this might be an interesting data set for someone to tally peek at whippy dee



## The Data Set

We'll be using historical Lotto 649 data that is hosted on the Lotto 649 website [at this link](http://www.lotto649stats.com/recent_winning.html). Should you choose to follow that link, you'll find yourself greeted by a table that looks similar to the one below 

![winning lottery numbers](images/lotto.png)

Which is fantastic, there's all the data right there. However, the question remains: how will we extract that data from the website into a format that will work well for us? Another catch, which isn't obvious from the screen shot is that this is an incomplete table - there are several more tables for earlier years available. As this is technically open data, we're free to go through that website (either manually or with a computer) and copy that data down ourselves. However, that might take a lot more time than it's worth, so we should be hesitant to start our analysis from here. 

Luckily, it turns out that someone else has already gone through the trouble of capturing the last 30 or so years worth of data and putting it into a form that is much easier to work with. This data is available [at this link](https://www.kaggle.com/datascienceai/lottery-dataset). However, this is an example of "semi-open data" in that you're free to download it without restriction, but you have to register for the website in order to access it. Luckily however, we've already gone through that trouble for you and we're simply loading a local copy we've saved separately. Below are a few lines of code that should be starting to look familiar if you worked through the previous notebook. 

In [None]:
# Import our data manipulation library
import pandas as pd 
import matplotlib.pyplot as plt
# So any plots we want will appear in the notebook
%matplotlib inline

In [None]:
# Here we're loading the data set, here the 'pd' prefix tells us that 'read_csv' is a method
# coming from the pandas library
lottery = pd.read_csv("data/649.csv")

# This is to show us how many rows in our table we will have, 'len' stands for length
print('Total number of lotteries played:', len(lottery))

# This is to just look at the first five rows of the data set and keep the 
# notebook a little cleaner. 
lottery.head()

The above table is the first five rows of our table which has entries for 3665 separate lotteries. That's a lot of prizes! A natural first question to ask of this data is "are some numbers more popular than others?". If there is a bias for some numbers over others, we'd probably like to play those numbers instead! 

A good way to try that is by simply counting how many times each number appears. For example, how many 1s, 2s, 3s ... etc have been drawn throughout the life of this data set. Let's walk through the steps that we have to go through in order to do that. First, let's make a new data table of just the numbers data. 

In [None]:
# First we define a list of the column names (seen in the table above) that we are interested in in order
# to count up their entries. 
cols = ['NUMBER DRAWN 1','NUMBER DRAWN 2','NUMBER DRAWN 3',
        'NUMBER DRAWN 4','NUMBER DRAWN 5','NUMBER DRAWN 6']

# By passing the list we've defined above, we return only those columns. We then assign those columns
# to a new dataframe called 'numbers'
numbers = lottery[cols]

numbers.head()

Now that we've isolated just the numbers we're interested in, it's a simple matter of counting up all the occurrences of each number. However, before we can do that, we have to first introduce another piece of Python/pandas functionality: the `apply()` function. 

---
### Digression: Understanding Apply()

The apply function can be though 

In [None]:
'''
Here we're creating a new dataframe and filling it with zeroes. 

The first argument is the values to fill (here just zero everywhere) 
index is a list of labels for the row indexes, here just zero through five,
and columns is the column names, again zero through five 
'''

example = pd.DataFrame(0, index=[0,1,2,3,4,5], columns=[0,1,2,3,4,5])
example

In [None]:
'''
Now we're going through defining a function, a process which we need to introduce. 

1. Here the 'def' keyword can be thought of to mean "define"

2. 'add_one' is the name we've given to the function that we're defining. This choice is 
    arbitrary and we could have named it anything we wanted. However, it's often helpful 
    to name a function in a way that is meaningful to help describe the purpose of the function.
    
3. 'x' in parenthesis is the name we're assigning to the variable that our function will be 
   taking as input 
   
4. output is the name that we're giving to a variable internal to our function to make calculations

5. return is the keyword that tells our function to return that value once it's been called

6. Notice the consistent indentation after 'def', this is how python knows that the lines of code 
   underneath def are a part of the function 


'''

# This function will take a number x as input, and return that value plus one 
def add_one(x):
    output = x + 1
    return output

# Here we're testing it for expected behavior: Does it indeed 
# return the original number plus one? 
add_one(10)

Now that we've defined a "test" dataframe and a function which will add one to any input, we will demonstrate at a high level what the `apply()` method will do when it is used on a dataframe. 

In [None]:
'''
Notice how we're not supplying any arguments to our add_one function, that is because 
the arguments will be supplied from the contents of the dataframe 'example' itself!
'''
example.apply(add_one)

Notice how we've added one to every entry of our dataframe in one short line of code! For our purposes, `apply()` can be thought of as a method that 'applies' a function to our data frame. The functionality of `apply()` extends much beyond this, but for our purposes this should be sufficient to understand the next the next sections.

--------------

### Back to Business

Now that we see how to use a function on the data within our dataframe using `apply()`, we are now free to use it in order to count how many times each number has appeared within the lottery. In this case we're going to use an internal pandas function, which can be thought of as an Excel function like `SUM`, `AVERAGE`, `MEAN` etc. In our case, we're going to use `pd.value_counts`. Here the `pd` prefix specifies that this is a pandas function, and `value_counts` is the name of the function that is useful for us. What `value_counts` does is return a separate data frame object of the counts of all unique values. Let's see how to use it with apply to count the frequencies of lottery numbers below.

In [None]:
"""
Here the function we're passing to "apply" is pd.value_counts. This is a pandas function
that counts the occurences of unique entries, in this case in the data frame. This is the perfect function
to count how many times each number has been drawn in the lottery!
"""

numbers.apply(pd.value_counts)

Where in the table above we can see how many times each number has been counted, and in which position! however this may be a good example of where perfectly good data may appear to be misleading. Why are seeing a lot of those `NaN` values appearing again? Well - the answer is quite simple. As these values had been sorted in ascending order in the columns, we're simply seeing that certain numbers haven't, or can't, appear in certain positions. For example, as these numbers are sorted, the number 1-5 can _never_ be present in the 6$^{\text{th}}$ column, and that is precisely what we are seeing.


Now, a table of data is fantastic, but it's much easier to communicate something like frequency counts in terms of a histogram, or bar chart. However, we first have to add up the number of times each number was counted! If we were doing this in Excel, such a thing would be quite straight forward; simply type `SUM`, drag a selection of rows, then drag that cell down the side. Easy! Here, I'm going to show you that it's just as easy, possibly even easier using pandas

In [None]:
'''
Here we first create a new data frame 'counts' which is the dataframe we created before

We then create a new  "sum_column" in our data frame, and filling it with the sum of each 
row in our data frame. The sum of each row is specified by the 'axis=1' argument.
'''

counts = numbers.apply(pd.value_counts)

# Create a new column in our counts data frame called "sum_column"
counts["sum_column"] = counts.sum(axis=1)

counts

Where we see that the above action has added a new column that represents the sum of the data contained within each row. Notice how the `NaN` entries did not count towards the numeric total in our sum column (this is easiest to see in the first or last rows). Now that we have a convenient column, let's finally visualize the amount of counts we have. 

In [None]:
'''
Here we're simply making a plot of the number frequencies that we counted in the previous step.

'''

counts['sum_column'].plot(kind = 'bar', figsize = (16,8))
plt.xlabel("Number Drawn", size = 16)
plt.ylabel("Counts", size = 16)
plt.title("Number Frequencies of Lotto 649", size = 18)

From the above figure, it's clear that there certainly isn't a _uniform_ distribution of lottery numbers. Of course, the question remains; are any numbers more _likely_ to be drawn in the lottery than others? If this was the case, certainly it would be ideal to choose those numbers instead. 

Now, in the scope of the entire notebook, it certainly appears that we did a lot of work to create the graph above. However, that is quite the contrary! In the cell below we've repeated all the code necessary to go from loading the data set to creating the number frequency histogram of Lotto 649. In fact, you'll find that ignoring white space and comments, it is but ten lines of code to go from loading the data to creating a formatted graph of wrangled data.  



In [None]:
# Below is all the code required to create the histogram of Lotto 649 number draws 
# Libraries
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

# Data loading/wrangling
lottery = pd.read_csv("data/649.csv")

cols = ['NUMBER DRAWN 1','NUMBER DRAWN 2','NUMBER DRAWN 3',
        'NUMBER DRAWN 4','NUMBER DRAWN 5','NUMBER DRAWN 6']
numbers = lottery[cols]

counts = numbers.apply(pd.value_counts)
counts["sum_column"] = counts.sum(axis=1)

# To create the plot again remove the '#' and trailing space from the four lines
# below. 

# counts['sum_column'].plot(kind = 'bar', figsize = (16,8))
# plt.xlabel("Number Drawn", size = 16)
# plt.ylabel("Counts", size = 16)
# plt.title("Number Frequencies of Lotto 649", size = 18)

## Wrapping Up

Hopefully this notebook has demystified how to incorporate open data within a Jupyter notebook. We have demonstrated the relative simplicity of loading data into an easy to work with data frame, then creating and visualizing summary statistics directly using that data frame. Of course, this is but the beginning of any analysis. If you were in fact interested in understanding any "non-random anomalies" you would need to dive a little deeper into this data set. Perhaps perform a test of randomness such a chi-squared or the Kolmogorov-Smirnov test. Perhaps you would try to find any numbers that appear to be "less random" than the others and use those as favorable picks in your own lottery picks. Certainly, it would be surprising if the lottery was subject to some form of bias, however discovering those possibilities is what working with open data is all about.


---

## Bonus

Below is a function called `lot` that will generate random lottery numbers for you. If you're feeling adventurous, try and create your own (simulated) data set of lottery numbers, or simply run the cell multiple times to generate as many combinations of potential lottery numbers as you'd like. 

In [None]:
import random
from IPython.display import clear_output
def lot(sort = True):
    # Create a list (the [] brackets) of your firstrandom lottery number
    choice = [float(random.randint(1,49))]
    
    # This is an infinite loop: be careful!
    while True:
        # Try to add a new number to our list provided it isn't already in 
        # our list
        new = float(random.randint(1,49))
        if new not in choice:
            # If it's a number we don't already have, add it to the list
            choice.append(new)
        
        # If we have 6 numbers, we can exit our infinite loop by returning
        # our lottery choices
        if len(choice) == 6:
            if sort:
                return sorted(choice)
            else:
                return choice

# This actually calls our function
lot()



![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banners_Bottom_06.06.18.jpg?raw=true)