In [1]:
import pandas as pd
import duckdb

# Load SQL extension
%load_ext sql

# Initialize 🦆 DuckDB connection
conn = duckdb.connect()

# Import database
%sql conn --alias duckdb

Config,value
feedback,True
autopandas,True
displaylimit,10
displaycon,False


Powerball is a popular lottery game in the United States. Players choose five numbers from 1 to 69 and a Powerball number from 1 to 26. A player wins the jackpot by matching all five numbers plus the Powerball number. Learn more about Powerball [here](https://en.wikipedia.org/wiki/Powerball). 

In this bonus exercise, we'll look at New York lottery powerball data! We'll pull it into a dataframe by reading directly from ny.gov's website.

In [None]:
powerball_df = pd.read_csv("https://data.ny.gov/api/views/d6yy-54nr/rows.csv")

powerball_df.rename(
    columns={k: k.lower().replace(" ", "_") for k in powerball_df.columns}, inplace=True
)

The winning numbers are made up of five "white balls" from a matrix of 69 and one "Powerball" from a matrix of 26, resulting in jackpot odds of 1 in 292,201,338 per play. Let's take a look at the data:

In [None]:
%%sql
SELECT * FROM powerball_df LIMIT 5

Write a query that splits the winning numbers into separate columns. Your query should return a result with columns: `draw_date`, `num1`, `num2`, `num3`, `num4`, `num5`, `num6`, and `multiplier`

In [None]:
%%sql


Using the above as a base, write a new query that returns a table where each drawn number represents a row and each column is the count of occurrences where that number was drawn in the proper position. Your response should look like this:

| range_str | num1_ct | num2_ct | num3_ct | num4_ct | num5_ct | num6_ct |
|----------:|--------:|--------:|--------:|--------:|--------:|--------:|
|        01 |     121 |       0 |       0 |       0 |       0 |      54 |
|        02 |     112 |       9 |       0 |       0 |       0 |      51 |
|        03 |     106 |      18 |       1 |       0 |       0 |      52 |
|        04 |      90 |      22 |       0 |       0 |       0 |      64 |
|        05 |      96 |      17 |       0 |       0 |       0 |      59 |

Hints:
- The numbers aren't actually numbers— they're left padded strings. 
- We can't be sure every number has been drawn to create the "index" (range_str)— it might be best to generate the index instead.

In [None]:
%%sql

Modify the previous query to return the _most_ common number for each draw. Your result should look something like this:

| most_popular_num1 | most_popular_num2 | most_popular_num3 | most_popular_num4 | most_popular_num5 | most_popular_num6 |
|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
| num_1             | num_2             | num_3             | num_4             | num_5             | num_6             |

In [None]:
%%sql


In addition to returning the most popular number, return the percentage of time that number was drawn

In [None]:
%%sql

Have the most popular numbers ever been drawn sequentially? (don't overthink this one)

In [None]:
%%sql


Now write a query that, for each draw number, returns the first date. The output should look something like:

| range_str |      num_1 |      num_2 |      num_3 | num_4 | num_5 |      num_6 |
|----------:|-----------:|-----------:|-----------:|------:|------:|-----------:|
|        01 | 2010-04-24 |        NaT |        NaT |   NaT |   NaT | 2010-02-13 |
|        02 | 2010-05-19 | 2014-01-22 |        NaT |   NaT |   NaT | 2010-07-28 |
|        03 | 2010-07-03 | 2010-05-29 | 2019-02-09 |   NaT |   NaT | 2010-07-03 |
|        04 | 2010-02-24 | 2011-03-12 |        NaT |   NaT |   NaT | 2010-02-06 |
|        05 | 2010-02-10 | 2011-01-26 |        NaT |   NaT |   NaT | 2010-04-24 |
|        06 | 2010-03-13 | 2010-05-26 | 2015-03-28 |   NaT |   NaT | 2010-06-16 |

In [None]:
%%sql


Can you find the first drawn set of numbers by windowing over the previous result?

In [None]:
%%sql


Are those the right numbers?

In [None]:
%%sql
