## Introductory experiment: drawing cards from two stacks

- Imagine a game with **two big stacks of cards**
- Each stack contains **winning cards** and **blanks**.
- You have to decide **which stack has more wins**
- How often do you have to draw pairs of cards (one from each stack)?

In [11]:
# setup notebook
from cards_code_altair import *

In [12]:
# create data for virtual experiments
df = repeated_experiments_df(
    p_win_1   = 0.5,
    p_win_2   = 0.4,
    n_cards   = 100,
    n_repeats = 50,
)
df.head()

KeyboardInterrupt: 

In [None]:
# transparently export data if needed
# if you have issues viewing the plots, try embedding the data
# this will increase the file size!
data = df_to_datasource(
    df, embed_in_notebook=True, embed_in_slides=False
)

### Draw cards from two different stacks, one from each stack at a time
- What can we know after drawing a certain number of cards?
- When have we drawn enough cards to be certain?

The next three figures are interactive! Unfortunately they won't work on GitHub, in that case please keep on scrolling.

In [None]:
# Have "Lato", the font used by the "simple" reveal.js theme, available in notebook
from IPython.core.display import HTML
HTML("""<style>@import url(https://fonts.googleapis.com/css?family=Lato:400,700,400italic,700italic);</style>""")

In [None]:
#https://towardsdatascience.com/consistently-beautiful-visualizations-with-altair-themes-c7f9f889602
def slide_theme():
    # Typography
    font = ["Lato",  "Helvetica Neue", "Helvetica", "Arial", "Sans Serif"]
    # At Urban it's the same font for all text but it's good to keep them separate in case you want to change one later.
    labelFont = font
    sourceFont = font
    base_font_size = 16
    # Axes
    #axisColor = "#000000"
    #gridColor = "#DEDDDD"
    # Colors
    #main_palette = ["#1696d2", 
    #                "#d2d2d2",
    #                "#000000", 
    #                "#fdbf11", 
    #                "#ec008b", 
    #                "#55b748", 
    #                "#5c5859", 
    #                "#db2b27", 
    #               ]
    #sequential_palette = ["#cfe8f3", 
    #                      "#a2d4ec", 
    #                      "#73bfe2", 
    #                      "#46abdb", 
    #                      "#1696d2", 
    #                      "#12719e", 
    #                     ]
    #axisColor = 'black'
    #markColor = 'black'
    return {
        # width and height are configured outside the config dict because they are Chart configurations/properties not chart-elements' configurations/properties.
        #"width": 685, # from the guide
        #"height": 380, # not in the guide
        "config": {
            "title": {
                "fontSize": base_font_size * 1.5,
                "font": font,
                "fontWeight": 'normal',
               #"anchor": "start", # equivalent of left-aligned.
                #"fontColor": "#000000"
            },
            "axisX": {
               # "domain": True,
               # "domainColor": axisColor,
                #"domainWidth": 2,
               # "grid": False,
                "labelFont": labelFont,
                "labelFontSize": base_font_size,
                #"labelAngle": 0, 
                #"tickColor": axisColor,
                #"tickSize": 5, # default, including it just to show you can change it
                #"tickWidth": 2,
                "titleFont": font,
                "titleFontSize": base_font_size,
                "titleFontWeight": 'normal',
                "titlePadding": 10, # guessing, not specified in styleguide
                #"title": "X Axis Title (units)", 
            },
            "axisY": {
                #"domain": False,
                #"domainWidth": 2,
                #"grid": True,
                #"gridColor": gridColor,
                #"gridWidth": 1,
                "labelFont": labelFont,
                "labelFontSize": base_font_size,
                #"labelAngle": 0, 
                #"ticks": False, # even if you don't have a "domain" you need to turn these off.
                #"tickWidth": 2,
                "titleFont": font,
                "titleFontSize": base_font_size,
                 "titleFontWeight": 'normal',
               "titlePadding": 10, # guessing, not specified in styleguide
                #"title": "Y Axis Title (units)", 
                ## titles are by default vertical left of axis so we need to hack this 
                #"titleAngle": 0, # horizontal
                #"titleY": -10, # move it up
                #"titleX": 18, # move it to the right so it aligns with the labels 
            },
            #"range": {
            #    "category": main_palette,
            #    "diverging": sequential_palette,
            #},
            "legend": {
                "labelFont": labelFont,
                "labelFontSize": base_font_size,
                #"symbolType": "square", # just 'cause
                #"symbolSize": 100, # default
                "titleFont": font,
                "titleFontSize": base_font_size,
                "titleFontWeight": 'normal',
                #"title": "", # set it to no-title by default
                #"orient": "top-left", # so it's right next to the y-axis
                #"offset": 0, # literally right next to the y-axis.
            },
            #"view": {
            #    "stroke": "transparent", # altair uses gridlines to box the area where the data is visualized. This takes that off.
            #},
            #"background": {
            #    "color": "#FFFFFF", # white rather than transparent
            #},
            ### MARKS CONFIGURATIONS ###
            #"area": {
            #   "fill": markColor,
            #},
            #"line": {
            #   "color": markColor,
            #   "stroke": markColor,
            #   "strokeWidth": 5,
            #},
            #"trail": {
            #   "color": markColor,
            #   "stroke": markColor,
            #   "strokeWidth": 0,
            #   "size": 1,
            #},
            #"path": {
            #   "stroke": markColor,
            #   "strokeWidth": 0.5,
            #},
            #"point": {
            #   "filled": True,
            #},
            "text": {
               "font": sourceFont,
               #"color": markColor,
               "fontSize": base_font_size * .9,
               #"align": "right",
               #"fontWeight": 400,
               #"size": 11,
            }, 
            #"bar": {
            #    "size": 40,
            #    "binSpacing": 1,
            #    "continuousBandSize": 30,
            #    "discreteBandSize": 30,
            #    "fill": markColor,
            #    "stroke": False,
         
            #}
        }
    }
alt.themes.register("statistics_slide_theme", slide_theme)
alt.themes.enable("statistics_slide_theme");

# the three dots menu is useful but a bit distracting in slides
alt.renderers.set_embed_options(actions=False);

In [None]:
# define input selection
input_n_cards = alt.binding(
    input='range',
    min=1,
    max=min(df.card_pair.max(), 40), 
    step=1, 
    name='Draw Card Pairs: '
)
selection = alt.selection_single(
    bind=input_n_cards,
    init={'card_pair': 1}
)

# filter data & plot bar chart
alt.Chart(data).mark_bar().encode(
    alt.X('stack:N', title='Card Stack'),
    alt.Y('sum(win):Q', axis=alt.Axis(title='Number of Wins', tickMinStep=1)),
    color=alt.Color('stack:N', legend=None)
).transform_filter(
    "datum.experiment == 1"
).add_selection(
    selection
).transform_filter(
    (alt.datum.card_pair<=selection.card_pair)
).properties(
    width=200,
    height = 250
).configure_axis(
    grid=False
).configure_view(
    strokeWidth=0
).display(renderer='svg')

.

.

**Spoilers below!**

.

.

.

Scroll down when you are finished drawing cards

.

.

.

.

.

.

.

.

.

.

.

.

.



### It can take a while to see which stack is better!

In [None]:
# plot ticks for each win per stack
plot_width = 500
ticks = alt.Chart(data).mark_tick(thickness=1.5).transform_filter(
    "(datum.experiment == 1) & (datum.win > 0)" 
).encode(
    alt.X(
        'card_pair:Q', 
        title='Cards drawn per stack', 
        axis=alt.Axis(ticks=False, grid=False, labels=False, title='Wins', orient='top')
    ),
    alt.Y(
        'stack:Q', 
        title='Stack', 
        scale=alt.Scale(domain=[2,0]),
        axis=None
    ),
    color=alt.Color('stack:N', title='Stack'),
).properties(
    width  = plot_width,
    height = 35,
    view=alt.ViewConfig(strokeWidth=0)
)

# plot cumulative wins per stack
lines = alt.Chart(data).mark_line().transform_window(
    cumulative_wins='sum(win)',
    frame=[None, 0],
    groupby=['experiment','stack']
).transform_filter(
    "datum.experiment == 1" 
).encode(
    alt.X('card_pair:Q', title='Cards drawn per stack'),
    alt.Y('cumulative_wins:Q', title='Number of wins'),
    color=alt.Color('stack:N', title='Stack')
).properties(
    width  = plot_width,
    height = 265
)

# combine & show plots
alt.vconcat(ticks, lines).display(renderer='svg')

### Repeat the experiment
- Draw a fixed number of from each stack
- Calculate each stack's winning probability = number of wins / number of cards
- Repeat 50 times

### Each repetition yields a different winning probability for each stack. 

In [None]:
plot_height = 225

# define input selection
input_n_cards = alt.binding(
    input='range',
    min=1,
    max=df.card_pair.max().astype(int), 
    step=1, 
    name='Card pairs per Experiment: '
)
selection = alt.selection_single(
    bind=input_n_cards,
    init={'card_pair': 25}
)

# filter data
base = alt.Chart(data).add_selection(
    selection
).transform_filter(
    (alt.datum.card_pair<=selection.card_pair)
).transform_aggregate(
    p_win='mean(win):Q',
    groupby=["stack","experiment"]
)
scale = alt.Scale(domain=[-.1,1.1])

# plot individual experiments
dots = base.mark_point().encode(
    x=alt.X('experiment:Q', title='Experiment'),
    y=alt.Y(
        'p_win:Q', 
        scale=scale, 
        axis=alt.Axis(
            values=np.arange(0,1.1,.2),
            grid=True
        ), 
        title='Winning probability'
    ),
    color='stack:N'
).properties(
    width=400,
    height=plot_height
)

# plot histogram over experiment results
hist = base.mark_bar(opacity=.7).encode(
    y=alt.X(
        'p_win:Q', 
        bin=alt.Bin(extent=[0,1], step=.1),
        scale=scale, 
        axis=alt.Axis(
            values=np.arange(0,1.1,.2), 
            title=None, 
            labels=False,
            grid=True
        ), 
    ),
    x=alt.X(
        'count(experiment):Q', 
        stack=None, # no stacked bar chart
        title='Number of experiments'
    ),
    color=alt.Color('stack:N', legend=alt.Legend(title='Stack')),
    order='stack:Q'
).properties(
    width=150,
    height=plot_height
)

# define custom label for standard deviation bars
std_text = alt.Chart(
    pd.DataFrame({'x':[4], 'y':[df.win.mean()], 'text':['Mean ± Standard Deviation']})
).mark_text(
    angle=90, baseline='middle'
).encode(
    x='x', y='y', text='text'
)

# plot standard deviation bars
errorbars = alt.layer(
    base.mark_errorbar(extent='stdev', rule=alt.MarkConfig(size=2)).encode(
        y=alt.Y('p_win:Q',
            axis=None,#alt.Axis(orient='right', ticks=False, labels=False, grid=False, style=None), 
            scale=scale,
        ),
        x=alt.X('stack:Q', axis=None),
        color='stack:N',
    ),
    base.mark_point(size=0).encode(
        y=alt.Y('p_win:Q', aggregate='mean', scale=scale),
        x=alt.X('stack:Q', scale=alt.Scale(domain=[1,4])),
    ),
    std_text,
    view=alt.ViewConfig(strokeWidth=0),
).properties(
    width=25,
    height=plot_height
)

# combine & show
alt.concat(dots, hist, errorbars).configure_axis(
    grid=False
).configure_view(
    strokeWidth=0
).display(renderer='svg')

- A **histogram** quantifies how often we observed a certain outcome
- The **standard deviation (std.)** over the repetitions measures uncertainty
- It corresponds to the **standard error** of the mean for a single experiment.

In [None]:
# export to slideshow AFTER saving the notebook
#! export CONVERT=TRUE; jupyter nbconvert --execute 1b_cards_altair.ipynb --to slides --no-input

In [None]:
# view the slides
#! python -m http.server

### Some explanations 
If you pass Altair a DataFrame, all the data will be embedded in the plot. This can blow up the file size quite quickly. Therefore, it has a default limit of 5000 rows, then it throws an error. You can switch that off using `alt.data_transformers.disable_max_rows()` but you will have to live with the consequences!

A better way is to load data from an url. The only issue is, that then you have to pass a url that you can access via the Jupyter server. You can find that by right clicking on the file in the Jupyter file browser and opening it in a new browser tab. However, if you intend to export the notebook as slides, then this url ends up in your slides. In that case you have specify a url where you will host the file and where it will be accessible for viewers of the slides.html. Referencing local files doesn't work even when opening it locally (without a web server) because that is a security features of modern browsers. So the best thing would be to check in the file to github (or upload it somewhere else) first and then put that url as the source for your chart.

A convenient way for working in the notebook is to use the altair data transformer which does all that saving and referencing transparently for you. That's really nice for exploring larger datasets but it won't help with distributing your notebooks.

Finally, you could just pre-aggregate everything before plotting, and just use the default embedding behaviour. That works, but it might not allow for certain kinds of interactivity.