# PSTAT 234 - Data and Uncertainty <a class="tocSkip">

## Sang-Yun Oh <a class="tocSkip">

# Coin Flips Re-visited

In [3]:
%%html
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/486jBqQEYs0?start=65" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

<div>
<img src="images/coinflip.png" width=500>
</div>

- Coin flips: https://www.youtube.com/watch?v=AYnJv68T3MM
- Fair dice: https://www.youtube.com/watch?v=G7zT9MljJ3Y

- Many real-world processes are much more complicated; however, we still try  

    > _Now it would be very remarkable if any system existing in the real world could be 
    exactly represented by any simple model. However, cunningly chosen parsimonious models 
    often do provide remarkably useful approximations. 
    [... example of modeling gas in physics ...]  
    For such a model there is no need to ask the question "Is the model true?". 
    If "truth" is to be the "whole truth" the answer must be "No". 
    The only question of interest is "Is the model illuminating and useful?"._  
    George Box

# Q: Is my model useful? <a class="tocSkip">

# A: It depends on what you want to do with it <a class="tocSkip">

https://www.forbes.com/sites/alexknapp/2012/10/27/scientists-beat-the-house-at-roulette-with-chaos-theory/#6380a4bf710d

![Fallingwater - Frank Lloyd Wright](https://lc-imageresizer-live-s.legocdn.com/resize/lego_21005_prod_pri_1488?width=1128&ratio=1&imageUrl=https%3A%2F%2Fwww.lego.com%2Fr%2Fwww%2Fr%2Fcatalogs%2F-%2Fmedia%2Fcatalogs%2Fproducts%2Fproduct%2520portal%2Fpri_1488%2Flego_21005_prod_pri_1488.jpg%3Fl.r%3D-410479427)
[Fallingwater - Frank Lloyd Wright](https://en.wikipedia.org/wiki/Fallingwater)

# Models and Data

- In probability, underlying processes are represented by __models__ (often distributions)   
    e.g., flipping a coin, rolling a dice, etc.
   

- Often these models are often based on assumptions:  
    e.g. binomial distribution cannot model for edge landings

- How much data is needed to learn a 3-state multinomial random variable?

- Choosing a model is important: simple vs. complex models

- _When model complexity goes up, required amount of data increases_ :  
    e.g. two-sided coin vs. 30-sided dice 

- Deep learning model is highly flexible and massive amounts of data is needed to train it

## Model Fit to Data

- Data: {Heads, Heads, Tails}

- Model: Bernoulli random variable  
    $$ p(x) = \theta^{x}(1-\theta)^{1-x} $$

- Model fit to data: $\hat\theta = 2/3$

- Does model fit data well? Yes

- Will it generalize well to more coin flips? No

## Model Complexity and Data

- How do we choose complexity of models?

- Depends on complexity of reality and availability of data

- Dependency on data:

![curvefit](https://scikit-learn.org/stable/_images/sphx_glr_plot_underfitting_overfitting_001.png)

- Fair comparison of methods require new set of data (out-of-sample data)

- **Goal: Choosing a model: complex enough to fit current data and generalizes well to new data**

## Data Amount/Quality and Models

- Data represents the real world to the analyst

- If chosen model is correct but data doesn't represent reality, model fit will be bad:  
    e.g. model coin edge-landing, but data doesn't contain any edge-landings

- If chosen model is not rich enough, model fit cannot capture the full reality:  
    e.g. do not model edge-landing, but data contains observed edge-landings

- If we don't have enough data to learn the model reliably, model will not generalize to new data  
    e.g. model fit with three coin flips will not generalize well

# Models in the Real World

- However, often building models from ground up can often be practically impossible

- e.g., where do you begin to build a model for [NCAA tournament](https://www.forbes.com/sites/kurtbadenhausen/2019/03/14/2019-march-madness-bracket-contest-prize-1-million-a-year-for-life-from-warren-buffett/#2ceb486a6755)? human behavior?

- Assuming there is a "useful" model, we can generate new data for comparison etc.

- Monte Carlo simulation 

## Monte Carlo simulation

![Monte Carlo Casino](https://asset.montecarlosbm.com/styles/hero_desktop_wide_responsive/public/media/orphea/sbm_casinos_10_2017_place_du_casinosoir_16_id111623_rsz.jpg.jpeg?itok=tjcPjZIV)

### Monte Carlo simulation of card games

- Dealing cards is simple

- Repeatedly play games with simulated hands to determine outcome

- Calculating probability of card game outcomes by counting or averaging

### Monte Carlo simulation of complex processes

- Complex process $f$ with random input $X$: i.e. process of interest is $f(x)$

- Simulate data $X$, apply process $f$

- Calculate some probability or expected value (average)

### Monte Carlo simulation for Data Science?

- Monte Carlo simulation generates fake data. How is this useful?

- If $f$ is fairly accurate (big assumption), we have ground truth

- We learn $\hat f$ from (fake) data

- We can compare $\hat f$ to $f$ without worrying about data availability

### Generative Adversarial Network?

- GANs are neural network models used for generating fake data. How is this different?

- Instead of specifying $f$ directly, train $f$ from data

- GANs can adapt to intricate data as the model is more expressive than any analyticical $f$

- Can generate additional data where there is no suitable alternate data source

- Useful data generating tool but also comes with dangers of any trained models

### Examples of GANS

- Illustrative example: [GAN Lab](https://poloclub.github.io/ganlab/)

- Finance: [Generating realistic stock market data for deeper financial research](https://news.engin.umich.edu/2020/02/generating-realistic-stock-market-data-for-deeper-financial-research/)

- Economics: [Using Wasserstein Generative Adversarial Networks for the Design of Monte Carlo Simulations](https://www.nber.org/papers/w26566)

- Particle physics: [Generative Adversarial Networks for Physics Simulation](https://mickypaganini.github.io/atlasML.html)

- Natural language processing: [Text generation](https://paperswithcode.com/task/text-generation)