# Lecture #1: Course Overview
## AM 207: Advanced Scientific Computing
### Stochastic Methods for Data Analysis, Inference and Optimization
### Fall, 2019

<img src="fig/logos.jpg" style="height:150px;">

## Outline
1. What is this course about?
2. Who should take this course?
3. How is this course structured?
4. How do I get help for the course?
5. Where do I find more information?

# What is this course about?

## How Do We Model Patterns in Data?

This is a scatter plot of home prices vs square footage of some homes in southern California.

<img src="fig/fig32.jpg" style="height:350px;">

Can you see any patterns or trends?


## How Do We Model Patterns in Data?

We see that as **square footage** increases, so does **price**. 

<img src="fig/fig32.jpg" style="height:350px;">

But what is a precise, mathematical description of this relationship?

## What is a Model?

Building a model to capture a hypothesized relationship means we predict the value of one group of attributes using another group. 

This prediction problem is called ***regression***, the attribute we are trying to predict (e.g.price) is called the ***outcome*** or the ***target***, denoted by $y$. 

The group of attributes (e.g. square footage) we use to make the prediction is called the ***covariates***, denoted by $x$.

A ***regression model*** is a mathematical function, $f(x)$, that predicts the target. We denote our prediction by $\hat{y} = f(x)$. 

## What is a Model?

We conjectured that the model for this data is a line: $\hat{y} = f(x) = w_1x + w_0$.

<img src="fig/fig33.jpg" style="height:350px;">

But which line fits the data best?

## A Notion of Error

An ***absolute residual*** is the absolute difference between the actual price of a home and the price predicted by the line for a given square footage:
$$
\mathtt{Residual}_i = y_i - \hat{y}_i
$$

<img src="fig/fig34.jpg" style="height:350px;">

## How do we quantify the overall error?

1. **(Max absolute deviation)** Count only the biggest "error"
$$
\max_i |y_i - \hat{y}_i| 
$$
2. **(Sum of absolute deviations)** Add up all the "errors"
$$
\sum_i |y_i - \hat{y}_i| 
$$
3. **(Sum of squared errors)** Add up the squares of the "errors"
$$
\sum_i |y_i - \hat{y}_i|^2 
$$
4. **(Mean squared errors)** We can also average the squared "errors".
$$
\frac{1}{N}\sum_{i=1}^N |y_i - \hat{y}_i|^2 
$$

Again, $y_i$ is the observed target, $\hat{y}_i$ is the predicted target.

## Model Fitting

**Question:** What do we mean by choosing "best" line, $\hat{y} = w_1x_1 + w_0$? 

The ***model fitting*** process:

1. *Choose* an overall error metric. This metric is called the ***loss function***:
$$
\mathcal{L}(w_0, w_1) = \frac{1}{N}\sum_{i=1}^N |y_i - (w_1x_1 + w_0)|^2, \quad\quad \text{(Mean Squared Error Loss)}
$$

2. Set up the problem of finding coefficients or ***parameters***, $w_0, w_1$, such that the loss function is **minimized**:
$$
\mathrm{argmin}_{w_0, w_1}\mathcal{L}(w_0, w_1) = \mathrm{argmin}_{w_0, w_1}\frac{1}{N}\sum_{i=1}^N |y_i - (w_1x_1 + w_0)|^2 
$$

3. Choose a method of minimizing the loss function.

**Note:** For linear regression, we can minimize $\mathcal{L}$ analytically. We cannot do this for every model!

## Linear Regression in `sklearn`

```python
# import the LinearRegression model from the sklearn library
from sklearn.linear_model import LinearRegression

# make an instance of the linear regression model
regression = LinearRegression()

# find the coefficients for the line that minimizes mean squared error
regression.fit(x_train, y_train)
```

## What is a Statistical Model?
Perhaps our **choice** of an overall error can be less arbitrary if we explain how, we believe, the residual arise.

**Belief:** The theoretical relationship between price and square footage ($x$) is given by $f(x)$. But, in real-life, due to unpredictable circumstances observed prices ($y$) differ from $f(x)$ by some random amount, $\epsilon$, called ***noise***:
$$
y = f(x) + \epsilon, \quad \epsilon \sim p(\epsilon)
$$

A ***statistical model*** is one that explicitly accounts for uncertainty or randomness. 

## A Statistical Model for Regression

Let us *assume* that (1) the underlying relationship between price and square footage $x$ is given by $f(x) = w_0 + w_1x$; (2) that the observed price $y$ deviates from $f(x)$ by a random amount that is independent from $x$ and is distributed as $\mathcal{N}(0, 1)$:

$$
y = f(x) + \epsilon, \quad \epsilon \overset{\text{iid}}{\sim} \mathcal{N}(0, 1)
$$

Note that $y$ is now a random variable with distribution $\mathcal{N}(f(x), 1)$, denoted by $p(y|x, w_0, w_1)$.

## How Do We Quantify Fitness?

Given our statistical model, a natural way for quantifying how well $f(x) = w_0 + w_1x$ fits the data is by checking how likely our choice of $w_0$ and $w_1$ makes the observed data, i.e. compute
$$
\mathcal{L}(w_0, w_1) = \prod_{n=1}^N p(y_n|x_n, w_0, w_1).
$$
The function $\mathcal{L}(w_0, w_1)$ is called the ***likelihood function***.

**Exercise:** suppose we have two models, $f(x) = 2 + 3x$ and $f(x) = 10 - x$. Suppose that $\mathcal{L}(w_0=2, w_1=3) = 10.2$ and $\mathcal{L}(w_0=10, w_1=-1) = 0.002$. Which model is a better fit for the data and why?

## Model Fitting

**Question:** What do we mean by choosing "best" line, $\hat{y} = f(x) = w_1x_1 + w_0$? 

The ***model fitting*** process:

1. *Choose* a method of estimation for statistical models. For example, set up the problem of finding coefficients or ***parameters***, $w_0, w_1$, such that the likelihood of the data is **maximized**:
$$
\mathrm{argmax}_{w_0, w_1}\mathcal{L}(w_0, w_1) = \mathrm{argmax}_{w_0, w_1}\prod_{n=1}^N p(y_n|x_n, w_0, w_1) 
$$

3. Choose a method of computing the estimate. For example, choose a way to maximize the likelihood.

## Maximimum Likelihood and Minimum Mean Square Error

Given our statistical model
$$
y = f(x) + \epsilon, \quad \epsilon \overset{\text{iid}}{\sim} \mathcal{N}(0, 1)
$$
Maximimizing the likelihood is equivalent to minimizing the mean squared error:
$$
\mathrm{argmax}_{w_0, w_1}\prod_{n=1}^N p(y_n|x_n, w_0, w_1) \equiv \mathrm{argmin}_{w_0, w_1}\frac{1}{N}\sum_{i=1}^N |y_i - (w_1x_1 + w_0)|^2 
$$

*Hint: note that* 
$$\prod_{n=1}^Np(y_n|x_n, w_1, w_0) = \frac{1}{\sqrt{2\pi 1}^N} \exp\left\{-\frac{\sum_{i=1}^N(y_n - (w_1x_n + w_0))^2}{2 * 1} \right\}$$ 
*and that*
$$\log p(y|x, w_1, w_0) = N\log\left(\frac{1}{\sqrt{2\pi 1}}\right) - \frac{\sum_{i=1}^N(y_n - (w_1x_n + w_0))^2}{2 * 1} $$

## Model Evaluation
After fitting the model (finding coefficients that maximizes the likelihood or that minimizes the loss function), we need to **check the error or residuals of the model**. Why?

<img src="fig/fig36.jpg" style="height:300px;">

Working with statistical models gives us an advantage in model evaluation, can you see why?

## Model Interpretation

In addition to evaluating our model on training and testing data, we must also examine the coefficients themselves. Why?

<img src="fig/fig35.jpg" style="height:300px;">


## What is a Bayesian Model?
In addition to a statistical model that explains trends $f(x)$ and observation noise $\epsilon$, we also want to incorporate our **prior beliefs** about the model. Finally, we want to obtain a measure of **uncertainty** for our parameter estimates as well as our predictions.

Our Bayesian model for linear regression:
\begin{aligned}
y &= w_0 + w_1x + \epsilon\\
\epsilon &\overset{\text{iid}}{\sim} \mathcal{N}(0, 1)\\
w_0 &\sim p(w_0)\\
w_1 &\sim p(w_1)\\
\end{aligned}

where the prior $p(w_1)$ may express that we want $w_1$ to be non-negative and not too large.

## Model Inference
How do we "learn" the parameters in a Bayesian model?

Baye's Rule gives us a way to obtain a distribution over $w_0, w_1$ given the data $(x_1, y_1), \ldots, (x_N, y_N)$:

$$
p(w_0, w_1 | x_1, \ldots, x_N, y_1, \ldots, y_N) \propto \underbrace{\left(\prod_{n=1}^N p(y_n|x_n, w_0, w_1)\right)}_{\text{How well params fit the data}} \underbrace{p(w_0)p(w_1)}_{\text{How well the params fit priors}}
$$

The distirbution $p(w_0, w_1 | x_1, \ldots, x_N, y_1, \ldots, y_N)$ is called the ***poseterior*** and gives the likelihood of a pair of parameters $w_0, w_1$ given the observed data.

We see that the likelihood score of the parameters under the posterior is influence both by how well the parameters fit the data and how well the parameters fit our prior beliefs.

## Bayesian Linear Regression

When we choose normal priors for the parameters in a linear regression model, for example,

\begin{aligned}
y &= w_0 + w_1x + \epsilon\\
\epsilon &\overset{\text{iid}}{\sim} \mathcal{N}(0, 1)\\
w_0 &\sim \mathcal{N}(0, 1)\\
w_1 &\sim \mathcal{N}(0, 0.5)\\
\end{aligned}

The posterior $p(w_0, w_1 | x_1, \ldots, x_N, y_1, \ldots, y_N)$ is again a (multivariate) normal distribution, $\mathcal{N}(\mu, \Sigma)$, and we can derive closed form solutions for $\mu$ and $\Sigma$.

Why is this observation important?

## Model Evaluation

With a Bayesian model we get a distribution $p(w_0, w_1| \text{Data})$ over likely functions rather than a single function $f(x) = w_0 + w_1x$. How then do we evaluate the "error" of model?

In the Maximum Likelihood model, we can explicitly check the correctness of our assumptions by checking the distribution of the residuals. How do we criticize a Bayesian model?

## Why is This Hard?

1. Stating that our goal is to maximize likelihood or minimize MSE is easy. Finding the optimal parameters is often very hard (especially if $f(x)$ is not linear, but rather, a complex function represented by a neural network).
<br><br>
2. If we choose more "interesting" or "expressive" priors, or if we choose more complex $f(x)$, then it is often the case that the posterior cannot be computed in closed form.

Both model fitting and inference requires sophisticated algorithms derived from deep theoretical understanding of the models.

## What is AM207?

1. Build statistical (Bayesian and non-Bayesian) models for: continuous, ordinal, categorical and sequential data
2. Study algorithms for model fitting and inference
3. Study paradigms for model evaluation and critique

**Goal:** students become familiar with standard statistical models and modern techniques of inference. At the end of the course you should be able to productively read current machine learning research papers.

**Focus:** computational aspects of inference.

**Related Courses:** Bayesian Inference (Stats), Advanced Machine Learning (CS)

# Who should take this class?

## Suggested Pre-Reqs for AM207
1. Fluency in high-level programming language (preferably `python`)
2. Multivariable calculus based statistics course
3. CS109 A, B: Data Science (strongly recommended)

**Disclaimer:** in the past, students have successfully completed the course having major gaps in their preparation, but the effort it takes to overcome these gaps can be extraordinary.

**Is this course right for me?** Homework #0 reviews the "assumed background" for the course. Although it is longer than a typical homework for this class, it gives a good indication of the type of theoretical and computational tasks that will appear on every homework. Use Homework #0 to gauge your preparation and how time consuming the course will be.

## What technology do you need for AM207
Homework will be completed in Jupyter Notebooks.

You have one of two options
1. Download the latest Anaconda `python` 3.x distribution on your personal machine
2. Complete homework using Google Colab - a free cloud computing service that comes with pre-installed machine learning tools. Colab is built on Jupyter Notebooks, an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. 

## Lab #1
In this Friday's lab, we will review some important aspects of scientfic computing and statistics you need for the course. 

You'll have a chance to get familiar with Jupyter Notebook (or Colab) and various `python` libaries we'll be using frequently.

# How is this course structured?

## Meetings

There are two weekly lectures and one lab.

The lab is a structured instructor office hour, where in we focus on specific tasks and concepts that supplement the lecture or prepare for the homwork.

## Graded Components

1. 10 equally weighted weekly homework
2. 1 group project

Each homework will be a combination of derivations/proofs (theory) and programming (implementation).

The group project involves choosing one pre-approved research paper and producing a tutorial in Jupyter Notebook to demonstrate the concepts and methodologies in the paper.

## Policies

**Grading:** Unreadable formatting or code with syntactic or runtime errrors will not be graded. "Right" answer without a (brief) justification will not receive full score. You can drop your lowest HW grade.

**Late HW:** Each student has two late days that can be applied to any one or two homework. Outside of late days, late submissions will not be accepted.

**Collaboration:** Collaboration is strongly encouraged, but copying is strictly not allowed (see policy on Syllabus).

**Attendance:** Attendance is not required but strongly suggested. For FAS students, attendance may be taken into account when determining priority during office hours.

# How do I get help for the course?

## Teaching Staff

**Instructor:** Weiwei Pan

**TFs:** 
- Jianzhun Du
- Meng Dong
- Yuhao Lu
- Shu Xu
- Sijie Sun
- Kela Roberts (Extension School)

## Office Hours

There is one TF office hour every weekday.

There are two instructor office hours: 
1. Tuesday for unstructured Q&A
2. Friday for structured help (Lab)

## Remote Learning Students

Tuesday's instructor office hours will have a remote dial-in option.

The dedicated TF for extension school students is Kela Roberts who will answer questions via email (see email policy on Syllabus).

## Piazza

There is a course Piazza to faciliate collaboration amongst students. 

Teaching staff moderate the discussions but are **not responsible for answering questions**! 

If you want help from the staff come to an office hour or schedule a meeting.

# Where can I find more information?

## Course Canvas
There is a course canvas with:
0. meeting times and location; office hour times and location
1. course syllabus and schedule
2. weekly summaries: lecture notes, lab notebooks, homework, homework solutions
3. information about the project (to appear)
4. all course announcements
5. attendance quizzes
6. link to the course piazza