# Harvard University course, CS109 Data Science.  
Slides1: https://github.com/cs109/2015/blob/master/Lectures/07-BiasAndRegression.pdf  
Slides2: https://github.com/cs109/2015/blob/master/Lectures/08-RegressionContinued.pdf  
Slides3: https://github.com/cs109/2015/blob/master/Lectures/09-ClassificationPCA.pdf  
Slides4: https://github.com/cs109/2015/blob/master/Lectures/10-SVMAndEvaluation.pdf

## Bias and Regression

### Bias Types
* **Selection Bias**. Where did the data come from?
* **Publication Bias**. What percentage of the scientific discoveries are replicatable? If you are to reproduce the results, will you succeed? How true are the discoveries?
* **Non-response Bias**. What if the people who didn't answer the survey questions are an important group of people?
* **Length Bias**. For example, you want to measure the average prison sentence. If you show up at a random point in time you would most probably see the prisoners who are going to be there for a long time and you would not meet many people who are serving 1, 2 weeks, days....

### Mathematical/Statistical Definition of Bias

The bias of an estimator is how far off it is on average: 

\begin{equation*}
bias(\hat{\theta}) = E(\hat{\theta}) - \theta   
\end{equation*}

*where $\theta$ is what we are trying to estimate, and $\hat{\theta}$ is it's estimator*.

The question may arise, why not subtract the bias? 
Because
* We don't know the bias. We can try to estimate it but will have bias in that process as well.
* **Bias_Variance Tradoff**, which is very often formulates the following way  

\begin{equation*}
MSE(\hat{\theta}) = VAR(\hat{\theta}) + bias^2(\hat{\theta})
\end{equation*}

**MSE** is the **Mean Squared Error**, the most common way to measure how good is your estimator (on average, in terms of squared distance, how far off are you from the truth?).

#### So the goal is not to minimize the bias but instead the more appropriate goal is to minimize MSE.

### Fisher Weighting

How should we combine independent, *unbiased* estimators for a parameter into one estimator?

\begin{equation*}
\hat{\theta} = \sum_{i=1}^{k}w_i\hat{\theta_i}
\end{equation*}

The *weights* should sum to 1 but how should they be chosen?

\begin{equation*}
w_i \propto \frac{1}{Var(\hat{\theta_i})}
\end{equation*}

(Inversly proportional to variance)

### Regression Toward the Mean

Galten investigated the heights of fathers and sons and found out that if the father is very tall the sone will be tall but not as tall as his father. And if the father is very short, the sun will be short but not as short as the father. 

This is the regression towards the mean and it's very common in various scenarios.

### Linear Model

Often called **OLS**, Ordinary Least Squares

\begin{equation*}
y = X * \beta + \epsilon
\end{equation*}

where 
* $y$ is n X 1, we want to predict
* $X$ is n X k, matrix of data 
* $\beta$ is k X 1, parameters
* $\epsilon$ is n X 1, error terms

**Note**, what makes the linear models linear are parameters, so we can take $x^2$ or $x^3$ and their linear combination would still be linear.

#### Coefficients

* For the **sample** (1 predictor)

\begin{equation*}
\hat{\beta_0} = \bar{y} - \hat{\beta_1} \bar{x}
\end{equation*}

\begin{equation*}
\hat{\beta_1} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}
\end{equation*}

* For the **population** (1 predictor)

\begin{equation*}
y = \beta_0 + \beta_1 x + \epsilon
\end{equation*}

\begin{equation*}
E(y) = \beta_0 + \beta_1 E(x)
\end{equation*}

\begin{equation*}
cov(y, x) = \beta_1 cov(x,x)
\end{equation*}

**"Explained" Variance** 

\begin{equation*}
var(y) = var(X\hat{\beta}) + var(e)
\end{equation*}

*where e are the residuals, the estimators of errors*.

#### R-squared

\begin{equation*}
R^2 = \frac{var(X\hat{\beta})}{var(y)} = \frac{\sum_{i=1}^{n}(\hat{y_i} - \bar{y})^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}
\end{equation*}

* $R^2$ is the variance explained by the fitted model divided by the total variance.
* $R^2$ measures goodness of fit, but it doesn't validate the model.
* Adding more predictors can only increase $R^2$.



#### Collinearity

Avoid having highly correlated predictor variables (colinearity results in instability, high variances in estimates, and worse interpretability)

## Odds Ratio

* Odds ratio is just a different parametization of probability, if someon's probability of experiencing an outcome is $p$, then that person's **odds** of the outcome is $p/(1-p)$.
Probability and odds are often used interchangably. For small $p$ it is almost the same, but if $p$ is large, they are different.
* The **odds ratio** is the ratio of two different people's odds of some outcome. If people in group $A$ have the probability $p_A$ of disease, and people in group $B$ have the probability of $p_B$, then the odds ratio of group A vs. group B is 

\begin{equation*}
Odds\ Ratio = \frac{\frac{p_A}{1-p_A}}{\frac{p_B}{1-p_B}} = \frac{(1-p_B)p_A}{(1-p_A)p_B}
\end{equation*}

## A Logistic Regression Model

\begin{equation*}
logit(p) = ln\big(\frac{p}{1-p}\big) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_kx_k
\end{equation*}

* The unknown parameters $\beta$ can be estimated from the data using **MLE** (Maximum Likelihood Estimate)

#### Interpreting the parameters of logistic regression

Let's say we have two patients, A and B, that are identical on all parameters. Patiant A has taken a medicine while pateient B has not. The model predicts that  

$\ $  
$logit(p_A) = ln(\frac{p}{1-p}) = \beta_0 + \beta_{age}x_{age} + ... + \beta_kx_k + \beta_{medicine}x_{medicine}$

$logit(p_B) = ln(\frac{p}{1-p}) = \beta_0 + \beta_{age}x_{age} + ... + \beta_kx_k$

\begin{equation*}
\beta_{medicine} = logit(p_A) - logit(p_B) = ln\bigg(\frac{\frac{p_A}{1-p_A}}{\frac{p_B}{1-p_B}}\bigg)
\end{equation*}

Using this model we can estimate an "adjusted" odds ratio that's the odds ratio for two people with all other known factors held constant:

\begin{equation*}
e^{\hat{\beta}_{medicine}} = \frac{\frac{p_A}{1-p_A}}{\frac{p_B}{1-p_B}}
\end{equation*}

## Curse of Dimentionality

* many features
* in high dimention the space is so sparse that it is difficult to find a neighbor to work with so we extrapolate which is not reliable.

## Ridge Regression

In linear regression instead of minimizing the sum of squared residuals, Ridge regression says to minimize  

\begin{equation*}
\sum_{i=1}^{n}\big(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij}\big)^2 + \lambda\sum_{j=1}^{p}\beta_j^2= RSS + \lambda\sum_{j=1}^{p}\beta_j^2
\end{equation*}

Shrinks the parameters towards the 0.

## Lasso Regression

In linear regression instead of minimizing the sum of squared residuals, Lasso regression says to minimize  

\begin{equation*}
\sum_{i=1}^{n}\big(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij}\big)^2 + \lambda\sum_{j=1}^{p}\big|\beta_j\big|= RSS + \lambda\sum_{j=1}^{p}\big|\beta_j\big|
\end{equation*}

This helps induce *sparsity*, reducing the number of variables one has to deal with. Sets some parameters to 0.

## Nearest Neighbors Classifiers

Predict class of new data point by majority vote of K nearest neighbors.

#### 1-NN Properties

* Simple and quite good for low dimentional data
* "Rough" decision boundary, may have "islands"
* **Training complexity** for N data points? As I just everything around during the training time, to add one new data point, I just add it. So the complexity of adding a new data point is $O(1)$, for N data points it is $O(N)$
* **Test complexity** for M data points? For every point during test time, I look at every example in training set. Hence for M test data points the complexity is $O(M*N)$
* **Error on training set**? It's 0.
* **Variance**? **Bias**? Variance is very high, Bias is low.

#### k-NN Properties

* Gets rid of "islands"
* If k is too large, the boundary may become too smooth.
* **Lower variance, increased bias.**
* How do we choose the ideal k? Cross-validation

#### Decisions to make

* Choose k based on cross-validation, with the lowest test error
* Distance measure
* How to assign a value to a new data point?

## Dimensionality Reduction

**Basic idea**. Project the high-dimensional data onto a lower-dimensional subspace that best "fits" the data

### Principal Components Analysis (PCA)

The algorithm
* Subtract the mean from the data (center X) 
* (Typically) Scale each dimension by its variance
    * Helps pay less attention to magnitude of the dimentions
* Compute covariance matrix $S$, $S = \frac{1}{N}X^TX$
* Compute k largest eigenvectors of $S$
* These eigenvectors are the k principal components

## Support Vector Machines (SVM)

* Widely used for all sorts of classification problems
* Maximum margin classification

### Kernel Functions

* With kernels we don't need to care about the actual representations of the points in a high-dimentional space. We only need the dot product.

\begin{equation*}
K(x, z) = \Phi(x)\Phi(z)
\end{equation*}

* **Polinomial** (tune $s$): 
\begin{equation*}
K(x, z) = (1 + x \cdot z)^s
\end{equation*}

* **Radial Basis Function** (tune $\gamma$):
\begin{equation*}
K(x, z) = exp(-\gamma(x-z)^2)
\end{equation*}

### Kernel Trick for SVM

* Arbitrary many dimensions
* Little computational cost
* Maximum margin helps with curse of dimensionality


### Prediction

* We only need the dot product of the new data with the support vectors.
* Prediction speed depends on the number of support vectors.

### Math behind the SVMs

* Andrew Ng's CS229 Machine Learning course notes: http://cs229.stanford.edu/notes/cs229-notes3.pdf
* Andrew Ng's CS229 Machine Learning course on Youtube: https://www.youtube.com/watch?v=UzxYlbK2c7E 

### Tips and Tricks

* SVMs are not scale invariant
* Ckeck if the library you are using normalizes the data by default
* Normalize the data
    - mean 0, std 1
    - map to [0,1] or [-1,1]
* Normalize test set in the same way

### Parameters to tune

* Which kernel?
    * RBF kernel is a good default
* Which values for the kernel parameters
    * try exponential sequences
* Which value for C
    * try exponential sequences

