# 1    Probability Review and Bayesian Spam Filter

## 1.1 Independence

If random variables X and Y are discrete,


$$\begin{array}{l}
E\left[ {XY} \right] &= \sum\limits_{X = x} {\sum\limits_{Y = y} {xyP\left( {X = x,Y = y} \right)} } \\
{\rm{           }} &= \sum\limits_{X = x} {\sum\limits_{Y = y} {xyP\left( {X = x} \right)} } P\left( {Y = y} \right)\\
{\rm{           }} &= \sum\limits_{X = x} {xP\left( {X = x} \right)} \sum\limits_{Y = y} {yP\left( {Y = y} \right)} \\
{\rm{           }} &= E\left[ X \right]E\left[ Y \right]
\end{array}$$

If random variables X and Y are continuous,

$$\begin{array}{l}
E\left[ {XY} \right] &= \int_{ - \infty }^{ + \infty } {\int_{ - \infty }^{ + \infty } {xyf\left( {X = x,Y = y} \right)} } dxdy\\
{\rm{           }} &= \int_{ - \infty }^{ + \infty } {\int_{ - \infty }^{ + \infty } {xyf\left( {X = x} \right)f\left( {Y = y} \right)} } dxdy\\
{\rm{           }} &= \int_{ - \infty }^{ + \infty } {xf\left( {X = x} \right)} dx\int_{ - \infty }^{ + \infty } {yf\left( {Y = y} \right)dy} \\
{\rm{           }} &= E\left[ X \right]E\left[ Y \right]
\end{array}$$

## 1.2 Spam filtering equation

Using Bayes' Rule,

$$\begin{array}{l}
\Pr \left( {S|W} \right) &= \frac{{\Pr \left( {S,W} \right)}}{{\Pr \left( W \right)}}\\
 &= \frac{{\Pr \left( {W|S} \right)\Pr \left( S \right)}}{{\Pr \left( W \right)}}\\
 &= \frac{{\Pr \left( {W|S} \right)\Pr \left( S \right)}}{{\Pr \left( {W|H} \right)\Pr \left( H \right) + \Pr \left( {W|S} \right)\Pr \left( S \right)}}
\end{array}$$

## 1.3 I.I.D. assumption in spam filters

1. Words in different type of mail may behave differently in distribution
2. The correlation between words are not considered: word order, syntax structure and etc.

## 1.4 Poison the Bayesian spam filter

 I. Randomly add more words whose $P(H|W) >> 0.5$. 

> (Not working if Naive Bayes Classifier keeps focus on the fact of only large $P(S|W) >> 0.5$)

 II. Use mutant substituion of word (Like Advertisement --> Advert!sement)

> (Naive Bayes Classifier will find out Advertisement ~ Advert!sement eventually)

 III. Instead of using words, send spam in terms of graph.


# 2 Regression

## 2.1 Linear Regression: MLE and Least Squares

### 1

Given $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$, we have

$$\Pr \left( {{\varepsilon _i}} \right) = \frac{1}{{\sqrt {2\pi } \sigma }}\exp ( - \frac{{{\varepsilon _i}^2}}{{2{\sigma ^2}}})$$

Having ${\varepsilon _i} = {Y_i} - \left\langle {{X_i},w} \right\rangle  - b$, we can get

$$\Pr \left( {{Y_i}|{X_i},w,b} \right) = \frac{1}{{\sqrt {2\pi } \sigma }}\exp \left[ { - \frac{{{{\left( {{Y_i} - \left\langle {{X_i},w} \right\rangle  - b} \right)}^2}}}{{2{\sigma ^2}}}} \right]$$

### 2

Since $$\Pr (Y|\beta ) = \prod\limits_{i = 1}^n {\Pr ({Y_i}|\beta )},$$
Then


$$\begin{array}{l}
\log \Pr (Y|\beta ) &= \sum\limits_{i = 1}^n {\log \Pr \left( {{Y_i}|\beta } \right)} \\
 &= \sum\limits_{i = 1}^n {\log \left\{ {\frac{1}{{\sqrt {2\pi } \sigma }}\exp \left[ { - \frac{{{{\left( {{Y_i} - \left\langle {{X_i},w} \right\rangle  - b} \right)}^2}}}{{2{\sigma ^2}}}} \right]} \right\}} \\
 &= \sum\limits_{i = 1}^n {\left[ { - \log \left( {\sqrt {2\pi } \sigma } \right) - \frac{{{{\left( {{Y_i} - \left\langle {{X_i},w} \right\rangle  - b} \right)}^2}}}{{2{\sigma ^2}}}} \right]} \\
 &=  - n\log \left( {\sqrt {2\pi } \sigma } \right) - \frac{1}{{2{\sigma ^2}}}\sum\limits_{i = 1}^n {{{\left( {{Y_i} - \left\langle {{X_i},w} \right\rangle  - b} \right)}^2}} 
\end{array}$$


### 3

Maximizing $\Pr (Y|\beta )$ is equivalent to maximizing $\log \Pr (Y|\beta )$.

Observing that $- n\log \left( {\sqrt {2\pi } \sigma } \right)$ is a constant, we can see the problem is same as minimizing $$\sum\limits_{i = 1}^n {{{\left( {{Y_i} - \left\langle {{X_i},w} \right\rangle  - b} \right)}^2}}$$ and we have

$$\mathop {\arg \min }\limits_{\bf{\beta }} {\left( {{\bf{y}} - {\bf{X'\beta }}} \right)^T}\left( {{\bf{y}} - {\bf{X'\beta }}} \right) = \mathop {\arg \min }\limits_{w,b} \sum\limits_{i = 1}^n {{{\left( {{Y_i} - \left\langle {{X_i},w} \right\rangle  - b} \right)}^2}} $$

### 4

Let 

$$\begin{array}{l}
\sigma \left( {\bf{\beta }} \right) &= {\left( {{\bf{y}} - {\bf{X'\beta }}} \right)^T}\left( {{\bf{y}} - {\bf{X'\beta }}} \right)\\
 &= {{\bf{y}}^T}{\bf{y}} - {{\bf{y}}^T}{\bf{X'\beta }} - {{\bf{\beta }}^T}{{{\bf{X'}}}^T}{\bf{y}} + {{\bf{\beta }}^T}{{{\bf{X'}}}^T}{\bf{X'\beta }}
\end{array}$$

By setting derivative to zero,

$$\frac{{\partial \sigma \left( {\bf{\beta }} \right)}}{{\partial {\bf{\beta }}}} =  - 2{{{\bf{X'}}}^T}{\bf{y}} + 2{{{\bf{X'}}}^T}{\bf{X'\beta }} = 0$$

We have 

$${\bf{\beta }} = {\left( {{{{\bf{X'}}}^T}{\bf{X'}}} \right)^{ - 1}}{{{\bf{X'}}}^T}{\bf{y}}$$

## 2.2 Nonlinear Regression and Regularization 

### 1

Similar to 2.1.1, we have
$$\begin{array}{l}
\\
\Pr \left( {{Y_i}|\phi \left( {{X_i}} \right),w,b} \right) = \frac{1}{{\sqrt {2\pi } \sigma }}\exp \left[ { - \frac{{{{\left( {{Y_i} - \left\langle {\phi \left( {{X_i}} \right),w} \right\rangle  - b} \right)}^2}}}{{2{\sigma ^2}}}} \right]
\end{array}$$

Thereby, the loglikelihood becomes

$$\log \Pr (Y|\beta ) =  - n\log \left( {\sqrt {2\pi } \sigma } \right) - \frac{1}{{2{\sigma ^2}}}\sum\limits_{i = 1}^n {{{\left( {{Y_i} - \left\langle {\phi \left( {{X_i}} \right),w} \right\rangle  - b} \right)}^2}} $$

So MLE is same as to maximize $${{{\left( {{\bf{y}} - \phi \left( {{\bf{X'}}} \right){\bf{\beta }}} \right)}^T}\left( {{\bf{y}} - \phi \left( {{\bf{X'}}} \right){\bf{\beta }}} \right)},$$ which will give $${\bf{\beta }} = {\left( {\phi {{\left( {{\bf{X'}}} \right)}^T}\phi \left( {{\bf{X'}}} \right)} \right)^{ - 1}}\phi {\left( {{\bf{X'}}} \right)^T}{\bf{y}}$$


### 2

Let $$\begin{array}{l}
\sigma \left( {\bf{\beta }} \right) &= {\left( {{\bf{y}} - \phi \left( {{\bf{X'}}} \right){\bf{\beta }}} \right)^T}\left( {{\bf{y}} - \phi \left( {{\bf{X'}}} \right){\bf{\beta }}} \right) + \lambda {{\bf{\beta }}^T}{\bf{\beta }}\\
 &= {{\bf{y}}^T}{\bf{y}} - {{\bf{y}}^T}\phi \left( {{\bf{X'}}} \right){\bf{\beta }} - {{\bf{\beta }}^T}\phi {\left( {{\bf{X'}}} \right)^T}{\bf{y}} + {{\bf{\beta }}^T}\phi {\left( {{\bf{X'}}} \right)^T}\phi \left( {{\bf{X'}}} \right){\bf{\beta }} + \lambda {{\bf{\beta }}^T}{\bf{\beta }}
\end{array}$$

Taking the derivative and set to zero

$$\begin{array}{l}
\frac{{\partial \sigma \left( {\bf{\beta }} \right)}}{{\partial {\bf{\beta }}}} &=  - 2\phi {\left( {{\bf{X'}}} \right)^T}{\bf{y}} + 2\phi {\left( {{\bf{X'}}} \right)^T}\phi \left( {{\bf{X'}}} \right){\bf{\beta }} + 2\lambda {\bf{\beta }}\\
 &= 2\phi {\left( {{\bf{X'}}} \right)^T}{\bf{y}} + 2\left( {\phi {{\left( {{\bf{X'}}} \right)}^T}\phi \left( {{\bf{X'}}} \right) + \lambda {\bf{I}}} \right){\bf{\beta }}
\end{array}$$

we can have

$${\bf{\beta }} = {\left( {\phi {{\left( {{\bf{X'}}} \right)}^T}\phi \left( {{\bf{X'}}} \right) + \lambda {\bf{I}}} \right)^{ - 1}}\phi {\left( {{\bf{X'}}} \right)^T}{\bf{y}}$$