# Bayes' Theorem

1. Conditional Probability
    1. P(A|B) = $\frac{P(A\cap B)}{P(B)}$
    
    2. suppose scenario = bag with 3 black marbles and 2 red marbles
        
        1. first event(A): marble picked = black, second(B) picked = red
        
        2. P(A) = $\frac{2}{5}$, P(B) = $\frac{3}{5}$
        
        3. the catch here is that B occurs after A, hence the event actually is B|A(event B occurs **given that A has already occurred**)
        
        4. hence P(A $\cap$ B) = P(A).P(B|A) = $\frac{2}{5}\times\frac{3}{4}$ = $\frac{3}{10}$
        
        5. theoretically, it can be interpreted as constructing the set of sample space of events: {(red, red), (red, black), (black, red), (black, black)}, and since each red is unique and each black is also unique, the space actually becomes = {($r_1$, $r_2$), ($r_1$, $r_3$), ($r_2$, $r_1$).....($r_1$, $b_1$), ($r_1$, $b_2$), ($r_2$, $b_1$) ....., ($b_1$, $r_1$), ($b_1$, $r_2$).... ($b_1$, $b_2$), ($b_2$, $b_1$) }
        
        and the event that we are looking for is ($b_i$, $r_j$)
        
        this event will occur 6 out of the total $5 \choose 2$ $\times 2$ = 20(since the ordered-pair nature of picking out the marbles also corresponds to different events taking place) , which evaluates to $\frac{3}{10}$
        
    3. hence we have P(A|B) = $\frac{P(A\cap B)}{P(B)}$ and P(B|A) = $\frac{P(A\cap B)}{P(A)}$
    
    4. thus $P(A|B).P(B) = P(B|A).P(A)$ , this is the **Bayes theorem**
    
2. the Bayes' theorem is actually structured as \
P(A|B) = $\frac{P(B|A).P(A)}{P(B)}$ , where P(A|B) = posterior probability, **P(B|A) = likelihood**, <font color="red">P(A) = apriori probability</font>, P(B) = marginal.

    1. its because we already know everything about B, and we know that when A occurs in isolation, what happens.
    
    2. what we don't know is that what happens when A occurs after B, hence simply the occurrence of A is termed **apriori probability**
    
    3. here B was the independent event - happened on its own without the **triggering** of another event, whereas A required B to happen first, thus making **A the dependent event on B**.

# Naive Bayes Classifier

1. before assuming anything, a prediction problem can be framed as given a certain set of features ($x_1\cdots x_d$), predict $y$
    - corresponding to estimate $P(y \, |\, x_1 \,,\, x_2 \cdots x_d)$
    - which, using Bayes' Theorem, can be written as $ \dfrac{P(x_1 \,,\, x_2 \cdots x_d | y).P(y)}{P(x_1 \,,\, x_2 \cdots x_d)}$
        - for 2 events, bayes theorem: $P(y|x_1) = \dfrac{P(y,x_1)}{P(x_1)} \rightarrow P(y,x_1) = P(y|x_1).P(x_1) = P(x_1|y).P(y) \Rightarrow \quad P(y|x_1) = \dfrac{P(x_1|y).P(y)}{P(x_1)}$
        - for 3 events, bayes theorem: $P(y |x_1, x_2) = \dfrac{P(y,x_1,x_2)}{P(x_1, x_2)} \rightarrow P(y,x_1,x_2) = P( (A),(C|B)).P(B) = P(A|C|B).P(C|B).P(B) = P(A|B,C).P(C|B).P(B)$
    - using this *chain rule of bayesian probability*, $P(y \, |\, x_1 \,,\, x_2 \cdots x_d) = \dfrac{P(x_1 \,,\, x_2 \cdots x_d | y).P(y)}{P(x_1 \,,\, x_2 \cdots x_d)} = \dfrac{P(x_1|y).P(x_2|x_1, y).P(x_3|x_1,x_2,y)\cdots P(x_d|x_1,x_2\cdots x_{d-1},y)}{P(x_1 \,,\, x_2 \cdots x_d)}$

1. as we know about dependent and independent events, we have features $x_1$, $x_2$.....$x_d$ which are all independent variables , and a quantity to be predicted - y which we refer to as the dependent variable, the analogy to events and variables should be pretty clear!!!.
    - independent means that $P(A,B) = P(A).P(B) \Rightarrow P(B|A) = P(B) \,,\, P(A|B) = P(A)$

2. hence $P(y \, |\, x_1 \,,\, x_2 \cdots x_d) = \dfrac{P(x_1|y).P(x_2|y).P(x_3|y)...P(x_d|y).P(y)}{P(x_1).P( x_2)....P(x_d)}$
    1. this means that before occuring y, events $x_1$ to $x_d$ have already occurred
    
3. hence the posteriori for classification problem is directly proportional to the product of apriori and product of all likelihoods (likelihood of that feature given that class value).

4. we need to find the y value which maximizes this posteriori, since that will be the **most likely** event to occur, or the **most likely class that the object belongs to**.

5. hence, maximize P(y).$\prod\limits_{i=1}^{d}P(x_i|y)$ = objective.

# Application in NLP

1. generate vector-encoding for each of the words

2. usually used in text-classification tasks such as mood/sentiment analysis

3. usually count-vectorized/tfidf-vectorized vectors are used as feature-representation of each word(sample)

4. P(y = 1 | sentences-set) = ? , sentences = a pre-defined sequence of words, i.e. a sequence of vectors

    1. here the term $P(x_i|y)$, for instance w.r.t. the problem of sentiment classification, would mean occurence probability of a word, given that the conveyed-information is positive(y=1) or negative sentiment(y=0).
    
    2. each $x_i$ obviously represents a unique word.
    
5. here, apriori probability can be simple learnt from the output label-y from the *training-data*.