### Regression(Linear Regression) Vs Logistic Regression
#### Regression(Linear Regression)
- **What it does**
  - Predicts a continuous numeric value.
    - Predicting house prices
    - Predicting temperature
    - Predicting sales revenue
- **Output**
  - A real number ‚Äî can be any value from ‚àí‚àû to +‚àû.
#### Logistic Regression
- **What it does**
  - Predicts a probability of a class (classification problem).
    - Predict whether an email is spam (1) or not spam (0)
    - Predict whether a customer will churn or not churn
- **Output**
  - A probability (0 to 1) ‚Äî typically converted into class labels (0 or 1).
  - ùëù > 0.5 ‚áí Class 1
  - ùëù ‚â§ 0.5 ‚áí Class 0


### Understanding Odds

- Probability of this outcome is the number of times it occurred divided by the total number of times we ran the experiment.
- The odds of this outcome are the number of times it occurred divided by the number of times it didn‚Äôt occur.
  - **example 1**:
    - the probability of obtaining 1 when we roll a die is 1/6
    - but the odds are 1/5
  - **example 2**:
    - If a particular horse wins 3 out of every 4 races, then the probability of that horse winning a race is 3/4
    - And the odds are 3/1 = 3

- **Formula**: The formula for odds is simple: if the probability of an event is x, then the odds are (x / 1 ‚àí x) .
  - Dice example, odd is. = ( `(1 /6)` / `1 - (1/6)`)  = 1 / 5
  - horse example, odd is  = ( `3/4` / `1 - (3/4)`) = `3/4` / `1/4` = 3

**Note:** 
  - the probability is a number between **0 and 1**
  - then the odds are a number between **0 and INF**      

- Imagine you're looking at 100 emails:
  - 80 are spam (class 1)
  - 20 are not spam (class 0)
- **Probability of spam**:
- ```
    p = number of spam emails / total emails
    p = 80 / 100 = 0.8 (or 80%)
  ```
- **Simple meaning**: "If I pick a random email, there's an 80% chance it's spam."

#### Now, What Are Odds?
- **Odds answer a different question**: "How many times more likely is it to be spam compared to not spam?"
  ```
  Odds = (number of spam) / (number of not spam)
  Odds = 80 / 20 = 4

  Or using probability:
  Odds = p / (1 - p)
  Odds = 0.8 / 0.2 = 4
  ```
- **Interpretation**: "For every 1 legitimate email, there are 4 spam emails." 
- Or: "Spam is **4 times more likely** than legitimate email." 

##### Example 1: Mostly Spam 
  ```
  Out of 100 emails:
  - 90 spam
  - 10 legitimate

  Probability: p = 90/100 = 0.9 (90% spam)
  Odds: 90/10 = 9

  Meaning: "For every 1 legitimate email, there are 9 spam emails"
          "Spam is 9√ó more likely"
  ``` 

##### Example 2: Balanced (well-filtered inbox)
  ```
  Out of 100 emails:
  - 50 spam
  - 50 legitimate

  Probability: p = 50/100 = 0.5 (50% spam)
  Odds: 50/50 = 1

  Meaning: "For every 1 legitimate email, there's 1 spam email"
          "Equal chance" or "1:1 odds"
  ```         
##### Example 3: Mostly Legitimate (professional inbox)
  ```
  Out of 100 emails:
  - 20 spam
  - 80 legitimate

  Probability: p = 20/100 = 0.2 (20% spam)
  Odds: 20/80 = 0.25

  Meaning: "For every 4 legitimate emails, there's 1 spam email"
          "Spam is 0.25√ó as likely (or 4√ó less likely)"
  ```

##### Example 4: Almost All Spam
  ```
    Out of 100 emails:
    - 99 spam
    - 1 legitimate

    Probability: p = 99/100 = 0.99 (99% spam)
    Odds: 99/1 = 99

    Meaning: "For every 1 legitimate email, there are 99 spam emails"
            "Spam is 99√ó more likely"
  ```  

| Scenario        | Spam  | Legit | Probability(p)| Odds | What it means        |
| :---            | :----:| ---:  |---:           | :----|:----                 |
| Terrible inbox  | 99    |  1    |0.99           |99    |Spam 99x more likely  |
| Bad inbox       | 90    |  10   |0.90           |9     |Spam 9x more likely   |
| unfiltered      | 80    |  20   |0.80           |4     |Spam 4x more likely   |
| filtered        | 70    |  30   |0.70           |2.3   |Spam 2.3x more likely |
| Balanced        | 50    |  50   |0.50           |1     |Equalchance           |
| good            | 30    |  70   |0.30           |0.43  |Legit 2.3x more likely|
| great           | 20    |  80   |0.20           |0.25  |Legit 4x more likely  |
| Excellent       | 10    |  90   |0.10           |0.11  |Legit 9x more likely  |
| Perfect         | 1     |  99   |0.01           |0.01  |Legit 99x more likely |


#### Key Differences Between Probability and Odds:
- **Probability asks**: "What fraction are spam?"
- **Odds asks**: "What's the ratio of spam to not spam?"
- **Real Conversational Usage:**
  - **Using Probability**:
    - "There's a 90% chance this email is spam"
    - "I'm 80% sure it's spam"
  - **Using Odds:**
    - "The odds are 9 to 1 it's spam"
    - "It's 9 times more likely to be spam than legitimate"
    - "Spam is favored 4:1" 


#### Why Odds in this Logistic regression
- Regression produces a continuous numbers.
- But we need Probability which is  (0-1) limited range.
- **Odds** generated from the same probablity are range between, 0 to +‚àû  (still not symmetric though)

#### Logit  
- Take the **logarithm** of odds:
- `logit(p) = log(p / (1-p))`
- **This is the magical transformation**
  
| Probability p | Odds | Log-odds (logit) |
|---------------|------|------------------|
| 0.001 | 0.001 | -6.9 |
| 0.1 | 0.11 | -2.2 |
| 0.5 | 1 | 0 |
| 0.9 | 9 | 2.2 |
| 0.99 | 99 | 4.6 |
| 0.999 | 999 | 6.9 |

- **Why log-odds?**
- Now after taking the log of the odds the range has been transformed to -‚àû  to  +‚àû ((matches linear regression model range!))
- Symmetric around 0
- p = 0.5 ‚Üí log-odds = 0 (neutral point)

#### Model
- Now we can write:
  - **log-odds** = w‚ÇÅx‚ÇÅ + w‚ÇÇx‚ÇÇ + ... + b
         = w·µÄx + b
         = z  (we call this "z" or "net input")

##### Intutive meaning
- ```
    # Example: Predicting spam email
    z = w‚ÇÅ(num_exclamation_marks) + w‚ÇÇ(has_word_free) + w‚ÇÉ(sender_unknown) + b

    # If z = 2.5 (positive and large):
    #   ‚Üí High log-odds 
    #   ‚Üí High odds
    #   ‚Üí High probability of spam

    # If z = -3.0 (negative):
    #   ‚Üí Low log-odds
    #   ‚Üí Low odds  
    #   ‚Üí Low probability of spam

    ```   

#### We need to get probability back from the log-odd
- We need to **reverse** the transformations to get back to probability:
- **z (net input) ‚Üí logit ‚Üí odds ‚Üí probability**
- The **inverse of logit** is the **sigmoid function**:
- œÉ(z) = 1 / (1 + e^(-z))
- This is the famous **S-shaped curve**:

| z (net input) | œÉ(z) (probability) | Interpretation |
|---------------|-------------------|----------------|
| -10 | 0.00005 | Almost certainly class 0 |
| -2 | 0.12 | Likely class 0 |
| 0 | 0.5 | Equally likely |
| 2 | 0.88 | Likely class 1 |
| 10 | 0.99995 | Almost certainly class 1 |


#### Complete picture

    
- **FORWARD (making predictions)**:
    
    ```
    Features (x‚ÇÅ, x‚ÇÇ, ...) 
        ‚Üì
        √ó weights (w‚ÇÅ, w‚ÇÇ, ...)
        ‚Üì
    z = w·µÄx + b  (net input, range: -‚àû to +‚àû)
        ‚Üì
    œÉ(z) = 1/(1 + e^(-z))  (sigmoid function)
        ‚Üì
    p = probability  (range: 0 to 1)

    Predicted class=
    {
	    1  ‚Äãif p ‚Äã‚â•0.5
	    0  if pi <0.5
    }


```

### How model learns in Logistic regression

- Logistic regression Predicts
  - p = P(Y=1|X) -> probability of Y=1 given X, It means I have certain features, What we want to predict is what is the probability of this feature set belongs to class Y.( What is the probability that X(feature set) will be classified as Y)
  - `p = P(Y=1|X) = 1 / ( 1 + e^z) = 1 / ( 1 + e^-(w1x1 + w2x2 + w3x3+....+wnxn + b)) ` -> z is net input from the record.
  - But‚Ä¶ how do we find the best w‚Äôs so that the model‚Äôs predicted probabilities are as close as possible to the true outcomes (0 or 1)?
  - That‚Äôs where **likelihood** and **log-likelihood** come in.

#### Likelihood
- Think of likelihood as a score that tells us how good our model‚Äôs predictions are for the observed data.
  - If the model gives high probabilities to the correct class for every data point, the likelihood is high.
  - If it gives wrong probabilities, the likelihood is low.
- **Likelihood = ‚ÄúHow likely is it that the model with these w values generated our observed data?‚Äù**
- **We want to maximize this likelihood** ‚Üí ‚Äúfind w that make the data most likely.‚Äù
##### Mathematically
- Each data point i has 
  - yi = actual class (0 or 1)
  - pi = predicted probability that Yi = 1
- Now:
  - if yi = 1, We want pi to be high -> so that it predicts correctly.
  - if yi = 0, we want 1-pi to be high -> So that it predicts correctly

- **Putting this 2 equation**
  -  Li = (pi^yi)(1-pi)^(1-yi)   -> Li is the likelyhood for one sample.
  -  L =  (pi^yi)(1-pi)^(1-yi)  -> For all samples
- **Undersatnding likelyhood function**
  -   Li = Œ† (pi^yi)(1-pi)^(1-yi)
  -  Each data point has:
     -  yi: the actual label ‚Äî either 1 (positive class) or 0 (negative class)
     -  pi: the model‚Äôs predicted probability that yi = 1
     -  Now, we want to measure:
        - How likely is it that our model‚Äôs predicted probability pi matches what really happened (the observed yi) ?
- **‚úÖ Case 1**: when yi =1 (actual class is 1)
  - Li = (pi^yi)(1-pi)^(1-yi) = (pi) ((1-pi)^0) = pi -> likelyhood of ith sample is pi
  - So if the true label is 1, the likelihood is the **predicted probability of being 1**.
  - if pi is high (e.g., 0.9), great ‚Üí high likelihood.
  - if pi is low (e.g., 0.1), poor ‚Üí low likelihood.
- **üö´ Case 2**: When yi = 0 (actuall class is 0)
  - Li = (pi^yi)(1-pi)^(1-yi) = (pi^0) ((1-pi)^1) = 1-pi 
  - So if the true label is 0, the likelihood is the predicted probability of being 0 is 1-pi
  - if pi (probability of 1) is small (e.g 0.1), 1-pi = 0.9 -> good 
  - if pi (probability of 1) is large (e.g 0.9), 1-pi = 0.1 -> bad 
- **So this formula elegantly unifies both cases**

#### Example:

- **Model1**

    | i | Actual ùë¶ùëñ| Model predicted ùëùùëñ=ùëÉ(ùë¶=1‚à£ùë•ùëñ)| ùêøi=ùëùùëñ^ùë¶ùëñ(1‚àíùëùùëñ)^(1‚àíùë¶ùëñ)|
    |---|----------------|----------------------------------------|----------------------------------------|
    | 1 | 1 | 0.9 | 0.9^1 ‚àó(1‚àí0.9)^0 = 0.9 ‚úÖ |
    | 2 | 0 | 0.2 | 0.2^1 ‚àó(1‚àí0.2)^1 = 0.8 ‚úÖ |
    | 3 | 1 | 0.3 | 0.3^1 ‚àó(1‚àí0.3)^0 = 0.3 ‚ùå|

    - **For sample 1**: Actual 1 model predicated 0.9 probability that it is 1, So likilehood of models prediction is correct 0.9 -> high
    - **For sample 2**: Actual 0, model predicted 0.2 probability that it is 1, So it means likilehood of models prediction is 0.8 -> high
    - **For sample 3**: Actual 1, model predicted 0.3 probability that it is 1, So it means likilehood of models prediction is 0.3, which is low, it means model predicted wrong class.
    
    - Now, the total likelihood for the whole dataset is: L = l1 * l2 * l3 = 0.9 * 0.8 *0.3 = 0.216 -> Thats the probability that model generated correct label for all 3 data points.
- **Model2**
  - If we change the weights(w's) and get better prediction , suppose say p = [ 0.95, 0.05, 0.8], 
  - Li = [0.95,0.95,0.8]
  - Then the probability that model generated correct label for all the samples are, L = 0.95 * 0.95 * 0.8 = 0.722 -> which is much higher.   

- Pi ^ yi -> means probability of model gave to the true class 1
- (1-pi) ^ (1-yi) -> means probability of model gave to the true class 0
- L -> Combined likelihood for this data point being predicted correctly

#### Log-likelihood
- Products of many probabilities can become very small numbers (numerical instability).
- So we take the logarithm ‚Äî it turns the product into a sum (much easier to handle).
- log L = ‚àë[ log ((Pi ^ yi)* ((1-pi) ^ (1-yi))) ]
  - = ‚àë [ log(Pi ^ yi) + log((1-pi) ^ (1-yi)) ]
  - = ‚àë [ yi*log(pi) + (1-yi)*log(1-pi) ]
- **This is called the log-likelihood.**
- Our goal is to maximize it -> find w's that makes this expression as large as possible. 

#### negative log likelihood
- In optimization, algorithms like gradient descent are usually written to minimize something (a ‚Äúloss‚Äù).
- So instead of maximizing the log-likelihood, we minimize the negative log-likelihood:
- So we define our loss function as Loss = -logL = - ‚àë [ yi*log(pi) + (1-yi)*log(1-pi) ]

- This is known as the **Logistic Loss** or **Binary Cross-Entropy Loss**.

### Different approach during prediction and training (intution)
- In logistic regression model while prediction
  - Predict the probability of a class being 1 , if pi >= 0.5  then it is claas 1 else class 0
- During training  this 0/1 doesn't make sense.
  - When learning (training), logistic regression never converts probabilities into 0/1.
  - Instead, it keeps the full probability ùëùùëñ and tries to make the probability of the true class as high as possible using the likelihood (or equivalently, minimizing log loss):
  - This allows the model to gradually learn ‚Äî it knows ‚Äúhow wrong‚Äù it was even if the prediction was 0.4 vs 0.6.
- **Why we don‚Äôt just compare 0/1 predictions during learning**
  - | Actual (y_i) | Predicted (p_i) | Thresholded prediction | Correct? | Continuous loss  |
    | ------------ | --------------- | ---------------------- | -------- | ---------------- |
    | 1            | 0.9             | 1                      | ‚úÖ        | Very small loss  |
    | 1            | 0.6             | 1                      | ‚úÖ        | Small loss       |
    | 1            | 0.51            | 1                      | ‚úÖ        | Still small loss |
    | 1            | 0.49            | 0                      | ‚ùå        | Large loss       |
    | 1            | 0.1             | 0                      | ‚ùå        | Very large loss  |
 - If you just use 0/1 classification:
   - All ‚úÖ examples get the same score ‚Äî no gradient difference between 0.6 and 0.95.
   - You lose information about how confident the model was.
   - But the logistic loss keeps that information ‚Äî it gives a smooth gradient:
     - 0.95 ‚Üí tiny loss
     - 0.6 ‚Üí moderate loss
     - 0.1 ‚Üí huge loss
    - That‚Äôs why gradient descent works ‚Äî because the loss changes smoothly with
- **Intuitive Analogy**
  - Imagine training a dart player
    - If you only say ‚Äúhit‚Äù or ‚Äúmiss‚Äù (0/1), he gets no idea how close he was.
    - But if you tell him how far he missed, he can adjust gradually.
    - The ‚Äúhow far‚Äù feedback = **probabilistic loss**.
    - The ‚Äúhit/miss‚Äù feedback = **classification after training**.