Objective: Learn to extract features from text into numerical vectors, then build a binary classifier for tweets using a logistic regression.

# 1. Supervised ML & Sentiment Analysis
In supervised machine learning, the typical process involves an input variable 
X that is fed into a prediction function, resulting in an estimated output 
Y. This estimated output is then compared against the actual value 
Y to determine the cost. This cost is used to adjust the parameters 
θ. The process is summarized in the following image.

<img src="../../img/supervised-ml.PNG" alt="Supervised ML" width="70%" height="70%">

To conduct sentiment analysis on a tweet, initially, the text (for example, "I am happy because I am learning NLP") must be converted into feature representation. Following this, you train a logistic regression classifier with these features. Once trained, this classifier can be applied to categorize the sentiment of the text. Here you either classify 1, for a positive sentiment, or 0, for a negative sentiment.

<img src="../../img/feature-representation.PNG" alt="Feature Representation" width="70%" height="70%">

# 2. Vocabulary & Feature Extraction
When you have a text snippet, like a tweet, it can be transformed into a vector of dimension 
V, with V representing the total number of words in your vocabulary. Take the tweet 'I am happy because I am learning NLP' as an example. In this case, each word in the tweet is denoted by a 1 at its respective index in the vector, while all other indices are marked with a 0. As V increases, the vector tends to become more sparse. Additionally, this leads to a significant increase in the number of features, which in turn requires training on θ V parameters:
   - θ represents the parameters or weights of the model that need to be learned during the training process.
   - If the number of features is V, then the model needs to learn a corresponding number of parameters (essentially, one parameter for each feature).
   - As V increases, the total number of parameters (θ V) also increases.
   
 Consequently, this can result in longer training and prediction times.

<img src="../../img/feature-extraction.PNG" alt="Feature Extraction" width="70%" height="70%">

# 3. Feature Extraction with Frequencies 

Let's consider these tweets:

<img src="../../img/tweets.PNG" alt="tweets" width="70%" height="70%">


Each tweet must be transformed into a vector format. Whereas in the past these vectors had a dimensionality equal to V, reflecting the size of the vocabulary. To accomplish this, it is necessary to construct a dictionary that logs each word alongside the class it belongs to—positive or negative—and tallies the frequency of its occurrence within that class.

<img src="../../img/dictionary.PNG" alt="dictionary" width="70%" height="70%">

The table illustrates how certain words such as 'happy' and 'sad' distinctly align with specific sentiments, whereas others like 'I' and 'am' appear more neutral. Utilizing this lexicon and a tweet example, 'I am sad, I am not learning NLP', one can construct a vector that encapsulates these features accordingly. \[ 8 \]  represents the sum of positive sentiments.

<img src="../../img/dictionary-positives.PNG" alt="dictionary" width="70%" height="70%">

to encode the negative features we can do as below. \[ 11 \] accounts for the sum of negative sentiments.
<img src="../../img/dictionary-negatives.PNG" alt="dictionary" width="70%" height="70%">

As a result, the derived feature vector is \[1, 8, 11 \].  In this vector, the \[ 1 \]  serves as the bias term, 

# 4. Preprocessing

During the preprocessing phase, the necessary steps include:

  - Removing usernames and links from the text.
  - Splitting the text into individual words (tokenisation).
  - Discarding common stop words like "the, and, for, etc."
  - Applying stemming to reduce words to their root form, for which the Porter Stemmer algorithm is commonly utilized.
  - Transforming all text to lowercase.

Example: "@DrJaneDoe TUNED a great AI model"

After applying the preprocessing steps, the tweet is reduced to:

\[ tun, great, ai, model \].

This illustrates the exclusion of usernames and URLs, word tokenization, stop words removal, stemming of words to their base forms, and conversion to lowercase.

<img src="../../img/preprocessing.PNG" alt="preprocessing" width="70%" height="70%">

# 5. Synthesizing the process

you begin with a selected piece of text. Initially, you conduct preprocessing steps, followed by feature extraction, which translates the text into a numerical format as detailed below:

<img src="../../img/process.PNG" alt="process" width="70%" height="70%">

Your \( X \) will then take on the dimensions \( (m, 3) \), as shown below.

<img src="../../img/matrix.PNG" alt="matrix" width="40%" height="40%">




[access code preprocessing](./Lab/C1_W1_lecture_nb_01_preprocessing.ipynb)

[access code words frequencies](./Lab/C1_W1_lecture_nb_02_word frequencies.ipynb)

## Example

In this example we are Given the tweet:

```plaintext
tweets = ['I am Happy Because i am learning NLP, @deeplearning']
```

i. **Tweet Preprocessing:**
   - Your `process_tweet` function tokenizes, stems, and removes stop words and punctuation from the tweet.
   - In this process, "Happy!!" is likely stemmed to "happi", and other words like "I", "am", "because", and "learning" are either removed as stopwords or retained based on their presence in the stopwords list.



ii. **Frequency Calculation:**
   - The `extract_features` function then calculates the frequency of each word (after preprocessing) in the context of positive (label 1.0) and negative (label 0.0) sentiments.





# 6. Logistic regression 
[access logistic regression](./

Logistic regression employs the sigmoid function, expressed as

 $h(\mathbf{x}^{(i)}, \theta) = \frac{1}{1 + e^{-\theta^\top \mathbf{x}^{(i)}}} $
 
to output a probability between 0 and 1. This function is crucial in the continuous process of updating parameters and repeating iterations until the cost is sufficiently minimized. In logistic regression, this cost-minimizing function equates to the sigmoid function.

<img src="../../img/sig.PNG" alt="sigmoid" width="70%" height="70%">


The classification function in logistic regression, denoted as h, is based on the sigmoid function and depends on the parameters Theta and the feature vector $( \mathbf{x}^{(i)} $), where 'i' denotes each observation or data point. In the context of tweets, this refers to individual tweets. The sigmoid function graphically decreases to zero as the product of $( \theta^\top \mathbf{x}^{(i)} $) approaches negative infinity, and conversely, it increases to one as this product nears positive infinity.

Classification within logistic regression necessitates setting a threshold, typically at 0.5. This threshold corresponds to when the dot product of $( \theta^\top \mathbf{x}^{(i)} $) equals zero. Therefore, a prediction is considered positive when the dot product is zero or higher, and negative when it is below zero.

To illustrate this within the realm of tweets and sentiment analysis, consider a specific tweet. After preprocessing, you obtain a list where usernames are removed, all text is converted to lowercase, and words like 'tuning' are stemmed to 'tun'.

Using a frequency dictionary, features are extracted to create a vector that includes a bias unit and two features indicating the sum of positive and negative word frequencies in your processed tweet. 
 
With an optimal set of parameters Theta, you can then calculate the value of the sigmoid function, for example, 4.92, and predict a positive sentiment for the tweet. 

<img src="../../img/sig2.PNG" alt="sigmoid2" width="70%" height="70%">




## 6.1. Training


To effectively train your logistic regression model, here are the steps involved: 
- Begin by initializing your parameter $( \theta $), which is essential for the sigmoid function. 
- Calculate the gradient, which is pivotal for updating $( \theta $).
- Determine the cost. 

This process is repeated continually until the results are satisfactory.

<img src="../../img/training.PNG" alt="sigmoid" width="70%" height="70%">



Typically, training continues until the cost stabilizes. If you were to graph the number of iterations against the cost, you would likely observe a specific trend or pattern.

<img src="../../img/convergence.PNG" alt="sigmoid" width="50%" height="50%">

## 6.2. Testing

To evaluate your model's performance, use a portion of your dataset known as the validation set to generate predictions. These predictions are derived from the outputs of the sigmoid function. If the output is equal to or greater than 0.5, categorize it as belonging to the positive class; if not, it's classified as negative.

<img src="../../img/train.PNG" alt="Supervised ML" width="70%" height="70%">


 When organizing the  data $( X $), it's common to divide it into three segments: $( X_{\text{train}} $), $( X_{\text{val}} $), and $( X_{\text{test}} $). 
 
 The division of data typically depends on the dataset's size, but a split of 80% for training, 10% for validation, and 10% for testing is often effective.

To calculate the accuracy of your model, use the following formula: 


$ \text{Accuracy} = \frac{1}{m} \sum_{i=1}^{m} \mathbb{1}(\text{pred}(i) == Y_{\text{val}}(i)) $

Here's what this formula represents:

- $( \text{Accuracy} $) is the metric you're calculating.
- $( \frac{1}{m} $) is the normalization factor, where $( m $) is the total number of examples in the validation set.
- $( \sum_{i=1}^{m} $) denotes the summation over all $( m $) examples.
- $( \mathbb{1}(\text{pred}(i) == Y_{\text{val}}(i)) $) is an indicator function that returns 1 if the prediction for the $( i $)-th example (denoted as $( \text{pred}(i) $)) matches the actual value in the validation set $( Y_{\text{val}}(i) $), and 0 otherwise.

This formula sums up the number of correct predictions (where the predicted value matches the actual value) and then divides by the total number of predictions to calculate the accuracy.

This means you review all your training examples counted as $( m $). For each prediction, add one if it is correct, and then divide the total by $( m $) to get the accuracy.

# 6.3. Cost function for logistic regression

The cost function for logistic regression is formulated as:

$ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log h(x^{(i)}, \theta) + (1 - y^{(i)}) \log (1 - h(x^{(i)}, \theta)) \right] $

<img src="../../img/cost.PNG" alt="Cost function" width="70%" height="70%">

This equation shows that if $( y = 1 $) and the predicted value is near 0, the cost approaches infinity. Likewise, if $( y = 0 $) and the prediction is close to 1, the cost also approaches infinity. Conversely, if the prediction matches the label, the cost is zero. The objective in both scenarios is to minimize $( J(\theta) $).

## Math Derivation

To understand why the cost function is structured this way, let's consider a function that unifies both scenarios:

$ P(y | x^{(i)}, \theta) = h(x^{(i)}, \theta)^{y^{(i)}} (1 - h(x^{(i)}, \theta))^{(1 - y^{(i)})} $

Here, when $( y = 1 $), the function becomes $( h(x^{(i)}, \theta) $), and when $( y = 0 $), it becomes $( 1 - h(x^{(i)}, \theta) $). This relationship is logical as both probabilities sum to 1.

 For $( y = 0 $), you'd want $( 1 - h(x^{(i)}, \theta) $) to be close to 1, meaning $( h(x^{(i)}, \theta) $) should be close to 0. For $( y = 1 $), $( h(x^{(i)}, \theta) $) should approximate 1.
 

To model the entire dataset, not just individual examples, we define the likelihood as:

$ L(\theta) = \prod_{i=1}^{m} h(\theta, x^{(i)})^{y^{(i)}} (1 - h(\theta, x^{(i)}))^{(1 - y^{(i)})} $

The $ \prod $ symbol indicates a product of terms. If the classification of even one example is incorrect, it affects the overall likelihood score. This model aims to fit the entire dataset where all data points are interconnected. As $( m $) increases, $( L(\theta) $) tends to zero since both $( h(x^{(i)}, \theta) $) and $( 1 - h(x^{(i)}, \theta) $) are bounded between 0 and 1. To maximize $( h(\theta, x^{(i)}) $) if $( y = 1 $), and minimize it if $( y = 0 $) in $( L(\theta) $), we use the logarithm, allowing us to transform the log of a product into a sum:

$ \log(a \cdot b \cdot c) = \log a + \log b + \log c $

$ \log a^b = b \log a $

Using these identities, we can rewrite the equation as:

$ \max \log L(\theta) = \log \prod_{i=1}^{m} h(x^{(i)}, \theta)^{y^{(i)}} (1 - h(x^{(i)}, \theta))^{(1 - y^{(i)})} $
$ = \sum_{i=1}^{m} \log h(x^{(i)}, \theta)^{y^{(i)}} + \log (1 - h(x^{(i)}, \theta))^{(1 - y^{(i)})} $


We divide by $( m $) to average the cost:

$ \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log h(x^{(i)}, \theta) + (1 - y^{(i)}) \log (1 - h(x^{(i)}, \theta)) \right] $

Maximizing $( h(\theta, x^{(i)}) $) is equivalent to minimizing its negative. This concept is similar to minimizing the function $( x^2 $). Therefore, we add a negative sign, aiming to minimize the cost function:

$ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log h(x^{(i)}, \theta) + (1 - y^{(i)}) \log (1 - h(x^{(i)}, \theta)) \right] $

In vectorized form, this is expressed as:

$ J(\theta) = \frac{1}{m} \cdot (-y^T \log(h) - (1 - y)^T \log(1 - h)) $

This deep dive into the mathematical derivation of the cost function in logistic regression provides a thorough understanding of its rationale and application.

## 6.4. Gradient

The general form of gradient descent is defined as:
  
   $\text{Repeat}\{\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta)\}.$ For all j.


- $\theta_j $: This represents an individual parameter in a vector of parameters $\theta $. In the context of logistic regression, $\theta $ is the vector of weights that the algorithm is trying to learn.

- The term "Repeat" indicates that the following process is iterative, meaning it is repeated multiple times to gradually approach the optimal set of parameters.

- This is a hyperparameter that controls the step size at each iteration while moving towards a minimum of a cost function. It's crucial for balancing the speed and stability of the learning process.

- This is the partial derivative of the cost function $ J(\theta) $ with respect to the parameter $\theta_j $. It indicates the direction and rate of the steepest increase of the cost function with respect to $\theta_j $.

- $( \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta) $): This is the rule for updating the parameter $ \theta_j $. The current value of $ \theta_j $ is adjusted by subtracting the product of the learning rate $ \alpha $ and the partial derivative of the cost function with respect to $ \theta_j $. This subtraction is key: since gradient points towards the direction of the steepest increase, subtracting it moves $ \theta_j $ towards the direction of the steepest decrease, which is what we want for minimization.

- This indicates that the update rule is applied to every parameter in the vector $( \theta $). Each parameter $( \theta_j $) in the model is updated in this manner.

We can work out the derivative part using calculus to get:
  
   $\text{Repeat}\{\theta_j := \theta_j - \frac{\alpha}{m}\sum_{i=1}^{m}(h(x^{(i)},\theta) - y^{(i)})x_j^{(i)}\}$

A vectorized implementation is:
   
   $\theta := \theta - \frac{\alpha}{m} X^T(H(X,\theta) - Y)$

To elaborate on the process of finding the derivative of the sigmoid function $ h(x) $, let's go through the mathematical steps in detail:

**Sigmoid Function Definition**: 
   The sigmoid function is defined as 
   $ h(x) = \frac{1}{1 + e^{-x}} $

**Applying the Chain Rule**: 
   To find the derivative $ h(x)' $, we apply the chain rule to $\frac{1}{1 + e^{-x}} $.

**Derivative Calculation**:
   - The first step is to differentiate the outer function. In our case, the outer function is $ f(u) = \frac{1}{u} $, where $ u = 1 + e^{-x} $. The derivative of $ f(u) $ with respect to $ u $ is $ f'(u) = -\frac{1}{u^2} $.
   - Next, differentiate the inner function $ u = 1 + e^{-x} $. The derivative of $ u $ with respect to $ x $ is $ u' = -e^{-x} $.
   - Now, apply the chain rule: $ h(x)' = f'(u) \cdot u' $.

**Chain Rule Application**: 
   Plugging in the derivatives, we get:
   $ h(x)' = -\frac{1}{(1 + e^{-x})^2} \cdot -e^{-x} $

**Simplification**:
   - Simplifying the expression, we have:
   $ h(x)' = \frac{e^{-x}}{(1 + e^{-x})^2} $
   - Next, notice that $ \frac{e^{-x}}{1 + e^{-x}} $ is actually $ h(x) $ itself. Thus, the expression can be rewritten as:
   $ h(x)' = h(x) \cdot \frac{e^{-x}}{1 + e^{-x}} $

**Final Form**:
   - Observe that $ \frac{e^{-x}}{1 + e^{-x}} = 1 - \frac{1}{1 + e^{-x}} = 1 - h(x) $.
   - Therefore, the final form of the derivative is:
   $ h(x)' = h(x)(1 - h(x)) $

The derivative of the sigmoid function $ h(x) $ is $ h(x)(1 - h(x)) $, which is derived using the chain rule. This derivative is a critical component in calculating the gradient for logistic regression, as it's used in the partial derivative of the cost function $ J(\theta) $ with respect to the parameters $ \theta $.

To find $ \frac{\partial}{\partial \theta_j} J(\theta) $, we need to differentiate the cost function with respect to each parameter $ \theta_j $.
- Apply the chain rule to differentiate the terms inside the summation. The derivative of the cost function with respect to $ \theta_j $ involves differentiating both $ \log(h(x^{(i)}, \theta)) $ and $ \log(1 - h(x^{(i)}, \theta)) $ with respect to $ \theta_j $.
- For the term $ \log(h(x^{(i)}, \theta)) $, the derivative is:
  $ \frac{\partial}{\partial \theta_j} \log(h(x^{(i)}, \theta)) = \frac{1}{h(x^{(i)}, \theta)} \cdot h(x^{(i)}, \theta)(1 - h(x^{(i)}, \theta)) \cdot x_j^{(i)} $
  where $ h(x^{(i)}, \theta)(1 - h(x^{(i)}, \theta)) \cdot x_j^{(i)} $ is the derivative of $ h(x^{(i)}, \theta) $ with respect to $ \theta_j $, as derived earlier.
- Similarly, for the term $ \log(1 - h(x^{(i)}, \theta)) $, the derivative is:
  $ \frac{\partial}{\partial \theta_j} \log(1 - h(x^{(i)}, \theta)) = \frac{-1}{1 - h(x^{(i)}, \theta)} \cdot h(x^{(i)}, \theta)(1 - h(x^{(i)}, \theta)) \cdot x_j^{(i)} $

**Combining the Derivatives**:

- Now, combine these derivatives into the overall expression for $ \frac{\partial}{\partial \theta_j} J(\theta) $:

  $ \frac{\partial}{\partial \theta_j} J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \cdot \frac{1}{h(x^{(i)}, \theta)} - (1 - y^{(i)}) \cdot \frac{1}{1 - h(x^{(i)}, \theta)} \right] \cdot h(x^{(i)}, \theta)(1 - h(x^{(i)}, \theta)) \cdot x_j^{(i)} $

- This can be simplified further to:
  $ \frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \left[ h(x^{(i)}, \theta) - y^{(i)} \right] x_j^{(i)} $

**Summary**:
- The partial derivative $ \frac{\partial}{\partial \theta_j} J(\theta) $ essentially represents the gradient of the cost function $ J(\theta) $ with respect to each parameter $ \theta_j $. It tells us how much the cost function changes as each parameter $ \theta_j $ changes, which is crucial for the gradient descent algorithm used in training the logistic regression model.
- In gradient descent, these derivatives are used to update each parameter $ \theta_j $ in the direction that minimizes the cost function $ J(\theta) $.

This breakdown shows the mathematical derivation involved in computing the gradients for logistic regression, which are essential for optimizing the model's parameters.

The vectorized version of the gradient:

$ \nabla J(\theta) = \frac{1}{m} \cdot X^T \cdot (H(X, \theta) - Y) $

Where:
- $ \nabla J(\theta) $ is the gradient of the cost function.
- $ m $ is the number of training examples.
- $ X^T $ is the transpose of the feature matrix.
- $ H(X, \theta) $ is the hypothesis function applied to all training examples.
- $ Y $ is the vector of actual labels.



# Takeaway

The cost function for logistic regression is typically defined as a log loss function. The purpose of this cost function is to penalize incorrect predictions more heavily. It is expressed as:

$ J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log(h(x_i, \theta)) + (1 - y_i) \log(1 - h(x_i, \theta)) \right] $

Where:
- $( m $) is the number of training examples.
- $( y_i $) is the actual label of the $( i $)-th training example.
- $( h(x_i, \theta) $) is the predicted probability for the $( i $)-th training example, computed using the logistic function $( \frac{1}{1 + e^{-\theta^T x_i}} $).

Given this, let's analyze each of your statements:

1. **When $( y_i = 1 $) and $( h(x_i, \theta) $) approaches 0**: The cost function approaches infinity. This is because the term $( y_i \log(h(x_i, \theta)) $) becomes $(\log(0)$) which approaches negative infinity (since $(\log$) of a number very close to 0 is a very large negative number). Since the cost function includes a negative sign in front, it approaches positive infinity.


2. **When $( y_i = 0 $) and $( h(x_i, \theta) $) approaches 0**: The cost function approaches 0. Here, the term $( (1 - y_i) \log(1 - h(x_i, \theta)) $) becomes $(\log(1)$) which is 0, and the other term $( y_i \log(h(x_i, \theta)) $) is 0 because $( y_i $) is 0. Thus, the cost becomes 0.




In the context of logistic regression, the sigmoid (or logistic) function is used to map the product of the feature vector $( x_i $) and the parameter vector $( \theta $) (i.e., $( \theta^T x_i $)) to a probability value between 0 and 1. The sigmoid function is defined as:

$ h(\theta, x_i) = \frac{1}{1 + e^{-\theta^T x_i}} $

The question is: For what values of $( \theta^T x_i $) does $( h(\theta, x_i) = 0.5 $)?

Let's set the equation equal to 0.5 and solve for $( \theta^T x_i $):

$ 0.5 = \frac{1}{1 + e^{-\theta^T x_i}} $

By rearranging the equation, we can isolate $( e^{-\theta^T x_i} $):

$ e^{-\theta^T x_i} = \frac{1}{0.5} - 1 = 2 - 1 = 1 $

Taking the natural logarithm of both sides, we get:

$ -\theta^T x_i = \ln(1) $

Since $ \ln(1) = 0 $, this simplifies to:

$ -\theta^T x_i = 0 $

Thus, $ \theta^T x_i = 0 $.

So, the sigmoid function $ h(\theta, x_i) $ equals 0.5 when $ \theta^T x_i = 0 $. This is the point where the sigmoid function transitions from classifying an input as more likely belonging to one class (below 0.5) to the other class (above 0.5).




In the context of logistic regression for binary classification, where $ y(i) = 1 $ for a specific example $ i $, and it is known that $ h(x_i, \theta) > 0.5 $, let's analyze the provided options:

**Our prediction $ h(x_i, \theta) $ for this specific training example is less than $( 1 - y_i $)**:

$( y_i = 1 $), $( 1 - y_i = 0 $). Therefore, saying $( h(x_i, \theta) $) is less than $( 1 - y_i $) is the same as saying it is less than 0, which is not possible because $( h(x_i, \theta) $) is always between 0 and 1.


**Our prediction $( h(x_i, \theta) $) for this specific training example is greater than $( 1 - h(x_i, \theta) $)**:

$( h(x_i, \theta) $) is a probability greater than 0.5, the complement of this probability, $( 1 - h(x_i, \theta) $), will be a value less than 0.5. Therefore, $( h(x_i, \theta) $) is indeed greater than $( 1 - h(x_i, \theta) $).



To determine the criteria for deciding when to stop training a model:

- When your accuracy is good enough on the test set.
- When you plot the cost versus iterations and you see that your loss is converging.










