Objective: Learn to extract features from text into numerical vectors, then build a binary classifier for tweets using a logistic regression.

# 1. Supervised ML & Sentiment Analysis
In supervised machine learning, the typical process involves an input variable 
X that is fed into a prediction function, resulting in an estimated output 
Y. This estimated output is then compared against the actual value 
Y to determine the cost. This cost is used to adjust the parameters 
θ. The process is summarized in the following image.

<img src="../../img/supervised-ml.PNG" alt="Supervised ML" width="70%" height="70%">

To conduct sentiment analysis on a tweet, initially, the text (for example, "I am happy because I am learning NLP") must be converted into feature representation. Following this, you train a logistic regression classifier with these features. Once trained, this classifier can be applied to categorize the sentiment of the text. Here you either classify 1, for a positive sentiment, or 0, for a negative sentiment.

<img src="../../img/feature-representation.PNG" alt="Feature Representation" width="70%" height="70%">

# 2. Vocabulary & Feature Extraction
When you have a text snippet, like a tweet, it can be transformed into a vector of dimension 
V, with V representing the total number of words in your vocabulary. Take the tweet 'I am happy because I am learning NLP' as an example. In this case, each word in the tweet is denoted by a 1 at its respective index in the vector, while all other indices are marked with a 0. As V increases, the vector tends to become more sparse. Additionally, this leads to a significant increase in the number of features, which in turn requires training on θ V parameters:
   - θ represents the parameters or weights of the model that need to be learned during the training process.
   - If the number of features is V, then the model needs to learn a corresponding number of parameters (essentially, one parameter for each feature).
   - As V increases, the total number of parameters (θ V) also increases.
   
 Consequently, this can result in longer training and prediction times.

<img src="../../img/feature-extraction.PNG" alt="Feature Extraction" width="70%" height="70%">

# 3. Feature Extraction with Frequencies 

Let's consider these tweets:

<img src="../../img/tweets.PNG" alt="tweets" width="70%" height="70%">


Each tweet must be transformed into a vector format. Whereas in the past these vectors had a dimensionality equal to V, reflecting the size of the vocabulary. To accomplish this, it is necessary to construct a dictionary that logs each word alongside the class it belongs to—positive or negative—and tallies the frequency of its occurrence within that class.

<img src="../../img/dictionary.PNG" alt="dictionary" width="70%" height="70%">

The table illustrates how certain words such as 'happy' and 'sad' distinctly align with specific sentiments, whereas others like 'I' and 'am' appear more neutral. Utilizing this lexicon and a tweet example, 'I am sad, I am not learning NLP', one can construct a vector that encapsulates these features accordingly. \[ 8 \]  represents the sum of positive sentiments.

<img src="../../img/dictionary-positives.PNG" alt="dictionary" width="70%" height="70%">

to encode the negative features we can do as below. \[ 11 \] accounts for the sum of negative sentiments.
<img src="../../img/dictionary-negatives.PNG" alt="dictionary" width="70%" height="70%">

As a result, the derived feature vector is \[1, 8, 11 \].  In this vector, the \[ 1 \]  serves as the bias term, 

# 4. Preprocessing
During the preprocessing phase, the necessary steps include:

  - Removing usernames and links from the text.
  - Splitting the text into individual words (tokenisation).
  - Discarding common stop words like "the, and, for, etc."
  - Applying stemming to reduce words to their root form, for which the Porter Stemmer algorithm is commonly utilized.
  - Transforming all text to lowercase.

Example: "@DrJaneDoe TUNED a great AI model"

After applying the preprocessing steps, the tweet is reduced to:

\[ tun, great, ai, model \].

This illustrates the exclusion of usernames and URLs, word tokenization, stop words removal, stemming of words to their base forms, and conversion to lowercase.

<img src="../../img/preprocessing.PNG" alt="preprocessing" width="70%" height="70%">

# 5. Synthesizing the process

you begin with a selected piece of text. Initially, you conduct preprocessing steps, followed by feature extraction, which translates the text into a numerical format as detailed below:

<img src="../../img/process.PNG" alt="process" width="70%" height="70%">

Your \( X \) will then take on the dimensions \( (m, 3) \), as shown below.

<img src="../../img/matrix.PNG" alt="matrix" width="40%" height="40%">




## Example

In this example we are Given the tweet:

```plaintext
tweets = ['I am Happy Because i am learning NLP, @deeplearning']
```

i. **Tweet Preprocessing:**
   - Your `process_tweet` function tokenizes, stems, and removes stop words and punctuation from the tweet.
   - In this process, "Happy!!" is likely stemmed to "happi", and other words like "I", "am", "because", and "learning" are either removed as stopwords or retained based on their presence in the stopwords list.

   [access code preprocessing](example.ipynb)

ii. **Frequency Calculation:**
   - The `extract_features` function then calculates the frequency of each word (after preprocessing) in the context of positive (label 1.0) and negative (label 0.0) sentiments.

[access code words frequencies](example.ipynb)



# 6. Logistic regression 

Logistic regression employs the sigmoid function, expressed as

 $h(\mathbf{x}^{(i)}, \theta) = \frac{1}{1 + e^{-\theta^\top \mathbf{x}^{(i)}}} $
 
to output a probability between 0 and 1. This function is crucial in the continuous process of updating parameters and repeating iterations until the cost is sufficiently minimized. In logistic regression, this cost-minimizing function equates to the sigmoid function.

<img src="../../img/sig.PNG" alt="sigmoid" width="70%" height="70%">


The classification function in logistic regression, denoted as h, is based on the sigmoid function and depends on the parameters Theta and the feature vector $( \mathbf{x}^{(i)} $), where 'i' denotes each observation or data point. In the context of tweets, this refers to individual tweets. The sigmoid function graphically decreases to zero as the product of $( \theta^\top \mathbf{x}^{(i)} $) approaches negative infinity, and conversely, it increases to one as this product nears positive infinity.

Classification within logistic regression necessitates setting a threshold, typically at 0.5. This threshold corresponds to when the dot product of $( \theta^\top \mathbf{x}^{(i)} $) equals zero. Therefore, a prediction is considered positive when the dot product is zero or higher, and negative when it is below zero.

To illustrate this within the realm of tweets and sentiment analysis, consider a specific tweet. After preprocessing, you obtain a list where usernames are removed, all text is converted to lowercase, and words like 'tuning' are stemmed to 'tun'.

Using a frequency dictionary, features are extracted to create a vector that includes a bias unit and two features indicating the sum of positive and negative word frequencies in your processed tweet. 
 
With an optimal set of parameters Theta, you can then calculate the value of the sigmoid function, for example, 4.92, and predict a positive sentiment for the tweet. 

<img src="../../img/sig2.PNG" alt="sigmoid2" width="70%" height="70%">




## 6.1. Logistic Regression: Training


To effectively train your logistic regression model, here are the steps involved: 
- Begin by initializing your parameter $( \theta $), which is essential for the sigmoid function. 
- Calculate the gradient, which is pivotal for updating $( \theta $).
- Determine the cost. 

This process is repeated continually until the results are satisfactory.

<img src="../../img/training.PNG" alt="sigmoid" width="70%" height="70%">



Typically, training continues until the cost stabilizes. If you were to graph the number of iterations against the cost, you would likely observe a specific trend or pattern.

<img src="../../img/convergence.PNG" alt="sigmoid" width="50%" height="50%">