# Basic Machine Learning in NLP

## Task: Sentiment Analysis

* Sentiment Analysis is a task to determine the emotional tone behind a body of text.
* This is especially useful in understanding customer opinions in product reviews, where sentiments are broadly categorized as positive or negative.
* For example, analyzing reviews on an e-commerce site can reveal customer satisfaction levels and preferences, aiding in better decision-making for both businesses and potential buyers.
* Similarly, in the context of book reviews, sentiment analysis can provide insights into readers' reception of the content, style, and storytelling

## Data Preparation

Before diving into the models, the first step is to acquire and understand the dataset we will be using.

1. **Download Dataset**
* Navigate to the [Large Movie Review Dataset v1.0 page](https://ai.stanford.edu/~amaas/data/sentiment/) provided by Stanford AI Lab.

* Download the dataset titled 'Large Movie Review Dataset v1.0'.

* This dataset is specifically curated for binary sentiment classification, making it ideal for our purposes.

2. **Explore the Dataset**
* Once downloaded, unzip the file and explore its structure. You'll notice it's divided into two main directories: train and test.
Each of these contains neg (negative) and pos (positive) subdirectories.

* Familiarize yourself with the data by opening and reading a few sample files from both the neg and pos categories in both the train and test directories.
This will give you a sense of the text data we'll be classifying.






Load train and test sets of the dataset and store them in lists.



In [1]:
#
# implement here
#

## Hand-coded Rules

This section introduces you to the concept of rule-based classification, a fundamental approach in NLP. You'll create a simple sentiment analyzer based on keyword matching.

1. **Prepare Keyword Sets**
* Compile two lists of keywords: one representing positive sentiment ($PS$) and the other representing negative sentiment ($NS$).
* Think about common words that clearly convey positivity or negativity.

In [2]:
# Define lists of positive and negative keywords
# PS =
# NS =


2. **Count Keyword Matches**
* For a given input text $t$, calculate the number of words that appear in the $PS$ set ($n_{PS}^{t}$) and in the $NS$ set ($n_{NS}^{t}$).
* This step involves processing the text to identify and count the matching words.


In [3]:
def count_keywords(text, PS, NS):
  """
  Counts the number of positive and negative keywords in a given text.

  Args:
    text: The text to analyze.
    PS: A set of positive keywords.
    NS: A set of negative keywords.

  Returns:
    A tuple containing the number of positive and negative keywords found.
  """
  pass
  #
  # implement here
  #


3. **Create Decision Rules**
* Develop your own set of rules to classify the sentiment as Positive or Negative based on $n_{PS}^{t}$ and $n_{NS}^{t}$.
* For example, you might decide that a text is positive if $n_{PS}^{t}$ is greater than $n_{NS}^{t}$, negative if the opposite is true, or neutral/undecided if the counts are equal.
* Feel free to experiment with different rule combinations to see what works best.

In [4]:
def classify_sentiment(text, PS, NS):
  """
  Classifies the sentiment of a given text based on keyword counts.

  Args:
    text: The text to analyze.
    PS: A set of positive keywords.
    NS: A set of negative keywords.

  Returns:
    1 for "pos" or 0 for "neg" depending on the keyword counts.
  """
  pass
  #
  # implement here
  #

4. **Evaluation**

* Apply your method to the test set and evaluate the performance

In [5]:
#
# implement here
#

## Naive Bayes Classifier with Bag of Words

In this section, we'll step into the realm of machine learning by implementing a Naive Bayes Classifier, a popular and straightforward probabilistic classifier that is often effective in text classification tasks.

### Understand the Naive Bayes Principle
* A probabilistic classifier that assumes independence between features.
* In the context of text classification, it calculates the probability of a document belonging to a certain class (positive or negative) based on the probabilities of the words it contains.

Assuming we have a document $d$ and we want to classify it into one of the classes $C$ (which in our case are 'positive' or 'negative'), the Naive Bayes classifier calculates the probability of $d$ belonging to a class $c$ in $C$ based on the words it contains.

The equation for this can be written as:

$$
P(c|d) = \frac{P(d|c) \times P(c)}{P(d)}
$$

Where
* $ P(c|d) $ is the posterior probability of class $ c $ given document $ d $.
* $ P(d|c) $ is the likelihood of document $ d $ given class $ c $.
* $ P(c) $ is the prior probability of class $ c $.
* $ P(d) $ is the probability of document $ d $.

For text classification:
$$
P(d|c) = P(w_1|c) \times P(w_2|c) \times ... \times P(w_n|c)
$$

where $ d $ is composed of words $ w_1, w_2, ..., w_n $, and each $ P(w_i|c) $ is the probability of word $ w_i $ occurring in documents of class $ c $.


### Feature Extraction: Bag of Words

In this stage, you'll transform your text data into a format suitable for the Naive Bayes Classifier, using the 'Bag of Words' (BoW) model.

Follow these steps to implement the BoW model:

**1**. **Text Preprocessing**
* Remove punctuation and special characters.
* Convert all text to lowercase.
* Split text into individual words.
* Remove commonly used words that do not contribute to the meaning of the text.
  * e.g., use [NLTK](https://www.nltk.org/)'s stopwords list

In [6]:
#
# implement here
#

2. **Vocabulary Building and Vectorization**
* List all unique words from the dataset to form a vocabulary.
* Represent each document as a vector, where each dimension corresponds to a word in the vocabulary, and the value is the frequency of the word in the document.

In [7]:
#
# implement here
#

### Training a Naive Bayes Classifier

**Note**: *Use the training set to train your model*

1. **Probability Calculation**:
   - Calculate the prior probabilities for each class (positive and negative).
   - For each class, count the frequency of each word in the vocabulary.

In [8]:
#
# implement here
#


2. **Conditional Probability Calculation with Smoothing**:
   - Compute the conditional probabilities $ P(w_i|c) $ for each word $ w_i $ in your vocabulary and each class $ c $, applying Laplace smoothing to handle zero probabilities.



In [9]:
#
# implement here
#

3. **Classifier Implementation and Document Classification**:
   - Implement the Naive Bayes formula to calculate $ P(c|d) $ for each document in the test set.
   - Classify each document into the class with the highest calculated probability.


In [10]:
def classify_document(document):
  """
  Classifies a given document using the Naive Bayes classifier.

  Args:
    document: The text document to classify.

  Returns:
    The predicted class ('pos' or 'neg').
  """
  pass
  #
  # implement here
  #

4. **Performance Evaluation**
   - Evaluate the classifier using metrics like accuracy, precision, recall, and F1 score, comparing the predicted class labels against the true labels of the test set.

In [11]:
#
# implement here
#

## Naive Bayes Classifier with TF-IDF

**Term Frequency-Inverse Document Frequency (TF-IDF)** is a numerical statistic that reflects how important a word is to a document in a collection or corpus.

It is often used in text mining and information retrieval to transform the textual data into a format more suitable for machine learning algorithms.

The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

This helps to adjust for the fact that some words appear more frequently in general.

For a given document $ d $ and a class $ c $ in a set of classes $ C $, TF-IDF can be particularly insightful. It allows us to:

1. **Distinguish Important Words in Each Document**: By calculating the TF-IDF score for each word in document $ d $, we can identify which words are particularly characteristic of that document compared to other documents in the corpus.

2. **Understand Word Significance Across Classes**: In a classification task, TF-IDF can highlight which words are most descriptive of documents belonging to class $ c $, as opposed to those in other classes in $ C $.

3. **Improve Classification Models**: When used as features in machine learning models, TF-IDF vectors provide a more nuanced representation of text data than simple word counts, potentially leading to more accurate predictions in tasks like sentiment analysis or topic classification.

In the following section, we'll use Python and a machine learning library to apply TF-IDF to a text classification problem, leveraging its potential to enhance the performance of our Naive Bayes Classifier.


TF-IDF is calculated using two components: Term Frequency (TF) and Inverse Document Frequency (IDF).

**Term Frequency (TF)**:
TF measures how frequently a term occurs in a document.
It's calculated as:

$$
     \text{TF}(t, d) = \frac{n_{t, d}}{n_{d}}
$$
where $ n_{t, d} $ is the number of occurrences of term $ t $ in document $ d $, and $ n_{d} $ is the total number of terms in document $d$.


**Inverse Document Frequency (IDF)**:
IDF measures how important a term is across a set of documents or corpus.
It's calculated as:

$$
     \text{IDF}(t, D) = \log \left( \frac{N_D}{n_{D, t}} \right)
$$
where $ N_D $ is the total number of documents in the corpus $ D $, and $ n_{D, t} $ is the number of documents with term $ t $.

Finally, the TF-IDF score of a term is the product of its TF and IDF scores:

$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$

Where:
- $ t $ is a term.
- $ d $ is a document.
- $ D $ is the corpus of documents.


1. **TF-IDF Vectorization**
* Utilize scikit-learn's TfidfVectorizer to transform the text data into TF-IDF features.
* This involves converting the raw documents into a matrix of TF-IDF features.


In [12]:
#
# implement here
#

2. **Train the Naive Bayes Classifier**
* Use scikit-learn's Naive Bayes implementation (like `MultinomialNB`) to train the model on the TF-IDF vectors.
* Fit the model to the training data.

In [13]:
#
# implement here
#

3. **Model Evaluation**
* Evaluate the performance of your model on the testing set.
* Use metrics such as accuracy, precision, recall, and F1 score to understand the effectiveness of your model.

In [14]:
#
# implement here
#

4. **Parameter Tuning (Optional)**
* Experiment with different parameters of TfidfVectorizer and the Naive Bayes model to optimize performance.

In [15]:
#
# implement here
#

# For examples:

# 1. **Parameter Tuning for TfidfVectorizer**
# - `max_features`: Control the maximum number of features in the vocabulary.
# - `min_df`: Set the minimum document frequency below which a word is ignored.
# - `max_df`: Set the maximum document frequency above which a word is ignored.
# - `stop_words`: Specify a list of stop words to be excluded from the vocabulary.
# - `ngram_range`: Specify the range of n-grams to be considered as features.
#
# 2. **Parameter Tuning for Naive Bayes Classifier**
# - `alpha`: Adjust the smoothing parameter to control the influence of prior probabilities.
# - `fit_prior`: Enable or disable learning class prior probabilities from the data.
# - `class_prior`: Set prior probabilities for each class manually.
#
# 3. **Grid Search and Cross-Validation**
# - Use a technique like grid search or randomized search to evaluate multiple combinations of parameters systematically.
# - Perform cross-validation to assess the performance of different parameter settings objectively.

## Statistical Significance Testing

1. Compare performance between classifers built in previous sections:

* Hand-coded Rule
* BoW
* TF-IDF

2. Are the results statistically significant?