# Sentiment Classification of Amazon Product Reviews

This project presents the development of a sentiment analysis classifier applied to Amazon product reviews in the food category. The goal is to classify textual reviews as expressing either a **positive** or **negative** sentiment, represented by binary labels: +1 for positive and -1 for negative.

The dataset contains written reviews from Amazon customers, with examples ranging from complaints about product quality to enthusiastic endorsements. For instance, negative reviews highlight issues such as lack of flavor or poor texture, whereas positive reviews emphasize satisfaction and high product quality.

To address this classification task, we implemented two linear models: **Perceptron** and **Logistic Regression**. We explored two different text representations to encode the input data:

- **Binary representation**, which encodes whether a word appears (1) or not (0) in a review.
- **Count representation**, which counts the number of times each word appears in a review.

These vectorizations were applied using a bag-of-words approach, transforming the raw text into numerical feature vectors. To evaluate the models and optimize their hyperparameters, we used **grid search** combined with **5-fold cross-validation**.

This project demonstrates how classical linear classifiers, combined with simple text representations, can effectively capture sentiment patterns in consumer-generated content.


# Overview of Algorithms for Supervised Classification

## 1. Perceptron Classifier

**The Perceptron** is one of the earliest algorithms for supervised binary classification. It’s a type of linear classifier that updates weights iteratively when misclassifications occur. No formal loss function is minimized.

### Mathematical Formulation:

The decision function is:

$$
f(x) = \text{sign}(w^\top x + b)
$$

Where:

- $ w \in \mathbb{R}^n $ is the weight vector.
- $ b $ is the bias term.
- $ x \in \mathbb{R}^n $ is the feature vector.

## Update rule (only on misclassification):
$$w \leftarrow w + \eta y_i x_i$$
$$b \leftarrow b + \eta y_i$$

Where:
- $ \eta $ is the learning rate.
- $ x_i \in \mathbb{R}^n $ is the feature vector.
- $ y_i \in \{-1, 1\} $ is the true label.

## Parameters used:
- `max_iter:` Number of passes over the dataset.
- `tol:` Tolerance for the stopping criterion.
- `eta0:` Constant learning rate.

---

## 2. Logistic Regression

**Logistic Regression** is a **probabilistic linear classifier** that models the **log-odds** of the class membership using a **sigmoid function**. It is commonly used for binary classification.

### Prediction: Sigmoid Probability

The logistic regression model estimates the probability of class $y = 1$ given input $ \mathbf{x} \in \mathbb{R}^n $ as:

$$
P(y=1 \mid \mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x} + b)}}
$$

Where:

- $ \mathbf{w} \in \mathbb{R}^n $ is the weight vector
- $ b \in \mathbb{R} $ is the bias (intercept)
- $ \mathbf{x} $ is the input feature vector



### Decision Rule

The model predicts $ \hat{y} \in \{0, 1\} $ using the threshold rule:

$$
\hat{y} =
\begin{cases}
1 & \text{if } P(y=1 \mid \mathbf{x}) \geq 0.5 \\
0 & \text{otherwise}
\end{cases}
$$


### Loss Function: Log Loss (a.k.a. Cross Entropy)

Logistic regression minimizes the **logarithmic loss**, defined for $ n $ training samples as:

$$
\mathcal{L}(\mathbf{w}, b) = - \frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]
$$

Where:

$$
p_i = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x}_i + b)}}
\quad \text{and} \quad y_i \in \{0, 1\}
$$

### Alternative Form: Logistic Loss for Labels in \(\{-1, +1\}\)

When labels are encoded as $ y_i \in \{-1, +1\} $, the **logistic loss** becomes:

$$
\mathcal{L}_{\text{log}} = \sum_{i=1}^n \log\left(1 + e^{-y_i(\mathbf{w}^\top \mathbf{x}_i + b)}\right)
$$

This version is often used in theoretical analyses and optimization.

### Optimization Problem (with Regularization)

The regularized logistic regression solves the following optimization problem:

$$
\min_{\mathbf{w}, b} \quad \frac{1}{2C} \|\mathbf{w}\|^2 + \sum_{i=1}^n \log\left(1 + e^{-y_i(\mathbf{w}^\top \mathbf{x}_i + b)}\right)
$$

Where:

- $ C $ is the inverse of the regularization strength.
- Smaller $ C $ → stronger regularization.
- $ \frac{1}{2C} \|\mathbf{w}\|^2 $ is the **L2 regularization** term.

Alternatively, using the regularization parameter $ \lambda = \frac{1}{C} $, the problem becomes:

$$
\min_{\mathbf{w}, b} \quad \sum_{i=1}^n \log\left(1 + e^{-y_i(\mathbf{w}^\top \mathbf{x}_i + b)}\right) + \frac{\lambda}{2} \|\mathbf{w}\|^2
$$

### Maximum Likelihood Interpretation

Logistic regression can also be derived by **maximum likelihood estimation** under the Bernoulli model:

$$
P(y_i \mid \mathbf{x}_i) = \left( \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x}_i + b)}} \right)^{y_i}
\left( 1 - \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x}_i + b)}} \right)^{1 - y_i}
$$

Taking the negative log-likelihood gives the **cross-entropy loss**, which is exactly the same as the log-loss shown earlier.

### Parameters used

- `C:` Inverse of regularization strength.
- `solver='lbfgs':` Quasi-Newton method.
- `max_iter:` Number of passes over the dataset.

# Implementation and Analysis

### Library Dependencies

The following libraries are essential for our analysis:

In [1]:
import numpy as np
import pandas as pd
from preprocessing_data import train_test_data_extract
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Perceptron
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics

# Feature Extraction Implementation

In this section, we create two different representations of our text data for sentiment analysis:

## 1. Binary Feature Representation

- Features are encoded as binary (0/1) values
- 0: Word is absent in the review
- 1: Word is present in the review
- Ignores word frequency

## 2. Count Feature Representation

- Features represent word frequencies
- Each cell contains the number of times a word appears
- Preserves information about word frequency

In [2]:
(train_features_binary,
 train_labels_binary,
 test_features_binary,
 test_labels_binary,
 _) = train_test_data_extract(True)

(train_features_count,
 train_labels_count,
 test_features_count,
 test_labels_count,
 dictionary_words) = train_test_data_extract(False)

dictionary_words = list(dictionary_words.keys())

## greed_search_best_model()

### Purpose
Performs a grid search using `GridSearchCV` to find the best hyperparameters for:
- `Perceptron` with different values of `eta0` (learning rate).
- `LogisticRegression` with different values of `C` (inverse of regularization strength).

### What it does:
1. Defines the parameter grid based on the model.
2. Applies 5-fold cross-validation.
3. Returns:
    - A table with mean and standard deviation of model performance.
    - The best model found.

In [3]:
def greed_search_best_model(train_features,
                            train_labels,
                            model):

    """
    Performs grid search with cross-validation to find optimal hyperparameters for a given model.

    The function implements grid search for two types of models:
    1. Perceptron: Searches over different learning rates (eta0)
    2. Logistic Regression: Searches over different regularization strengths (C)

    Args:
        train_features (np.ndarray): Training feature matrix of shape (n_samples, n_features)
        train_labels (np.ndarray): Training labels of shape (n_samples,)
        model (str): Model type, either 'perceptron' or 'logistic_regression'

    Returns:
        tuple: Contains:
            - summary_table (pd.DataFrame): Results of grid search with columns:
                * parameter (eta0/C)
                * mean_test_score
                * std_test_score
            - best_estimator (sklearn estimator): Model with best parameters
            - best_params (dict): Best parameters found

    Notes:
        - Uses 5-fold cross-validation
        - For Perceptron: eta0 ∈ [0.01, 0.05, 0.1, 0.5, 1]
        - For Logistic Regression: C ∈ [0.001, 0.01, 0.1, 1, 10, 100, 1000]
    """

    grid_search = None
    if model == 'perceptron':
        parameters = {'eta0': [0.01, 0.05, 0.1, 0.5, 1]}
        grid_search = GridSearchCV(Perceptron(), parameters, cv=5)
    elif model == 'logistic_regression':
        parameters = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
        grid_search = GridSearchCV(LogisticRegression(max_iter=1000000), parameters, cv=5)
    grid_search.fit(train_features, train_labels)

    results_df = pd.DataFrame(grid_search.cv_results_)

    parameter = 'eta0' if model == 'perceptron' else 'C'
    summary_table = pd.DataFrame({
        parameter: results_df['param_' + parameter].astype(float),
        'mean_test_score': results_df['mean_test_score'],
        'std_test_score': results_df['std_test_score']
    })

    return summary_table, grid_search.best_estimator_, grid_search.best_params_

## accuracy_scores()

### Purpose
Evaluates a model on both training and test data. Calculates:
- `accuracy_score` from `sklearn.metrics` (percentage of correctly predicted labels).


### What it does:
$$
\hat{y}_{\text{train}} = \text{model.predict}(x_{\text{train}}), \quad \hat{y}_{\text{test}} = \text{model.predict}(x_{\text{test}})
$$
$$
\text{Accuracy} = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(\hat{y}_i = y_i)
$$

In [4]:
def accuracy_scores(model,
                    x_train,
                    y_train,
                    x_test,
                    y_test):
    """
    Evaluates model performance on both training and test sets.

    Calculates accuracy scores using sklearn.metrics.accuracy_score for both
    training and test datasets, providing insight into model generalization.

    Args:
        model (sklearn estimator): Trained classification model
        x_train (np.ndarray): Training features
        y_train (np.ndarray): Training labels
        x_test (np.ndarray): Test features
        y_test (np.ndarray): Test labels

    Returns:
        pd.DataFrame: Results DataFrame with columns:
            - 'Dataset': ['Training Data', 'Test Data']
            - 'Accuracy': Corresponding accuracy scores
    """


    y_pred_test = model.predict(x_test)
    y_pred_train = model.predict(x_train)
    train_accuracy = metrics.accuracy_score(y_train, y_pred_train)
    test_accuracy = metrics.accuracy_score(y_test, y_pred_test)
    # Create a DataFrame to display the results
    results = pd.DataFrame({
        'Dataset': ['Training Data', 'Test Data'],
        'Accuracy': [train_accuracy, test_accuracy]
    })

    return results

def analysis_best_model(model, data_type):
    """
    Performs comprehensive model analysis including cross-validation,
    performance evaluation, and feature importance analysis.

    This function combines model training, evaluation, and analysis to provide
    a complete picture of model performance and interpretability.

    Args:
        model (str): Model type ('perceptron' or 'logistic_regression')
        data_type (str): Feature representation type ('binary' or 'count')

    Returns:
        tuple: Contains:
            - results_cv (pd.DataFrame): Cross-validation results
            - results_best_model (pd.DataFrame): Performance metrics on train/test sets
            - best_params (dict): Best hyperparameters found
            - feature_importance (pd.DataFrame): Top 10 most important features with
                their absolute coefficient values

    Note:
        Feature importance is determined by the absolute values of model coefficients,
        which indicate the strength of each feature's influence on the classification.
    """

    train_features, train_labels, test_features, test_labels = None, None, None, None
    if data_type == 'binary':
        train_features = train_features_binary
        train_labels = train_labels_binary
        test_features = test_features_binary
        test_labels = test_labels_binary

    elif data_type == 'count':
        train_features = train_features_count
        train_labels = train_labels_count
        test_features = test_features_count
        test_labels = test_labels_count


    (results_cv,
    best_model,
    best_params) = greed_search_best_model(train_features,
                                            train_labels,
                                            model)

    results_best_model = accuracy_scores(best_model,
                                        train_features,
                                        train_labels,
                                        test_features,
                                        test_labels)

    coefficients = best_model.coef_[0]

    feature_importance = pd.DataFrame({
    'Feature': dictionary_words,
    'Absolute_Coefficient': np.abs(coefficients)
    })

    feature_importance = feature_importance.sort_values(by='Absolute_Coefficient', ascending=False)

    return results_cv, results_best_model, best_params, feature_importance.head(10)


## Analysis of Perceptron Performance on Binary Encoded Data

In [5]:
(results_perceptron_binary_cv,
results_perceptron_binary_best_model,
perceptron_binary_best_params,
perceptron_binary_feature_importance) = analysis_best_model('perceptron', 'binary')

### Grid Search Results for Perceptron (Binary Feature Representation):

In [6]:
results_perceptron_binary_cv

Unnamed: 0,eta0,mean_test_score,std_test_score
0,0.01,0.77825,0.011742
1,0.05,0.7765,0.017073
2,0.1,0.78275,0.018138
3,0.5,0.7855,0.00983
4,1.0,0.78725,0.01035


### Best Parameters:

In [7]:
perceptron_binary_best_params

{'eta0': 1}

### Accuracy Results for Best Model on Training and Test Sets for Perceptron (Binary Feature Representation)

In [8]:
results_perceptron_binary_best_model

Unnamed: 0,Dataset,Accuracy
0,Training Data,0.99875
1,Test Data,0.796


### Top 10 most important features

In [9]:
perceptron_binary_feature_importance

Unnamed: 0,Feature,Absolute_Coefficient
1483,disappointment,26.0
141,awful,26.0
2038,worst,26.0
2475,originally,25.0
3163,perfectly,24.0
1669,horrible,24.0
153,disappointed,23.0
3360,reasonable,23.0
3006,including,22.0
3834,newman,22.0


## Analysis of Perceptron Performance on Count Encoded Data

In [10]:
(results_perceptron_count_cv,
results_perceptron_count_best_model,
perceptron_count_best_params,
perceptron_count_feature_importance) = analysis_best_model('perceptron', 'count')

### Grid Search Results for Perceptron (Count Feature Representation):

In [11]:
results_perceptron_count_cv

Unnamed: 0,eta0,mean_test_score,std_test_score
0,0.01,0.779,0.019452
1,0.05,0.76675,0.020242
2,0.1,0.76675,0.020242
3,0.5,0.78025,0.020115
4,1.0,0.78025,0.020115


### Best Parameters

In [12]:
perceptron_count_best_params

{'eta0': 0.5}

### Accuracy Results for Best Model on Training and Test Sets for Perceptron (Count Feature Representation)

In [13]:
results_perceptron_count_best_model

Unnamed: 0,Dataset,Accuracy
0,Training Data,0.9835
1,Test Data,0.772


### Top 10 most important features

In [14]:
perceptron_count_feature_importance

Unnamed: 0,Feature,Absolute_Coefficient
2038,worst,68.0
607,delicious,67.5
153,disappointed,62.0
1669,horrible,60.5
751,unfortunately,57.0
4618,ball,56.0
329,perfect,54.5
666,ok,53.5
750,glad,50.5
1483,disappointment,50.0


## Analysis of Logistic Regression Performance on Binary Encoded Data

In [15]:
(results_regression_binary_cv,
results_regression_binary_best_model,
regression_binary_best_params,
regression_binary_feature_importance) = analysis_best_model('logistic_regression',
                                                            'binary')

### Grid Search Results for Logistic Regression (Binary Feature Representation):

In [16]:
results_regression_binary_cv

Unnamed: 0,C,mean_test_score,std_test_score
0,0.001,0.73875,0.014296
1,0.01,0.78875,0.01199
2,0.1,0.8025,0.013229
3,1.0,0.804,0.012831
4,10.0,0.79675,0.013661
5,100.0,0.79125,0.011429
6,1000.0,0.789,0.01558


### Best Parameters

In [17]:
regression_binary_best_params

{'C': 1}

### Accuracy Results for Best Model on Training and Test Sets for Logistic Regrgession (Binary Feature Representation):

In [18]:
results_regression_binary_best_model

Unnamed: 0,Dataset,Accuracy
0,Training Data,0.991
1,Test Data,0.81


### Top 10 most important features

In [19]:
regression_binary_feature_importance

Unnamed: 0,Feature,Absolute_Coefficient
607,delicious,1.698068
141,awful,1.582719
153,disappointed,1.579949
329,perfect,1.467616
1483,disappointment,1.462417
751,unfortunately,1.431611
2038,worst,1.415653
666,ok,1.367475
275,great,1.354981
1025,disgusting,1.31404


## Analysis of Logistic Regression Performance on Count Encoded Data

In [20]:
(results_regression_count_cv,
results_regression_count_best_model,
regression_count_best_params,
regression_count_feature_importance) = analysis_best_model('logistic_regression',
                                                            'count')

### Grid Search Results for Logistic Regression (Count Feature Representation):

In [21]:
results_regression_count_cv

Unnamed: 0,C,mean_test_score,std_test_score
0,0.001,0.73025,0.013309
1,0.01,0.77525,0.008711
2,0.1,0.80725,0.011922
3,1.0,0.796,0.010648
4,10.0,0.7845,0.007969
5,100.0,0.784,0.009918
6,1000.0,0.7855,0.013933


### Best Parameters

In [22]:
regression_count_best_params

{'C': 0.1}

"Accuracy Results for Best Model on Training and Test Sets for Logistic Regrgession (Count Feature Representation):

In [23]:
results_regression_count_best_model

Unnamed: 0,Dataset,Accuracy
0,Training Data,0.94625
1,Test Data,0.798


### Top 10 most important features

In [24]:
regression_count_feature_importance

Unnamed: 0,Feature,Absolute_Coefficient
607,delicious,1.018176
153,disappointed,0.844344
275,great,0.837651
329,perfect,0.720934
26,bad,0.697556
221,however,0.693031
551,best,0.679971
963,loves,0.646232
754,favorite,0.622688
4,not,0.60824


# Results Analysis

## Performance Comparison Among Best Models
### Perceptron with Binary Features:
  - Training Accuracy: 0.99875
  - Test Accuracy: 0.796
  - Mean Test Score: 0.787
  - STD Test Score:0.0103

### Perceptron with Count Features:
  - Training Accuracy: 0.9835
  - Test Accuracy: 0.772
  - Mean Test Score: 0.78
  - STD Test Score:0.02

### Logistic Regression with Binary Features:
  - Training Accuracy: 0.991
  - Test Accuracy: 0.810
  - Mean Test Score: 0.804
  - STD Test Score: 0.012

### Logistic Regression with Count Features:
  - Training Accuracy: 0.946
  - Test Accuracy: 0.798
  - Mean Test Score: 0.807
  - STD Test Score: 0.011


#### *Based on extensive cross-validation and performance metrics, the Logistic Regression model with regularization parameter C=1.0, trained on binary bag-of-words features, achieved optimal sentiment classification with 81.0% accuracy.*