Gabriel Lluch
CS 598 - PSL
660131202

# Project 3 - Report
---


# Section 1: Sentiment Classification Details

## Sentiment Classification Report

### Overview

The sentiment classification process leveraged the provided **OpenAI embeddings** as direct input to a classification model, bypassing traditional natural language processing steps such as tokenization, bag-of-words encoding, or manual feature engineering. Since the embeddings already captured semantic and contextual nuances of the reviews, this approach streamlined sentiment modeling. However, this method traded off interpretability for high performance, as discussed in the next section.

### Data Preparation

Data preparation involved extracting the **1,536-dimensional OpenAI-generated embeddings** from the given training and test splits. These features were standardized using a `StandardScaler` to ensure uniform weighting and stable optimization during training. Importantly, the scaler was fit only on the training set to prevent data leakage. 

**Logistic Regression** with elastic-net regularization was chosen for its simplicity, coefficient-level interpretability, and robustness to high-dimensional data. Hyperparameters, including the regularization strength (`C`) and `l1_ratio`, were tuned via cross-validation (`LogisticRegressionCV`) on the training split. Values of `C = 0.01` and `l1_ratio = 0.1` were determined sufficient to meet performance benchmarks, balancing bias and variance across multiple folds.

### Training and Evaluation

Once optimal hyperparameters were identified, they were consistently applied across all splits to ensure straightforward reproducibility and consistent model behavior. The training procedure across each split followed a uniform methodology:
1. Load data.
2. Standardize embeddings.
3. Fit the logistic regression model.
4. Predict probabilities for the test set.

This approach ensured fair comparisons of performance metrics (AUC scores) across all five provided splits.

Predictions for each test split were saved into `mysubmission.csv` to adhere to the project’s submission requirements. The simplicity of the approach, combined with the direct use of pre-computed embeddings, facilitated efficient training and inference while achieving high AUC scores across all splits.

## Performance Metrics

| **Split** | **Test AUC** | **Execution Time (s)** |
|-----------|--------------|-------------------------|
| 1         | 0.986928     | 24.05                  |
| 2         | 0.986612     | 27.56                  |
| 3         | 0.986332     | 24.39                  |
| 4         | 0.986937     | 26.87                  |
| 5         | 0.986216     | 26.39                  |
| **Avg**   | **0.986605** | **25.85**              |


## Hardware Specifications

- **Device:** Apple M3 Pro  
- **RAM:** 36GB  
- **Operating System:** MacOS 14.3  

---

# Section 2: Model Interpretability 

## Interpretability Approach

To understand which parts of the review text most strongly influence the model’s sentiment predictions, we implemented a **sentence-level interpretability technique** on a subset of test reviews from the first split. We selected **five positive** and **five negative reviews**, ensuring a balanced perspective on how the model perceives both sentiment classes.

The approach is based on an **embedding alignment model** to align both review-level and sentence-level embeddings with the model's input space. We performed two key analyses to interpret sentiment predictions:

1. **Individual Sentence Analysis:** Assigning sentiment scores to individual sentences.
2. **Leave-One-Out Analysis:** Quantifying the importance of each sentence by observing the impact of its removal on the overall sentiment prediction.

---

## 1. Sentence Extraction and Embedding Alignment

### Preprocessing, Tokenization, Encoding Strategy

- For training the **alignment mechanism**, the workflow involves embedding **entire reviews** (not individual sentences) using **BERT**. These embeddings represent the full semantic structure of the review in a **768-dimensional space**, aligned with corresponding **1,536-dimensional OpenAI embeddings**.

- Each full review undergoes preprocessing before embedding:
  - **HTML tags** are removed using BeautifulSoup.
  - **URLs** are replaced with placeholders to ensure uniformity.
  - Non-printable characters and excess whitespace are eliminated.
  - The cleaned reviews are embedded as a whole using BERT.

- For interpretability using sentence-level analysis on the selected reviews:
  - The reviews are **tokenized into individual sentences** using **nltk's sentence tokenizer**.
  - Sentence embeddings are generated for each sentence in the selected reviews using BERT. These embeddings are transformed to the OpenAI embedding space using the trained alignment mechanism.
  - This process allows for sentence-specific sentiment scoring, providing a fine-grained understanding of how individual sentences contribute to the overall sentiment.

- For interpretability using **leave-one-out analysis** on the selected reviews:
  - For each sample review, the overall sentiment score is computed using the full review embedding.
  - Each sentence is then omitted one at a time, and the remaining text is re-embedded to compute the updated sentiment score.
  - The change in sentiment score when a sentence is removed quantifies the **influence of that sentence** on the model's prediction.
  - Sentences whose presence causes the largest increase in sentiment score are identified as **key positive contributors**, while those whose presence decreases the score are identified as **key negative drivers**.

This dual approach of **sentence-specific scoring** and **leave-one-out analysis** offers complementary insights into how individual sentences shape the sentiment predictions, enhancing the interpretability of the model's decision-making process.

#### Sentence Embedding with BERT
- For each tokenized sentence:
  - Sentence embeddings are generated using the **[CLS] token representation** from a pre-trained **BERT base model (base-uncased)**.
  - Longer reviews are split into manageable chunks of up to 510 tokens to ensure compatibility with BERT's input size limitations. The resulting embeddings for all chunks are **averaged** to form the final representation.
- BERT embeddings are **768-dimensional**, capturing semantic nuances at the sentence level.

#### Alignment to OpenAI Embedding Space
- Since our sentiment classification model operates in the **OpenAI embedding space (1,536 dimensions)**, the BERT embeddings must be aligned to this higher-dimensional space:
  1. **Scaling BERT Embeddings:** The generated BERT embeddings are scaled using a **StandardScaler**, fit on training data, to normalize feature distributions.
  2. **Linear Regression Mapping:** A **linear regression alignment model** transforms the scaled 768-dimensional BERT embeddings into the 1,536-dimensional OpenAI space.

#### Evaluation of Alignment Approaches
- To determine the optimal alignment approach, we experimented with several models, including:
  - **Linear Models:** Lasso, Ridge, ElasticNet.
  - **Tree-Based Models:** RandomForest, XGBoost.
  - **Neural Network Architectures:** Simple feedforward networks with varying depths and activation functions.

- While more complex models occasionally improved certain metrics, they often severely compromised others, such as cosine similarity or interpretability. In contrast, **basic LinearRegression** offered a straightforward and robust solution, achieving:
  - **R² Score:** 0.35
  - **Cosine Similarity:** 0.77
  - **Mean Squared Error (MSE):** 0.00027

#### Consistency and Reproducibility
- While this notebook contains code to process and train the alignment model from scratch, to ensure consistency, the trained scaler and alignment model can be loaded directly from a **GitHub**.
- The alignment model provides a semantically meaningful mapping to the OpenAI embedding space, suitable for input into the logistic regression sentiment classifier.

This preprocessing and embedding alignment workflow bridges the dimensional and representational gap between BERT and OpenAI embeddings while ensuring solid performance metrics with a simple and efficient alignment model.

---

## 2. Sentence-Level Sentiment Scoring for Sample Reviews

- After mapping each sentence’s BERT embedding to its corresponding OpenAI embedding, we passed these embeddings through the **trained logistic regression sentiment classifier**.
- This process yielded a **probability score** for each sentence, offering a fine-grained view of its sentiment.
- For visualization, sentences were **color-coded** in the original review text:
  - **Positive Sentences:** Greenish hues.
  - **Negative Sentences:** Red or orange tones.

This color-coding made it intuitive to identify which sentences most influenced the overall sentiment prediction. We see that then positive reviews are dominated by green while negative reviews are dominated by red,

---

## 3. Leave-One-Out Analysis

To use a different approach to measure the importance of individual sentences, we performed a **leave-one-out analysis**:

1. Computed the model’s sentiment probability for the full review.
2. For each sentence:
   - Removed the sentence from the review.
   - Re-embedded the modified text using the same **BERT and alignment pipelines**.
   - Re-ran the classifier on the modified embedding to observe changes in sentiment probability.
3. Measured the **difference in sentiment score** before and after removing each sentence.

#### Insights from Leave-One-Out Analysis
- Sentences whose removal caused large **drops in sentiment scores** were identified as key **positive contributors**.
- Sentences whose removal led to an **increase in sentiment scores** were identified as key **negative drivers**.

- Similar to the direct sentence embedding approach, we also list the sentences by order of their leave-one-out impact, from positive to negative impact.

---

## 4. Observations and Insights

By combining **color-coded visualizations** and **leave-one-out analysis**, we uncovered valuable insights into the model’s decision-making process.  
Please refer to the end of this file to view the results of color coding analysis, and sentence impact magnitidude ordering for both approaches.

### General Trends in Sentiment Scoring
- The **sentence-specific scoring** and **leave-one-out analysis** revealed that positive reviews are typically dominated by **positive sentences**, while negative reviews are dominated by **negative sentences**. This pattern indicates the model’s ability to capture sentiment trends at the sentence level in alignment with the overall review sentiment.

### Misclassifications and Ambiguities in Individual Sentence Analysis
However, there are notable cases where the sentence-level predictions diverge from the review sentiment:

1. **Incorrectly Marked Positive Sentiment in Negative Reviews:**
   - Example: *"Sadly, the film suffers from difficult to believe characters as well as a major plot problem that makes some of the characters seem brain-addled."*
     - This sentence is incorrectly marked as **positive**, likely due to specific phrases or words that the model interprets in isolation without understanding the full negative context.
   - Common Issue: Sentences describing **positive actions or events in a movie** are often marked positive, even when the author’s opinion is negative.

2. **Incorrectly Marked Negative Sentiment in Positive Reviews:**
   - Example: *"Disgusted by the state of karate, Oyama returns to his lone training."*
     - This sentence is incorrectly marked as **negative** in an otherwise positive review. The model’s misclassification likely stems from the word *"disgusted,"* which is contextually negative but not indicative of the review's sentiment. Given that it's merely describing a point in the plot, a score closer to neutral would be more appropriate.
   - Example: *"Just plain old great television."*
     - This sentence is incorrectly marked as **negative**, likely because the model misinterprets *"plain old"* as a negative sentiment phrase. However, the overall context and the word *"great"* clearly signal positivity. This misclassification demonstrates the model’s difficulty in handling nuanced expressions, where certain words may carry sentiment depending on context and phrasing.

### Correctly Classified Sentences
Despite misclassifications, many sentences are correctly classified, aligning with the sentiment of the review:

1. **Negative Sentiment in a Negative Review:**
   - Example: *"No resolution or big twist or anything."*
     - This sentence is correctly identified as **negative**, highlighting dissatisfaction and critique of the subject.
   
2. **Positive Sentiment in a Positive Review:**
   - Example: *"This is a tragic love story and a refreshing entry into the genre."*
     - The sentence is correctly marked as **positive**, emphasizing the author’s praise for the movie.

### Magnitudes and Consistency in Analysis
- While there are differences in **magnitude** and **ordering** between the single-sentence scoring and leave-one-out analysis, the results are **largely consistent** with the overall sentiment predictions.
- Single-sentence scoring provides an isolated sentiment view, whereas leave-one-out analysis highlights the **contribution** of each sentence to the overall sentiment. These two approaches complement each other, offering both granular and holistic insights.

### Leave-One-Out Analysis
Leave-one-out analysis highlights the influence of individual sentences on the overall sentiment of a review. While this approach captures contextual dependencies, it also reveals cases where the sentence impacts diverge from the expected sentiment alignment:

1. **Positive Influence in Negative Reviews:**
   - Example (Review 19238, Negative Review): 
     - *"But at least there's enough neat carnival themes and b-movie monster makeup to keep you watching."*  
       This sentence is marked as having a **positive influence** on the overall score. Though the review sentiment is negative, this sentence offers some positive slant in that the movie has some entertaining aspects.
   
   - Example (Review 15808, Negative Review): 
     - *"Everything is so very peachy and swell--the family adores Bergman and things couldn't be more perfect."*  
       This sentence contributes positively to the sentiment, likely due to the phrases "peachy and swell" and "things couldn't be more perfect." However, in the context of the review, these descriptions could be sarcastic, and the sentence's impact should be neutral or even negative.

2. **Negative Influence in Positive Reviews:**
   - Sentences with a neutral or mildly negative tone (e.g., plot descriptions or critical observations) in positive reviews can lower the sentiment score disproportionately. For instance:
     - A sentence like *"The pacing was slow at times, but the story made up for it"* could contribute negatively due to the phrase "slow at times," even though the sentence ends positively.

3. **Ambiguity in Neutral Sentences:**
   - Example (Review 23733, Negative Review): 
     - Neutral sentences like *"Family goes away on vacation and 16-year-old daughter wants independence from parents."* are marked with a **positive or negative influence** depending on how the model interprets the surrounding context. This demonstrates how leave-one-out analysis sometimes amplifies subtle shifts in sentiment, even for sentences that seem purely descriptive.

---

### Explanatory Considerations
- **Model Tendency Toward Keywords Over Context:**
  - The model often emphasizes specific keywords or phrases, which can lead to misclassifications. For example:
    - Words like *"tragic"* or *"disgusted"* are strongly associated with sentiment polarity, even when their contextual meaning may differ.
    - Phrases like *"plain old great television"* are misinterpreted due to an overreliance on the phrase *"plain old"* as a negative cue, despite the positive intent.

- **Contextual Dependencies in Leave-One-Out Analysis:**
  - Leave-one-out analysis reveals that the model incorporates some level of **sentence interdependencies**, as removing a sentence can significantly alter the overall sentiment score.
  - However, this also exposes limitations:
    - Neutral or descriptive sentences may have exaggerated impacts due to their balancing role in the review.
    - Mixed-sentiment reviews, where both positive and negative aspects are discussed, often show sentences contributing in unexpected directions.

---

### Summary of Insights
1. **Alignment of Overall Trends:**
   - Positive reviews are generally dominated by positive sentences, and negative reviews by negative ones. This alignment shows the model’s capability to broadly capture sentiment trends.

2. **Nuances and Misclassifications:**
   - Misclassifications often occur when:
     - Sentences describe **events or actions** (e.g., plot details) that don't directly reflect the author’s sentiment.
     - The model focuses on isolated keywords rather than contextual meaning.
   - For instance, sarcasm or complex expressions are frequently misinterpreted due to a lack of deeper semantic understanding.

3. **Complementary Analyses:**
   - Combining **single-sentence scoring** with **leave-one-out analysis** provides a more holistic view of how individual sentences shape sentiment predictions:
     - Single-sentence scoring highlights isolated sentiment polarity.
     - Leave-one-out analysis captures the contribution of each sentence to the overall sentiment, revealing interdependencies and contextual nuances.

4. **Model Limitations:**
   - The model’s overemphasis on specific words and its inability to fully account for sarcasm, nuanced expressions, or mixed sentiment reviews highlight areas for improvement.

---

### Limitations and Trade-Offs
While the interpretability approach offers valuable insights, it comes with several challenges:

1. **Noise in Embedding Alignment:**
   - Mapping BERT embeddings to OpenAI’s space introduces potential noise, which may slightly distort the semantic fidelity of sentence representations.

2. **Resource Intensity:**
   - Leave-one-out analysis is computationally expensive, requiring multiple re-embedding operations. This could limit scalability in large datasets or real-time applications.

3. **Contextual Shifts:**
   - Removing sentences during leave-one-out analysis can unintentionally shift the review’s overall context, leading to misleading interpretations of sentence contributions.

4. **Handling Mixed Sentiment:**
   - Reviews with balanced positive and negative observations challenge the model, as it struggles to weigh these nuances accurately in both single-sentence scoring and leave-one-out approaches.

---

### Conclusion
The combined use of **sentence-specific scoring** and **leave-one-out analysis** offers a detailed and complementary understanding of sentiment predictions. This approach provides a **clearer window into embedding-based models**, highlighting the textual elements most influential to predictions while exposing key limitations:

- **Strengths:** These methods excel at identifying and visualizing how individual sentences contribute to overall sentiment, providing interpretable insights for model evaluation.
- **Weaknesses:** The reliance on specific keywords and computational intensity limits scalability and accuracy in nuanced cases, such as sarcasm or mixed sentiment.

Despite its imperfections, this interpretability framework advances the understanding of sentiment models by uncovering the **underlying decision-making process**. Future enhancements could include:
- Incorporating attention mechanisms to better capture sentence-level dependencies.
- Differentiating factual from opinion-based sentences to reduce misclassification in descriptive contexts.
- Exploring alternative embedding alignment techniques to improve semantic fidelity.

Ultimately, this analysis not only enhances trust in the model’s predictions but also highlights critical areas for refinement, paving the way for more robust and interpretable sentiment classification systems.


In [2]:
# Step 1: Selecting Reviews

import random
import pandas as pd
import numpy as np
# Set random seed
random.seed(42)

# Load data for split 1
train_data = pd.read_csv('data/split_1/train.csv')
test_data = pd.read_csv('data/split_1/test.csv')
test_labels = pd.read_csv('data/split_1/test_y.csv')

# Merge test data and labels
test_data = test_data.merge(test_labels, on='id')

# Separate positive and negative reviews
positive_reviews = test_data[test_data['sentiment'] == 1]
negative_reviews = test_data[test_data['sentiment'] == 0]

# Randomly select 5 positive and 5 negative reviews
selected_positive = positive_reviews.sample(n=5, random_state=42)
selected_negative = negative_reviews.sample(n=5, random_state=42)

# Combine selected reviews
selected_reviews = pd.concat([selected_positive, selected_negative])

# Reorder test columns for alignment
test_data= test_data[train_data.columns]
print(train_data.columns)
print(test_data.columns)


Index(['id', 'sentiment', 'review', 'embedding_1', 'embedding_2',
       'embedding_3', 'embedding_4', 'embedding_5', 'embedding_6',
       'embedding_7',
       ...
       'embedding_1527', 'embedding_1528', 'embedding_1529', 'embedding_1530',
       'embedding_1531', 'embedding_1532', 'embedding_1533', 'embedding_1534',
       'embedding_1535', 'embedding_1536'],
      dtype='object', length=1539)
Index(['id', 'sentiment', 'review', 'embedding_1', 'embedding_2',
       'embedding_3', 'embedding_4', 'embedding_5', 'embedding_6',
       'embedding_7',
       ...
       'embedding_1527', 'embedding_1528', 'embedding_1529', 'embedding_1530',
       'embedding_1531', 'embedding_1532', 'embedding_1533', 'embedding_1534',
       'embedding_1535', 'embedding_1536'],
      dtype='object', length=1539)


In [3]:
# Utility function to load resources from GitHub
import joblib
import requests
from io import BytesIO

def load_from_github(url):
    """
    Load a file from a given GitHub raw content URL.
    """
    try:
        print(f"Downloading from {url}...")
        response = requests.get(url)
        response.raise_for_status()  # Check for HTTP errors
        loaded_object = joblib.load(BytesIO(response.content))
        print("Download and loading successful.")
        return loaded_object
    except Exception as e:
        print(f"Error loading {url}: {e}")
        return None

In [4]:
# Step 2: Splitting Reviews into Sentences

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

sent_tokenize(selected_reviews.iloc[0]['review'])
selected_reviews['sentences'] = selected_reviews['review'].apply(
    lambda x: sent_tokenize(x) if isinstance(x, str) else []
)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gabriellluch/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/gabriellluch/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/gabriellluch/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/gabriellluch/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [5]:
# Define basic preprocessing
import string
from bs4 import BeautifulSoup
import re

def preprocess_text(text):
    # Remove HTML tags
    text = clean_html(text)
    # Replace URLs
    text = replace_urls(text)
    # Normalize whitespace
    text = ' '.join(text.split())
    # Remove non-printable characters
    text = ''.join(filter(lambda x: x in string.printable, text))
    return text

def clean_html(text):
    return BeautifulSoup(text, "html.parser").get_text()

def replace_urls(text):
    url_pattern = r'http\S+|www.\S+'
    return re.sub(url_pattern, '<URL>', text)

In [6]:
# Define BERT model and embedding utilities
from transformers import BertTokenizer, BertModel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")

# Sentence level embedding
def get_sentence_embedding(sentence):
    inputs = tokenizer(sentence, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        outputs = bert_model(**inputs)
    # Get the [CLS] token embedding
    cls_embedding = outputs.last_hidden_state[:, 0, :].numpy()
    return cls_embedding.flatten()

def get_review_embedding(review_text, tokenizer, bert_model, device, chunk_size=510):
    """
    Generates a comprehensive BERT embedding for a full review by splitting it into chunks.
    """
    tokens = tokenizer.tokenize(review_text)
    chunks = [tokens[i:i + chunk_size] for i in range(0, len(tokens), chunk_size)]
    embeddings = []

    for chunk in chunks:
        chunk_text = tokenizer.convert_tokens_to_string(chunk)
        inputs = tokenizer(chunk_text, return_tensors='pt', truncation=True, max_length=512).to(device)
        with torch.no_grad():
            outputs = bert_model(**inputs)
        cls_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy().flatten()
        embeddings.append(cls_embedding)

    if embeddings:
        # Aggregate embeddings (e.g., average)
        review_embedding = np.mean(embeddings, axis=0)
    else:
        # Handle empty reviews
        review_embedding = np.zeros(bert_model.config.hidden_size)

    return review_embedding


  from .autonotebook import tqdm as notebook_tqdm


In [7]:
# Step 3: Convert reviews to their BERT embeddings
import joblib
bert_train_path = "bert_train"
openai_train_path = "openai_train"

# THIS STEP IS USED TO CONVERT REVIEWS TO BERT EMBEDDINGS IN ORDER TO TRAIN ALIGNMENT MODEL
# THIS STEP CAN BE SKIPPED BY LOADING THE TRAINED MODEL DIRECTLY FROM GITHUB
# UNCOMMENT IF YOU WOULD LIKE TO CREATE THE EMBEDDED REVIEWS LOCALLY

EMBED_REVIEWS = False

if EMBED_REVIEWS:
    # Number of samples to use for mapping
    n_samples = 25000

    # Randomly select reviews from training data
    # mapping_samples = train_data.sample(n=n_samples, random_state=42)

    mapping_samples = train_data.copy()
    bert_embeddings = []

    for review in mapping_samples['review']:
        clean_review = preprocess_text(review)
        bert_emb = get_review_embedding(clean_review, tokenizer, bert_model, device)
        bert_embeddings.append(bert_emb)

    bert_embeddings = np.array(bert_embeddings)
    # Get OpenAI embeddings from training data
    openai_embeddings = mapping_samples.iloc[:, 3:].values

    # Save embedding data
    joblib.dump(bert_embeddings, 'bert_train')
    joblib.dump(openai_embeddings, 'openai_train')


In [8]:
# Load saved embedding data
bert_embeddings = joblib.load(bert_train_path)
openai_embeddings = joblib.load(openai_train_path)

# Verify shapes
print("bert_train shape:", bert_embeddings.shape)
print("openai_train shape:", openai_embeddings.shape)

bert_train shape: (25000, 768)
openai_train shape: (25000, 1536)


In [10]:
# Fit the BERT input scaler or load from remote
import requests
from io import BytesIO
from sklearn.preprocessing import StandardScaler


# URL to the scaler file on GitHub
bert_scaler_url = "https://raw.githubusercontent.com/gclluch/psl_project_3/main/bert_scaler.pkl"

FETCH_BERT_SCALER = True

if FETCH_BERT_SCALER:
    bert_scaler = load_from_github(bert_scaler_url)
else:
    # Train a new scaler if loading is not required
    print("Fitting a new scaler...")
    bert_scaler = StandardScaler()
    bert_scaler.fit(bert_embeddings)  # Fit on training data


Downloading from https://raw.githubusercontent.com/gclluch/psl_project_3/main/bert_scaler.pkl...
Download and loading successful.


In [11]:
# Step 5: Train LinearRegression embedding alignment model

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics.pairwise import cosine_similarity

# URL to the LinReg alignment file on GitHub
alignment_model_url = "https://raw.githubusercontent.com/gclluch/psl_project_3/main/alignment_model.pkl"

FETCH_ALIGNMENT_MODEL = True
if FETCH_ALIGNMENT_MODEL:
    # Load the trained alignment model from GitHub
    lin_reg = load_from_github(alignment_model_url)
    print("R^2 score of the Linear Regression: 0.3497796544637562")
    print("MSE score of the Linear Regression 0.0002662521497447378")
    print("Cosing Similarity score of the Linear Regression 0.7689253035754772")
else:
    # Apply the scaler
    bert_embeddings_scaled = bert_scaler.transform(bert_embeddings)

    # Fit linear regression model
    lin_reg = LinearRegression()

    lin_reg.fit(bert_embeddings_scaled, openai_embeddings)
    r2_score = lin_reg.score(bert_embeddings_scaled, openai_embeddings)
    print(f'R^2 score of the Linear Regression: {r2_score}')
    joblib.dump(lin_reg, 'lin_reg_model.pkl')

    lin_reg_predictions = lin_reg.predict(bert_embeddings_scaled)
    lin_reg_mse = mean_squared_error(openai_embeddings, lin_reg_predictions)
    lin_reg_cos_sim = np.mean([
        cosine_similarity(
            [openai_embeddings[i]], [lin_reg_predictions[i]])[0, 0]
            for i in range(len(openai_embeddings))
            ])

    print('MSE score of the Linear Regression', lin_reg_mse)
    print('Cosing Similarity score of the Linear Regression', lin_reg_cos_sim)


Downloading from https://raw.githubusercontent.com/gclluch/psl_project_3/main/alignment_model.pkl...
Download and loading successful.
R^2 score of the Linear Regression: 0.3497796544637562
MSE score of the Linear Regression 0.0002662521497447378
Cosing Similarity score of the Linear Regression 0.7689253035754772


In [12]:
# Step 6: Transforming Sentence Embeddings to OpenAI format

selected_reviews['sentence_embeddings'] = None
for idx, row in selected_reviews.iterrows():
    sentence_embeddings = []
    for sentence in row['sentences']:
        # Use BERT to get embedding of single sentence
        clean_review = preprocess_text(sentence)
        bert_emb = get_review_embedding(clean_review, tokenizer, bert_model, device)
        bert_emb_reshaped = bert_emb.reshape(1, -1)

        # Transform BERT -> OpenAI using the learned mapping
        scaled_emb = bert_scaler.transform(bert_emb_reshaped)
        openai_emb = lin_reg.predict(scaled_emb)

        openai_emb_list = openai_emb.tolist()
        sentence_embeddings.append(openai_emb_list)

    # Store the transformed embeddings
    selected_reviews.at[idx, 'sentence_embeddings'] = sentence_embeddings

  return BeautifulSoup(text, "html.parser").get_text()
  return BeautifulSoup(text, "html.parser").get_text()
  return BeautifulSoup(text, "html.parser").get_text()
  return BeautifulSoup(text, "html.parser").get_text()
  return BeautifulSoup(text, "html.parser").get_text()
  return BeautifulSoup(text, "html.parser").get_text()
  return BeautifulSoup(text, "html.parser").get_text()


In [13]:
# Step 7: Sentiment analysis for individual sentences in selected reviews

# URLs for the sentiment model and scaler files on GitHub
model_url = "https://raw.githubusercontent.com/gclluch/psl_project_3/main/sentiment_model.pkl"
scaler_url = "https://raw.githubusercontent.com/gclluch/psl_project_3/main/sentiment_scaler.pkl"

# Load the sentiment model and scaler
logreg_sentiment_model = load_from_github(model_url)
logreg_sentiment_scaler = load_from_github(scaler_url)

# Confirm successful loading
if logreg_sentiment_model and logreg_sentiment_scaler:
    print("Sentiment model and scaler loaded successfully from GitHub.")
else:
    print("Failed to load sentiment model or scaler.")

print("Classifier and scaler loaded successfully.")
print(logreg_sentiment_model)

selected_reviews['sentence_scores'] = None
# For each review
for idx, row in selected_reviews.iterrows():
    sentence_scores = []
    for emb in row['sentence_embeddings']:
        emb_np = np.array(emb)

        # Scale the input embedding
        emb_scaled = logreg_sentiment_scaler.transform(emb_np.reshape(1, -1))

        # Generate sentence sentiment score
        prob = logreg_sentiment_model.predict_proba(emb_scaled)[0, 1]

        sentence_scores.append(prob)

    selected_reviews.at[idx, 'sentence_scores'] = sentence_scores

# selected_reviews['sentence_scores']

Downloading from https://raw.githubusercontent.com/gclluch/psl_project_3/main/sentiment_model.pkl...
Download and loading successful.
Downloading from https://raw.githubusercontent.com/gclluch/psl_project_3/main/sentiment_scaler.pkl...
Download and loading successful.
Sentiment model and scaler loaded successfully from GitHub.
Classifier and scaler loaded successfully.
LogisticRegression(C=0.01, l1_ratio=0.1, max_iter=10000, penalty='elasticnet',
                   random_state=42, solver='saga')


In [14]:
# Step 8: Visualizing the Results of individual sentence sentiment by color

from IPython.core.display import display, HTML

def highlight_sentences(review_text, sentences, scores):
    highlighted_text = review_text
    for sentence, score in zip(sentences, scores):
        # Determine color based on score
        if score > 0.9:
            color = 'chartreuse'  # Very strongly positive
        elif score > 0.8:
            color = 'limegreen'  # Strongly positive
        elif score > 0.7:
            color = 'yellowgreen'  # Moderately positive
        elif score > 0.6:
            color = 'yellowgreen'  # Slightly positive

        # Neutral as grey

        elif score < 0.1:
            color = 'red'  # Very strongly negative
        elif score < 0.2:
            color = 'orange'  # Strongly negative
        elif score < 0.3:
            color = 'yellow'  # Moderately negative
        elif score < 0.4:
            color = 'yellow'  # Slightly negative
        else:
            color = 'grey'  # Neutral
        # Wrap the sentence with a span tag
        highlighted_sentence = f'<span style="color:{color}">{sentence}</span>'
        # Replace the sentence in the review text
        highlighted_text = highlighted_text.replace(sentence, highlighted_sentence)
    return highlighted_text


# For each review
for idx, row in selected_reviews.iterrows():
    review_text = row['review']
    sentences = row['sentences']
    scores = row['sentence_scores']
    print(row['sentiment'])
    highlighted_review = highlight_sentences(review_text, sentences, scores)
    display(HTML(f"<p>{highlighted_review}</p>"))


1


  from IPython.core.display import display, HTML


1


1


1


1


0


0


0


0


0


In [15]:
# Print review details with sentences sorted by sentiment scores
for idx, row in selected_reviews.iterrows():
    print(f"Review {idx + 1}:")
    sentiment = "Positive" if row['sentiment'] == 1 else "Negative"
    print(f"Overall Sentiment: {sentiment}")

    # Pair sentences with their scores
    sentences = row['sentences']
    scores = row['sentence_scores']
    sentence_analysis = list(zip(sentences, scores))

    # Sort sentences by score
    sorted_sentences = sorted(sentence_analysis, key=lambda x: x[1], reverse=True)

    print("Sentences sorted by sentiment score:")
    for sentence, score in sorted_sentences:
        print(f"  [{score:.2f}] {sentence}")

    print("-" * 50)  # Separator for readability


Review 21222:
Overall Sentiment: Positive
Sentences sorted by sentiment score:
  [1.00] When television was still a young medium, there was a form of entertainment very prominent on the air that is but a memory today: musical variety.
  [1.00] Some musical shows were weekly series, but others were single, one-time specials, usually showcasing the special talent of the individual performer.
  [0.99] <br /><br />It amazes me that she still had the film debut of FUNNY GIRL yet to come, as well as turns as songwriter, director, and political activist.
  [0.99] In 1966, COLOR ME BARBRA introduced Barbra Streisand in color (hence the title), but copied the format of her first special a year earlier almost to the letter.
  [0.96] This is where we get the raw, uninhibited first looks at Streisand.
  [0.95] Before the Broadway phenomenon of the mid-60's.
  [0.95] Check it out."
  [0.93] It is not until Act 3, believe it or not, that the moment is matched or bettered by another feat: in the conc

In [16]:
# Step 9: Prepare for Leave-one-out: Sentiment Analysis for Full Embedded Reviews

# Add a new column for full review sentiment scores
selected_reviews['full_review_score'] = None

for idx, row in selected_reviews.iterrows():
    # Preprocess the full review
    full_review_text = preprocess_text(row['review'])

    # Embed the full review with BERT
    full_review_embedding = get_review_embedding(full_review_text, tokenizer, bert_model, device)
    full_review_embedding = full_review_embedding.reshape(1, -1)  # Reshape for scaler compatibility

    bert_emb_reshaped = full_review_embedding.reshape(1, -1)

    # Transform BERT -> OpenAI using the learned mapping
    scaled_bert_emb = bert_scaler.transform(bert_emb_reshaped)
    open_ai_review_emb = lin_reg.predict(scaled_bert_emb)

    # Scale the input embedding
    open_ai_emb_scaled = logreg_sentiment_scaler.transform(open_ai_review_emb.reshape(1, -1))
    # Predict sentiment using the logistic regression classifier
    full_review_score = logreg_sentiment_model.predict_proba(open_ai_emb_scaled)[0, 1]

    # Store the score in the DataFrame
    selected_reviews.at[idx, 'full_review_score'] = full_review_score

# Display the full review sentiment scores
# print(selected_reviews[['review', 'sentiment', 'full_review_score']])


In [17]:
# Step 10: Leave-One-Out Analysis for Sentences

# Add a new column to store leave-one-out differences for each sentence
selected_reviews['leave_one_out_differences'] = None

for idx, row in selected_reviews.iterrows():
    leave_one_out_differences = []
    base_score = row['full_review_score']  # Base sentiment score for the full review

    # For each sentence in the review
    for i, sentence in enumerate(row['sentences']):
        # Create the modified review by omitting the current sentence
        modified_review_text = " ".join([s for j, s in enumerate(row['sentences']) if j != i])

        # Preprocess the modified review
        clean_modified_review = preprocess_text(modified_review_text)

        # Embed the modified review with BERT
        modified_review_embedding = get_review_embedding(clean_modified_review, tokenizer, bert_model, device)
        modified_review_embedding = modified_review_embedding.reshape(1, -1)

        # Transform BERT -> OpenAI using the learned mapping
        scaled_modified_emb = bert_scaler.transform(modified_review_embedding)
        openai_modified_emb = lin_reg.predict(scaled_modified_emb)

        # Scale the transformed embedding
        openai_emb_scaled = logreg_sentiment_scaler.transform(openai_modified_emb.reshape(1, -1))

        # Predict sentiment for the modified review
        modified_score = logreg_sentiment_model.predict_proba(openai_emb_scaled)[0, 1]

        # Calculate the difference from the base score
        score_difference = base_score - modified_score
        leave_one_out_differences.append(score_difference)

    # Store the differences in the DataFrame
    selected_reviews.at[idx, 'leave_one_out_differences'] = leave_one_out_differences

# Display the leave-one-out differences
# print(selected_reviews[['review', 'sentiment', 'leave_one_out_differences']])


In [20]:
# Print review details with sentences sorted by their leave-one-out impact
for idx, row in selected_reviews.iterrows():
    print(f"Review {idx + 1}:")
    sentiment = "Positive" if row['sentiment'] == 1 else "Negative"
    print(f"Overall Sentiment: {sentiment}")
    print(f"Base Sentiment Score: {row['full_review_score']:.2f}")

    # Pair sentences with their leave-one-out differences
    sentences = row['sentences']
    leave_one_out_differences = row['leave_one_out_differences']

    # Ensure all impacts are floats
    leave_one_out_differences = [float(impact) for impact in leave_one_out_differences]

    # Pair sentences with their leave-one-out differences
    sentence_analysis = list(zip(sentences, leave_one_out_differences))

    # Sort sentences by their absolute impact (magnitude of difference)
    sorted_sentences = sorted(sentence_analysis, key=lambda x: (x[1]), reverse=True)

    print("Sentences sorted by their impact on the overall sentiment score:")
    for sentence, impact in sorted_sentences:
        print(f"  [Impact: {impact:.4f}] {sentence}")

    print("-" * 50)  # Separator for readability


Review 21222:
Overall Sentiment: Positive
Base Sentiment Score: 0.99
Sentences sorted by their impact on the overall sentiment score:
  [Impact: 0.0487] Yet there are cuts, dissolves, and tracking shots galore, resulting in one rather spectacular peak moment-- the modern, slightly beatnik-flavored, \Gotta Move.\" After getting lost amongst the modern abstracts, jazz-club bongos begin, with Streisand emerging in a psychedelic gown and glittering eye makeup, doing the catchy staccato tune with almost androgynous sex appeal.
  [Impact: 0.0271] In 3 distinct acts, we get an abstract Streisand (in an after-hours art museum looking at and sometimes becoming the works of art), a comic Streisand working an already adoring audience in a studio circus (populated with many fuzzy and furry animals), and best of all, a singing Streisand in mini-concert format just-- well, frankly, just doing it.
  [Impact: 0.0202] Some musical shows were weekly series, but others were single, one-time specials, usu