In [4]:
import pandas as pd

# Load the dataset
file_path = "/Users/liuliangcheng/Desktop/Duke/IDS_NLP/final/Womens Clothing E-Commerce Reviews.csv.zip"
data = pd.read_csv(file_path)

# Step 1: Remove rows with missing Review Text
cleaned_data = data.dropna(subset=['Review Text']).reset_index(drop=True)

# Step 2: Map Ratings to Sentiment Categories
# (1-2: Dissatisfied, 3: Neutral, 4-5: Satisfied)
cleaned_data['Sentiment'] = cleaned_data['Rating'].map({
    1: 'Dissatisfied',
    2: 'Dissatisfied',
    3: 'Neutral',
    4: 'Satisfied',
    5: 'Satisfied'
})

# Step 3: Text Preprocessing (optional for BERT)
# Remove special characters and strip extra spaces
cleaned_data['Review Text'] = cleaned_data['Review Text'].str.replace(r"[^a-zA-Z0-9\s]", "", regex=True).str.strip()

# Step 4: Save the cleaned data
output_path = "/Users/liuliangcheng/Desktop/Duke/IDS_NLP/final/cleaned_reviews.csv"
cleaned_data.to_csv(output_path, index=False)

print(f"Cleaned data saved to {output_path}")


Cleaned data saved to /Users/liuliangcheng/Desktop/Duke/IDS_NLP/final/cleaned_reviews.csv


The cleaned dataset appears to be in excellent condition for further modeling:

Missing Values:

No missing values in the Review Text column.
No missing values in the Sentiment column.
Total Entries:

The dataset contains 22,641 entries, which is sufficient for fine-tuning a BERT model.
Sentiment Distribution:

Satisfied: 17,448 entries
Neutral: 2,823 entries
Dissatisfied: 2,370 entries
While there is some imbalance favoring the "Satisfied" category, this can be managed during model training with techniques like class weighting or oversampling.

In [5]:
import numpy as np
from sklearn.utils.class_weight import compute_class_weight
import torch

# Ensure 'classes' is a NumPy array
classes = np.array(['Satisfied', 'Neutral', 'Dissatisfied'])

# Calculate class weights
class_weights = compute_class_weight('balanced', classes=classes, y=cleaned_data['Sentiment'])

# Convert class weights to a PyTorch tensor
class_weights = torch.tensor(class_weights, dtype=torch.float)

# Output the class weights for confirmation
print("Class weights:", class_weights)


Class weights: tensor([0.4325, 2.6734, 3.1844])


The class weights tensor([0.4325, 2.6734, 3.1844]) indicate that our dataset has significant class imbalance:

Satisfied (weight: 0.4325): Majority class with the least weight.
Neutral (weight: 2.6734): Minority class with higher weight.
Dissatisfied (weight: 3.1844): Smallest class, given the highest weight to balance its contribution.
What This Means:
The model will "penalize" errors on the minority classes (Neutral and Dissatisfied) more than errors on the majority class (Satisfied). This encourages the model to pay more attention to underrepresented classes.