# 🎓 Building a Sentiment Classifier with Logistic Regression

In this notebook, we will build a **Logistic Regression** classifier to predict the sentiment (positive or negative) of customer reviews for women's clothing from an e-commerce website. 

For this task, we will use the **LogisticRegression** class from the scikit-learn library.


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression  # To train the model and predict the sentiment
from sklearn.metrics import accuracy_score # # To calculate the accuracy of the Model

In [3]:
# --- Step 1: Load the dataset ---
file_path = 'womens_clothing_ecommerce_reviews.csv'
df = pd.read_csv(file_path)
print("✅ Successfully loaded the dataset.")
print("Dataset preview:")
print(df.head())

✅ Successfully loaded the dataset.
Dataset preview:
                                         Review Text  sentiment
0  Absolutely wonderful - silky and sexy and comf...          1
1  Love this dress!  it's sooo pretty.  i happene...          1
2  I love, love, love this jumpsuit. it's fun, fl...          1
3  This shirt is very flattering to all due to th...          1
4  I love tracy reese dresses, but this one is no...         -1


In [4]:
# Get some basic information about the dataset
print("\nDataset Information:")
df.info()


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19818 entries, 0 to 19817
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Review Text  19818 non-null  object
 1   sentiment    19818 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 309.8+ KB


In [5]:
# --- Step 2: Split Data into Training and Testing Sets ---
# It's crucial to test our model on data it has never seen before.
# We'll use 80% of the data for training and 20% for testing.
X = df['Review Text'] # Here we don't use 2D array bcz we doen't feed directly to the model
y = df['sentiment']

# 'stratify=y' ensures that the proportion of positive and negative reviews is the same in both your training set and your testing set.
# (To maintain the same ratio in training and testing sets)
# ('stratify=y' is the y axis of the graph)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"\nData split into {len(X_train)} training samples and {len(X_test)} testing samples.")


Data split into 15854 training samples and 3964 testing samples.


In [6]:
# --- Step 3: Feature Engineering with Bag-of-Words ---
# Here, we convert the text reviews into numerical feature vectors.
# Each feature is a count of how many times a word appears in a review.
print("\nConverting text to numerical features using Bag-of-Words...")

# Initialize the vectorizer. `stop_words='english'` removes common
# English words like 'the', 'a', 'is', which don't carry much sentiment.
vectorizer = CountVectorizer(stop_words='english')

# Fit the vectorizer on the TRAINING data and transform it into a matrix
X_train_bow = vectorizer.fit_transform(X_train)

# ONLY transform the TESTING data using the already-fitted vectorizer
X_test_bow = vectorizer.transform(X_test)

print("✅ Text successfully converted to feature vectors.")


Converting text to numerical features using Bag-of-Words...
✅ Text successfully converted to feature vectors.


In [7]:
# --- Step 4: Train the Linear Classifier ---
# We'll use Logistic Regression, a reliable linear model for classification.
print("\nTraining the Logistic Regression model...")

# Initialize the model
# max_iter is increased to ensure the model has enough time to find the best weights
model = LogisticRegression(max_iter=2000)

# Train the model on our Bag-of-Words training data
model.fit(X_train_bow, y_train)

print("✅ Model training complete!")


Training the Logistic Regression model...
✅ Model training complete!


In [8]:
# --- Step 5: Evaluate the Model's Performance ---
# Let's see how accurately our model predicts sentiment on the unseen test data.
print("\nEvaluating model performance on the test set...")

# Make predictions on the test data
y_pred = model.predict(X_test_bow)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"📈 Model Accuracy: {accuracy:.4f} ({accuracy:.2%})")


Evaluating model performance on the test set...
📈 Model Accuracy: 0.9294 (92.94%)


In [10]:
# --- Step 6: Predict Sentiment on New Reviews ---
# This is the fun part! Let's use our fine-tuned model on brand new text.
print("\n--- Making Predictions on New Reviews ---")

new_reviews = [
    "This dress is absolutely beautiful and fits perfectly!",
    "The material felt cheap and it was not what I expected.",
    "It's an okay product, not great but not terrible either.",
    "I am so disappointed with this purchase, I will be returning it."
]

# 1. Transform the new reviews into the Bag-of-Words format
new_reviews_bow = vectorizer.transform(new_reviews)

# 2. Predict using our trained model
new_predictions = model.predict(new_reviews_bow)


for i in range(len(new_reviews)):
    print(f"Review: {new_reviews[i]}")
    print(f"Predicted Sentiment: {new_predictions[i]}\n")


--- Making Predictions on New Reviews ---
Review: This dress is absolutely beautiful and fits perfectly!
Predicted Sentiment: 1

Review: The material felt cheap and it was not what I expected.
Predicted Sentiment: -1

Review: It's an okay product, not great but not terrible either.
Predicted Sentiment: -1

Review: I am so disappointed with this purchase, I will be returning it.
Predicted Sentiment: -1

