<div align="center" >
    <h1> Sentiment Analysis for Product Reviews</h1>
</div>

* *__Description:__* Build a system that can analyze customer reviews from e-commerce sites to determine the sentiment (positive, negative, neutral).
<br></br>
* *__Tools/Techniques:__* Use NLP libraries like `NLTK` or `spaCy`, and machine learning models like Naive Bayes or transformers like BERT.
<br></br>
* *__Dataset:__* Amazon Customer Reviews, Yelp Reviews.

Step-by-Step Explanation:
1. *__Import Libraries:__*
```python 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, confusion_matrix
import joblib
```
2. *__Load the dataset:__*
```python
file_path = 'Womens-Clothing-E-Commerce-Reviews.csv'  
df = pd.read_csv(file_path)
```
3.  *__Preprocess the data:__*<br></br>
a. Drop rows with missing values in the 'Review Text' column
```python
df.dropna(subset= ['Review Text'], inplace=True)
```
b. Convert text to lowercase and remove punctuation
```python
df['Review Text'] = df['Review Text'].str.lower()
df['Review Text'] = df['Review Text'].str.replace('[^\w\s]', '')
``` 
c. Ensure 'Rating' column is correctly mapped to 0 and 1
```python
df['Rating'] = df['Rating'].apply(lambda x: 0 if x < 3 else 1)  
# Example: Ratings less than 3 are negative (0), others are positive (1) 
```  
    * `dropna`: Removes rows with missing values in the specified column.
    * `str.lower()`: Converts text to lowercase.
    * `str.replace()`: Removes punctuation from the text.
    * `lambda` function: Converts ratings into binary classes (0 for negative and 1 for positive).

4. *__Split the dataset:__*
```python
X_train, X_test, y_train, y_test = train_test_split(df['Review Text'], df['Rating'], test_size=0.2, random_state=42)
```
    * `train_test_split`: Splits the dataset into training and testing sets.
5. *__Convert text data to numerical features using `TF-IDF`:__*
```python
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
```
    + *`TF-IDF` (Term Frequency-Inverse Document Frequency):* A statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus).
    + *Term Frequency (TF): Measures how frequently a term occurs in a document.*
    + *Inverse Document Frequency (IDF): Measures how important a term is. It decreases the weight of terms that appear very frequently in the document set and increases the weight of terms that appear rarely.*
    + *`TFIDFVectorizer`: Transforms text data into numerical representation based on TF-IDF scores.*
6. *__Apply `SMOTE` to handle class imbalance:__*
```python
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_tfidf, y_train)
```
    * *`SMOTE` (Synthetic Minority Over-sampling Technique)*: An oversampling technique to balance class distribution by generating synthetic samples for the minority class. It generates synthetic samples for the minority class by interpolating between existing minority class samples.

7. *__Build and train the classifier:__*
```python
clf = MultinomialNB()
clf.fit(X_train_resampled, y_train_resampled)
```
   * *Multinomial Naive Bayes: A probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions between features. Suitable for classification with discrete features, such as word counts in text classification.*

8. *__Save the model:__*
```python
joblib.dump(clf, 'sentiment_analysis_model_womens_clothing.pkl')
```
  * `Joblib`: *A library for efficient serialization of Python objects.*

9. *__Evaluate the model:__*
```python
y_pred = clf.predict(X_test_tfidf)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
```
* Classification Report: Provides metrics such as `precision`, `recall`, and `F1-score` for each class.

  * `Precision`: The number of true positive results divided by the number of all positive results predicted by the classifier.
  * `Recall`: The number of true positive results divided by the number of all relevant samples (all samples that should have been identified as positive).
  * `F1-score`: The harmonic mean of precision and recall, providing a balance between the two metrics.
  * `confusion_matrix`: A table that describes the performance of a classification model. It shows the true positives, true negatives, false positives, and false negatives.

#### Step 1: Setting Up the Environment:
We will be using `pandas`, `numpy`, `scikit-learn`, and `nltk` for this project.

In [34]:
# install required libraries
#!pip3 install scikit-learn nltk
!pip3 install imbalanced-learn

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.12.3-py3-none-any.whl (258 kB)
[K     |████████████████████████████████| 258 kB 3.3 MB/s eta 0:00:01
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.12.3


#### Step 2: Import Libraries
Let's start by importing the necessary libraries and loading the dataset. For this example, we'll use a subset of Amazon Customer Reviews, which is available on `Kaggle`.

In [40]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics


#### Step 3 Load and inspect the data: 

First, we'll load the dataset from the CSV file and inspect its structure.

In [41]:
# Load the dataset
file_path = 'Womens-Clothing-E-Commerce-Reviews.csv'  # Replace with your actual file path
df = pd.read_csv(file_path)

df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [42]:
df.columns

Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Division Name',
       'Department Name', 'Class Name'],
      dtype='object')

#### Step 4: Data Preprocessing
preprocess the data by cleaning and preparing the text for sentiment analysis.

In [43]:
# Drop rows with missing values in the 'Review Text' column
df.dropna(subset=['Review Text'], inplace=True)

# Convert text to lowercase and remove punctuation
df['Review Text'] = df['Review Text'].str.lower()
df['Review Text'] = df['Review Text'].str.replace('[^\w\s]', '')

# Display the cleanned text
df['Review Text'].head()

0    absolutely wonderful - silky and sexy and comf...
1    love this dress!  it's sooo pretty.  i happene...
2    i had such high hopes for this dress and reall...
3    i love, love, love this jumpsuit. it's fun, fl...
4    this shirt is very flattering to all due to th...
Name: Review Text, dtype: object

In [44]:
df['Review Text'].shape

(22641,)

#### Step 5: Perform Sentiment Analysis
build a sentiment analysis model using a pipeline with a `naive_bayes` classifier.

In [45]:
# Ensure 'Rating' column is correctly mapped to 0 and 1
df['Rating'] = df['Rating'].apply(lambda x: 0 if x < 3 else 1)  # Example: Ratings less than 3 are negative (0), others are positive (1)

# Split the dataset into training (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(df['Review Text'], df['Rating'], test_size=0.2, random_state=42)

# Convert text data to numerical features using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Apply SMOTE to handle class imbalance in training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_tfidf, y_train)

# Build the classifier (using Multinomial Naive Bayes in this example)
clf = MultinomialNB()
clf.fit(X_train_resampled, y_train_resampled)

# Evaluate the model
y_pred = clf.predict(X_test_tfidf)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.37      0.81      0.51       457
           1       0.98      0.85      0.91      4072

    accuracy                           0.84      4529
   macro avg       0.68      0.83      0.71      4529
weighted avg       0.92      0.84      0.87      4529

Confusion Matrix:
 [[ 371   86]
 [ 620 3452]]


#### Step 6: Save and Use the Model
Save the trained model and use it to predict sentiments on new reviews.

In [46]:
import joblib

# Save the model
joblib.dump(clf, 'sentiment_analysis_model_womens_clothing.pkl')

# Example of new review
new_reviews = ["This dress is amazing! It fits perfectly and looks elegant.",
               "The quality of this blouse is poor, it tore after just one wash.",
              "Looks nice on me but i dont like it", "Received a borken item"]

# Load the model
loaded_model = joblib.load('sentiment_analysis_model_womens_clothing.pkl')

# Preprocess new reviews
new_reviews_processed = tfidf_vectorizer.transform(new_reviews)

# Make predictions
predictions = loaded_model.predict(new_reviews_processed)

# Print the predictions
for review, sentiment in zip(new_reviews, predictions):
    print(f"Review: {review}\nSentiment: {'Positive' if sentiment == 1 else 'Negative'}\n")


Review: This dress is amazing! It fits perfectly and looks elegant.
Sentiment: Positive

Review: The quality of this blouse is poor, it tore after just one wash.
Sentiment: Negative

Review: Looks nice on me but i dont like it
Sentiment: Positive

Review: Received a borken item
Sentiment: Negative



### Complete Code: 

In [39]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, confusion_matrix
import joblib

# Load the dataset
file_path = 'Womens-Clothing-E-Commerce-Reviews.csv'  # Replace with your actual file path
df = pd.read_csv(file_path)

# Drop rows with missing values in the 'Review Text' column
df.dropna(subset=['Review Text'], inplace=True)

# Convert text to lowercase and remove punctuation
df['Review Text'] = df['Review Text'].str.lower()
df['Review Text'] = df['Review Text'].str.replace('[^\w\s]', '')

# Ensure 'Rating' column is correctly mapped to 0 and 1
df['Rating'] = df['Rating'].apply(lambda x: 0 if x < 3 else 1)  # Example: Ratings less than 3 are negative (0), others are positive (1)

# Split the dataset into training (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(df['Review Text'], df['Rating'], test_size=0.2, random_state=42)

# Convert text data to numerical features using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Apply SMOTE to handle class imbalance in training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_tfidf, y_train)

# Build the classifier (using Multinomial Naive Bayes in this example)
clf = MultinomialNB()
clf.fit(X_train_resampled, y_train_resampled)

# Save the model
joblib.dump(clf, 'sentiment_analysis_model_womens_clothing.pkl')

# Evaluate the model
y_pred = clf.predict(X_test_tfidf)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Example of new review
new_reviews = ["This dress is amazing! It fits perfectly and looks elegant.",
               "The quality of this blouse is poor, it tore after just one wash.",
              "Looks nice on me but i dont like it", "Received a borken item"]

# Load the model
loaded_model = joblib.load('sentiment_analysis_model_womens_clothing.pkl')

# Preprocess new reviews
new_reviews_processed = tfidf_vectorizer.transform(new_reviews)

# Make predictions
predictions = loaded_model.predict(new_reviews_processed)

# Print the predictions
for review, sentiment in zip(new_reviews, predictions):
    print(f"Review: {review}\nSentiment: {'Positive' if sentiment == 1 else 'Negative'}\n")


Classification Report:
               precision    recall  f1-score   support

           0       0.37      0.81      0.51       457
           1       0.98      0.85      0.91      4072

    accuracy                           0.84      4529
   macro avg       0.68      0.83      0.71      4529
weighted avg       0.92      0.84      0.87      4529

Confusion Matrix:
 [[ 371   86]
 [ 620 3452]]
Review: This dress is amazing! It fits perfectly and looks elegant.
Sentiment: Positive

Review: The quality of this blouse is poor, it tore after just one wash.
Sentiment: Negative

Review: Looks nice on me but i dont like it
Sentiment: Positive

Review: Received a borken item
Sentiment: Negative



## Testing on Amazon Ecommerce Data: 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, confusion_matrix
import joblib

# Load the dataset
file_path = 'amazon_reviews_us_Apparel_v1_00.tsv'  # Replace with your actual file path
df = pd.read_csv(file_path, sep='\t', on_bad_lines='skip')

df.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,32158956,R1KKOXHNI8MSXU,B01KL6O72Y,24485154,Easy Tool Stainless Steel Fruit Pineapple Core...,Apparel,4.0,0.0,0.0,N,Y,★ THESE REALLY DO WORK GREAT WITH SOME TWEAKING ★,"These Really Do Work Great, But You Do Need To...",2013-01-14
1,US,2714559,R26SP2OPDK4HT7,B01ID3ZS5W,363128556,V28 Women Cowl Neck Knit Stretchable Elasticit...,Apparel,5.0,1.0,2.0,N,Y,Favorite for winter. Very warm!,I love this dress. Absolute favorite for winte...,2014-03-04
2,US,12608825,RWQEDYAX373I1,B01I497BGY,811958549,James Fiallo Men's 12-Pairs Low Cut Athletic S...,Apparel,5.0,0.0,0.0,N,Y,Great Socks for the money.,"Nice socks, great colors, just enough support ...",2015-07-12
3,US,25482800,R231YI7R4GPF6J,B01HDXFZK6,692205728,Belfry Gangster 100% Wool Stain-Resistant Crus...,Apparel,5.0,0.0,0.0,N,Y,Slick hat!,"I bought this for my husband and WOW, this is ...",2015-06-03
4,US,9310286,R3KO3W45DD0L1K,B01G6MBEBY,431150422,JAEDEN Women's Beaded Spaghetti Straps Sexy Lo...,Apparel,5.0,0.0,0.0,N,Y,I would do it again!,Perfect dress and the customer service was awe...,2015-06-12


In [2]:
print(df.shape)
# Drop rows with missing values in the 'Review Text' column
df.dropna(subset=['review_body'], inplace=True)
df.shape

(5881874, 15)


(5880879, 15)

In [3]:
# Convert text to lowercase and remove punctuation
df['review_body'] = df['review_body'].str.lower()
df['review_body'] = df['review_body'].str.replace('[^\w\s]', '')
df['review_body'].head()

0    these really do work great, but you do need to...
1    i love this dress. absolute favorite for winte...
2    nice socks, great colors, just enough support ...
3    i bought this for my husband and wow, this is ...
4    perfect dress and the customer service was awe...
Name: review_body, dtype: object

In [4]:
sampled_df = df.sample(frac=0.05, random_state=42)
# Ensure 'Rating' column is correctly mapped to 0 and 1
sampled_df['star_rating'] = sampled_df['star_rating'].apply(lambda x: 0 if x < 3 else 1)  # Example: Ratings less than 3 are negative (0), others are positive (1)


In [5]:




# Split the dataset into training (80%) and test (20%) sets
X_train, X_test, y_train, y_test = train_test_split(sampled_df['review_body'], sampled_df['star_rating'], test_size=0.2, random_state=42)

# Convert text data to numerical features using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Apply SMOTE to handle class imbalance in training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_tfidf, y_train)

# Build the classifier (using Multinomial Naive Bayes in this example)
clf = MultinomialNB()
clf.fit(X_train_resampled, y_train_resampled)

# Save the model
joblib.dump(clf, 'sentiment_analysis_Amazon_womens_clothing.pkl')

# Evaluate the model
y_pred = clf.predict(X_test_tfidf)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Example of new review
new_reviews = ["This dress is amazing! It fits perfectly and looks elegant.",
               "The quality of this blouse is poor, it tore after just one wash.",
              "Looks nice on me but i dont like it", "Received a borken item"]

# Load the model
loaded_model = joblib.load('sentiment_analysis_Amazon_womens_clothing.pkl')

# Preprocess new reviews
new_reviews_processed = tfidf_vectorizer.transform(new_reviews)

# Make predictions
predictions = loaded_model.predict(new_reviews_processed)

# Print the predictions
for review, sentiment in zip(new_reviews, predictions):
    print(f"Review: {review}\nSentiment: {'Positive' if sentiment == 1 else 'Negative'}\n")


Classification Report:
               precision    recall  f1-score   support

           0       0.51      0.85      0.64      8107
           1       0.97      0.87      0.92     50702

    accuracy                           0.87     58809
   macro avg       0.74      0.86      0.78     58809
weighted avg       0.91      0.87      0.88     58809

Confusion Matrix:
 [[ 6889  1218]
 [ 6681 44021]]
Review: This dress is amazing! It fits perfectly and looks elegant.
Sentiment: Positive

Review: The quality of this blouse is poor, it tore after just one wash.
Sentiment: Negative

Review: Looks nice on me but i dont like it
Sentiment: Negative

Review: Received a borken item
Sentiment: Negative



In [10]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import joblib

sampled_df = df.sample(frac=0.8, random_state=38)
# Assuming new_data has a column 'review_body' that contains the reviews to analyze
new_reviews = sampled_df['review_body'].astype(str)

# Load the TF-IDF vectorizer used in training
tfidf_vectorizer = TfidfVectorizer()
# Fit the vectorizer on the original dataset used for training
# Assuming X_train is from your original training data
tfidf_vectorizer.fit(X_train)

# Transform the new reviews into numerical features using the fitted vectorizer
new_reviews_tfidf = tfidf_vectorizer.transform(new_reviews)

# Load the trained sentiment analysis model
loaded_model = joblib.load('sentiment_analysis_Amazon_womens_clothing.pkl')

# Make predictions on the new reviews
predictions = loaded_model.predict(new_reviews_tfidf)

# Count the number of positive and negative reviews
positive_count = sum(predictions == 1)
negative_count = sum(predictions == 0)

# Print the results
print(f"Number of Positive Reviews: {positive_count}")
print(f"Number of Negative Reviews: {negative_count}")


Number of Positive Reviews: 3612460
Number of Negative Reviews: 1092243
