Capstone Project 3:
Opinion Mining from Product Reviews:
Objective: Extract opinions and analyze sentiments from product reviews to understand customer satisfaction.
Techniques: Aspect-Based Sentiment Analysis, Word Embeddings, Text Preprocessing.
Tools: Python, NLTK, Scikit-Learn, Gensim.
Dataset: Amazon product reviews or similar datasets from Kaggle.
General Workflow for Each Project:
Data Collection: Obtain the necessary text data from public datasets or through web scraping.
Data Preprocessing: Clean and preprocess the text data, including tokenization, stopword removal, and stemming/lemmatization.
Feature Extraction: Convert text data into numerical representations using techniques like Bag of Words, TF-IDF, or word embeddings.
Model Development: Train machine learning models to achieve the project's objective.
Model Evaluation: Evaluate the performance of the model using metrics like accuracy, precision, recall, and F1-score.
Optimization: Tune hyperparameters to improve model performance.
Documentation: Document the process, results, and insights gained from the project.
API: Pickle the model file and Create user testing API using any web framework for demonstration

In [1]:
import pandas as pd
import numpy as np

In [2]:
data=pd.read_csv("Dataset-SA.csv")

In [3]:
data

Unnamed: 0,product_name,product_price,Rate,Review,Summary,Sentiment
0,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,super!,great cooler excellent air flow and for this p...,positive
1,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,awesome,best budget 2 fit cooler nice cooling,positive
2,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,3,fair,the quality is good but the power of air is de...,positive
3,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,1,useless product,very bad product its a only a fan,negative
4,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,3,fair,ok ok product,neutral
...,...,...,...,...,...,...
205047,cello Pack of 18 Opalware Cello Dazzle Lush Fi...,1299,5,must buy!,good product,positive
205048,cello Pack of 18 Opalware Cello Dazzle Lush Fi...,1299,5,super!,nice,positive
205049,cello Pack of 18 Opalware Cello Dazzle Lush Fi...,1299,3,nice,very nice and fast delivery,positive
205050,cello Pack of 18 Opalware Cello Dazzle Lush Fi...,1299,5,just wow!,awesome product,positive


In [4]:
data["Review"][0]

'super!'

In [5]:
data.isnull().sum()

product_name         0
product_price        0
Rate                 0
Review           24664
Summary             11
Sentiment            0
dtype: int64

In [6]:
df1=data.dropna(subset=['Review','Summary'])

In [38]:
df1.shape

(180379, 6)

In [7]:
df1.isnull().sum()

product_name     0
product_price    0
Rate             0
Review           0
Summary          0
Sentiment        0
dtype: int64

In [8]:
df1.columns

Index(['product_name', 'product_price', 'Rate', 'Review', 'Summary',
       'Sentiment'],
      dtype='object')

In [9]:
df=df1.drop(['product_name', 'product_price'],axis=1)

In [10]:
df

Unnamed: 0,Rate,Review,Summary,Sentiment
0,5,super!,great cooler excellent air flow and for this p...,positive
1,5,awesome,best budget 2 fit cooler nice cooling,positive
2,3,fair,the quality is good but the power of air is de...,positive
3,1,useless product,very bad product its a only a fan,negative
4,3,fair,ok ok product,neutral
...,...,...,...,...
205047,5,must buy!,good product,positive
205048,5,super!,nice,positive
205049,3,nice,very nice and fast delivery,positive
205050,5,just wow!,awesome product,positive


In [11]:
df["Sentiment"].value_counts() 

Sentiment
positive    147171
negative     24401
neutral       8807
Name: count, dtype: int64

In [12]:
import re
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # remove URL's
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Remove user mentions (e.g., @username)
    text = re.sub(r'@\w+', '', text)
    
    # Remove special characters and punctuation (keeping only alphanumeric characters and spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

In [13]:
df['R+S'] = df['Review'] + ' ' + df['Summary']

In [14]:
df['Cleaned_Summary'] = df['R+S'].apply(preprocess_text)

In [15]:
df

Unnamed: 0,Rate,Review,Summary,Sentiment,R+S,Cleaned_Summary
0,5,super!,great cooler excellent air flow and for this p...,positive,super! great cooler excellent air flow and for...,super great cooler excellent air flow and for ...
1,5,awesome,best budget 2 fit cooler nice cooling,positive,awesome best budget 2 fit cooler nice cooling,awesome best budget 2 fit cooler nice cooling
2,3,fair,the quality is good but the power of air is de...,positive,fair the quality is good but the power of air ...,fair the quality is good but the power of air ...
3,1,useless product,very bad product its a only a fan,negative,useless product very bad product its a only a fan,useless product very bad product its a only a fan
4,3,fair,ok ok product,neutral,fair ok ok product,fair ok ok product
...,...,...,...,...,...,...
205047,5,must buy!,good product,positive,must buy! good product,must buy good product
205048,5,super!,nice,positive,super! nice,super nice
205049,3,nice,very nice and fast delivery,positive,nice very nice and fast delivery,nice very nice and fast delivery
205050,5,just wow!,awesome product,positive,just wow! awesome product,just wow awesome product


In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()  # Adjust max_features as needed
X = vectorizer.fit_transform(df['Cleaned_Summary'])

# Check feature shape
print(X.shape)


(180379, 46200)


In [17]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Encode labels
label_encoder = LabelEncoder()
df['Sentiment_Encoded']= label_encoder.fit_transform(df['Sentiment'])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, df['Sentiment_Encoded'], test_size=0.2, random_state=42)

In [18]:
print(df[['Sentiment', 'Sentiment_Encoded']].head())

  Sentiment  Sentiment_Encoded
0  positive                  2
1  positive                  2
2  positive                  2
3  negative                  0
4   neutral                  1


In [19]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Initialize and train model
model = MultinomialNB()
model.fit(X_train, y_train)


In [20]:
from sklearn.metrics import classification_report, accuracy_score

# Predict on test set
y_pred = model.predict(X_test)

# Print evaluation metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

Accuracy: 0.9033152234172303
              precision    recall  f1-score   support

    negative       0.88      0.68      0.77      4877
     neutral       0.74      0.01      0.02      1695
    positive       0.91      0.99      0.95     29504

    accuracy                           0.90     36076
   macro avg       0.84      0.56      0.58     36076
weighted avg       0.89      0.90      0.88     36076



In [37]:
df["Sentiment"].value_counts() 

Sentiment
positive    147171
negative     24401
neutral       8807
Name: count, dtype: int64

In [21]:
minimum=8807

df_positive=df[df['Sentiment']=='positive'].sample(minimum,random_state=42)
df_negative=df[df['Sentiment']=='negative'].sample(minimum,random_state=42)
df_neutral=df[df['Sentiment']=='neutral'].sample(minimum,random_state=42)

df_new=pd.concat([df_positive,df_negative,df_neutral],axis=0)

df_new['Sentiment'].value_counts()

Sentiment
positive    8807
negative    8807
neutral     8807
Name: count, dtype: int64

In [22]:
df_new.shape

(26421, 7)

In [36]:
df_new

Unnamed: 0,Rate,Review,Summary,Sentiment,R+S,Cleaned_Summary,Sentiment_Encoded
22276,5,mind-blowing purchase,awesome,positive,mind-blowing purchase awesome,mindblowing purchase awesome,2
184721,5,simply awesome,very nice worthy money,positive,simply awesome very nice worthy money,simply awesome very nice worthy money,2
18188,2,slightly disappointed,not saying too bad but after using 1 week the ...,positive,slightly disappointed not saying too bad but a...,slightly disappointed not saying too bad but a...,2
179306,5,awesome,better than,positive,awesome better than,awesome better than,2
69723,5,awesome,product some crachesh but good,positive,awesome product some crachesh but good,awesome product some crachesh but good,2
...,...,...,...,...,...,...,...
144832,5,terrific purchase,its very soft and light in weight and handygoo...,neutral,terrific purchase its very soft and light in w...,terrific purchase its very soft and light in w...,1
137774,1,bad service from flipkart,product is good but flipkart service is very b...,neutral,bad service from flipkart product is good but ...,bad service from flipkart product is good but ...,1
140477,1,not specified,quality is not bad but short size is very smal...,neutral,not specified quality is not bad but short siz...,not specified quality is not bad but short siz...,1
16980,3,just okay,a decent one with one problemmy unit just heat...,neutral,just okay a decent one with one problemmy unit...,just okay a decent one with one problemmy unit...,1


In [23]:
from sklearn.linear_model import LogisticRegression
model1 = LogisticRegression()
model1.fit(X_train, y_train)

y_pred = model1.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.86      0.86      0.86      4877
           1       0.70      0.39      0.51      1695
           2       0.96      0.98      0.97     29504

    accuracy                           0.94     36076
   macro avg       0.84      0.75      0.78     36076
weighted avg       0.93      0.94      0.93     36076



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [30]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer
clf1=Pipeline([('vectorizer',CountVectorizer()),('model',SVC())])

X_train,X_test, y_train, y_test = train_test_split(df_new['Cleaned_Summary'], df_new['Sentiment'], test_size=0.2, random_state=42)
clf1.fit(X_train,y_train)

y_pred=clf1.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negative       0.80      0.88      0.84      1740
     neutral       0.79      0.76      0.78      1770
    positive       0.91      0.85      0.88      1775

    accuracy                           0.83      5285
   macro avg       0.83      0.83      0.83      5285
weighted avg       0.83      0.83      0.83      5285



In [25]:
from sklearn.ensemble import RandomForestClassifier
clf=Pipeline([('vectorizer',CountVectorizer()),('model',RandomForestClassifier())])

X_train,X_test, y_train, y_test = train_test_split(df_new['Cleaned_Summary'], df_new['Sentiment'], test_size=0.2, random_state=42)
clf.fit(X_train,y_train)


y_pred=clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negative       0.83      0.87      0.85      1740
     neutral       0.79      0.68      0.73      1770
    positive       0.82      0.89      0.86      1775

    accuracy                           0.82      5285
   macro avg       0.81      0.82      0.81      5285
weighted avg       0.81      0.82      0.81      5285



In [34]:
import pickle

# Save model
with open('model1.pkl', 'wb') as model_file:
    pickle.dump(model1, model_file)

# Save vectorizer
with open('vectorizer.pkl', 'wb') as vectorizer_file:
    pickle.dump(vectorizer, vectorizer_file)


In [35]:
df['Cleaned_Summary'][3]

'useless product very bad product its a only a fan'

In [2]:
df_new.columns

NameError: name 'df_new' is not defined