
# Introduction

In the realm of beauty and skincare products, understanding customer sentiment is a critical factor in delivering products that resonate with the market. This project focuses on sentiment analysis, a powerful natural language processing task, to uncover the emotional tones behind customer reviews. By leveraging machine learning techniques and deep learning models, we aim to determine whether customer reviews express positive, negative, or neutral sentiments. Additionally, we will explore which brands and products have garnered the most positive or negative reviews. This analysis will provide valuable insights for the beauty and skincare industry to improve products and customer satisfaction.

# About the Data:

The dataset we are working with comprises a wealth of information, encompassing over 8,000 beauty products available on the Sephora online store. It includes essential attributes such as product and brand names, prices, ingredients, ratings, and a wide range of features. Furthermore, we have access to user reviews, which total over a million across more than 2,000 products in the Skincare category. These reviews encompass user appearances, review ratings by other users, and detailed feedback.

In [6]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/sephora-products-and-skincare-reviews/product_info.csv
/kaggle/input/sephora-products-and-skincare-reviews/reviews_500-750.csv
/kaggle/input/sephora-products-and-skincare-reviews/reviews_750-1250.csv
/kaggle/input/sephora-products-and-skincare-reviews/reviews_1250-end.csv
/kaggle/input/sephora-products-and-skincare-reviews/reviews_250-500.csv
/kaggle/input/sephora-products-and-skincare-reviews/reviews_0-250.csv


In [7]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

import nltk

In [8]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Data Overview


In [9]:
df = pd.read_csv('/kaggle/input/sephora-products-and-skincare-reviews/reviews_0-250.csv')

In [10]:
df.shape

(602130, 19)

In [11]:
df.head()

Unnamed: 0.1,Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,0,1741593524,5,1.0,1.0,2,0,2,2023-02-01,I use this with the Nudestix “Citrus Clean Bal...,Taught me how to double cleanse!,,brown,dry,black,P504322,Gentle Hydra-Gel Face Cleanser,NUDESTIX,19.0
1,1,31423088263,1,0.0,,0,0,0,2023-03-21,I bought this lip mask after reading the revie...,Disappointed,,,,,P420652,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE,24.0
2,2,5061282401,5,1.0,,0,0,0,2023-03-21,My review title says it all! I get so excited ...,New Favorite Routine,light,brown,dry,blonde,P420652,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE,24.0
3,3,6083038851,5,1.0,,0,0,0,2023-03-20,I’ve always loved this formula for a long time...,Can't go wrong with any of them,,brown,combination,black,P420652,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE,24.0
4,4,47056667835,5,1.0,,0,0,0,2023-03-20,"If you have dry cracked lips, this is a must h...",A must have !!!,light,hazel,combination,,P420652,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE,24.0


In [12]:
df = df[['review_text','is_recommended', 'rating']]
df.rename(columns = {'is_recommended': 'label', 'review_text':'text'}, inplace = True)
df.head()

Unnamed: 0,text,label,rating
0,I use this with the Nudestix “Citrus Clean Bal...,1.0,5
1,I bought this lip mask after reading the revie...,0.0,1
2,My review title says it all! I get so excited ...,1.0,5
3,I’ve always loved this formula for a long time...,1.0,5
4,"If you have dry cracked lips, this is a must h...",1.0,5


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 602130 entries, 0 to 602129
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   text    601131 non-null  object 
 1   label   484644 non-null  float64
 2   rating  602130 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 13.8+ MB


In [14]:
df.label.value_counts()

1.0    406094
0.0     78550
Name: label, dtype: int64

In [15]:
# print percentage of both labels present

print("Positive labels percentage", round(df.label.value_counts()[1]/len(df) *100 ,2), "%")
print("Negative labels percentage", round(df.label.value_counts()[0]/len(df) *100 ,2), "%")

Positive labels percentage 67.44 %
Negative labels percentage 13.05 %


In [18]:
# load other dataset files

df2 = pd.read_csv('/kaggle/input/sephora-products-and-skincare-reviews/reviews_500-750.csv')
df2 = df2[['review_text','is_recommended', 'rating']]
df2.rename(columns = {'is_recommended': 'label', 'review_text':'text'}, inplace = True)
df2.label.value_counts()

1.0    88410
0.0    16049
Name: label, dtype: int64

In [20]:
df3 = pd.read_csv('/kaggle/input/sephora-products-and-skincare-reviews/reviews_750-1250.csv')
df3 = df3[['review_text','is_recommended', 'rating']]
df3.rename(columns = {'is_recommended': 'label', 'review_text':'text'}, inplace = True)
df3.label.value_counts()

1.0    94797
0.0    15624
Name: label, dtype: int64

In [21]:
df_concat = pd.concat([df2,df3], axis = 0)

In [22]:
df_neg = df_concat[df_concat['label'] == 0]
df_neg.label.value_counts()

0.0    31673
Name: label, dtype: int64

In [23]:
df = pd.concat([df, df_neg])
df['label'].value_counts()

1.0    406094
0.0    110223
Name: label, dtype: int64

In [25]:
print("Positive labels", round(df.label.value_counts()[1]/len(df) *100 ,2), "%")
print("Negative labels", round(df.label.value_counts()[0]/len(df) *100 ,2), "%")

Positive labels 64.07 %
Negative labels 17.39 %


# ✂️ Downsizing majority class

In [26]:
df_neg = df[df['label'] == 0]
df_pos = df[df['label'] == 1].sample(len(df_neg)) # samples a number of rows equal to the length of df_neg

In [30]:
print(df_neg.label.value_counts())
print(df_pos.label.value_counts())

0.0    110223
Name: label, dtype: int64
1.0    110223
Name: label, dtype: int64


In [31]:
df = pd.concat([df_pos, df_neg], axis = 0)
df = shuffle(df)
df.head()

Unnamed: 0,text,label,rating
277637,"Patchy, hard to use, doesn’t mix well, doesn’t...",0.0,1
515136,This product (received as a complimentary samp...,1.0,5
194478,Perfect for under makeup and nighttime use. Sk...,1.0,5
408016,My face has been on fire since last night. Thi...,0.0,1
525716,I was not impressed at all with this cc cream....,0.0,1


In [32]:
print("Positive labels percentage", round(df.label.value_counts()[1]/len(df) *100 ,2), "%")
print("Negative labels percentage", round(df.label.value_counts()[0]/len(df) *100 ,2), "%")

Positive labels percentage 50.0 %
Negative labels percentage 50.0 %


In [33]:
# checking null values
df.isnull().sum()

text      429
label       0
rating      0
dtype: int64

In [34]:
# drop null values

df = df.dropna()
df = df.reset_index(drop = True)


In [35]:
df.isnull().sum()

text      0
label     0
rating    0
dtype: int64

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220017 entries, 0 to 220016
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   text    220017 non-null  object 
 1   label   220017 non-null  float64
 2   rating  220017 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 5.0+ MB


# Text Preprocessing

In [37]:
import re
import nltk
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.tokenize import ToktokTokenizer
from nltk.stem import PorterStemmer

def preprocess_text(text, remove_digits=True):
    # Removing HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()
    
    # Removing square brackets
    text = re.sub('\[[^]]*\]', '', text)
    
    # Removing special characters
    if remove_digits:
        text = re.sub('[^a-zA-Z\s]', '', text)
    else:
        text = re.sub('[^a-zA-Z0-9\s]', '', text)
    
    # Lowercasing
    text = text.lower()
    
    # Stemming
    ps = PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    
    # Removing stopwords
    stopword_list = set(stopwords.words('english'))
    tokenizer = ToktokTokenizer()
    tokens = tokenizer.tokenize(text)
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)
    
    return filtered_text


In [None]:
print('Before preprocessing \n', df['text'][2])

df['text'] = df['text'].apply(preprocess_text)

print('After preprocessing \n', df['text'][2])

Before preprocessing 
 Perfect for under makeup and nighttime use. Skin feels wonderful. Well worth it.


In [None]:
train_df, test_df = train_test_split(df, random_state =42, test_size = 0.10, shuffle = True)

train_df , val_df = train_test_split(train_df, test_size=0.25, random_state= 42)


In [None]:
module_url = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"

In [None]:
!pip install git+https://github.com/tensorflow/docs

In [None]:
import tensorflow_docs as tfdocs
import tensorflow_docs.modeling
import tensorflow_docs.plots

In [None]:
import tensorflow as tf
import tensorflow_hub as hub

In [None]:
def train_and_evaluate_model(module_url, embed_size, name, trainable=False):
  hub_layer = hub.KerasLayer(module_url, input_shape = [], output_shape = [embed_size], dtype = tf.string, trainable = trainable)
  model = tf.keras.models.Sequential([
      hub_layer,
      tf.keras.layers.Dense(256, activation = 'relu'),
      tf.keras.layers.Dense(64, activation = 'relu'),
      tf.keras.layers.Dense(1, activation = 'sigmoid')

  ])

  model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0001), loss = tf.losses.BinaryCrossentropy(), metrics = ['accuracy'])
  model.summary()
  history = model.fit(train_df['text'], train_df['label'], 
                      epochs = 100,
                      batch_size = 32, 
                      validation_data = (val_df['text'], val_df['label']), 
                      callbacks =[tfdocs.modeling.EpochDots(),
                      tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 2, mode = 'min')
                      
                      ], verbose = 0)
  return history

In [None]:
histories = {}

In [None]:
histories['gnews-swivel-20dim'] = train_and_evaluate_model(module_url, embed_size = 20, name = 'gnews-swivel-20dim')

In [None]:
plt.rcParams['figure.figsize'] = (12, 8)
plotter = tfdocs.plots.HistoryPlotter(metric = 'accuracy')
plotter.plot(histories)
plt.xlabel("Epochs")
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
plt.title("Accuracy Curves for Models")
plt.show()

# Finetuning

In [None]:
histories['gnews-swivel-20dim_finetuned'] = train_and_evaluate_model(module_url, embed_size = 20, name = 'gnews-swivel-20dimfinetuned', trainable = True)

In [None]:
plt.rcParams['figure.figsize'] = (12, 8)
plotter = tfdocs.plots.HistoryPlotter(metric = 'accuracy')
plotter.plot(histories)
plt.xlabel("Epochs")
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
plt.title("Accuracy Curves for Models")
plt.show()