# Neural Network for Sentiment Analysis

## Preparing the data

Data: Sephora Products and Skincare review, found at: https://www.kaggle.com/datasets/nadyinky/sephora-products-and-skincare-reviews?resource=download

Inspiration taken from the sentiment analysis task on Kaggle: \
https://www.kaggle.com/code/aashidutt3/sentiment-analysis-sephora-reviews \
last checked on Jan 23, 2024

In [1]:
import pandas as pd
import glob
from sklearn.utils import shuffle
from utils_data_exploration import process_csv_file, print_label_percentages, calculate_percentage_and_count_for_values, write_df_to_file, split_and_write_data
from utils_preprocess_data import preprocess_and_read_csv

In [4]:
#checking the amount of files in the dataset
file_paths = []
for filename in glob.glob('./sephora-data/reviews*'):
    print(filename)
    file_paths.append(filename)

./sephora-data/reviews_0-250.csv
./sephora-data/reviews_1250-end.csv
./sephora-data/reviews_750-1250.csv
./sephora-data/reviews_250-500.csv
./sephora-data/reviews_500-750.csv


In [5]:
#keeping the data that is valueable for the task of the first file
df = process_csv_file(file_paths[0])
#checking the distribution per rating in the unbalanced dataset
values = [1,2,3,4,5]
calculate_percentage_and_count_for_values(df, 'label', 'rating', values)

  df = pd.read_csv(file_path)


For rating value 1: 
 Positive label: 0.91% - count: 252 
 Negative label: 99.09% - count: 27538 

For rating value 2: 
 Positive label: 3.67% - count: 896 
 Negative label: 96.33% - count: 23508 

For rating value 3: 
 Positive label: 35.35% - count: 13166 
 Negative label: 64.65% - count: 24081 

For rating value 4: 
 Positive label: 96.52% - count: 84007 
 Negative label: 3.48% - count: 3026 

For rating value 5: 
 Positive label: 99.87% - count: 307773 
 Negative label: 0.13% - count: 397 



In [6]:
# Filter to count only rows where 'label' column is not null
filtered_df = df[df['label'].notna()]

# Count the occurrences of each value in 'label' in the filtered DataFrame
value_counts = filtered_df['label'].value_counts()
print(value_counts)

# print updated percentage of both labels present
print_label_percentages(filtered_df)

label
1.0    406094
0.0     78550
Name: count, dtype: int64
Positive labels percentage: 83.79 %
Negative labels percentage: 16.21 %


## Downsizing the data

In [18]:
#downsizing the majority class but also reducing the length of the corpus for experimental purposes
df_neg = filtered_df[filtered_df['label'] == 0].sample(25000)
df_pos = filtered_df[filtered_df['label'] == 1].sample(len(df_neg)) #sampling a number of rows equal to the length of negative labels (df_neg)

In [19]:
df_neg.label.value_counts()

label
0.0    25000
Name: count, dtype: int64

In [20]:
df_pos.label.value_counts()

label
1.0    25000
Name: count, dtype: int64

In [21]:
#concatenating and shuffling to get final usable dataset
final_df = pd.concat([df_pos, df_neg], axis = 0)
final_df = shuffle(final_df)
final_df.head()

Unnamed: 0,text,label,rating
388390,has helped so much reduce blackheads and pores...,1.0,5
55194,"It’s okay, not the best. I prefer the Clinque ...",0.0,2
206131,Best new addition to my skin care routine this...,1.0,5
391555,This product goes on beautifully! It IS import...,1.0,5
196929,After hearing rave reviews from a couple of fr...,0.0,1


In [22]:
# print percentage of both labels present
print_label_percentages(final_df)

Positive labels percentage: 50.0 %
Negative labels percentage: 50.0 %


In [23]:
#checking if the data contains null values
final_df.isnull().sum()

text      95
label      0
rating     0
dtype: int64

In [24]:
#dropping null values
final_df = final_df.dropna()
final_df = final_df.reset_index(drop = True)

In [25]:
final_df.isnull().sum()

text      0
label     0
rating    0
dtype: int64

In [26]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49905 entries, 0 to 49904
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   text    49905 non-null  object 
 1   label   49905 non-null  float64
 2   rating  49905 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.1+ MB


In [27]:
final_df.label.value_counts()

label
0.0    24954
1.0    24951
Name: count, dtype: int64

In [28]:
#checking the final distribution per rating in the balanced dataset
calculate_percentage_and_count_for_values(final_df, 'label', 'rating', values)

For rating value 1: 
 Positive label: 0.15% - count: 13 
 Negative label: 99.85% - count: 8831 

For rating value 2: 
 Positive label: 0.91% - count: 68 
 Negative label: 99.09% - count: 7378 

For rating value 3: 
 Positive label: 9.94% - count: 848 
 Negative label: 90.06% - count: 7685 

For rating value 4: 
 Positive label: 84.61% - count: 5185 
 Negative label: 15.39% - count: 943 

For rating value 5: 
 Positive label: 99.38% - count: 18837 
 Negative label: 0.62% - count: 117 



In [29]:
#for rating 3, the majority results is negative labels ('not recommended') in the cleaned data 
#so there is no need to introduce a neutral label

## Writing the final df into a smaller file for final preprocessing

In [30]:
#writing the final df into a file as the final dataset to work on this SA task
output_file_path = './sephora-data/sa-reviews_smaller.csv'

write_df_to_file(final_df, output_file_path)

df successfully written to ./sephora-data/sa-reviews_smaller.csv


## Preprocessing and splitting files into training, dev and test

In [2]:
preprocessed_df = preprocess_and_read_csv('./sephora-data/sa-reviews_smaller.csv')

  soup = BeautifulSoup(text, 'html.parser')
  soup = BeautifulSoup(text, 'html.parser')


In [3]:
preprocessed_df.isnull().sum()

text                 0
label                0
rating               0
preprocessed_text    0
dtype: int64

In [4]:
# from the cleaned file, separating the training, dev and tests files for the models
split_and_write_data(preprocessed_df,'./sephora-data/sa-reviews')

df successfully written to ./sephora-data/sa-reviews_training.csv
df successfully written to ./sephora-data/sa-reviews_dev.csv
df successfully written to ./sephora-data/sa-reviews_test.csv
