![](https://7kmtorg.com.ua/wp-content/uploads/2020/11/zhenskaya-obuv-optom.jpg)

# <span style="font-size:40px;"><center>Shoes Reviews Analysis</center> </span>

In [7]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
import warnings
warnings.filterwarnings('ignore')

In [8]:
file = pd.read_csv("Src/Shoes_Data.csv")

All we need is two columns from the original dataset:

In [9]:
df = file[["reviews", "reviews_rating"]]
df.head()

Unnamed: 0,reviews,reviews_rating
0,Not happy with product|| It's not as expected....,1.0 out of 5 stars|| 1.0 out of 5 stars|| 3.0 ...
1,Memory cushioning in these shoes is the best f...,5.0 out of 5 stars|| 1.0 out of 5 stars|| 5.0 ...
2,Worth to its amount|| Go for it|| Perfect|| 5 ...,5.0 out of 5 stars|| 5.0 out of 5 stars|| 5.0 ...
3,Sup quality|| Good but not expected|| Awesome 👌.!,5.0 out of 5 stars|| 3.0 out of 5 stars|| 5.0 ...
4,Best|| Satisfied!|| Affordable beauty 😘😘😘😘 the...,5.0 out of 5 stars|| 5.0 out of 5 stars|| 5.0 ...


# <center>Data preparation</center>

We can see that reviews and ratings are separated by "||", so row 1 row contains a number of reviews. To analyse them, we need to make "1 row - 1 review" format.

In [10]:
rew  = []
rat = []

for j in df.index:
    lst = [i for i in df.iloc[j].reviews.split('||')]
    for k in lst:
        rew.append(k)
        
for j in df.index:
    lst = [i for i in df.iloc[j].reviews_rating.split('||')]
    for k in lst:
        rat.append(k)
        
df = pd.DataFrame(list(zip(rew, rat)),
               columns =['Review', 'Review_rating'])

In [11]:
df.head()

Unnamed: 0,Review,Review_rating
0,Not happy with product,1.0 out of 5 stars
1,It's not as expected.,1.0 out of 5 stars
2,AVERAGE PRODUCT,3.0 out of 5 stars
3,Pic more beautiful,3.0 out of 5 stars
4,Got damage product. But quality is average fo...,3.0 out of 5 stars


Much better! But not enough. First, we need to check all symbols presented in text.

In [12]:
# Getting all unique symbols in text
all_text = str()
for sentence in df['Review'].values:
    all_text += sentence
    
''.join(set(all_text))

'😐w💥😤fG)»🙄ப5📦👏s😭☑nछ😋Y🇮BढपT41🦶✌P2😞t#😍🤎❣ிो👞💸Kள🇪लC7😒😑टक%M8🔥WQू@🥰ेு✔🇨ु🤣DEआरoUइV🏃u😡mत🤘उ🖤💋💯💕😇\'y☺(i😁🏻जिq9चXzk🤑🤫👌🌹₹❤"aA=x’नv🤨/बH्👎😠ॉFZ-JRI!\u200ddवg😘🤩O😊3म_✊😎|⭐Nbखअ😢😅😉😌e்த😶हीगட👇ैई⇢j6😃ायh 🤮🙂झ❌S♂☹😄pr💪.?।🇳😔सc👍😀…+️🤙ं*&👟ड:0💖l💚🏼L😟💰ßद🤟🥾«🌟😂'

We can see that there's plenty of symbols we better get rid of, like emojis and non-printable ones.

In [13]:
import nltk
#nltk.download('stopwords')
#from nltk.corpus import stopwords
#stop = stopwords.words('english')

# Set of stopwords to remove
#stop = set(stop)

# Set of punctuation signs to remove
from string import punctuation

In [14]:
import re

def lower(text):
    return text.lower()

def remove_punctuation(text):
    return text.translate(str.maketrans('','', punctuation))

def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in stop])

# Removing all words with digits and standalone digits
def remove_digits(text):
    return re.sub(r'\d+', '', text)

def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

# Removing all non-printable symbols like "ड", "ட"
def remove_non_printable(text):
    text = text.encode("ascii", "ignore")
    return text.decode()
        
# One function to clean it all
def clean_text(text):
    text = lower(text)
    text = remove_punctuation(text)
    #text = remove_stopwords(text)
    text = remove_digits(text)
    text = remove_emoji(text)
    text = remove_non_printable(text)
    return text

Second, we only need one digit in "Review_rating" column, not a full string.

In [15]:
# Returns first digit entry in a string
def get_first_digit(text):
    match = re.search(r'\d', text)
    return match[0]

Applying all formatting functions:

In [16]:
df['Review_rating']=df['Review_rating'].apply(get_first_digit)
df['clean_review']=df['Review'].apply(clean_text)
df.head()

Unnamed: 0,Review,Review_rating,clean_review
0,Not happy with product,1,not happy with product
1,It's not as expected.,1,its not as expected
2,AVERAGE PRODUCT,3,average product
3,Pic more beautiful,3,pic more beautiful
4,Got damage product. But quality is average fo...,3,got damage product but quality is average for


Check how it worked:

In [17]:
all_text_clean = str()
for sentence in df['clean_review'].values:
    all_text_clean += sentence
''.join(set(all_text_clean))

'jwah fxvospnrumcdytgiqlbzek'

Let's find most frequent reviews:

In [18]:
df["clean_review"].value_counts()

clean_review
 verified purchase                          647
 report abuse                               418
 good                                       280
 good product                               151
 nice                                       118
                                           ... 
 product dammag                               1
 quality must be maintained as per price      1
 qlity bakwaas                                1
its very good                                 1
seems fake                                    1
Name: count, Length: 4711, dtype: int64

"verified purchase" and "report abuse" seem to be some automatic texts, so we can just delete them from dataset.

In [19]:
df = df[~df.Review.str.contains("Report abuse")]
df = df[~df.Review.str.contains("Verified")]

In [20]:
df["clean_review"].value_counts()

clean_review
 good                                       280
 good product                               151
 nice                                       118
 value for money                             94
 nice product                                64
                                           ... 
 product quantity is poor                     1
 product dammag                               1
 quality must be maintained as per price      1
 qlity bakwaas                                1
 sub standard quality                         1
Name: count, Length: 4706, dtype: int64

#  <center>Word clouds</center>

In [21]:
from wordcloud import WordCloud

plt.figure(figsize=(40,25))

subset1 = df[df['Review_rating']=='1']
text = subset1.clean_review.values
cloud1=WordCloud(background_color='pink',colormap="Dark2",collocations=False,width=2500,height=1800).generate(" ".join(text))

plt.subplot(3, 2, 1)
plt.axis('off')
plt.title("1",fontsize=40)
plt.imshow(cloud1)

subset2 = df[df['Review_rating']=='2']
text = subset2.clean_review.values
cloud2=WordCloud(background_color='pink',colormap="Dark2",collocations=False,width=2500,height=1800).generate(" ".join(text))

plt.subplot(3, 2, 2)
plt.axis('off')
plt.title("2",fontsize=40)
plt.imshow(cloud2)

subset3 = df[df['Review_rating']=='3']
text = subset3.clean_review.values
cloud3=WordCloud(background_color='pink',colormap="Dark2",collocations=False,width=2500,height=1800).generate(" ".join(text))

plt.subplot(3, 2, 3)
plt.axis('off')
plt.title("3",fontsize=40)
plt.imshow(cloud3)

subset4 = df[df['Review_rating']=='4']
text = subset4.clean_review.values
cloud4=WordCloud(background_color='pink',colormap="Dark2",collocations=False,width=2500,height=1800).generate(" ".join(text))

plt.subplot(3, 2, 4)
plt.axis('off')
plt.title("4",fontsize=40)
plt.imshow(cloud4)

subset5 = df[df['Review_rating']=='5']
text = subset5.clean_review.values
cloud5=WordCloud(background_color='pink',colormap="Dark2",collocations=False,width=2500,height=1800).generate(" ".join(text))

plt.subplot(3, 2, 5)
plt.axis('off')
plt.title("5",fontsize=40)
plt.imshow(cloud5)

ModuleNotFoundError: No module named 'wordcloud'

# <center> Sentiment analysis </center>

We will use nltk library to perform sentiment analysis of the reviews. Sentiment can be "positive", "negative", "neutral" and "compound".

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

sentiments = []

for sentence in df['clean_review'].values:
    sentiments.append(max(sia.polarity_scores(sentence), key=sia.polarity_scores(sentence).get))

We got the sentiment that had the most probability for every review:

In [None]:
df["sentiment"] = sentiments
df.head()

Unnamed: 0,Review,Review_rating,clean_review,sentiment
0,Not happy with product,1,not happy with product,neg
1,It's not as expected.,1,its not as expected,neu
2,AVERAGE PRODUCT,3,average product,neu
3,Pic more beautiful,3,pic more beautiful,pos
4,Got damage product. But quality is average fo...,3,got damage product but quality is average for,neu


# <center> EDA </center>

Let's see how well nltk performed by comparing rating of the reviews and their sentiment:

In [None]:
count = df[['Review_rating', 'sentiment']].value_counts().to_frame().reset_index()
count.columns.values[2] = "count"

import plotly.express as px

fig = px.bar(count, x="Review_rating", y="count", color="sentiment", text="sentiment")
fig.update_layout(title_text='Review rating/detected sentiments',  title_x=0.5)
fig.show()

The results of sentiment analysis are actually fine: reviews with "1" - "2" stars was mostly detected as negative/neutral, and "4" - "5" stars are positive/neutral.

In [None]:
import plotly.graph_objs as go
from plotly.offline import iplot

import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')

Let's check the length and word counts in the reviews. <p>
Reference: https://www.kaggle.com/parulpandey/eda-and-preprocessing-for-bert

In [None]:
df['review_len'] = df['clean_review'].astype(str).apply(len)
df['review_word_count'] = df['clean_review'].apply(lambda x: len(str(x).split()))

In [None]:
one = df[df['Review_rating']=='1']
two = df[df['Review_rating']=='2']
three = df[df['Review_rating']=='3']
four = df[df['Review_rating']=='4']
five = df[df['Review_rating']=='5']

In [None]:
trace0 = go.Box(
    y=one['review_len'],
    name = 'One star',
    marker = dict(
        color = 'red',
    )
)

trace1 = go.Box(
    y=two['review_len'],
    name = 'Two stars',
    marker = dict(
        color = 'green',
    )
)

trace2 = go.Box(
    y=three['review_len'],
    name = 'Three stars',
    marker = dict(
        color = 'orange',
    )
)

trace3 = go.Box(
    y=four['review_len'],
    name = 'Four stars',
    marker = dict(
        color = 'blue',
    )
)

trace4 = go.Box(
    y=five['review_len'],
    name = 'Five stars',
    marker = dict(
        color = 'purple',
    )
)

data = [trace0, trace1, trace2, trace3, trace4]
layout = go.Layout(
    title = "Length of the reviews", title_x=0.5,
)

fig = go.Figure(data=data,layout=layout)
iplot(fig)

In [None]:
trace0 = go.Box(
    y=one['review_word_count'],
    name = 'One star',
    marker = dict(
        color = 'red',
    )
)

trace1 = go.Box(
    y=two['review_word_count'],
    name = 'Two stars',
    marker = dict(
        color = 'blue',
    )
)

trace2 = go.Box(
    y=three['review_word_count'],
    name = 'Three stars',
    marker = dict(
        color = 'darksalmon',
    )
)

trace3 = go.Box(
    y=four['review_word_count'],
    name = 'Four stars',
    marker = dict(
        color = 'purple',
    )
)

trace4 = go.Box(
    y=five['review_word_count'],
    name = 'Five stars',
    marker = dict(
        color = 'green',
    )
)
data = [trace0, trace1, trace2, trace3, trace4]
layout = go.Layout(
    title = "Word count of the reviews", title_x=0.5,
)

fig = go.Figure(data=data,layout=layout)
iplot(fig)

Reviews with five stars are not very wordy. <p>
Let's check unigrams and bigrams in the reviews.

In [None]:
def get_top_n_gram(corpus,ngram_range,n=None):
    vec = CountVectorizer(ngram_range=ngram_range,stop_words = 'english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
pos_unigrams = get_top_n_gram(five['clean_review'],(1,1),10)
neg_unigrams = get_top_n_gram(one['clean_review'],(1,1),10)


df1 = pd.DataFrame(pos_unigrams, columns = ['Text' , 'count'])
df1.groupby('Text').sum()['count'].sort_values(ascending=True).iplot(
    kind='bar', yTitle='Count', linecolor='black',color='green', title='Top 10 Unigrams in positve text',orientation='h')

df2 = pd.DataFrame(neg_unigrams, columns = ['Text' , 'count'])
df2.groupby('Text').sum()['count'].sort_values(ascending=True).iplot(
    kind='bar', yTitle='Count', linecolor='black', color='red',title='Top 10 Unigrams in negative text',orientation='h')

In [None]:
pos_bigrams = get_top_n_gram(five['clean_review'],(2,2),10)
neg_bigrams = get_top_n_gram(one['clean_review'],(2,2),10)


df1 = pd.DataFrame(pos_bigrams, columns = ['Text' , 'count'])
df1.groupby('Text').sum()['count'].sort_values(ascending=True).iplot(
    kind='bar', yTitle='Count', linecolor='black',color='green', title='Top 10 Bigrams in positve text',orientation='h')

df2 = pd.DataFrame(neg_bigrams, columns = ['Text' , 'count'])
df2.groupby('Text').sum()['count'].sort_values(ascending=True).iplot(
    kind='bar', yTitle='Count', linecolor='black', color='red',title='Top 10 Bigrams in negative text',orientation='h')

The preprocessed dataset can be used for text classification task.

In [None]:
df.head()

Unnamed: 0,Review,Review_rating,clean_review,sentiment,review_len,review_word_count
0,Not happy with product,1,not happy with product,neg,22,4
1,It's not as expected.,1,its not as expected,neu,20,4
2,AVERAGE PRODUCT,3,average product,neu,16,2
3,Pic more beautiful,3,pic more beautiful,pos,19,3
4,Got damage product. But quality is average fo...,3,got damage product but quality is average for,neu,47,8
