<a href="https://colab.research.google.com/github/ashamril/Text-Analytics/blob/master/Sentiment_Analysis_of_Shopee_product_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis of Shopee product reviews
## English reviews are using 4 pre-trained machine learning models
1. Huggingface Pipeline<br>
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering<br>
https://huggingface.co/transformers/main_classes/pipelines.html
2. VADER<br>
Valence Aware Dictionary and sEntiment Reasoner is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.
VADER uses a combination of a sentiment lexicon is a list of lexical features (e.g., words) which are generally labelled according to their semantic orientation as either positive or negative.<br>
https://github.com/cjhutto/vaderSentiment
3. Flair<br>
Flair’s sentiment classifier is based on a character-level LSTM neural network which takes sequences of letters and words into account when predicting
Allows you to apply state-of-the-art natural language processing (NLP) models to sections of text. It works quite differently to the previously mentioned models. 
Flair utilizes a pre-trained model to detect positive or negative comments and print a number in brackets behind the label which is a prediction confidence.<br>
https://github.com/flairNLP/flair
4. Textblob<br>
Textblob’s Sentiment Analysis works in a similar way to NLTK — using a bag of words classifier, but the advantage is that it includes Subjectivity Analysis too (how factual/opinionated a piece of text is)<br>
https://textblob.readthedocs.io/en/dev/
<br>

## Bahasa Melayu reviews are using 1 pre-trained machine learning model
1. Malaya<br>
A Natural-Language-Toolkit library for Bahasa Malaysia, powered by Deep Learning Tensorflow.
Malaya provided basic interface for Pre-trained Transformer encoder models, specific to Malay, local social media slang and Manglish language, called Transformer-Bahasa.<br>
https://malaya.readthedocs.io/en/latest/

In [1]:
%pip -q install tabulate
%pip -q install textblob
%pip -q install google_trans_new

In [2]:
%pip -q install flair

In [3]:
from textblob import TextBlob

In [4]:
%pip -q install malaya

In [5]:
import re
import requests
import pandas as pd
from tabulate import tabulate
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import flair
import malaya
import json
from google_trans_new import google_translator
from transformers import pipeline

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


  'Cannot import beam_search_ops not available for Tensorflow 2, `deep_model` for stemmer will not available to use.'


In [6]:
# Using stacking model for better results
bert = malaya.sentiment.transformer('bert')
tinybert = malaya.sentiment.transformer('tiny-bert')
albert = malaya.sentiment.transformer('albert')
tinyalbert = malaya.sentiment.transformer('tiny-albert')
xlnet = malaya.sentiment.transformer('xlnet')
alxlnet = malaya.sentiment.transformer('alxlnet')

In [7]:
# url = input("Enter the Shopee product URL: ")
# # Example: https://shopee.com.my/Acer-Aspire-3-A315-35-15.6-Laptop-(Celeron-N4500-4GB-256GB-Intel)-(Windows-10-Pro-Basic-Installation)-i.268863847.4038359450

In [8]:
url = 'https://shopee.com.my/-Bundle-Microsoft-Surface-Pro-7-Platinum-(i5-1035G4-Intel-Iris-Plus-Graphics-8GB-128GB-12.3-Windows-10)-i.154568882.4433101145'
#url = 'https://shopee.com.my/Acer-Aspire-3-A315-35-15.6-Laptop-(Celeron-N4500-4GB-256GB-Intel)-(Windows-10-Pro-Basic-Installation)-i.268863847.4038359450'
#url = 'https://shopee.com.my/%E3%80%90EXP-MAY-2021%E3%80%91Nongshim-Shin-Ramyun-Korea-Ramen-5-Pack-%E9%9F%A9%E5%9B%BD%E8%BE%9B%E6%8B%89%E9%9D%A2%E6%B3%A1%E8%8F%9C%E9%9D%A2-Instant-Spicy-Noodle-Soup-i.39931794.2920958018'
#url = 'https://shopee.com.my/Verbatim-New-Slider-USB-2.0-Flash-Drive-Pendrive-16GB-32GB-i.143275520.2312604078'

numberInURL = re.findall('\d+', url)
itemid = numberInURL[-1]
shopid = numberInURL[-2]
print("Shop ID: ", shopid)
print("Item ID: ", itemid)

Shop ID:  154568882
Item ID:  4433101145


In [9]:
ratings_url = 'https://shopee.com.my/api/v2/item/get_ratings?filter=0&flag=1&itemid={item_id}&limit=20&offset={offset}&shopid={shop_id}&type=0'

komen = []
offset = 0

while True:
    data = requests.get(ratings_url.format(shop_id=shopid, item_id=itemid, offset=offset)).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    i = 1
    for i, rating in enumerate(data['data']['ratings'], 1):
      if len(rating['comment']) > 2:
        #print(i, rating['comment'])
        #print('-' * 80)
        komen.append(rating['comment'])
    if i % 20:
        break
    offset += 20

print(komen)

['Memang berbaloi tunggu promo tu !!!\nAnd packaging mmg kemas dan selamat walaupun dkt kotak tu mcm kemik sikit tapi alhamdulillah takde calar pape pun dkt tab 🥺😍. ', 'Barang sampai dalam keadaan yang selamat dan penghantaran yang baik...👍👍👍', 'I got mine. The delivery quite average to east malaysia. Thank you', 'Item was properly sealed and packed with Fragile sticker.\nNo dent on the product.\nReally super fast delivery to East Malaysia.\n\n👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽', 'Arrived 2 days after payment. Serve by Seng Heng. Very good price and original product.', 'Nice! Everything looks good. I was carry surface pro1. Now switch to pro7. Taller than previous but still looks beautiful.ice blue typecover but it more like light grey/sliver instead but it is still nice.thanks.', 'Buy it during the promotion, worth it ', 'Thanks man! Superb fast delivery ! All item was packed nicely 🥺🥺🥺 will buy a pen and mouse later on', 'The servic

In [10]:
#  Language detection using Google translator

corpus_list = komen
dataEN = []
dataMSID = []
for i in corpus_list: 
  t = google_translator().detect(i)
  if t[0] == 'en':
    dataEN.append([t[0], i])
  elif t[0] == 'ms' or t[0] == 'id':
    dataMSID.append([t[0], i])

dfKomenEN = pd.DataFrame(dataEN)
dfKomenEN.columns = ['Language', 'Review']
dfKomenMSID = pd.DataFrame(dataMSID)
dfKomenMSID.columns = ['Language', 'Review']
print("English reviews: \n", tabulate(dfKomenEN, showindex=False, headers=dfKomenEN.columns))
print("")
print("Bahasa Melayu reviews: \n", tabulate(dfKomenMSID, showindex=False, headers=dfKomenMSID.columns))

English reviews: 
 Language    Review
----------  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
en          I got mine. The delivery quite average to east malaysia. Thank you
en          Item was properly sealed and packed with Fragile sticker.
            No dent on the product.
            Really super fast delivery to East Malaysia.

            👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽
en          Arrived 2 days after payment. Serve by Seng Heng. Very good price and original product.
en          Nice! Everything looks good. I was carry surface pro1. Now switch to pro7. Taller than previous but still looks beautiful.ice blue typecover but it more like light grey/sliver instead but it is s

In [11]:
# # Language Detection using TextBlob

# corpus_list = komen
# from textblob import TextBlob 
# import time

# dataEN = []
# dataMSID = []
# for i in corpus_list: 
#   lang = TextBlob(i)
#   time.sleep(1)
#   if lang.detect_language() == 'en':
#     dataEN.append([lang.detect_language(), i])
#   elif lang.detect_language() == 'ms' or lang.detect_language() == 'id':
#     dataMSID.append([lang.detect_language(), i])

# dfKomenEN = pd.DataFrame(dataEN)
# dfKomenEN.columns = ['Language', 'Review']
# dfKomenMSID = pd.DataFrame(dataMSID)
# dfKomenMSID.columns = ['Language', 'Review']
# print("English reviews: \n", tabulate(dfKomenEN, showindex=False, headers=dfKomenEN.columns))
# print("")
# print("Bahasa Melayu reviews: \n", tabulate(dfKomenMSID, showindex=False, headers=dfKomenMSID.columns))

In [12]:
# By default, the model downloaded for this pipeline is called “distilbert-base-uncased-finetuned-sst-2-english”. 
# It uses the DistilBERT architecture and has been fine-tuned on a dataset called SST-2 for the sentiment analysis task.
classifier = pipeline('sentiment-analysis')

# Using specific model
#classifier = pipeline('sentiment-analysis', model="bert-base-cased")
#classifier = pipeline('sentiment-analysis', model="xlnet-base-cased")

In [13]:
corpus = dfKomenEN['Review']
corpus_list = corpus.tolist()
corpus_list = [x.replace('\n', '') for x in corpus_list]

In [14]:
def cls_corpus_pipeline(corpus_list):
  data = []
  global df
  global df3
  for sentence in corpus_list: 
    corpus_result = classifier(sentence)
    listToStr = ' '.join([str(elem) for elem in corpus_result])
    listToStr = listToStr.replace('\'', '')
    listToStr = listToStr.replace('}', '')
    listToStr = listToStr.replace(',', '')
    label = listToStr.split()[1]
    score = listToStr.split()[3]
    data.append([label, score, sentence])
    df = pd.DataFrame(data)

  df.columns=['Classification', 'Score', 'Text']
  class_count  = df['Classification'].value_counts().sort_index()
  s = df.Classification
  counts = s.value_counts()
  percent100 = s.value_counts(normalize=True).mul(100).round(2).astype(str) + '%'
  df3 = pd.DataFrame({'Counts': counts, 'Percentage': percent100}).sort_index(ascending=False)

In [15]:
cls_corpus_pipeline(corpus_list)
dfPipeline = df.copy()
dfPipelineCount = df3.copy()

In [16]:
model = SentimentIntensityAnalyzer()

In [17]:
def cls_corpus_vader(corpus_list):
  data=[]
  global df2
  global df3
  for sentence in corpus_list: 
    corpus_result = model.polarity_scores(sentence)
    # Decide whether the text is positive, negative or neutral 
    if corpus_result['compound'] >= 0.05 : 
      cls = "Positive"
    elif corpus_result['compound'] <= -0.05 : 
      cls = "Negative" 
    else : 
      cls = "Neutral" 
    data.append([corpus_result, cls, sentence])
    df = pd.DataFrame(data)

  df2 = pd.DataFrame(df[0].values.tolist(), index=df.index)
  df2.columns=['Negative', 'Neutral', 'Positive', 'Compound']

  df2['Classification'] = df[1]
  df2['Text'] = df[2]
  class_count  = df2['Classification'].value_counts().sort_index()
  s = df2.Classification
  counts = s.value_counts()
  percent100 = s.value_counts(normalize=True).mul(100).round(2).astype(str) + '%'
  dfvader = pd.DataFrame({'Counts': counts, 'Percentage': percent100}).sort_index(ascending=False)

In [18]:
cls_corpus_vader(corpus_list)
dfVader = df2.copy()
dfVaderCount = df3.copy()

In [19]:
flair_sentiment = flair.models.TextClassifier.load('en-sentiment')

2021-04-09 15:47:52,543 loading file /root/.flair/models/sentiment-en-mix-distillbert_4.pt


In [20]:
def cls_corpus_flair(corpus_list):
  data=[]
  global df
  global df3
  for sentence in corpus_list: 
    corpus_result = flair.data.Sentence(sentence)
    flair_sentiment.predict(corpus_result)
    total_sentiment = corpus_result.labels
    total_sentiment = total_sentiment.pop()
    data.append([round(total_sentiment.score, 3), total_sentiment.value, sentence])
  
  df = pd.DataFrame(data)
  df.columns=['Predict', 'Classification', 'Text']

  class_count  = df['Classification'].value_counts().sort_index()
  s = df.Classification
  counts = s.value_counts()
  percent100 = s.value_counts(normalize=True).mul(100).round(2).astype(str) + '%'
  df3 = pd.DataFrame({'Counts': counts, 'Percentage': percent100}).sort_index(ascending=False)

In [21]:
cls_corpus_flair(corpus_list)
dfFlair = df.copy()
dfFlairCount = df3.copy()

In [22]:
def cls_corpus_textblob(corpus_list):
  data=[]
  global df
  global df3
  for sentence in corpus_list: 
    corpus_result = TextBlob(sentence)
    result = round(corpus_result.sentiment.polarity, 3)
    # Decide whether the text is positive, negative or neutral 
    if result > 0 : 
      cls = "Positive"
    elif result < 0 : 
      cls = "Negative" 
    else : 
      cls = "Neutral" 
    data.append([result, cls, sentence])

  df = pd.DataFrame(data)
  df.columns=['Polarity', 'Classification', 'Text']
  class_count  = df['Classification'].value_counts().sort_index()
  s = df.Classification
  counts = s.value_counts()
  percent100 = s.value_counts(normalize=True).mul(100).round(2).astype(str) + '%'
  df3 = pd.DataFrame({'Counts': counts, 'Percentage': percent100}).sort_index(ascending=False)

In [23]:
cls_corpus_textblob(corpus_list)
dfTextblob = df.copy()
dfTextblobCount = df3.copy()

In [24]:
# Available Transformer models
malaya.sentiment.available_transformer()

INFO:root:tested on 20% test set.


Unnamed: 0,Size (MB),Quantized Size (MB),macro precision,macro recall,macro f1-score
bert,425.6,111.0,0.9933,0.9933,0.99329
tiny-bert,57.4,15.4,0.98774,0.98774,0.98774
albert,48.6,12.8,0.99227,0.99226,0.99226
tiny-albert,22.4,5.98,0.98554,0.9855,0.98551
xlnet,446.6,118.0,0.99353,0.99353,0.99353
alxlnet,46.8,13.3,0.99188,0.99188,0.99188


In [25]:
def cls_corpus_malaya(corpus_list):
  #model = malaya.sentiment.transformer('albert')
  data = []
  global df2
  global df3
  for i in corpus_list: 
    corpus_result = malaya.stack.predict_stack([bert, tinybert, albert, tinyalbert, xlnet, alxlnet], [i])
    #corpus_result = model.predict_proba([i])
    for j in corpus_result:
      corpus_result = j.values()
      data.append([corpus_result, i])

  df = pd.DataFrame(data)
  df2 = pd.DataFrame(df[0].values.tolist(), index=df.index)
  df2.columns=['Negative', 'Positive', 'Neutral']
  df2['Classification'] = df2[['Negative','Positive','Neutral']].idxmax(axis=1)
  df2['Text'] = df[1]

  class_count  = df2['Classification'].value_counts().sort_index()
  s = df2.Classification
  counts = s.value_counts()
  percent100 = s.value_counts(normalize=True).mul(100).round(2).astype(str) + '%'
  df3 = pd.DataFrame({'Counts': counts, 'Percentage': percent100})

In [26]:
corpus = dfKomenMSID['Review']
corpus_list = corpus.tolist()
corpus_list = [x.replace('\n', '') for x in corpus_list]
cls_corpus_malaya(corpus_list)
dfMalaya = df2.copy()
dfMalayaCount = df3.copy()

In [27]:
print("Total Number of EN Reviews: ", dfPipeline['Classification'].count())
print("1. EN Model Pipeline: \n", dfPipelineCount)
print("")
print("2. EN Model Vader: \n", dfVaderCount)
print("")
print("3. EN Model Flair: \n", dfFlairCount)
print("")
print("4. EN Model Textblob: \n", dfTextblobCount)
print("")
print("Total Number of BM Reviews: ", dfMalaya['Classification'].count())
print("1. BM Model Malaya: \n", dfMalayaCount)

Total Number of EN Reviews:  31
1. EN Model Pipeline: 
           Counts Percentage
POSITIVE      22     70.97%
NEGATIVE       9     29.03%

2. EN Model Vader: 
           Counts Percentage
POSITIVE      22     70.97%
NEGATIVE       9     29.03%

3. EN Model Flair: 
           Counts Percentage
POSITIVE      21     67.74%
NEGATIVE      10     32.26%

4. EN Model Textblob: 
           Counts Percentage
Positive      25     80.65%
Neutral        1      3.23%
Negative       5     16.13%

Total Number of BM Reviews:  11
1. BM Model Malaya: 
           Counts Percentage
Positive      10     90.91%
Negative       1      9.09%


In [28]:
print("1. EN Model Pipeline: \n", tabulate(dfPipeline, showindex=False, headers=dfPipeline.columns))

1. EN Model Pipeline: 
 Classification       Score  Text
----------------  --------  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
POSITIVE          0.99978   I got mine. The delivery quite average to east malaysia. Thank you
NEGATIVE          0.950041  Item was properly sealed and packed with Fragile sticker.No dent on the product.Really super fast delivery to East Malaysia.👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽
POSITIVE          0.998006  Arrived 2 days after payment. Serve by Seng Heng. Very good price and original product.
POSITIVE          0.999771  Nice! Everything looks good. I was carry surface pro1. Now switch to pro7. Taller than previous but still looks beautiful.ice blue typec

In [29]:
print("2. EN Model Vader: \n", tabulate(dfVader, showindex=False, headers=dfVader.columns))

2. EN Model Vader: 
   Negative    Neutral    Positive    Compound  Classification    Text
----------  ---------  ----------  ----------  ----------------  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     0          0.8         0.2        0.3612  Positive          I got mine. The delivery quite average to east malaysia. Thank you
     0          0.822       0.178      0.5994  Positive          Item was properly sealed and packed with Fragile sticker.No dent on the product.Really super fast delivery to East Malaysia.👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽
     0          0.686       0.314      0.6697  Positive          Arrived 2 days after payment. Serve by Seng Heng. Very good price and 

In [30]:
print("3. EN Model Flair: \n", tabulate(dfFlair, showindex=False, headers=dfFlair.columns))

3. EN Model Flair: 
   Predict  Classification    Text
---------  ----------------  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    0.862  POSITIVE          I got mine. The delivery quite average to east malaysia. Thank you
    0.905  POSITIVE          Item was properly sealed and packed with Fragile sticker.No dent on the product.Really super fast delivery to East Malaysia.👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽
    0.986  POSITIVE          Arrived 2 days after payment. Serve by Seng Heng. Very good price and original product.
    0.996  POSITIVE          Nice! Everything looks good. I was carry surface pro1. Now switch to pro7. Taller than previous but still looks beautiful.ice blue ty

In [31]:
print("4. EN Model Textblob: \n", tabulate(dfTextblob, showindex=False, headers=dfTextblob.columns))

4. EN Model Textblob: 
   Polarity  Classification    Text
----------  ----------------  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    -0.15   Negative          I got mine. The delivery quite average to east malaysia. Thank you
     0.133  Positive          Item was properly sealed and packed with Fragile sticker.No dent on the product.Really super fast delivery to East Malaysia.👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽👍🏽
     0.642  Positive          Arrived 2 days after payment. Serve by Seng Heng. Very good price and original product.
     0.364  Positive          Nice! Everything looks good. I was carry surface pro1. Now switch to pro7. Taller than previous but still looks beautiful.ic

In [32]:
print("1. BM Model Malaya: \n", tabulate(dfMalaya, showindex=False, headers=dfMalaya.columns))

1. BM Model Malaya: 
    Negative     Positive      Neutral  Classification    Text
-----------  -----------  -----------  ----------------  --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
5.28512e-06  0.997439     0.000522789  Positive          Memang berbaloi tunggu promo tu !!!And packaging mmg kemas dan selamat walaupun dkt kotak tu mcm kemik sikit tapi alhamdulillah takde calar pape pun dkt tab 🥺😍.
0.000134307  0.94672      0.0132966    Positive          Barang sampai dalam keadaan yang selamat dan penghantaran yang baik...👍👍👍
0.99902      1.53571e-06  0.00015033   Negative          Nasib kurang baik...cover terima yg tak ok. Bila tekan shift + 2 keluar " tidak keluar @. Bila tanya minta pergi center...menyesalnya beli online, xde jaminan daripada kedai.
3.79842e-06  0.999023     0.000374773  Positive          br

In [33]:
# # Testing
# model = malaya.sentiment.transformer('alxlnet')
# string1 = 'Item selamat sampai. Tiada kerosakan. Tak sampai 24 jam order dh smpaii wowww💖 Seller bagus, call bgtau psl stock. Good job'
# model.predict_proba([string1])