# **Scrapping Data Review Google (Aplikasi Shopee)**
---

**1. Latar Belakang**

Shopee adalah salah satu aplikasi belanja online paling populer di Indonesia. Dengan jutaan pengguna aktif, review atau ulasan yang diberikan pengguna di Google Play Store menjadi sumber data yang sangat berharga untuk memahami kepuasan, keluhan, dan harapan mereka terhadap aplikasi. Untuk mengolah data dalam jumlah besar secara efisien, dilakukan analisis sentimen otomatis yang mampu mengkategorikan ulasan menjadi positif, negatif, atau netral. Hasil analisis ini dapat digunakan sebagai masukan strategis bagi pengembang aplikasi dan manajemen produk.

**2. Tujuan Penelitian**

Mengelompokkan ulasan pengguna Shopee dari Google Play Store ke dalam kategori sentimen (positif, negatif, netral).
Menganalisis tema-tema umum dalam setiap kategori sentimen.
Membangun model klasifikasi sentimen dengan akurasi ≥ **85%**.

# **1. Import Library**

In [88]:
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
seed = 0
np.random.seed(seed)
import matplotlib.pyplot as plt
import seaborn as sns
import csv

import datetime as dt
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

!pip install sastrawi
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

from wordcloud import WordCloud



In [89]:
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# **2. Scrapping Google Play Store**

In [90]:
%pip install google-play-scraper



In [None]:
from google_play_scraper import app, reviews, Sort, reviews_all

In [None]:
scrapreview, _ = reviews(
    'com.shopee.id',
    lang='id',
    country='id',
    sort=Sort.MOST_RELEVANT,
    count=50000
)

print(f"Jumlah ulasan yang berhasil di-scrape: {len(scrapreview)}")

Jumlah ulasan yang berhasil di-scrape: 50000


In [None]:
import csv

with open('scrapping_ulasan_.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Review'])
    for review in scrapreview:
        writer.writerow([review['content']])

# **3. Loading Dataset**

In [None]:
app_reviews_df = pd.DataFrame(scrapreview)

In [None]:
app_reviews_df.shape

(50000, 11)

In [None]:
app_reviews_df.head()

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion
0,ed7aae21-32c9-4499-b3c8-8115535da482,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"Aplikasi belanja ini sebenarnya bagus, tapi pe...",5,2939,3.55.23,2025-07-25 09:35:31,"Hi kak Gilang Alf4rizy, maaf ya udah bikin km ...",2025-07-25 10:05:45,3.55.23
1,366bafc1-ed1c-400a-a54f-1ac30a823bfd,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,Saya sudah lama langganan SPaylater. Sekarang ...,3,185,3.54.23,2025-07-27 03:44:35,"Hai kak Doddy Limited, mohon maaf atas ketidak...",2025-07-27 04:47:07,3.54.23
2,f77c35cc-a5a9-4c1c-ab4a-0c3e8f3af350,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"Setelah update ko malah ngawur,, buka Riwayat ...",3,144,3.54.23,2025-07-27 14:29:07,"Hai kak Ahmad Diemy, maaf ya udah buat kecewa....",2025-07-27 15:04:52,3.54.23
3,4c41f0be-ca29-4bcb-b64f-8607948c752a,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,lama² saya kesel juga sm apl ini. awal² lancar...,1,218,3.54.23,2025-07-25 08:41:00,"Hi kak Purnama Sang555, maaf udah bikin ga nya...",2025-07-25 09:17:04,3.54.23
4,035ea8e3-7e7a-40f6-9eeb-0fda8efad466,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"Barang sesuai dengan deskripsi, bahan 0k sanga...",5,1316,3.54.23,2025-07-27 12:03:26,,NaT,3.54.23


In [None]:
app_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   reviewId              50000 non-null  object        
 1   userName              50000 non-null  object        
 2   userImage             50000 non-null  object        
 3   content               50000 non-null  object        
 4   score                 50000 non-null  int64         
 5   thumbsUpCount         50000 non-null  int64         
 6   reviewCreatedVersion  45174 non-null  object        
 7   at                    50000 non-null  datetime64[ns]
 8   replyContent          41037 non-null  object        
 9   repliedAt             41037 non-null  datetime64[ns]
 10  appVersion            45174 non-null  object        
dtypes: datetime64[ns](2), int64(2), object(7)
memory usage: 4.2+ MB


# **4. Cleaning Dataset**

drop dataset yang kosong dan duplikat

In [None]:
clean_df = app_reviews_df.copy()

clean_df = clean_df.dropna()
clean_df = clean_df.drop_duplicates()

print("Jumlah baris dan kolom setelah cleaning:", clean_df.shape)

Jumlah baris dan kolom setelah cleaning: (37836, 11)


In [None]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37836 entries, 0 to 49998
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   reviewId              37836 non-null  object        
 1   userName              37836 non-null  object        
 2   userImage             37836 non-null  object        
 3   content               37836 non-null  object        
 4   score                 37836 non-null  int64         
 5   thumbsUpCount         37836 non-null  int64         
 6   reviewCreatedVersion  37836 non-null  object        
 7   at                    37836 non-null  datetime64[ns]
 8   replyContent          37836 non-null  object        
 9   repliedAt             37836 non-null  datetime64[ns]
 10  appVersion            37836 non-null  object        
dtypes: datetime64[ns](2), int64(2), object(7)
memory usage: 3.5+ MB


# **5. Preprocessing Dataset**

membersihkan dan memproses text

In [None]:
import re
import string
import unicodedata

def normalize_repeated_chars(word):
    # Mengubah huruf berulang lebih dari 2 jadi satu (contoh: "mantaaaap" → "mantap")
    return re.sub(r'(.)\1{2,}', r'\1', word)

def cleaningText(text):
    # Lowercase
    text = text.lower()

    # Hapus mention, hashtag, RT, link
    text = unicodedata.normalize('NFKC', text)
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)     # mention
    text = re.sub(r'#[A-Za-z0-9_]+', '', text)     # hashtag
    text = re.sub(r'rt[\s]+', '', text)            # RT
    text = re.sub(r'http\S+|www.\S+', '', text)    # URL/link

    # Hapus angka dan tanda baca
    text = re.sub(r'\d+', '', text)                # angka
    text = re.sub(rf"[{re.escape(string.punctuation)}]", ' ', text)  # tanda baca

    # Hapus newline dan spasi berlebih
    text = text.replace('\n', ' ')
    text = re.sub(r'\s+', ' ', text).strip()

    # Normalisasi huruf berulang (per kata)
    text = ' '.join(normalize_repeated_chars(word) for word in text.split())

    return text

In [None]:
def load_slangwords_from_csv(file_path):
    slang_dict = {}
    with open(file_path, newline='', encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            slang = row["Slang"].strip()
            formal = row["Formal"].strip()
            slang_dict[slang] = formal
    return slang_dict

In [None]:
slangwords = load_slangwords_from_csv("slangwords.csv")

In [None]:
def fix_slangwords(text, slangwords):
    words = text.split()
    fixed_words = []

    for word in words:
        fixed_words.append(slangwords.get(word.lower(), word))

    return ' '.join(fixed_words)

In [None]:
def fix_slangwords(text):
    words = text.split()
    fixed_words = []

    for word in words:
        if word.lower() in slangwords:
            fixed_words.append(slangwords[word.lower()])
        else:
            fixed_words.append(word)

    fixed_text = ' '.join(fixed_words)
    return fixed_text

In [92]:
clean_df['text_clean'] = clean_df['content'].apply(cleaningText)
clean_df['text_slangwords'] = clean_df['text_clean'].apply(lambda x: fix_slangwords(x, slangwords))

In [None]:
def tokenizingText(text): # Tokenizing or splitting a string, text into a list of tokens
    text = word_tokenize(text)
    return text

In [None]:
def filteringText(text): # Remove stopwors in a text
    listStopwords = set(stopwords.words('indonesian'))
    listStopwords1 = set(stopwords.words('english'))
    listStopwords.update(listStopwords1)
    listStopwords.update(['iya','yaa','gak','nya','na','sih','ku',"di","ga","ya","gaa","loh","kah","woi","woii","woy"])
    filtered = []
    for txt in text:
        if txt not in listStopwords:
            filtered.append(txt)
    text = filtered
    return text

In [None]:
def toSentence(list_words): # Convert list of words into sentence
    sentence = ' '.join(word for word in list_words)
    return sentence

In [93]:
clean_df['text_tokenizing'] = clean_df['text_slangwords'].apply(tokenizingText)
clean_df['text_stopword'] = clean_df['text_tokenizing'].apply(filteringText)

In [95]:
clean_df['text_akhir'] = clean_df['text_stopword'].apply(toSentence)
clean_df.head()

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion,text_clean,text_slangwords,text_tokenizing,text_stopword,text_akhir,polarity_score,polarity
0,ed7aae21-32c9-4499-b3c8-8115535da482,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"Aplikasi belanja ini sebenarnya bagus, tapi pe...",5,2939,3.55.23,2025-07-25 09:35:31,"Hi kak Gilang Alf4rizy, maaf ya udah bikin km ...",2025-07-25 10:05:45,3.55.23,aplikasi belanja ini sebenarnya bagus tapi per...,aplikasi belanja ini sebenarnya bagus tapi per...,"[aplikasi, belanja, ini, sebenarnya, bagus, ta...","[aplikasi, belanja, bagus, performanya, mengec...",aplikasi belanja bagus performanya mengecewaka...,-22,negative
1,366bafc1-ed1c-400a-a54f-1ac30a823bfd,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,Saya sudah lama langganan SPaylater. Sekarang ...,3,185,3.54.23,2025-07-27 03:44:35,"Hai kak Doddy Limited, mohon maaf atas ketidak...",2025-07-27 04:47:07,3.54.23,saya sudah lama langganan spaylater sekarang t...,saya sudah lama langganan spaylater sekarang t...,"[saya, sudah, lama, langganan, spaylater, seka...","[langganan, spaylater, tagihannya, berasa, mah...",langganan spaylater tagihannya berasa mahal ce...,-20,negative
2,f77c35cc-a5a9-4c1c-ab4a-0c3e8f3af350,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"Setelah update ko malah ngawur,, buka Riwayat ...",3,144,3.54.23,2025-07-27 14:29:07,"Hai kak Ahmad Diemy, maaf ya udah buat kecewa....",2025-07-27 15:04:52,3.54.23,setelah update ko malah ngawur buka riwayat pe...,setelah update ko bahkan berbicara sembarangan...,"[setelah, update, ko, bahkan, berbicara, semba...","[update, ko, berbicara, sembarangan, buka, riw...",update ko berbicara sembarangan buka riwayat p...,-6,negative
3,4c41f0be-ca29-4bcb-b64f-8607948c752a,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,lama² saya kesel juga sm apl ini. awal² lancar...,1,218,3.54.23,2025-07-25 08:41:00,"Hi kak Purnama Sang555, maaf udah bikin ga nya...",2025-07-25 09:17:04,3.54.23,lama saya kesel juga sm apl ini awal lancar ma...,lama saya kesel juga sm apl ini awal lancar ma...,"[lama, saya, kesel, juga, sm, apl, ini, awal, ...","[kesel, sm, apl, lancar, kesini, parah, kali, ...",kesel sm apl lancar kesini parah kali pesanan ...,-8,negative
5,ad7aebea-ae73-4930-81e6-6973ffd94015,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"sedikit masukan untuk shopee. "" pada saat memb...",5,1987,3.54.23,2025-07-27 15:32:32,"Hai kak Tri Wiyono, maaf ya udah buat kecewa. ...",2025-07-27 16:08:08,3.54.23,sedikit masukan untuk shopee pada saat membuka...,sedikit masukan untuk shopee pada saat membuka...,"[sedikit, masukan, untuk, shopee, pada, saat, ...","[masukan, shopee, membuka, aplikasi, tolong, h...",masukan shopee membuka aplikasi tolong halaman...,-4,negative


In [96]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37836 entries, 0 to 49998
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   reviewId              37836 non-null  object        
 1   userName              37836 non-null  object        
 2   userImage             37836 non-null  object        
 3   content               37836 non-null  object        
 4   score                 37836 non-null  int64         
 5   thumbsUpCount         37836 non-null  int64         
 6   reviewCreatedVersion  37836 non-null  object        
 7   at                    37836 non-null  datetime64[ns]
 8   replyContent          37836 non-null  object        
 9   repliedAt             37836 non-null  datetime64[ns]
 10  appVersion            37836 non-null  object        
 11  text_clean            37836 non-null  object        
 12  text_slangwords       37836 non-null  object        
 13  text_tokenizing      

# **6. Labelling**

In [97]:
import csv
import requests
from io import StringIO

lexicon_positive = dict()

response = requests.get('https://raw.githubusercontent.com/angelmetanosaa/dataset/main/lexicon_positive.csv')

if response.status_code == 200:
    reader = csv.reader(StringIO(response.text), delimiter=',')

    for row in reader:
        lexicon_positive[row[0]] = int(row[1])
else:
    print("Failed to fetch positive lexicon data")

lexicon_negative = dict()

response = requests.get('https://raw.githubusercontent.com/angelmetanosaa/dataset/main/lexicon_negative.csv')

if response.status_code == 200:
    reader = csv.reader(StringIO(response.text), delimiter=',')

    for row in reader:
        lexicon_negative[row[0]] = int(row[1])
else:
    print("Failed to fetch negative lexicon data")

In [98]:
def sentiment_analysis_lexicon_indonesia(text, threshold=0):

    score = 0
    for word in text:
        if (word in lexicon_positive):
            score += lexicon_positive[word]

    for word in text:
        if (word in lexicon_negative):
            score += lexicon_negative[word]

    polarity=''

    if score > threshold:
        polarity = 'positive'
    elif score < -threshold:
        polarity = 'negative'
    else:
        polarity = 'neutral'

    return score, polarity

In [99]:
results = clean_df['text_stopword'].apply(sentiment_analysis_lexicon_indonesia)
results = list(zip(*results))
clean_df['polarity_score'] = results[0]
clean_df['polarity'] = results[1]
print(clean_df['polarity'].value_counts())

polarity
positive    19766
negative    15617
neutral      2453
Name: count, dtype: int64


In [100]:
clean_df.head(100)

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion,text_clean,text_slangwords,text_tokenizing,text_stopword,text_akhir,polarity_score,polarity
0,ed7aae21-32c9-4499-b3c8-8115535da482,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"Aplikasi belanja ini sebenarnya bagus, tapi pe...",5,2939,3.55.23,2025-07-25 09:35:31,"Hi kak Gilang Alf4rizy, maaf ya udah bikin km ...",2025-07-25 10:05:45,3.55.23,aplikasi belanja ini sebenarnya bagus tapi per...,aplikasi belanja ini sebenarnya bagus tapi per...,"[aplikasi, belanja, ini, sebenarnya, bagus, ta...","[aplikasi, belanja, bagus, performanya, mengec...",aplikasi belanja bagus performanya mengecewaka...,-22,negative
1,366bafc1-ed1c-400a-a54f-1ac30a823bfd,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,Saya sudah lama langganan SPaylater. Sekarang ...,3,185,3.54.23,2025-07-27 03:44:35,"Hai kak Doddy Limited, mohon maaf atas ketidak...",2025-07-27 04:47:07,3.54.23,saya sudah lama langganan spaylater sekarang t...,saya sudah lama langganan spaylater sekarang t...,"[saya, sudah, lama, langganan, spaylater, seka...","[langganan, spaylater, tagihannya, berasa, mah...",langganan spaylater tagihannya berasa mahal ce...,-20,negative
2,f77c35cc-a5a9-4c1c-ab4a-0c3e8f3af350,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"Setelah update ko malah ngawur,, buka Riwayat ...",3,144,3.54.23,2025-07-27 14:29:07,"Hai kak Ahmad Diemy, maaf ya udah buat kecewa....",2025-07-27 15:04:52,3.54.23,setelah update ko malah ngawur buka riwayat pe...,setelah update ko bahkan berbicara sembarangan...,"[setelah, update, ko, bahkan, berbicara, semba...","[update, ko, berbicara, sembarangan, buka, riw...",update ko berbicara sembarangan buka riwayat p...,-6,negative
3,4c41f0be-ca29-4bcb-b64f-8607948c752a,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,lama² saya kesel juga sm apl ini. awal² lancar...,1,218,3.54.23,2025-07-25 08:41:00,"Hi kak Purnama Sang555, maaf udah bikin ga nya...",2025-07-25 09:17:04,3.54.23,lama saya kesel juga sm apl ini awal lancar ma...,lama saya kesel juga sm apl ini awal lancar ma...,"[lama, saya, kesel, juga, sm, apl, ini, awal, ...","[kesel, sm, apl, lancar, kesini, parah, kali, ...",kesel sm apl lancar kesini parah kali pesanan ...,-8,negative
5,ad7aebea-ae73-4930-81e6-6973ffd94015,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"sedikit masukan untuk shopee. "" pada saat memb...",5,1987,3.54.23,2025-07-27 15:32:32,"Hai kak Tri Wiyono, maaf ya udah buat kecewa. ...",2025-07-27 16:08:08,3.54.23,sedikit masukan untuk shopee pada saat membuka...,sedikit masukan untuk shopee pada saat membuka...,"[sedikit, masukan, untuk, shopee, pada, saat, ...","[masukan, shopee, membuka, aplikasi, tolong, h...",masukan shopee membuka aplikasi tolong halaman...,-4,negative
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,7531431d-5e72-4832-9a8c-59af2cf60fc8,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"sebenarnya gamau kasih riview, karna apk nya b...",4,69,3.53.24,2025-07-07 09:18:16,"Hai kak aqila destriani, maaf banget ya udah b...",2025-07-07 10:14:49,3.53.24,sebenarnya gamau kasih riview karna apk nya ba...,sebenarnya gamau kasih riview karena apk nya b...,"[sebenarnya, gamau, kasih, riview, karena, apk...","[gamau, kasih, riview, apk, bagus, bagus, ming...",gamau kasih riview apk bagus bagus minggu buka...,-13,negative
103,9c33d0ef-b1aa-431b-80ac-2dfde6c8c59f,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,Niat beli makanan hemat malah keluar duit 2x. ...,1,6,3.54.23,2025-07-27 14:45:26,"Hi Kakak Alifio Putra Tama (Alfi), mohon maaf ...",2025-07-27 15:11:07,3.54.23,niat beli makanan hemat malah keluar duit x vo...,niat beli makanan hemat bahkan keluar duit x v...,"[niat, beli, makanan, hemat, bahkan, keluar, d...","[niat, beli, makanan, hemat, duit, x, voucher,...",niat beli makanan hemat duit x voucher diskon ...,5,positive
104,9dbf7b73-55ac-48ac-818a-3fe517258c36,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,Semenjak di update setiap buka shopee tambah s...,4,205,3.54.23,2025-07-23 00:07:57,"Hi kak Khoriidah Hanuum, maaf bikin kecewa ter...",2025-07-23 01:04:11,3.54.23,semenjak di update setiap buka shopee tambah s...,semenjak di update setiap buka shopee tambah s...,"[semenjak, di, update, setiap, buka, shopee, t...","[semenjak, update, buka, shopee, suka, ngelag,...",semenjak update buka shopee suka ngelag buka s...,2,positive
105,361ac3eb-12bb-4748-a9f3-56179ab74328,Pengguna Google,https://play-lh.googleusercontent.com/EGemoI2N...,"belanja 3x bulan ini, datang barangnya kurang ...",3,43,3.54.23,2025-07-17 09:03:35,"Hai kak, maaf buat resah terkait pembayaran di...",2022-12-12 10:48:23,3.54.23,belanja x bulan ini datang barangnya kurang se...,belanja x bulan ini datang barangnya kurang se...,"[belanja, x, bulan, ini, datang, barangnya, ku...","[belanja, x, barangnya, pengembalian, dana, ha...",belanja x barangnya pengembalian dana harga ba...,-13,negative


In [101]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37836 entries, 0 to 49998
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   reviewId              37836 non-null  object        
 1   userName              37836 non-null  object        
 2   userImage             37836 non-null  object        
 3   content               37836 non-null  object        
 4   score                 37836 non-null  int64         
 5   thumbsUpCount         37836 non-null  int64         
 6   reviewCreatedVersion  37836 non-null  object        
 7   at                    37836 non-null  datetime64[ns]
 8   replyContent          37836 non-null  object        
 9   repliedAt             37836 non-null  datetime64[ns]
 10  appVersion            37836 non-null  object        
 11  text_clean            37836 non-null  object        
 12  text_slangwords       37836 non-null  object        
 13  text_tokenizing      

In [102]:
clean_df.to_csv("shopee_reviews.csv", index=False)