# 01 — Data Scraping & Corpus Construction

## Project Title
Emotion-Aware Comparative User Segmentation of Indonesian E-Wallet Applications Using NLP, Topic Modeling, and Unsupervised Learning

## Objective
- Mengumpulkan ulasan pengguna e-wallet Indonesia dari Google Play Store
- Membangun corpus teks multiaplikasi untuk analisis emosi, sentimen, dan topik
- Menyediakan dataset terstruktur untuk pendekatan unsupervised learning
- Menjamin konsistensi dataset lintas aplikasi

## Output
- raw_e_wallet_reviews.csv


###### GOOGLE DRIVE MOUNT

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


###### IMPORT LIBRARIES

In [3]:
!pip install google-play-scraper

Collecting google-play-scraper
  Downloading google_play_scraper-1.2.7-py3-none-any.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.2/50.2 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading google_play_scraper-1.2.7-py3-none-any.whl (28 kB)
Installing collected packages: google-play-scraper
Successfully installed google-play-scraper-1.2.7


In [4]:
# Core libraries
import pandas as pd
import numpy as np
import os
import time
import random
from datetime import datetime

# Scraping library
from google_play_scraper import reviews, Sort

###### PROJECT PATH CONFIGURATION

In [5]:
BASE_PATH = "/content/drive/MyDrive/ewallet_nlp_clustering_project"

NOTEBOOK_PATH = f"{BASE_PATH}/notebooks"
RAW_DATA_PATH = f"{BASE_PATH}/data/raw"
PROCESSED_DATA_PATH = f"{BASE_PATH}/data/processed"
OUTPUT_PATH = f"{BASE_PATH}/outputs"

# Safety check
for path in [NOTEBOOK_PATH, RAW_DATA_PATH, PROCESSED_DATA_PATH, OUTPUT_PATH]:
    if not os.path.exists(path):
        raise FileNotFoundError(f"Path not found: {path}")

###### SCRAPING STRATEGY
- Data source: Google Play Store
- Language: Indonesian (id)
- Country: Indonesia (id)
- Review order: NEWEST (pagination backward)
- Time span: multi-year (not limited to current year)
- Total target: 15,000 reviews
- Distribution: ±5,000 reviews per application
- Resource usage: CPU-only, rate-limited


###### APPLICATION CONFIGURATION

In [6]:
APPS_CONFIG = {
    "DANA": {
        "app_id": "id.dana",
        "target_reviews": 5000
    },
    "OVO": {
        "app_id": "ovo.id",
        "target_reviews": 5000
    },
    "GoPay": {
        "app_id": "com.gojek.app",
        "target_reviews": 5000
    }
}

###### CORE SCRAPING FUNCTION

In [7]:
def scrape_app_reviews(app_name, app_id, target_reviews):
    """
    Scrape Google Play Store reviews using pagination.
    Returns a DataFrame with standardized schema.
    """

    collected_reviews = []
    continuation_token = None

    while len(collected_reviews) < target_reviews:
        result, continuation_token = reviews(
            app_id,
            lang="id",
            country="id",
            sort=Sort.NEWEST,
            count=200,
            continuation_token=continuation_token
        )

        if not result:
            break

        for r in result:
            collected_reviews.append({
                "review_id": r.get("reviewId"),
                "app_name": app_name,
                "review_text": r.get("content"),
                "rating": r.get("score"),
                "review_date": r.get("at"),
                "app_version": r.get("reviewCreatedVersion"),
                "thumbs_up": r.get("thumbsUpCount"),
                "reviewer_name": r.get("userName")
            })

        if continuation_token is None:
            break

        # Polite scraping to avoid rate limit
        time.sleep(random.uniform(0.8, 1.5))

    df = pd.DataFrame(collected_reviews)
    return df.iloc[:target_reviews]

###### SCRAPING EXECUTION PIPELINE

In [8]:
all_reviews = []

for app_name, config in APPS_CONFIG.items():
    print(f"Scraping reviews for {app_name}...")

    df_app = scrape_app_reviews(
        app_name=app_name,
        app_id=config["app_id"],
        target_reviews=config["target_reviews"]
    )

    print(f"Collected {len(df_app)} reviews from {app_name}")
    all_reviews.append(df_app)

raw_reviews_df = pd.concat(all_reviews, ignore_index=True)

print(f"\nTotal reviews collected: {len(raw_reviews_df)}")

Scraping reviews for DANA...
Collected 5000 reviews from DANA
Scraping reviews for OVO...
Collected 5000 reviews from OVO
Scraping reviews for GoPay...
Collected 5000 reviews from GoPay

Total reviews collected: 15000


###### TECHNICAL DATA VALIDATION

In [9]:
raw_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   review_id      15000 non-null  object        
 1   app_name       15000 non-null  object        
 2   review_text    15000 non-null  object        
 3   rating         15000 non-null  int64         
 4   review_date    15000 non-null  datetime64[ns]
 5   app_version    11796 non-null  object        
 6   thumbs_up      15000 non-null  int64         
 7   reviewer_name  15000 non-null  object        
dtypes: datetime64[ns](1), int64(2), object(5)
memory usage: 937.6+ KB


In [10]:
raw_reviews_df.isnull().sum()

Unnamed: 0,0
review_id,0
app_name,0
review_text,0
rating,0
review_date,0
app_version,3204
thumbs_up,0
reviewer_name,0


In [11]:
raw_reviews_df["app_name"].value_counts()

Unnamed: 0_level_0,count
app_name,Unnamed: 1_level_1
DANA,5000
OVO,5000
GoPay,5000


###### DATA TYPE NORMALIZATION

In [12]:
raw_reviews_df["review_date"] = pd.to_datetime(raw_reviews_df["review_date"])
raw_reviews_df["rating"] = raw_reviews_df["rating"].astype(int)
raw_reviews_df["thumbs_up"] = raw_reviews_df["thumbs_up"].fillna(0).astype(int)

###### DUPLICATE & EMPTY REVIEW HANDLING

In [13]:
raw_reviews_df.drop_duplicates(subset="review_id", inplace=True)
raw_reviews_df = raw_reviews_df[raw_reviews_df["review_text"].str.strip() != ""]

###### DATASET SCHEMA ENFORCEMENT

In [14]:
EXPECTED_COLUMNS = [
    "review_id",
    "app_name",
    "review_text",
    "rating",
    "review_date",
    "app_version",
    "thumbs_up",
    "reviewer_name"
]

assert list(raw_reviews_df.columns) == EXPECTED_COLUMNS

###### SAVE RAW DATASET

In [15]:
output_file = f"{RAW_DATA_PATH}/raw_e_wallet_reviews.csv"
raw_reviews_df.to_csv(output_file, index=False)

print(f"Dataset successfully saved to:\n{output_file}")
print(f"Final total reviews: {len(raw_reviews_df)}")

Dataset successfully saved to:
/content/drive/MyDrive/ewallet_nlp_clustering_project/data/raw/raw_e_wallet_reviews.csv
Final total reviews: 15000


## Scraping Summary

- Total reviews collected: 15,000
- Applications: DANA, OVO, GoPay
- Language: Indonesian
- Source: Google Play Store
- Time span: Multi-year
- Schema: Validated & standardized
- Output file: raw_e_wallet_reviews.csv

This dataset serves as the foundational corpus for subsequent NLP tasks:
sentiment analysis, emotion detection, topic modeling, and clustering.