# **üß† Real-Time Sentiment Analysis on Malaysian E-Wallet Reviews**

---





**üìò Subject Details**  
**Subject Code:** SECP3133  
**Subject Name:** HIGH PERFORMANCE DATA PROCESSING  
**Session-Sem:** 24/25-2  

---

**üë• Group B - Data Drillers**

| No. | Name                                  | Matric No       |
|-----|---------------------------------------|-----------------|
| 1   | MUHAMMAD ANAS BIN MOHD PIKRI         | A21SC0464       |
| 2   | MULYANI BINTI SARIPUDDIN             | A22EC0223       |
| 3   | ALIATUL IZZAH BINTI JASMAN           | A22EC0136       |
| 4   | THEVAN RAJU A/L JEGANATH             | A22EC0286       |

  - This Colab notebook is part of a real-time big data analytics project focused on sentiment analysis of popular Malaysian e-wallet applications using user reviews from the Google Play Store.

  - The goal is to scrape, clean, and label user reviews (in Bahasa Melayu) from selected e-wallet apps, preparing them for ingestion into a real-time pipeline.

## **üì± Apps Covered**
1. Touch 'n Go (my.com.tngdigital.ewallet)

2. Boost (my.com.myboost)

3. GrabPay (com.grabtaxi.passenger)

4. Setel (com.setel.mobile)

5. ShopeePay (com.shopeepay.my)

## **Google Play Scrapping**

### 1. üì¶ Install google-play-scraper

Installs the library used to scrape app details and user reviews from the Google Play Store ‚Äî no API key required.

In [1]:
pip install google-play-scraper


Collecting google-play-scraper
  Downloading google_play_scraper-1.2.7-py3-none-any.whl.metadata (50 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/50.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m50.2/50.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading google_play_scraper-1.2.7-py3-none-any.whl (28 kB)
Installing collected packages: google-play-scraper
Successfully installed google-play-scraper-1.2.7


### üîó 2. Connect to Google Drive
Mounts Google Drive to save scraped review data and ensure persistent storage across sessions.

In [2]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Set Drive output path
drive_folder = '/content/drive/MyDrive/Project2'  # Change folder name if needed
os.makedirs(drive_folder, exist_ok=True)

OUTPUT_CSV = os.path.join(drive_folder, 'e_wallet_reviews.csv')


Mounted at /content/drive


### üåê 3. Install langdetect

Installs the langdetect library for automatic language detection of text (supports 55+ languages, including Malay).



In [3]:
pip install langdetect


Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m163.8/981.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m [32m972.8/981.5 kB[0m [31m15.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m981.5/981.5 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Buil

### üóÉÔ∏è 4. Scrape E-Wallet Reviews from Google Play (Bahasa Melayu)

This block collects the latest user reviews (in Bahasa Melayu) for five major Malaysian e-wallet apps using google-play-scraper.
Each review is labeled as positive, neutral, or negative based on the rating, and saved to Google Drive (e_wallet_reviews.csv).


In [4]:
from google_play_scraper import Sort, reviews
import pandas as pd
import os
import time
from datetime import datetime
from google.colab import drive

# Step 1: Mount Google Drive
drive.mount('/content/drive')

# Step 2: Define your e-wallet app packages
apps = {
    'TouchNGo': 'my.com.tngdigital.ewallet',
    'Boost': 'my.com.myboost',
    'Grab': 'com.grabtaxi.passenger',
    'Setel': 'com.setel.mobile',
    'Shopee': 'com.shopeepay.my',
}

# Step 3: Set Drive output file path
drive_folder = '/content/drive/MyDrive/Project2'
os.makedirs(drive_folder, exist_ok=True)
OUTPUT_CSV = os.path.join(drive_folder, 'e_wallet_reviews.csv')

# Step 4: Sentiment labeling function
def label_sentiment(score):
    if score >= 4:
        return 'positive'
    elif score == 3:
        return 'neutral'
    else:
        return 'negative'

# Step 5: Start scraping
start_time = time.time()
total_reviews = 0

# Clear old file on start
if os.path.exists(OUTPUT_CSV):
    os.remove(OUTPUT_CSV)

print("üîç Starting e-wallet reviews scraping (BM only via lang='ms')...\n")

for app_name, package in apps.items():
    print(f"üì± Scraping: {app_name} ({package})")
    continuation_token = None
    batch_count = 0
    app_reviews = 0
    is_first_batch = True

    while True:
        try:
            result, continuation_token = reviews(
                package,
                lang='ms',        # ‚úÖ Only Bahasa Melayu
                country='my',
                sort=Sort.NEWEST,
                count=200,
                continuation_token=continuation_token
            )
        except Exception as e:
            print(f"‚ùå Error scraping {app_name}: {e}")
            time.sleep(2)
            continue

        if not result:
            break

        batch_count += 1
        print(f"   üì¶ Batch {batch_count}: {len(result)} reviews")

        rows = []
        for r in result:
            review_content = r.get('content')
            review_content = review_content.strip() if review_content else ""

            rows.append({
                'app': app_name,
                'username': r['userName'],
                'review': review_content,
                'rating': r['score'],
                'sentiment': label_sentiment(r['score']),
                'date': r['at'].strftime('%Y-%m-%d %H:%M:%S')
            })

        if rows:
            df = pd.DataFrame(rows)
            df.to_csv(OUTPUT_CSV, mode='a', header=is_first_batch, index=False, encoding='utf-8-sig')
            is_first_batch = False

            print(f"   üíæ Saved {len(rows)} reviews to file.")


            total_reviews += len(rows)
            app_reviews += len(rows)

        if continuation_token is None:
            break

        time.sleep(1)

    print(f"‚úÖ Finished {app_name}: {app_reviews} reviews collected\n")

# End timer
end_time = time.time()
duration = end_time - start_time
print(f"üéâ Scraping complete. Total reviews collected: {total_reviews}")
print(f"‚è±Ô∏è Total execution time: {duration:.2f} seconds")
print(f"üìÑ All data saved to: {OUTPUT_CSV}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
üîç Starting e-wallet reviews scraping (BM only via lang='ms')...

üì± Scraping: TouchNGo (my.com.tngdigital.ewallet)
   üì¶ Batch 1: 200 reviews
   üíæ Saved 200 reviews to file.
   üì¶ Batch 2: 200 reviews
   üíæ Saved 200 reviews to file.
   üì¶ Batch 3: 200 reviews
   üíæ Saved 200 reviews to file.
   üì¶ Batch 4: 200 reviews
   üíæ Saved 200 reviews to file.
   üì¶ Batch 5: 200 reviews
   üíæ Saved 200 reviews to file.
   üì¶ Batch 6: 200 reviews
   üíæ Saved 200 reviews to file.
   üì¶ Batch 7: 200 reviews
   üíæ Saved 200 reviews to file.
   üì¶ Batch 8: 200 reviews
   üíæ Saved 200 reviews to file.
   üì¶ Batch 9: 200 reviews
   üíæ Saved 200 reviews to file.
   üì¶ Batch 10: 200 reviews
   üíæ Saved 200 reviews to file.
   üì¶ Batch 11: 200 reviews
   üíæ Saved 200 reviews to file.
   üì¶ Batch 12: 200 reviews
   üíæ Saved 2

## **üßπ Data Cleaning & Preprocessing**
This section:

Removes empty reviews and emojis

Cleans text by removing punctuation, digits, and stopwords

Converts and filters valid date entries

Saves the final cleaned dataset to Google Drive as cleaned_reviews.csv

In [5]:
# ‚úÖ Install required libraries
!pip install emoji

# ‚úÖ Download necessary NLTK data
import nltk
nltk.download('stopwords')

# ‚úÖ Import libraries
import pandas as pd
import re
import emoji
from nltk.corpus import stopwords
from google.colab import drive
import os

# ‚úÖ Step 1: Mount Google Drive
drive.mount('/content/drive')

# ‚úÖ Step 2: Load original BM-only reviews CSV
input_path = '/content/drive/MyDrive/Project2/e_wallet_reviews.csv'
df = pd.read_csv(input_path)

# ‚úÖ Step 3: Drop empty reviews
df.dropna(subset=['review'], inplace=True)

# ‚úÖ Step 4: Get NLTK stopwords (English for now, can customize later)
nltk_stopwords = set(stopwords.words('english'))  # Replace with BM stopwords source if needed

# ‚úÖ Step 5: Cleaning function using NLTK stopwords
def clean_text(text):
    text = text.lower()
    text = emoji.replace_emoji(text, replace='')                    # Remove emojis
    text = re.sub(r'[^a-zA-Z\s]', '', text)                         # Remove punctuation/digits
    text = re.sub(r'\s+', ' ', text).strip()                        # Normalize whitespace
    words = text.split()
    filtered_words = [word for word in words if word not in nltk_stopwords]
    return ' '.join(filtered_words)

# ‚úÖ Step 6: Apply cleaning
df['review'] = df['review'].astype(str).apply(clean_text)

# ‚úÖ Step 7: Clean and convert date format
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# ‚úÖ Step 8: Remove rows with empty review or missing date
df = df[df['review'].str.strip() != '']
df = df[df['date'].notna()]

# ‚úÖ Step 9: Keep only relevant columns
df = df[['app', 'review', 'rating', 'sentiment', 'date']]

# ‚úÖ Step 10: Save cleaned data
output_path = '/content/drive/MyDrive/Project2/cleaned_reviews.csv'
df.to_csv(output_path, index=False, encoding='utf-8-sig')

# ‚úÖ Summary
print(f"‚úÖ Cleaned data saved to Google Drive: {output_path}")
print(f"üßæ Final row count: {len(df)}")


Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m590.6/590.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ Cleaned data saved to Google Drive: /content/drive/MyDrive/Project2/cleaned_reviews.csv
üßæ Final row count: 100779
