# ``Gcash Review Sentiment Analysis``

The goal of the "GCash App Review Sentiment Analysis" project is to develop a robust sentiment analysis system that accurately assesses user sentiments expressed in GCash app reviews, providing valuable insights to improve user experience and enhance the app's functionality.

The dataset is get from Kaggle:
[🇵🇭 GCash Google Store App Reviews](https://www.kaggle.com/datasets/bwandowando/globe-gcash-google-app-reviews)

* Data Preprocessing:

Clean and preprocess the collected data, including text normalization, removing noise, and handling missing or duplicate reviews.
* Sentiment Labeling:

Manually or using automated tools, label the reviews with sentiment categories (e.g., positive, negative, neutral) to create a labeled dataset for supervised learning.
* Feature Extraction:


Prepare a presentation summarizing the project's findings, recommendations, and the impact of sentiment analysis on enhancing the GCash app's user experience.

# Installing Dependencies 

Installing depndencies using pip, run the following command in the terminal:

In [2]:
%pip install pandas numpy scikit-learn tensorflow keras matplotlib seaborn spacy textblob wordcloud

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


# Import Dependencies

In [2]:
# Import data analysis libraries
import pandas as pd
import numpy as np

# Import data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Import text processing libraries
import spacy 
from textblob import TextBlob
from wordcloud import WordCloud



# Import Dataset

In [2]:
df = pd.read_csv("Dataset/gcash_review_dataset.zip", compression="zip")

## Get information about Dataset

### Show 5 rows of the dataset

To see what are the rows look like

In [3]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,review_text,review_rating,author_id,author_name,author_app_version,review_datetime_utc,review_likes
0,0,Works fine.. I like the graphics and layout.. ...,5,,A Google user,1.0.1.0,2012-03-26T05:49:59.000Z,4
1,1,"""Unknown error occurred"" always popping up! Ne...",1,,A Google user,1.0.0.0,2012-03-26T10:49:57.000Z,0
2,2,very convenient to use..,5,,A Google user,1.0.1.0,2012-05-08T03:32:34.000Z,0
3,3,"It would really be great if you add ""payable t...",4,,A Google user,1.0.1.0,2012-05-31T13:53:30.000Z,7
4,4,Its working fine with my motorola droid razr. ...,5,,A Google user,1.0.1.0,2012-06-20T13:38:43.000Z,1


### Shows Columns info and datatypes

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 585275 entries, 0 to 585274
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   Unnamed: 0           585275 non-null  int64 
 1   review_text          584879 non-null  object
 2   review_rating        585275 non-null  int64 
 3   author_id            585242 non-null  object
 4   author_name          585275 non-null  object
 5   author_app_version   456066 non-null  object
 6   review_datetime_utc  585275 non-null  object
 7   review_likes         585275 non-null  int64 
dtypes: int64(3), object(5)
memory usage: 35.7+ MB


Chceck the shape of the dataset

In [5]:
print(f"""
Rows: {df.shape[0]}
Columns: {df.shape[1]}
""")


Rows: 585275
Columns: 8



# Data Exploration

This includes the checking the class and other categorical variable.

## Checkout the classes

In [8]:
print(df["author_app_version"].value_counts(normalize=True))

author_app_version
5.69.0    0.038391
5.50.0    0.036769
5.43.0    0.036670
5.42.0    0.032432
5.44.1    0.029950
            ...   
2.2.2     0.000002
2.3.1     0.000002
5.13.1    0.000002
5.58.0    0.000002
5.18.0    0.000002
Name: proportion, Length: 199, dtype: float64


We can split the dataset by the version name so that we can see the distribution of the classes, and find the changes over the versions.

In [12]:
# splitting the dataset by the author_app_version
versions = df["author_app_version"].unique()
compiled_versions = []

array(['1.0.1.0', '1.0.0.0', nan, '1.3.0', '1.4.0', '1.4.1', '1.4.2',
       '1.5.2', '1.6.0', '1.7.0', '1.7.2', '1.7.3', '1.7.4', '1.7.7',
       '2.0.0', '2.0.1', '2.0.2', '2.0.3', '2.0.4', '2.0.5', '2.0.6',
       '2.0.7', '2.0.8', '2.0.10', '2.1.0', '2.1.1', '2.1.3', '2.1.4',
       '2.1.5', '2.2.0', '2.2.1', '2.2.2', '2.2.3', '2.2.4', '2.3.1',
       '2.3.3', '2.3.4', '2.3.5', '2.3.6', '2.3.7', '2.4.0', '2.4.1',
       '2.4.2', '2.4.3', '3.0.0', '3.0.1', '3.0.3', '3.0.4', '3.0.6',
       '3.0.7', '3.0.8', '3.0.9', '3.1.0', '3.1.1', '4.0.1', '4.0.3',
       '4.0.5', '4.1.0', '4.2.0', '4.2.1', '4.3.0', '4.4.0', '4.5.0',
       '4.6.0', '4.7.0', '4.7.1', '5.0.0', '5.1.0', '5.2.0', '5.2.1',
       '5.3.0', '5.4.1', '5.4.0', '5.5.0', '5.6.0', '5.7.0', '5.8.0',
       '5.9.0', '5.9.1', '5.9.2', '5.10.1', '5.11.0', '5.11.1', '5.11.2',
       '5.11.3', '5.12.0', '5.13.0', '5.13.1', '5.13.2', '5.14.0',
       '5.15.0', '5.15.1', '5.15.2', '5.15.3', '5.16.0', '5.16.1',
       '5.17.0', '5.1

Grouping the dataset by the version name

from  version 1.0.0 to 5.0.0, we can see the distribution of the classes

In [25]:
dfv_1 = df[df["author_app_version"].str.contains(r'^1.', na=False, regex=True)]
dfv_2 = df[df["author_app_version"].str.contains(r'^2.', na=False, regex=True)]
dfv_3 = df[df["author_app_version"].str.contains(r'^3.', na=False, regex=True)]
dfv_4 = df[df["author_app_version"].str.contains(r'^4.', na=False, regex=True)]
dfv_5 = df[df["author_app_version"].str.contains(r'^5.', na=False, regex=True)]
versions= [dfv_1, dfv_2, dfv_3, dfv_4, dfv_5]

In [28]:
# Exporting into a csv or pickle file to be used in the next notebook
for i, version in enumerate(versions):
    version.to_csv(f"Dataset/gcash_review_dataset_v{i+1}.zip", compression="zip", index=False)
    version.to_pickle(f"Dataset/gcash_review_dataset_v{i+1}.pkl")

In [27]:
for i,version in enumerate(versions):
    print(f"{i+1} __ Rows: {version.shape[0]} Columns: {version.shape[1]}")
    

1 __ Rows: 43 Columns: 8
2 __ Rows: 786 Columns: 8
3 __ Rows: 695 Columns: 8
4 __ Rows: 1420 Columns: 8
5 __ Rows: 453122 Columns: 8


This means most of the reviews are in the 5th version of the GCash App

Then, checking how many review in each ratings.

In [9]:
review_rating = pd.DataFrame(df["review_rating"].value_counts())

In [10]:
review_rating["percentage"] = round(review_rating["count"] / df.__len__() *100,2)

In [11]:
review_rating

Unnamed: 0_level_0,count,percentage
review_rating,Unnamed: 1_level_1,Unnamed: 2_level_1
5,321150,54.87
1,160982,27.51
4,40357,6.9
3,34238,5.85
2,28548,4.88


This shows that 54.87% gives highest ratings and 27.51% gives lowest rating.

### Checking the Review Text

Check for null values

In [12]:
print("Null values:",df["review_text"].isna().sum())

Null values: 396


Removing the null values

In [13]:
df.dropna(subset=["review_text"], inplace=True)

Getting the latest version of the GCash App

In [3]:
df_v5 = pd.read_pickle("Dataset/gcash_review_dataset_v5.pkl")

In [4]:
df_v5.tail(5)

Unnamed: 0.1,Unnamed: 0,review_text,review_rating,author_id,author_name,author_app_version,review_datetime_utc,review_likes
585268,585268,Hatdog,5,107165*********609479,A Google user,5.69.1,2023-11-06T07:39:03.000Z,0
585271,585271,So frustrating this app always says need to re...,2,116162*********376741,A Google user,5.69.3,2023-11-06T07:41:23.000Z,0
585272,585272,Every I open this app you need to upate kakainis,1,106498*********463524,A Google user,5.69.1,2023-11-06T07:41:56.000Z,0
585273,585273,This app is helpful 3yrs ago but sad to say it...,2,111783*********584598,A Google user,5.69.1,2023-11-06T07:42:26.000Z,0
585274,585274,Good,5,104224*********236759,A Google user,5.69.1,2023-11-06T07:42:43.000Z,0


Downloading the spacy model for topic modelling

In [None]:
!python -m spacy download en_core_web_sm

Loading it

In [5]:
from spacy.lang.en.stop_words import STOP_WORDS as STOP_WORDS_EN
from spacy.lang.tl.stop_words import STOP_WORDS as STOP_WORDS_TL

additional_stopwords = set(['nyo', 'niyo', 'naman', 'yung', 'yun', 'yong'])
STOP_WORDS_TL |= additional_stopwords

In [6]:
def removing_stopwords(text):
    up = " ".join([word.lower() for word in text.split(" ") if word.lower() not in STOP_WORDS_EN and word.lower() not in STOP_WORDS_TL])
    return up

In [7]:
import re
def remove_special_characters(text):
    text = re.sub(r'[^A-Za-z0-9]+', ' ', text)
    return text

In [8]:
df_v5["tokens"] = df_v5["review_text"].astype(str).apply(lambda x: removing_stopwords(x)).apply(lambda x: remove_special_characters(x))

In [9]:
df_v5.reset_index(drop=True, inplace=True)

In [10]:
df_v5["tokens"]

0                  says 3 globe tm numbers globe tm numbers
1         doesn t reply verification trying couple weeks...
2                                                  love it 
3              new version got error access account dislike
4         gcash pakiayos system madaming nangangailangan...
                                ...                        
453117                                               hatdog
453118    frustrating app says need restarts security me...
453119                         open app need upate kakainis
453120    app helpful 3yrs ago sad it s getting worst as...
453121                                                 good
Name: tokens, Length: 453122, dtype: object

In [11]:
sampled_text = df_v5.dropna().sample(1000)["tokens"]

In [12]:
sampled_text

143415                                                 good
109220    worst app use reliable crashes everytime stron...
108859    hope school id voters certificate verification...
449673    sana list valid id barangay id sobra hirap id ...
167370                                                 good
                                ...                        
314853                                                 good
336275    update app account loss wrong didn t remember ...
73746                                                 happy
40309                                                  like
374785                                       hard to follow
Name: tokens, Length: 1000, dtype: object

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
vectorizer = TfidfVectorizer() # Para sa pag convert ng text patungong vector

In [17]:
X = vectorizer.fit_transform(sampled_text)

feature_names = vectorizer.get_feature_names_out()

X_array = X.toarray()

In [18]:
len(feature_names)

1551

In [1]:
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD

In [20]:
lda = LatentDirichletAllocation(n_components=8, max_iter=15, learning_method='online',verbose=True)
data_lda = lda.fit_transform(X)

iteration: 1 of max_iter: 15
iteration: 2 of max_iter: 15
iteration: 3 of max_iter: 15
iteration: 4 of max_iter: 15
iteration: 5 of max_iter: 15
iteration: 6 of max_iter: 15
iteration: 7 of max_iter: 15
iteration: 8 of max_iter: 15
iteration: 9 of max_iter: 15
iteration: 10 of max_iter: 15
iteration: 11 of max_iter: 15
iteration: 12 of max_iter: 15
iteration: 13 of max_iter: 15
iteration: 14 of max_iter: 15
iteration: 15 of max_iter: 15


In [21]:
nmf = NMF(n_components=8)
data_nmf = nmf.fit_transform(X) 

In [22]:
# Latent Semantic Indexing Model using Truncated SVD
lsi = TruncatedSVD(n_components=8)
data_lsi = lsi.fit_transform(X)

In [27]:
# Functions for printing keywords for each topic
def selected_topics(model, vectorizer, top_n=8):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx+1))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]]) 

In [28]:
# Keywords for topics clustered by Latent Dirichlet Allocation
print("LDA Model:")
selected_topics(lda, vectorizer)

LDA Model:
Topic 1:
[('nice', 75.63118513090986), ('app', 18.226274384450605), ('useful', 18.055818870262165), ('use', 8.5821190516108), ('easy', 5.326639910913145), ('wow', 4.225389787066876), ('application', 3.807223252477295), ('wallet', 3.717365414885776)]
Topic 2:
[('easy', 14.37708032403989), ('love', 12.765035932390946), ('awesome', 12.464659492889556), ('money', 9.607964108756018), ('it', 8.95694657229838), ('send', 7.120572020327362), ('load', 6.75237594461246), ('slow', 5.396162503863429)]
Topic 3:
[('ok', 27.50086867552988), ('happy', 10.625445783329697), ('best', 3.9543213427362143), ('apps', 2.030256884034176), ('thanks', 1.960546007773061), ('gcash', 1.5894763978406963), ('exeptional', 1.4717302365749152), ('gnda', 1.1156410430093893)]
Topic 4:
[('convenient', 13.678572267832644), ('excellent', 10.683542489313936), ('usefull', 4.81845311217723), ('reliable', 4.178778074331734), ('okay', 2.3403067460499596), ('updates', 2.2210620537624823), ('vg', 2.109623605687873), ('low

In [29]:
# Keywords for topics clustered by Latent Semantic Indexing
print("NMF Model:")
selected_topics(nmf, vectorizer)

NMF Model:
Topic 1:
[('good', 3.2939847700036893), ('apps', 0.05995517633865084), ('it', 0.03534485307826172), ('service', 0.024376276300333396), ('application', 0.01831156117990022), ('verry', 0.014738998678006415), ('experience', 0.01301104636552158), ('far', 0.010921781616487505)]
Topic 2:
[('nice', 2.9122544429275963), ('apps', 0.14990385504636236), ('wallet', 0.03983019530532689), ('faster', 0.03516844702854418), ('wow', 0.030770963078818898), ('it', 0.03053868050851193), ('super', 0.025933890577824418), ('verry', 0.020032553274006853)]
Topic 3:
[('great', 2.390528698736892), ('convenient', 0.05407064066127419), ('experience', 0.049649967953861306), ('super', 0.04173488641571222), ('thanks', 0.04050155662733686), ('bills', 0.03767873091038058), ('awesome', 0.03708285290579865), ('ever', 0.031086402156141026)]
Topic 4:
[('ok', 2.2295515749780828), ('nmn', 0.053591193540858624), ('super', 0.0473875022811581), ('100', 0.04349429435260936), ('little', 0.040836138230918084), ('xa', 0.0

In [30]:
# Keywords for topics clustered by Non-Negative Matrix Factorization
print("LSI Model:")
selected_topics(lsi, vectorizer)

LSI Model:
Topic 1:
[('good', 0.9974401370454636), ('app', 0.05711501439577938), ('apps', 0.0217192604787404), ('nice', 0.019041585783440184), ('it', 0.017071570307956713), ('use', 0.01021884607956934), ('easy', 0.008971198853437706), ('service', 0.008024773326834097)]
Topic 2:
[('nice', 0.973258339650495), ('app', 0.21457252193072912), ('apps', 0.052575774036984116), ('it', 0.020337953591376693), ('use', 0.015289930139781712), ('easy', 0.014135785468331566), ('gcash', 0.014061810417448067), ('great', 0.013531982140048875)]
Topic 3:
[('great', 0.9646682392553954), ('app', 0.23459359941281585), ('convenient', 0.03594730787256112), ('awesome', 0.0331202239878512), ('open', 0.021429506651308307), ('gcash', 0.0210761959770177), ('bills', 0.02017667562617699), ('experience', 0.01991938040745996)]
Topic 4:
[('ok', 0.9981859118816803), ('nmn', 0.02413585440435502), ('super', 0.020969452043634953), ('100', 0.019479860084869013), ('gcash', 0.01879181232614891), ('little', 0.018279089483515615),