# ``Gcash Review Sentiment Analysis``

The goal of the "GCash App Review Sentiment Analysis" project is to develop a robust sentiment analysis system that accurately assesses user sentiments expressed in GCash app reviews, providing valuable insights to improve user experience and enhance the app's functionality.

The dataset is get from Kaggle:
[🇵🇭 GCash Google Store App Reviews](https://www.kaggle.com/datasets/bwandowando/globe-gcash-google-app-reviews)

* Data Preprocessing:

Clean and preprocess the collected data, including text normalization, removing noise, and handling missing or duplicate reviews.
* Sentiment Labeling:

Manually or using automated tools, label the reviews with sentiment categories (e.g., positive, negative, neutral) to create a labeled dataset for supervised learning.
* Feature Extraction:

Extract relevant features from the review texts, such as word embeddings, n-grams, and sentiment-related features.
* Model Selection:

Choose an appropriate machine learning or deep learning model for sentiment analysis, such as natural language processing (NLP) models like BERT or traditional models like Naive Bayes or Support Vector Machines.
* Model Training:

Train the selected model on the labeled dataset, optimizing hyperparameters and assessing model performance through cross-validation.
* Sentiment Analysis:

Apply the trained model to perform sentiment analysis on a new set of GCash app reviews, providing sentiment scores and classifications for each review.
* Evaluation:

Evaluate the model's performance using metrics like accuracy, precision, recall, and F1-score, and make necessary adjustments to improve accuracy.
* Visualization:

Create visualizations, such as sentiment distribution charts, to present the sentiment analysis results in a clear and understandable format.
* Insights and Reporting:

Generate insights and recommendations based on the sentiment analysis results, highlighting areas of improvement and user concerns.
* Continuous Monitoring:

Implement a system for ongoing sentiment analysis to continuously track and respond to changes in user sentiment over time.
* Feedback Integration:

Collaborate with the GCash development team to integrate feedback and suggestions gathered from the sentiment analysis into app improvements and updates.
* Documentation:

Create comprehensive documentation of the sentiment analysis methodology, data sources, model details, and results for future reference.
* Presentation:

Prepare a presentation summarizing the project's findings, recommendations, and the impact of sentiment analysis on enhancing the GCash app's user experience.

# Installing Dependencies 

Installing depndencies using pip, run the following command in the terminal:

In [1]:
%pip install pandas numpy scikit-learn tensorflow keras matplotlib seaborn spacy textblob wordcloud

Collecting tensorflow
  Downloading tensorflow-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting keras
  Downloading keras-2.15.0-py3-none-any.whl.metadata (2.4 kB)
Collecting spacy
  Downloading spacy-3.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting textblob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m636.8/636.8 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting wordcloud
  Downloading wordcloud-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.3 kB)
Collecting absl-py>=1.0.0 (from tensorflow)
  Downloading absl_py-2.0.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting flatbuffers>=23.5.26 (from tensorflow)
  Downloading flatbuffers-23.5.26-py2.py3-none-any.whl.metadata

# Import Dependencies

In [2]:
# Import data analysis libraries
import pandas as pd
import numpy as np

# Import data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Import machine learning libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline

# Import deep learning libraries
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, Dropout, TextVectorization
from keras.callbacks import EarlyStopping


# Import text processing libraries
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from textblob import TextBlob
from wordcloud import WordCloud



2023-11-09 01:24:07.130905: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-09 01:24:07.163838: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-09 01:24:07.163880: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-09 01:24:07.163913: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-09 01:24:07.171035: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-09 01:24:07.171935: I tensorflow/core/platform/cpu_feature_guard.cc:182] This Tens

# Import Dataset

In [3]:
df = pd.read_csv("Dataset/gcash_review_dataset.zip", compression="zip")

## Get information about Dataset

### Show 5 rows of the dataset

To see what are the rows look like

In [5]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,review_text,review_rating,author_id,author_name,author_app_version,review_datetime_utc,review_likes
0,0,Works fine.. I like the graphics and layout.. ...,5,,A Google user,1.0.1.0,2012-03-26T05:49:59.000Z,4
1,1,"""Unknown error occurred"" always popping up! Ne...",1,,A Google user,1.0.0.0,2012-03-26T10:49:57.000Z,0
2,2,very convenient to use..,5,,A Google user,1.0.1.0,2012-05-08T03:32:34.000Z,0
3,3,"It would really be great if you add ""payable t...",4,,A Google user,1.0.1.0,2012-05-31T13:53:30.000Z,7
4,4,Its working fine with my motorola droid razr. ...,5,,A Google user,1.0.1.0,2012-06-20T13:38:43.000Z,1


### Shows Columns info and datatypes

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 585275 entries, 0 to 585274
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   Unnamed: 0           585275 non-null  int64 
 1   review_text          584879 non-null  object
 2   review_rating        585275 non-null  int64 
 3   author_id            585242 non-null  object
 4   author_name          585275 non-null  object
 5   author_app_version   456066 non-null  object
 6   review_datetime_utc  585275 non-null  object
 7   review_likes         585275 non-null  int64 
dtypes: int64(3), object(5)
memory usage: 35.7+ MB


Chceck the shape of the dataset

In [8]:
print(f"""
Rows: {df.shape[0]}
Columns: {df.shape[1]}
""")


Rows: 585275
Columns: 8



# Data Exploration

This includes the checking the class and other categorical variable.

## Checkout the classes

In [9]:
df["author_app_version"].value_counts()

author_app_version
5.69.0    17509
5.50.0    16769
5.43.0    16724
5.42.0    14791
5.44.1    13659
          ...  
2.2.2         1
2.3.1         1
5.13.1        1
5.58.0        1
5.18.0        1
Name: count, Length: 199, dtype: int64

This means most of the reviews are in the 5th version of the GCash App

Then, checking how many review in each ratings.

In [30]:
review_rating = pd.DataFrame(df["review_rating"].value_counts())

In [28]:
review_rating["percentage"] = round(review_rating["count"] / df.__len__() *100,2)

In [29]:
review_rating

Unnamed: 0_level_0,count,percentage
review_rating,Unnamed: 1_level_1,Unnamed: 2_level_1
5,321150,54.87
1,160982,27.51
4,40357,6.9
3,34238,5.85
2,28548,4.88


This shows that 54.87% gives highest ratings and 27.51% gives lowest rating.

### Checking the Review Text

Check for null values

In [32]:
print("Null values:",df["review_text"].isna().sum())

Null values: 396
