1 Sentiment and Thematic Analysis

Introduction â€” Sentiment Analysis
In this phase of the project, we aim to quantify customer opinions from Google Play reviews of Ethiopian bank apps. After cleaning and normalizing the review text, we apply sentiment analysis to classify feedback as positive, negative, or neutral. This process provides a measurable understanding of user satisfaction and highlights areas for improvement. By combining sentiment scores with review metadata, we can identify trends across banks, app versions, and time, laying the groundwork for deeper thematic and semantic insights.


In [1]:
import sys, os
project_root = os.path.abspath("..")  # adjust if needed
if project_root not in sys.path:
    sys.path.append(project_root)

# Import the SentimentAnalyzer
from src.sentiment import SentimentAnalyzer

In [2]:
analyzer = SentimentAnalyzer()

In [3]:
# Add sentiment to cleaned reviews and save results
df_sentiment = analyzer.add_sentiment()

Sentiment analysis completed. Saved to data/processed/sentiment.csv


In [4]:
agg_sentiment = analyzer.aggregate_by_bank()
agg_sentiment

Unnamed: 0,bank,total_reviews,positive_reviews,negative_reviews,neutral_reviews,mean_polarity
0,Bank of Abyssinia,286,121,63,102,0.10274
1,Commercial Bank of Ethiopia,296,146,27,123,0.229599
2,Dashen Bank,286,171,44,71,0.230363


2 Thematic analysis

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Load cleaned reviews
df = pd.read_csv("data/processed/cleaned.csv")

# Drop rows where 'cleaned_text' is missing or empty
df = df.dropna(subset=['cleaned_text'])
df = df[df['cleaned_text'].str.strip() != ""]

# Function to tokenize, lemmatize, and remove stopwords/punctuation
def preprocess_text(text):
    # Ensure the input is a string
    text = str(text)
    doc = nlp(text)
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop and token.is_alpha]
    return " ".join(tokens)

# Apply preprocessing to cleaned text
df["processed_text"] = df["cleaned_text"].apply(preprocess_text)

# TF-IDF Vectorizer to extract keywords (1-grams and 2-grams)
vectorizer = TfidfVectorizer(max_features=50, ngram_range=(1, 2))
tfidf_matrix = vectorizer.fit_transform(df["processed_text"])
feature_names = vectorizer.get_feature_names_out()

# Count keywords across all reviews
all_keywords = " ".join(df["processed_text"]).split()
keyword_freq = Counter(all_keywords)

# Map keywords to themes manually
theme_mapping = {
    "Account Access Issues": ["login", "otp", "password", "authentication", "access"],
    "Transaction Performance": ["transfer", "payment", "failed", "slow", "transaction"],
    "User Interface & Experience": ["ui", "interface", "navigation", "layout", "experience"],
    "Customer Support": ["support", "response", "help", "service", "complaint"],
    "Feature Requests": ["feature", "request", "option", "add", "missing"]
}

# Count occurrences of keywords under each theme
theme_counts = {theme: 0 for theme in theme_mapping}
for theme, keywords in theme_mapping.items():
    for kw in keywords:
        theme_counts[theme] += keyword_freq.get(kw, 0)

# Convert to DataFrame for visualization
theme_df = pd.DataFrame(list(theme_counts.items()), columns=["Theme", "Count"])
theme_df = theme_df.sort_values("Count", ascending=False)

print(theme_df)

                         Theme  Count
1      Transaction Performance    107
4             Feature Requests     65
3             Customer Support     51
2  User Interface & Experience     49
0        Account Access Issues     47
