### What is Topic Modeling?
Topic modeling is an unsupervised machine learning technique that's capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. Topic modeling can be easily compared to clustering. By doing topic modeling we build clusters of words rather than clusters of texts. A text is thus a mixture of all the topics, each having a certain weight.

### How it works?
If document classification is assigning a single category to a text, topic modeling is assigning multiple tags to a text. A human expert can label the resulting topics with human-readable labels and use different heuristics to convert the weighted topics to a set of tags.

### Why do you need it?
There are several scenarios when topic modeling can prove useful. Here are some of them:

- Text classification – Topic modeling can improve classification by grouping similar words together in topics rather than using each word as a feature
- Recommender Systems – Using a similarity measure we can build recommender systems. If our system would recommend articles for readers, it will recommend articles with a topic structure similar to the articles the user has already read.
- Uncovering Themes in Texts – Useful for detecting trends in online publications for example

# 1. Install and load the necessary packages
All the packages needed from crawling to sentiment analysis can be found on this section

In [18]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import json
from urllib.request import urlopen
from pandas.io.json import json_normalize
from google_play_scraper import app,Sort, reviews
from app_store_scraper import AppStore
from pprint import pprint
import urllib3
import xmltodict
import time
from textblob import TextBlob
import spacy 
import langid 
from nltk.classify.textcat import TextCat 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text 
from sklearn.decomposition import LatentDirichletAllocation

# 2. Load data
This step can be skipped if you're scraping directly the data on the same script

In [19]:
# Loading our previously scraped data
play_store_reviews = pd.read_csv('play_store_reviews.csv', index_col=False)
app_store_reviews = pd.read_csv('app_store_reviews.csv', index_col=False)

# Add paltform names to each review
app_store_reviews = app_store_reviews.assign(Platform='iOS')
play_store_reviews = play_store_reviews.assign(Platform='Android')

# Select the relevant columns
app_store_reviews = app_store_reviews[['App', 'Rating', 'Comment', 'Platform']]
play_store_reviews = play_store_reviews[['App', 'Rating', 'Comment', 'Platform']]

# Create final dataset combining reviews from App 
app_store_reviews['App'] = app_store_reviews['App'].str.replace('revolut', 'Revolut')
app_store_reviews['App'] = app_store_reviews['App'].str.replace('n26-mobile-banking', 'N26')
app_store_reviews['App'] = app_store_reviews['App'].str.replace('monzo-bank', 'MonzoBank')
df_reviews = play_store_reviews.append(app_store_reviews,ignore_index=True)

# Keep only reviews with a meaningful lenght (15 characters)
df_reviews = df_reviews[df_reviews.Comment.str.len()>=15]

Now we will classify our reviews based on the language they are written down

In [20]:
# Get the language id for each review
ids_langid = df_reviews['Comment'].apply(langid.classify)

# Get just the language label
langs = ids_langid.apply(lambda tuple: tuple[0])

# Assign the language to each review
df_reviews['Language'] = langs

# How many unique language labels were applied?
print("Number of tagged languages (estimated):")
print(len(langs.unique()))

# Percent of the total dataset in English
print("Percent of data in English (estimated):")
print((sum(langs=="en")/len(langs))*100)

Number of tagged languages (estimated):
77
Percent of data in English (estimated):
90.4166594420266


In [21]:
# 90% of the reviews are in English. The population seems to be well represented in that group
#     We will select English reviews only

df_reviews = df_reviews[df_reviews['Language']=='en']
df_reviews

Unnamed: 0,App,Rating,Comment,Platform,Language
0,Revolut,5.0,Good and efficient,Android,en
1,Revolut,3.0,The transfers take a lot longer to hit your in...,Android,en
2,Revolut,5.0,There is a lot of wasted space in the vaults.....,Android,en
3,Revolut,5.0,Everything you could possibly need from a bank...,Android,en
4,Revolut,5.0,Revolut is a brilliant app that saves you lots...,Android,en
...,...,...,...,...,...
75941,bunq,3.0,The bank is good but the new app became super ...,iOS,en
75942,bunq,1.0,The new version has Terrible accessibility,iOS,en
75943,bunq,1.0,This bank used to have one of the best user ex...,iOS,en
75945,bunq,5.0,Fantastic bank for modern thinking people who ...,iOS,en


# 3. Model
Our goal is to classify bad reviews under meaningful topics

In [22]:
# What are people complaining about? Ratings below 4 and at least 15 characters
reviews = df_reviews[df_reviews['Rating']<=3]
reviews = reviews[['App','Comment']].drop_duplicates()
reviews.dropna(inplace=True)
reviews = reviews.reset_index().drop(columns='index')
print(f'% of total reviews are rated below 4: {len(reviews)/len(df_reviews)*100}')

% of total reviews are rated below 4: 31.60740996433092


In [23]:
# Create document term matrix of the reviews
#   max_df : discard words that occur more than 95% documents
#   min_df : include only those words that occur atleast in 2 documents

# Add custom stop words
my_additional_stop_words = ['app', 'really', 'just', 'n26', 'bunq', 'revolut', 'MonzoBank', 'monzo', 'bank', 've']
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)

# Create document term matrix with the english and our custom stop-words
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words=stop_words)
reviews_cv = cv.fit_transform(reviews['Comment'])


###### Note: each data scientist has to perform several manual combinations in order  
#          to arrive in a meaningful and interpretable set of topics and stop words



In [24]:
# LDA model with 4 topics and fit the dataset

LDA = LatentDirichletAllocation(n_components=4,random_state=1)
LDA.fit(reviews_cv)

LatentDirichletAllocation(n_components=4, random_state=1)

In [25]:
# Extract the topics and their most represented words

for index,topic in enumerate(LDA.components_):
    print(f'topic #{index} : ')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-20:]])

topic #0 : 
['better', 'design', 'need', 'transactions', 'interface', 'time', 'banking', 'old', 'used', 'ui', 'don', 'features', 'great', 'good', 'user', 'version', 'like', 'use', 'new', 'update']
topic #1 : 
['login', 'trying', 'support', 'log', 'times', 'use', 'try', 'verification', 'open', 'verify', 'email', 'number', 'doesn', 'id', 'time', 'tried', 'work', 'phone', 'account', 'card']
topic #2 : 
['got', 'using', 'want', 'like', 'blocked', 'months', 'funds', 'days', 'free', 'good', 'closed', 'close', 'transfer', 'pay', 'don', 'reason', 'use', 'card', 'money', 'account']
topic #3 : 
['poor', 'bad', 'problem', 'access', 'use', 'way', 'terrible', 'worst', 'waiting', 'time', 'days', 'contact', 'don', 'help', 'money', 'account', 'chat', 'support', 'customer', 'service']


In [26]:
# Merge the results into our initial dataset and save it
topic_reviews = LDA.transform(reviews_cv)

df_topic_reviews = pd.DataFrame(topic_reviews, columns=[
'0_app_functionality',
'1_registration/verification',
'2_financial_products/account_blocked',
'3_customer_support'
])

df_result_low = pd.merge(reviews, df_topic_reviews,  how='inner', left_index=True, right_index=True)

# Add a column to the dataset with the predicted category (maximum score among our categories)
df_result_low['estimated_topic'] = df_result_low[['0_app_functionality','1_registration/verification','2_financial_products/account_blocked','3_customer_support']].idxmax(axis = 1, skipna = True)

# Save results
df_result_low.to_csv("df_result_low.csv")

In [27]:
df_result_low

Unnamed: 0,App,Comment,0_app_functionality,1_registration/verification,2_financial_products/account_blocked,3_customer_support,estimated_topic
0,Revolut,The transfers take a lot longer to hit your in...,0.354785,0.011857,0.621022,0.012336,2_financial_products/account_blocked
1,Revolut,Absolutely disgraceful level of customer suppo...,0.007846,0.007623,0.373257,0.611274,3_customer_support
2,Revolut,"Completely unpleasant experience, first they g...",0.151480,0.013871,0.496314,0.338335,2_financial_products/account_blocked
3,Revolut,Can't log in after an update and they have no ...,0.399220,0.153812,0.041810,0.405157,3_customer_support
4,Revolut,Blocked my money and cannot even get in touch ...,0.035769,0.036142,0.037448,0.890641,3_customer_support
...,...,...,...,...,...,...,...
16477,bunq,They list it no where online or in the App Sto...,0.010991,0.742960,0.235195,0.010855,1_registration/verification
16478,bunq,The bank is good but the new app became super ...,0.851539,0.018094,0.018065,0.112302,0_app_functionality
16479,bunq,The new version has Terrible accessibility,0.847356,0.050277,0.050129,0.052238,0_app_functionality
16480,bunq,This bank used to have one of the best user ex...,0.922208,0.001866,0.074079,0.001847,0_app_functionality


# 4. Validation
Let's create a random sample of 10 reviews for each app and manually check if the labels assigned are correct

In [28]:
# Create a random sample dataset and save it
rev = df_result_low[df_result_low['App']=='Revolut'].sample(n=10, random_state=1)
n26 = df_result_low[df_result_low['App']=='N26'].sample(n=10, random_state=1)
MonzoBank = df_result_low[df_result_low['App']=='MonzoBank'].sample(n=10, random_state=1)
bunq = df_result_low[df_result_low['App']=='bunq'].sample(n=10, random_state=1)
sample = rev.append(n26).append(MonzoBank).append(bunq)

# Save results
sample.to_excel("sample.xlsx")
sample.head()

Unnamed: 0,App,Comment,0_app_functionality,1_registration/verification,2_financial_products/account_blocked,3_customer_support,estimated_topic
7870,Revolut,Awful. Revolut blocks your account (with money...,0.009234,0.009281,0.383561,0.597924,3_customer_support
3867,Revolut,Worst aap I have existing revolt account. I ca...,0.408627,0.208769,0.349941,0.032663,0_app_functionality
5736,Revolut,There is no Lebanon access! Pitty in 2020 🤨,0.835436,0.056133,0.053344,0.055087,0_app_functionality
5506,Revolut,I have been unable to use my card since no ass...,0.020072,0.938491,0.020249,0.021189,1_registration/verification
2666,Revolut,New interface is difficult and unintuitive.,0.849734,0.050147,0.05004,0.050079,0_app_functionality


In [29]:
# Manually labeled 40 reviews (5 reviews per app) and determined if the categorisation was correct

# Load the result of the previously generated "sample.xlsx" with our manual validation input (IsAccurate)
validation = pd.read_csv('validation.csv') 
validation = validation.reset_index().drop(columns='index')
accuracy = validation['IsAccurate'].sum() / validation['IsAccurate'].count()

# Display random reviews from the validation dataset
validation.sample(n=15, random_state=1)

Unnamed: 0.1,Unnamed: 0,App,Comment,0_app_functionality,1_registration/verification,2_financial_products/account_blocked,3_customer_support,estimated_topic,IsAccurate
2,5736,Revolut,There is no Lebanon access! Pitty in 2020 🤨,8354364634,5613255263,5334360225,5508738176,0_app_functionality,0
31,14031,bunq,Still pending to solve an issue with them. Goo...,3652024702,364889772,3701674017,8899740356,3_customer_support,1
3,5506,Revolut,I have been unable to use my card since no ass...,2007173833,9384906751,2024872833,2118885828,1_registration/verification,1
21,12132,MonzoBank,Customer service is very poor! I had a payment...,2522009182,3377759072,2681226521,6101917358,3_customer_support,1
27,12749,MonzoBank,Unable to login to the account. First time use...,1728825158,7946011056,1611983626,1639654236,1_registration/verification,1
29,15820,MonzoBank,"The app all round is fantastic however, the sp...",578803028,3980787202,1130211114,1181614066,0_app_functionality,1
22,11832,MonzoBank,Why Monzo not creating account? Its stop on ve...,3728913191,4849107662,2886667974,1891333045,1_registration/verification,1
39,16353,bunq,"Wow, they couldn’t have messed up more. Time t...",6630851686,2715655784,3298300186,3236625119,0_app_functionality,1
19,8847,N26,"As an Iranian living in Ireland, I have no per...",2275283961,9300998446,240034594,2314385635,1_registration/verification,1
26,11972,MonzoBank,To restricted Cash deposit are to restrictive ...,3299969987,6944235913,6559680339,7090731444,2_financial_products/account_blocked,1


In [30]:
# Evaluation of our project
print(f'Project result: The model accuracy (number of correctly labeled reviews / total reviews) is: {accuracy *100}'+"%") 

Project result: The model accuracy (number of correctly labeled reviews / total reviews) is: 80.0%


# 6. The extra mile
You may wonder what would happend if we get the topics for every different App - would reviews show a completely different scenario?

In [31]:
# How does the review distribution per app looks like? Should we get topics for each app individually?
#     Revolut represents the vast majority of the reviews, so it makes sense to explore this route

reviews['Comment'].groupby(reviews['App']).count()

App
MonzoBank    2365
N26          3758
Revolut      9434
bunq          925
Name: Comment, dtype: int64

In [32]:
# Revolut topic modelling

revolut_reviews = reviews[reviews['App']=='Revolut']
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words=stop_words)
reviews_cv = cv.fit_transform(revolut_reviews['Comment'])

# LDA model with 4 topics
LDA = LatentDirichletAllocation(n_components=4,random_state=1)
LDA.fit(reviews_cv)

# Extract the topics and their most represented words
for index,topic in enumerate(LDA.components_):
    print(f'topic #{index} : ')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-20:]])



topic #0 : 
['good', 'previous', 'features', 'time', 'design', 'change', 'transactions', 'latest', 'screen', 'used', 'old', 'interface', 'don', 'user', 'ui', 'use', 'version', 'like', 'update', 'new']
topic #1 : 
['10', 'time', 'identity', 'email', 'verification', 'free', 'code', 'times', 'verify', 'tried', 'number', 'doesn', 'pay', 'phone', 'work', 'id', 'use', 'money', 'account', 'card']
topic #2 : 
['fees', 'experience', 'premium', 'lot', 'currency', 'poor', 'don', 'works', 'need', 'exchange', 'transfer', 'using', 'better', 'time', 'great', 'support', 'use', 'good', 'customer', 'service']
topic #3 : 
['way', 'weeks', 'reason', 'months', 'funds', 'time', 'locked', 'use', 'help', 'contact', 'blocked', 'days', 'don', 'access', 'customer', 'service', 'chat', 'support', 'money', 'account']


In [33]:
# N26 topic modelling

n26_reviews = reviews[reviews['App']=='N26']
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words=stop_words)
reviews_cv = cv.fit_transform(n26_reviews['Comment'])

# LDA model with 4 topics
LDA = LatentDirichletAllocation(n_components=4,random_state=1)
LDA.fit(reviews_cv)

# Extract the topics and their most represented words
for index,topic in enumerate(LDA.components_):
    print(f'topic #{index} : ')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-20:]])



topic #0 : 
['transfers', 'direct', 'days', 'don', 'transaction', 'funds', 'time', 'support', 'good', 'banking', 'like', 'banks', 'deposit', 'uk', 'pay', 'use', 'card', 'transfer', 'account', 'money']
topic #1 : 
['verification', 'tried', 'need', 'does', 'new', 'error', 'confirm', 'times', 'number', 'account', 'use', 'log', 'time', 'password', 'update', 'doesn', 'support', 'login', 'work', 'phone']
topic #2 : 
['people', 'verification', 'poor', 'rude', 'id', 'banking', 'good', 'transactions', 'terrible', 'make', 'experience', 'card', 'like', 'worst', 'support', 'bad', 'time', 'don', 'customer', 'service']
topic #3 : 
['help', 'id', 'tried', 'passport', 'verification', 'told', 'contact', 'verify', 'days', 'time', 'email', 'chat', 'don', 'money', 'service', 'open', 'support', 'customer', 'card', 'account']


In [34]:
# Monzo Bank topic modelling

monzobank_reviews = reviews[reviews['App']=='MonzoBank']
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words=stop_words)
reviews_cv = cv.fit_transform(monzobank_reviews['Comment'])

# LDA model with 4 topics
LDA = LatentDirichletAllocation(n_components=4,random_state=1)
LDA.fit(reviews_cv)

# Extract the topics and their most represented words
for index,topic in enumerate(LDA.components_):
    print(f'topic #{index} : ')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-20:]])



topic #0 : 
['version', 'let', 'sign', 'help', 'address', 'card', 'work', 'keeps', 'doesn', 'update', 'time', 'try', 'phone', 'link', 'use', 'tried', 'open', 'log', 'email', 'account']
topic #1 : 
['way', 'spending', 'using', 'easy', 'money', 'old', 'used', 'features', 'make', 'security', 'don', 'account', 'card', 'pin', 'great', 'good', 'update', 'like', 'use', 'new']
topic #2 : 
['avoid', 'recommend', 'explanation', 'time', 'like', 'good', 'months', 'people', 'got', 'weeks', 'close', 'use', 'don', 'closed', 'customer', 'card', 'service', 'reason', 'money', 'account']
topic #3 : 
['don', 'waiting', 'help', 'bad', 'doesn', 'trying', 'verification', 'times', 'like', 'work', 'open', 'tried', 'days', 'video', 'chat', 'service', 'id', 'customer', 'time', 'account']


In [36]:
# Bunq topic modelling

bunq_reviews = reviews[reviews['App']=='bunq']
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words=stop_words)
reviews_cv = cv.fit_transform(bunq_reviews['Comment'])

# LDA model with 4 topics
LDA = LatentDirichletAllocation(n_components=4,random_state=1)
LDA.fit(reviews_cv)

# Extract the topics and their most represented words
for index,topic in enumerate(LDA.components_):
    print(f'topic #{index} : ')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-20:]])



topic #0 : 
['features', 'money', 'good', 'information', 'chat', 'month', 'people', 'card', 'available', 'stay', 'know', 'want', 'way', 'pay', 'free', 'don', 'used', 'like', 'banking', 'account']
topic #1 : 
['free', 'business', 'want', 'like', 'open', 'reason', 'id', 'access', 'pay', 'customer', 'using', 'premium', 'time', 'phone', 'use', 'service', 'don', 'money', 'support', 'account']
topic #2 : 
['accounts', 'things', 'use', 'features', 'v2', 'trees', 'terrible', 'ui', 'used', 'design', 'ux', 'bad', 'user', 'like', 'don', 'banking', 'version', 'new', 'update', 'v3']
topic #3 : 
['tried', 'features', 'think', 'make', 'travel', 'great', 'service', 'experience', 'want', 'premium', 'like', 'pay', 'use', 'time', 'accounts', 'money', 'don', 'account', 'free', 'card']


# Result
It looks like there are certain topics that are more closely related to some apps (e.g. UX/UI with Bunq), but overall topics are similar among the 4 apps