### What is Topic Modeling?
Topic modeling can be easily compared to clustering. By doing topic modeling we build clusters of words rather than clusters of texts. A text is thus a mixture of all the topics, each having a certain weight.

### How it works?
If document classification is assigning a single category to a text, topic modeling is assigning multiple tags to a text. A human expert can label the resulting topics with human-readable labels and use different heuristics to convert the weighted topics to a set of tags.

### Why do you need it?
There are several scenarios when topic modeling can prove useful. Here are some of them:

- Text classification – Topic modeling can improve classification by grouping similar words together in topics rather than using each word as a feature
- Recommender Systems – Using a similarity measure we can build recommender systems. If our system would recommend articles for readers, it will recommend articles with a topic structure similar to the articles the user has already read.
- Uncovering Themes in Texts – Useful for detecting trends in online publications for example

# 1. Install and load the necessary packages
All the packages needed from crawling to sentiment analysis can be found on this section

In [1]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import json
from urllib.request import urlopen
from pandas.io.json import json_normalize
from google_play_scraper import app,Sort, reviews
from app_store_scraper import AppStore
from pprint import pprint
import urllib3
import xmltodict
import time
from textblob import TextBlob
import spacy 
import langid 
from nltk.classify.textcat import TextCat 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# 2. Load data
This step can be skipped if you're scraping directly the data on the same script

In [2]:
# Loading our previously scraped data
play_store_reviews = pd.read_csv('play_store_reviews.csv', index_col=False)
app_store_reviews = pd.read_csv('app_store_reviews.csv', index_col=False)

# Add paltform names to each review
app_store_reviews = app_store_reviews.assign(Platform='iOS')
play_store_reviews = play_store_reviews.assign(Platform='Android')

# Select the relevant columns
app_store_reviews = app_store_reviews[['App', 'Rating', 'Comment', 'Platform']]
play_store_reviews = play_store_reviews[['App', 'Rating', 'Comment', 'Platform']]

# Create final dataset combining reviews from App 
app_store_reviews['App'] = app_store_reviews['App'].str.replace('revolut', 'Revolut')
app_store_reviews['App'] = app_store_reviews['App'].str.replace('n26-mobile-banking', 'N26')
app_store_reviews['App'] = app_store_reviews['App'].str.replace('monzo-bank', 'MonzoBank')
df_reviews = play_store_reviews.append(app_store_reviews,ignore_index=True)

# Keep only reviews with a meaningful lenght (15 characters)
df_reviews = df_reviews[df_reviews.Comment.str.len()>=15]

Now we will classify our reviews based on the language they are written down

In [3]:
# Get the language id for each review
ids_langid = df_reviews['Comment'].apply(langid.classify)

# Get just the language label
langs = ids_langid.apply(lambda tuple: tuple[0])

# Assign the language to each review
df_reviews['Language'] = langs

# How many unique language labels were applied?
print("Number of tagged languages (estimated):")
print(len(langs.unique()))

# Percent of the total dataset in English
print("Percent of data in English (estimated):")
print((sum(langs=="en")/len(langs))*100)

Number of tagged languages (estimated):
77
Percent of data in English (estimated):
90.4166594420266


In [4]:
# 90% of the reviews are in English. The population seems to be well represented in that group
#     We will select English reviews only

df_reviews = df_reviews[df_reviews['Language']=='en']
df_reviews

Unnamed: 0,App,Rating,Comment,Platform,Language
0,Revolut,5.0,Good and efficient,Android,en
1,Revolut,3.0,The transfers take a lot longer to hit your in...,Android,en
2,Revolut,5.0,There is a lot of wasted space in the vaults.....,Android,en
3,Revolut,5.0,Everything you could possibly need from a bank...,Android,en
4,Revolut,5.0,Revolut is a brilliant app that saves you lots...,Android,en
...,...,...,...,...,...
75941,bunq,3.0,The bank is good but the new app became super ...,iOS,en
75942,bunq,1.0,The new version has Terrible accessibility,iOS,en
75943,bunq,1.0,This bank used to have one of the best user ex...,iOS,en
75945,bunq,5.0,Fantastic bank for modern thinking people who ...,iOS,en


# 3. Model
Our goal is to classify bad reviews under meaningful topics

In [5]:
# What are people complaining about? Ratings below 4 and at least 15 characters
reviews = df_reviews[df_reviews['Rating']<=3]
reviews = reviews[['App','Comment']].drop_duplicates()
reviews.dropna(inplace=True)
reviews = reviews.reset_index().drop(columns='index')
print(f'% of total reviews are rated below 4: {len(reviews)/len(df_reviews)*100}')

% of total reviews are rated below 4: 31.60740996433092


In [6]:
# Create document term matrix of the reviews
#   max_df : discard words that occur more than 95% documents
#   min_df : include only those words that occur atleast in 2 documents
    
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
reviews_cv = cv.fit_transform(reviews['Comment'])

In [7]:
# LDA model with 4 topics

LDA = LatentDirichletAllocation(n_components=4,random_state=1)
LDA.fit(reviews_cv)

LatentDirichletAllocation(n_components=4, random_state=1)

In [8]:
# Extract the topics and their most represented words

for index,topic in enumerate(LDA.components_):
    print(f'topic #{index} : ')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-20:]])

topic #0 : 
['bunq', 'old', 'bank', 'banking', 'used', 'ui', 'features', 'user', 'great', 'revolut', 'don', 'good', 'just', 'version', 'really', 'like', 'use', 'new', 'update', 'app']
topic #1 : 
['ve', 'help', 'reason', 'chat', 'contact', 'revolut', 'just', 'email', 'use', 'support', 'number', 'don', 'access', 'phone', 'customer', 'service', 'bank', 'app', 'money', 'account']
topic #2 : 
['need', 'just', 'passport', 'n26', 'bank', 'identity', 'time', 'money', 'verification', 'card', 'revolut', 'verify', 'days', 'id', 'account', 'app', 'chat', 'service', 'customer', 'support']
topic #3 : 
['phone', 'monzo', 'does', 'don', 'transfer', 'tried', 'try', 'free', 'just', 'money', 'use', 'revolut', 'doesn', 'time', 'pay', 'bank', 'work', 'account', 'app', 'card']


In [9]:
# Merge the results into our initial dataset and save it
topic_reviews = LDA.transform(reviews_cv)

df_topic_reviews = pd.DataFrame(topic_reviews, columns=[
'0_app_functionality',
'1_customer_support/account_blocking' ,
'2_validation/verification'            ,
'3_financial_products'
])

df_result_low = pd.merge(reviews, df_topic_reviews,  how='inner', left_index=True, right_index=True)

# Save results
df_result_low.to_csv("df_result_low.csv")

# 4. Validation
Let's create a random sample of 10 reviews for each app and manually check if the labels assigned are correct

In [13]:
# Create a random sample dataset and save it
rev = df_result_low[df_result_low['App']=='Revolut'].sample(n=10, random_state=1)
n26 = df_result_low[df_result_low['App']=='N26'].sample(n=10, random_state=1)
MonzoBank = df_result_low[df_result_low['App']=='MonzoBank'].sample(n=10, random_state=1)
bunq = df_result_low[df_result_low['App']=='bunq'].sample(n=10, random_state=1)
sample = rev.append(n26).append(MonzoBank).append(bunq)

# Save results
sample.to_excel("sample.xlsx")

sample.head(15)

Unnamed: 0,App,Comment,0_app_functionality,1_customer_support/account_blocking,2_validation/verification,3_financial_products
7870,Revolut,Awful. Revolut blocks your account (with money...,0.00897,0.582514,0.398772,0.009745
3867,Revolut,Worst aap I have existing revolt account. I ca...,0.705212,0.233274,0.030297,0.031217
5736,Revolut,There is no Lebanon access! Pitty in 2020 🤨,0.055112,0.842648,0.050351,0.051889
5506,Revolut,I have been unable to use my card since no ass...,0.019793,0.020235,0.72546,0.234512
2666,Revolut,New interface is difficult and unintuitive.,0.849725,0.050095,0.050087,0.050093
954,Revolut,Terrible service and fees Charging 3.99 per at...,0.015733,0.44572,0.015353,0.523194
6011,Revolut,Stuck in a loop ordering card and there is no ...,0.023277,0.276665,0.024546,0.675512
2052,Revolut,This app is sucks Im waiting 4 days for someon...,0.010428,0.638667,0.010751,0.340154
14660,Revolut,There’s no way to deposit money via credit car...,0.012351,0.012295,0.012452,0.962902
14375,Revolut,"Since 6 months, they blocked all the accounts !!",0.06617,0.806993,0.063617,0.06322


In [31]:
# Manually labeled 40 reviews (5 reviews per app) and determined if the categorisation was correct
validation = pd.read_csv('validation.csv')
validation = validation.reset_index().drop(columns='index')
accuracy = validation['IsAccurate'].sum() / validation['IsAccurate'].count()

print(f'The model accuracy (number of correctly labeled reviews / total reviews) is: {accuracy *100}'+"%") 

The model accuracy (number of correctly labeled reviews / total reviews) is: 72.5%


In [32]:
# Display random reviews from the validation dataset
validation.sample(n=15, random_state=1)

Unnamed: 0,App,Comment,0_app_functionality,1_customer_support/account_blocking,2_validation/verification,3_financial_products,IsAccurate
2,Revolut,There is no Lebanon access! Pitty in 2020 🤨,6%,84%,5%,5%,0
31,bunq,Still pending to solve an issue with them. Goo...,4%,4%,89%,4%,0
3,Revolut,I have been unable to use my card since no ass...,2%,2%,73%,23%,1
21,MonzoBank,Customer service is very poor! I had a payment...,3%,92%,3%,3%,1
27,MonzoBank,Unable to login to the account. First time use...,23%,74%,2%,2%,0
29,MonzoBank,"The app all round is fantastic however, the sp...",40%,1%,1%,58%,1
22,MonzoBank,Why Monzo not creating account? Its stop on ve...,3%,90%,3%,3%,0
39,bunq,"Wow, they couldn’t have messed up more. Time t...",53%,41%,3%,3%,1
19,N26,"As an Iranian living in Ireland, I have no per...",2%,50%,46%,2%,1
26,MonzoBank,To restricted Cash deposit are to restrictive ...,27%,1%,39%,33%,1
