# Natural Language Processing of Reviews

In this notebook we will have a look at the most common words, bigrams and trigrams.

## Load the reviews

In [18]:
import pandas as pd

reviews = pd.read_pickle('data/combined.pkl')

In [3]:
#This will give us a pretty interactive data table
# NOTE: This only works when executed on Jupyter server, not on Github
import qgrid
qgrid_widget = qgrid.show_grid(reviews, show_toolbar=True)
qgrid_widget

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…

## Most common words

In [5]:
import nltk
from collections import Counter

In [17]:
stemmer = nltk.stem.PorterStemmer()
t = reviews['content'].str.cat(sep=' ')
tokens = nltk.word_tokenize(t)
stopwords = nltk.corpus.stopwords.words('german')
stemmed_tokens = [stemmer.stem(t) for t in tokens if t not in stopwords and len(t) > 2]
Counter(stemmed_tokens).most_common(10)

[('app', 8089),
 ('daten', 4157),
 ('leider', 4017),
 ('samsung', 3384),
 ('ich', 3189),
 ('tage', 2947),
 ('funktioniert', 2500),
 ('...', 2464),
 ('gern', 2321),
 ('googl', 2250)]

Findings are more insightful for bigrams and trigrams, so we skip the analysis of words for now.

## Most common bigrams (2 words occuring together)

In [13]:
bigrams = nltk.ngrams(stemmed_tokens, 2)
[(c, t) for c, t in enumerate(Counter(bigrams).most_common(50), start = 1)]

[(1, (('googl', 'fit'), 1243)),
 (2, (('gespendeten', 'tage'), 899)),
 (3, (('samsung', 'health'), 680)),
 (4, (('gute', 'ide'), 608)),
 (5, (('die', 'app'), 606)),
 (6, (('gespendet', 'tage'), 496)),
 (7, (('gern', 'helfen'), 409)),
 (8, (('leider', 'samsung'), 391)),
 (9, (('samsung', 'gear'), 389)),
 (10, (('appl', 'health'), 388)),
 (11, (('hätte', 'gern'), 384)),
 (12, (('ich', 'gern'), 376)),
 (13, (('seit', 'tagen'), 368)),
 (14, (('anzahl', 'gespendeten'), 358)),
 (15, (('ich', 'app'), 357)),
 (16, (('die', 'ide'), 355)),
 (17, (('fehler', '403'), 351)),
 (18, (('würde', 'gern'), 331)),
 (19, (('appl', 'watch'), 285)),
 (20, (('galaxi', 'watch'), 268)),
 (21, (('samsung', 'smartwatch'), 265)),
 (22, (('tage', 'angezeigt'), 264)),
 (23, (('ide', 'gut'), 262)),
 (24, (('samsung', 'galaxi'), 262)),
 (25, (('ich', 'hoff'), 255)),
 (26, (('app', 'funktioniert'), 246)),
 (27, (('app', 'seit'), 242)),
 (28, (('anzeig', 'gespendeten'), 240)),
 (29, (('daten', 'übertragen'), 235)),
 (30

## Findings

- #1, #3, #9: mentioning specific devices -> comparison of sentiments and analysis of words appearing together with device name to see which devices perform good
- #2: could refer to a specific bug
- #4, #7: positive attitude towards app, like to help

## Most common trigrams (3 words occuring together)

In [15]:
trigrams = nltk.ngrams(stemmed_tokens, 3)
[(c, t) for c, t in enumerate(Counter(trigrams).most_common(50), start = 1)]

[(1, (('anzahl', 'gespendeten', 'tage'), 290)),
 (2, (('anzeig', 'gespendeten', 'tage'), 206)),
 (3, (('hätte', 'gern', 'geholfen'), 160)),
 (4, (('samsung', 'galaxi', 'watch'), 153)),
 (5, (('würde', 'gern', 'helfen'), 143)),
 (6, (('gespendeten', 'tage', 'angezeigt'), 131)),
 (7, (('ich', 'gern', 'helfen'), 117)),
 (8, (('anmeldung', 'googl', 'fit'), 111)),
 (9, (('die', 'ide', 'gut'), 106)),
 (10, (('schade', 'hätte', 'gern'), 104)),
 (11, (('anzahl', 'gespendet', 'tage'), 101)),
 (12, (('gespendeten', 'tage', 'immer'), 100)),
 (13, (('fehler', '403', 'rate_limit_exceed'), 100)),
 (14, (('googl', 'fit', 'möglich'), 95)),
 (15, (('ich', 'hätte', 'gern'), 88)),
 (16, (('verbindung', 'googl', 'fit'), 87)),
 (17, (('gute', 'ide', 'leider'), 83)),
 (18, (('googl', 'fit', 'verbinden'), 83)),
 (19, (('app', 'seit', 'tagen'), 82)),
 (20, (('ich', 'find', 'ide'), 78)),
 (21, (('immer', 'gespendet', 'tage'), 78)),
 (22, (('hätte', 'gern', 'mitgemacht'), 78)),
 (23, (('rate', 'limit', 'exceed'

## Findings

- #1, #2: could again point to a bug
- #3, #5, #10: users have/had the intention to share data, but were not successful

## Todo:
- compare unigrams / bigrams / trigrams between both stores
- sentiment analysis (positive / negative), compare stores