# Create a Dataset for Sentiment Analysis

## Setup

Let's install the required packages and setup the imports:

In [1]:
!pip install -qq google-play-scraper

In [2]:
!pip install -qq -U watermark

In [3]:
%reload_ext watermark
%watermark -v -p pandas,matplotlib,seaborn,google_play_scraper

CPython 3.8.3
IPython 7.16.1

pandas 1.0.5
matplotlib 3.2.2
seaborn 0.10.1
google_play_scraper 0.0.3.0


In [4]:
!pip install requests



In [5]:
import json
import pandas as pd
from tqdm import tqdm

import seaborn as sns
import matplotlib.pyplot as plt

from pygments import highlight
from pygments.lexers import JsonLexer
from pygments.formatters import TerminalFormatter

from google_play_scraper import Sort, reviews, app

%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style='whitegrid', palette='muted', font_scale=1.2)

In [6]:
!pip install beautifulsoup4

Collecting beautifulsoup4
  Using cached beautifulsoup4-4.9.1-py3-none-any.whl (115 kB)
Collecting soupsieve>1.2
  Using cached soupsieve-2.0.1-py3-none-any.whl (32 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.1 soupsieve-2.0.1


In [21]:
keywords = ['nordea', 'DNA', 'posti', 'yle', 'lidl', 'telia', 'elisa', 'pizza', 'mobilepay', 'VR%Matkalla','ilta','Oma','Suomi','mobiilisovellus','OmaMobiili','tori']

In [22]:
import requests
from bs4 import BeautifulSoup
import re

In [23]:
apps_packages_ids = []
for keyword in tqdm(keywords):
    URL = f'https://play.google.com/store/search?q={keyword}&c=apps'
    pre_txt = 'href="/store/apps/details?id='
    post_txt = '"'
    page = requests.get(URL)
    soup = BeautifulSoup(page.text, 'html.parser')
    links = soup.find_all('a')
    for link in links :
        link = str(link)
        if 'href="/store/apps/details' in link and 'tabindex' not in link:
            start = link.find(pre_txt) + len(pre_txt)
            end = link.find(post_txt, start)
            apps_packages_ids.append(str(link[start:end]))

100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:12<00:00,  1.25it/s]


In [24]:
apps_packages_ids = set(apps_packages_ids)

In [25]:
len(apps_packages_ids)

706

## Scraping App Information

Let's scrape the info for each app:

In [26]:
app_infos = []

for ap in tqdm(apps_packages_ids):
    info = app(ap, lang='fi', country='fi')
    del info['comments']
    app_infos.append(info)

100%|████████████████████████████████████████████████████████████████████████████████| 706/706 [07:35<00:00,  1.55it/s]


We got the info for all 15 apps. Let's write a helper function that prints JSON objects a bit better:

In [27]:
def print_json(json_object):
    json_str = json.dumps(
    json_object, 
    indent=2, 
    sort_keys=True, 
    default=str
    )
    print(highlight(json_str, JsonLexer(), TerminalFormatter()))

Here is a sample app information from the list:

This contains lots of information including the number of ratings, number of reviews and number of ratings for each score (1 to 5). Let's ignore all of that and have a look at their beautiful icons:

In [28]:
len(apps_packages_ids)

706

We'll store the app information for later by converting the JSON objects into a Pandas dataframe and saving the result into a CSV file:

In [128]:
app_infos_df = pd.DataFrame(app_infos)
app_infos_df.to_csv('../data/apps.csv', index=None, header=True)

In [29]:
from google_play_scraper import Sort, reviews_all

In [30]:
app_reviews = []

for ap in tqdm(apps_packages_ids):
    try:
        result = reviews_all(
            ap,
            lang='fi',
            country='fi',
            )
        for r in result:
            r['appId'] = ap
        app_reviews = app_reviews + result
    except IndexError :
        continue
    

100%|████████████████████████████████████████████████████████████████████████████████| 706/706 [19:39<00:00,  1.67s/it]


Note that we're adding the app id and sort order to each review. Here's an example for one:

`repliedAt` and `replyContent` contain the developer response to the review. Of course, they can be missing.

How many app reviews did we get?



In [31]:
len(app_reviews)

194472

Let's save the reviews to a CSV file:

In [33]:
app_reviews_df = pd.DataFrame(app_reviews)
app_reviews_df.to_csv('../data/reviews.csv', index=None, header=True)