# Create a Dataset for Sentiment Analysis

In this notebook,we will create a dataset for Sentiment Analysis by scraping user reviews for Android apps. We'll convert the app and review information into Data Frames and save that to CSV files.

Through this step we will:

- Set a goal and inclusion criteria for your dataset
- Get real-world user reviews by scraping Google Play
- Use Pandas to convert and save the dataset into CSV files

## The Goal of the Dataset

You wish to hear user opinions on your app. Both bad and wonderful things are beneficial. However, the negative one can highlight missing important features or service outages (when they occur significantly more frequently).

Fortunately, Google Play offers a vast selection of apps, ratings, and reviews. We can use the [google-play-scraper](https://github.com/JoMingyu/google-play-scraper) package to scrape reviews and app information.

There are several apps available for you to evaluate. However, distinct app categories have unique audiences, peculiarities unique to their domains, and more. Let's get basic first.

We want apps that have been around for a while, as this allows opinions to naturally arise. Our goal is to minimize the use of advertising tactics. Since apps are updated often, the timing of the review is crucial.

Ideally, you should gather each and every review that is available and use it. But in reality, data is frequently scarce (too big, unreachable, etc.). So we'll try our hardest.

Let's select a few apps from the *Productivity* category that meet the requirements. Here are a few of the most popular US apps:




# Table of Contents
1. [Setup](#Setup)
2. [Scraping App Information](#Scraping_App_Information)
3. [Scraping App Reviews](#Scraping_App_Reviews)
4. [Summary](#Summary)
5. [References](#References)

## Setup <a id='Setup'></a>

Installing the required packages and setup the imports:

In [1]:
!pip install -qq google-play-scraper

In [2]:
!pip install -qq -U watermark

In [3]:
%reload_ext watermark
%watermark -v -p pandas,matplotlib,seaborn,google_play_scraper

Python implementation: CPython
Python version       : 3.9.19
IPython version      : 7.31.1

pandas             : 2.2.2
matplotlib         : 3.9.2
seaborn            : 0.13.2
google_play_scraper: 1.2.7



In [4]:
import json
import pandas as pd
from tqdm import tqdm

import seaborn as sns
import matplotlib.pyplot as plt

from pygments import highlight
from pygments.lexers import JsonLexer
from pygments.formatters import TerminalFormatter

from google_play_scraper import Sort, reviews, app

%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style='whitegrid', palette='muted', font_scale=1.2)

In [5]:
app_packages = [
  'com.anydo',
  'com.todoist',
  'com.ticktick.task',
  'com.habitrpg.android.habitica',
  'cc.forestapp',
  'com.oristats.habitbull',
  'com.levor.liferpgtasks',
  'com.habitnow',
  'com.microsoft.todos',
  'prox.lab.calclock',
  'com.gmail.jmartindev.timetune',
  'com.artfulagenda.app',
  'com.tasks.android',
  'com.appgenix.bizcal',
  'com.appxy.planner'
]

## Scraping App Information <a id="Scraping_App_Information"></a>

Let's scrape the info for each app:

In [6]:
app_infos = []

for ap in tqdm(app_packages):
  info = app(ap, lang='en', country='us')
  del info['comments']
  app_infos.append(info)

100%|██████████| 15/15 [00:10<00:00,  1.38it/s]


All 15 apps' information was obtained. Let's create a helper method that produces slightly better JSON object printing:

In [7]:
def print_json(json_object):
  json_str = json.dumps(
    json_object,
    indent=2,
    sort_keys=True,
    default=str
  )
  print(highlight(json_str, JsonLexer(), TerminalFormatter()))

Sample app information from the list:

In [8]:
print_json(app_infos[0])

{[37m[39;49;00m
[37m  [39;49;00m[94m"adSupported"[39;49;00m:[37m [39;49;00m[34mfalse[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"appId"[39;49;00m:[37m [39;49;00m[33m"com.anydo"[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"categories"[39;49;00m:[37m [39;49;00m[[37m[39;49;00m
[37m    [39;49;00m{[37m[39;49;00m
[37m      [39;49;00m[94m"id"[39;49;00m:[37m [39;49;00m[33m"PRODUCTIVITY"[39;49;00m,[37m[39;49;00m
[37m      [39;49;00m[94m"name"[39;49;00m:[37m [39;49;00m[33m"Productivity"[39;49;00m[37m[39;49;00m
[37m    [39;49;00m}[37m[39;49;00m
[37m  [39;49;00m],[37m[39;49;00m
[37m  [39;49;00m[94m"containsAds"[39;49;00m:[37m [39;49;00m[34mfalse[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"contentRating"[39;49;00m:[37m [39;49;00m[33m"Everyone"[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"contentRatingDescription"[39;49;00m:[37m [39;49;00m[34mnull[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"currenc

We'll store the app information for later by converting the JSON objects into a Pandas dataframe and saving the result into a CSV file:

In [9]:
app_infos_df = pd.DataFrame(app_infos)
app_infos_df.to_csv('apps.csv', index=None, header=True)

## Scraping App Reviews <a id='Scraping_App_Reviews'></a>

We will use the scraping package option to filter the review score to get a balanced datset, we'll sort the reviews by their helpfulness, which are the reviews that Google Play thinks are most important in order to obtain a representative sample of the reviews for each app.

In [10]:
app_reviews = []

for ap in tqdm(app_packages):
  for score in list(range(1, 6)):
    for sort_order in [Sort.MOST_RELEVANT, Sort.NEWEST]:
      rvs, _ = reviews(
        ap,
        lang='en',
        country='us',
        sort=sort_order,
        count= 200 if score == 3 else 100,
        filter_score_with=score
      )
      for r in rvs:
        r['sortOrder'] = 'most_relevant' if sort_order == Sort.MOST_RELEVANT else 'newest'
        r['appId'] = ap
      app_reviews.extend(rvs)

100%|██████████| 15/15 [01:51<00:00,  7.45s/it]


In [11]:
print_json(app_reviews[0])

{[37m[39;49;00m
[37m  [39;49;00m[94m"appId"[39;49;00m:[37m [39;49;00m[33m"com.anydo"[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"appVersion"[39;49;00m:[37m [39;49;00m[33m"5.18.2.3"[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"at"[39;49;00m:[37m [39;49;00m[33m"2024-08-21 10:43:17"[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"content"[39;49;00m:[37m [39;49;00m[33m"The new update is terrible. I really dislike the new agenda view. Why would remove the bullet task list from the monthly view on the mobile format? For me, the simplicity of design, logical functionality, and visual appeal set anydo apart from the others. I don't know that I will continue to use it. I wish I could uninstall the update."[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"repliedAt"[39;49;00m:[37m [39;49;00m[33m"2024-08-23 11:07:55"[39;49;00m,[37m[39;49;00m
[37m  [39;49;00m[94m"replyContent"[39;49;00m:[37m [39;49;00m[33m"If you're referring to the monthly 

`repliedAt` and `replyContent` contain the developer response to the review. Of course.



In [12]:
len(app_reviews)

17670

Saving the reviews to a CSV file:

In [13]:
app_reviews_df = pd.DataFrame(app_reviews)
app_reviews_df.to_csv('reviews.csv', index=None, header=True)

## Summary <a id="Summary"></a>

We now have a dataset with more than 15k user reviews from 15 productivity apps.

Next, we're going to use the reviews for sentiment analysis with BERT. But first, we'll have to do some text preprocessing!


## References <a id="References"></a>

- [Google Play Scraper for Python](https://github.com/JoMingyu/google-play-scraper)