# App Market Research

In this project, publicly available data about dog play-date apps from Google Play Store is scrapped and analyzed to access the potential of such doggy play-date apps in the app market. Possible insights from the analysis might include:
* What do people love most and least about such apps?
* How well do such apps generally tend to do financially?
* What regions of the world are such apps currently found in?

In [42]:
# Imports.
import json
import re
from glob import glob
import datetime
import pycountry
import pandas as pd
import numpy as np
from tqdm import tqdm
from google_play_scraper import app
from utility_functions import get_now
from google_play_scraper import search
from utility_functions import inspect_function

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [48]:
# Countries based on dogs per capita.
# References: 
# 1. https://www.mappr.co/thematic-maps/world-pet-ownership/
# 2. https://www.petsecure.com.au/pet-care/a-guide-to-worldwide-pet-ownership/

country_names = [ # ISO standard names.
    'United States', 'Brazil', 'China', 'Russian Federation', 'Japan', 
    'Philippines', 'India', 'Argentina', 'United Kingdom', 'France', 
    'Poland', 'Spain', 'Romania', 'Australia', 'Hungary',
    'Czechia', 'South Africa', 'Germany', 'Ethiopia',  'Canada'
]

languages = {'english': 'en'} # English apps only.

## 1. Data Acquisition

In [49]:
search_str = "dog dating" # Search string => kind of app we're interested in.
n_apps = 30 # No. of apps per country (value range = [1, 30]).

In [50]:
# For each of our shortlisted countries ...
start_country = 0
for i in range(start_country, len(country_names)):
    data = []
    country = country_names[i]
    print(f"COUNTRY: {country}")
    # Try to get information about some dog play-date apps.
    country_code = pycountry.countries.get(name=country).alpha_2.lower()
    language_code = languages['english']
    app_ids = [a['appId'] for a in search(search_str, lang=language_code, country=country_code, n_hits=n_apps)]
    # For each app ...
    for j in tqdm(range(len(app_ids))):
        try:
            app_id = app_ids[j]
            # Get details regarding the app.
            app_details = app(app_id, lang=language_code, country=country_code)
            # Add potentially useful new data fields.
            app_details['searchMoment'] = str(datetime.datetime.now())
            app_details['countryCode'] = country_code
            app_details['languageCode'] = language_code
            # Remove less useful data fields.
            del app_details['icon']
            del app_details['headerImage']
            del app_details['screenshots']
            del app_details['video']
            del app_details['videoImage']
            del app_details['descriptionHTML']
            # Add app to data.
            data.append(app_details)
        except Exception as e: 
            print(f"app {j}:", e)
    df = pd.DataFrame(data)
    df.to_csv(f"./Data/{'_'.join(country.lower().split())}.csv", index=False)

COUNTRY: United States


100%|██████████| 30/30 [00:55<00:00,  1.85s/it]


COUNTRY: Brazil


  3%|▎         | 1/30 [00:19<09:20, 19.32s/it]

app 0: Remote end closed connection without response


 63%|██████▎   | 19/30 [01:00<01:45,  9.58s/it]

app 18: App not found. Status code 504 returned.


100%|██████████| 30/30 [01:07<00:00,  2.25s/it]


COUNTRY: China


 77%|███████▋  | 23/30 [01:53<01:22, 11.78s/it]

app 22: App not found. Status code 504 returned.


100%|██████████| 30/30 [01:58<00:00,  3.95s/it]


COUNTRY: Russian Federation


100%|██████████| 30/30 [00:49<00:00,  1.66s/it]


COUNTRY: Japan


 30%|███       | 9/30 [00:36<03:29,  9.97s/it]

app 8: App not found. Status code 504 returned.


100%|██████████| 30/30 [00:51<00:00,  1.70s/it]


COUNTRY: Philippines


100%|██████████| 30/30 [01:32<00:00,  3.10s/it]


COUNTRY: India


 70%|███████   | 21/30 [00:46<01:27,  9.71s/it]

app 20: App not found. Status code 504 returned.


100%|██████████| 30/30 [00:53<00:00,  1.78s/it]


COUNTRY: Argentina


100%|██████████| 30/30 [00:50<00:00,  1.68s/it]


COUNTRY: United Kingdom


  3%|▎         | 1/30 [00:30<14:39, 30.32s/it]

app 0: App not found. Status code 504 returned.


100%|██████████| 30/30 [00:51<00:00,  1.72s/it]


COUNTRY: France


100%|██████████| 30/30 [00:52<00:00,  1.76s/it]


COUNTRY: Poland


100%|██████████| 30/30 [00:53<00:00,  1.77s/it]


COUNTRY: Spain


100%|██████████| 30/30 [00:28<00:00,  1.06it/s]


COUNTRY: Romania


100%|██████████| 30/30 [00:38<00:00,  1.29s/it]


COUNTRY: Australia


100%|██████████| 30/30 [00:55<00:00,  1.86s/it]


COUNTRY: Hungary


100%|██████████| 30/30 [00:56<00:00,  1.89s/it]


COUNTRY: Czechia


 97%|█████████▋| 29/30 [01:03<00:09,  9.73s/it]

app 28: App not found. Status code 504 returned.


100%|██████████| 30/30 [01:04<00:00,  2.13s/it]


COUNTRY: South Africa


100%|██████████| 11/11 [00:10<00:00,  1.04it/s]


COUNTRY: Germany


 87%|████████▋ | 26/30 [01:09<00:38,  9.68s/it]

app 25: App not found. Status code 504 returned.


100%|██████████| 30/30 [01:13<00:00,  2.45s/it]


COUNTRY: Ethiopia


100%|██████████| 30/30 [00:51<00:00,  1.70s/it]


COUNTRY: Canada


 83%|████████▎ | 25/30 [00:49<00:48,  9.65s/it]

app 24: App not found. Status code 504 returned.


100%|██████████| 30/30 [00:54<00:00,  1.81s/it]


In [75]:
country_data_files = glob("./Data/*.csv")

df = pd.concat([pd.read_csv(f) for f in country_data_files])

df = df.reset_index(drop=True)

df = pd.concat([
    pd.DataFrame(list(df['histogram'].apply(
        lambda r: [int(re.sub(r'[\[\]]', r'', s.strip())) for s in r.split(',')]
    )), columns=['ratings1', 'ratings2', 'ratings3', 'ratings4', 'ratings5']),
    df.drop(['histogram', 'ratings'], axis=1)
], axis=1)

df['categories'] = df['categories'].apply(
    lambda r: ' '.join(set([c['name'].lower().replace(' ', '-') for c in eval(r)]))
)

df['comments'] = df['comments'].apply(lambda r: ' '.join(eval(r)).lower())

df.to_csv("./Data/all_countries.csv", index=False)

Each app details request returns data with following fields.
* `title`: Brief title.
* `description`: Description in plain text.
* `descriptionHTML`: Description in HTML format.
* `summary`: Summary of what the app is about.
* `installs`: No. of installs display string.
* `minInstalls`: At least these many installs.
* `realInstalls`: Exactly these many installs.
* `score`: Average user rating out of 5. ???
* `ratings`: No. of ratings.
* `reviews`: No. of reviews.
* `histogram`: List corresponding to no. of 1, 2, 3, 4 and 5 ratings respectively.
* `price`: Price of install.
* `free`: Whether or not this app is free to install.
* `currency`: Currency that the price is expressed in.
* `sale`: Whether or not this app is for sale. ???
* `saleTime`: ???
* `originalPrice`: ???
* `saleText`: ???
* `offersIAP`: Whether or not this app offers in app purchases.
* `inAppProductPrice`: String indicating prizes of purchasable items in the app.
* `developer`: Developer of the app.
* `developerId`: Identification string corresponding to app developer.
* `developerEmail`: Email corresponding to app developer.
* `developerWebsite`: Website corresponding to app developer.
* `developerAddress`: Address corresponding to app developer.
* `privacyPolicy`: Link to the privacy policy of this app.
* `genre`: A string trying to encompass the main category that this app may be put into.
* `genreId`: Identifier string trying to encompass the main category that this app may be put into.
* `categories`: List of {'name', 'id'} objects corresponding to categories that this app may be put into.
* `icon`: Link to app icon image.
* `headerImage`: Link to app header image that shows up as part of a search result.
* `screenshots`: List of links to screenshots of the app.
* `video`: A video associated with this app.
* `videoImage`: Link to a thumbnail image of above video.
* `contentRating`: Category of people allowed to rate this app. ???.
* `contentRatingDescription`: Genre under which people date this app. ???.
* `adSupported`: Whether or not this app supports ads.
* `containsAds`: Whether or not this app contains ads.
* `released`: String corresponding to date of release of this app.
* `updated`: No. of times this app was updated. ???
* `version`: Current version string.
* `comments`: List of some comments.
* `appId`: App identifier string.
* `url`: Link to this app on Google Play Store.

Following potentially useful data fields were added to details of each app.
* `searchMoment`: Date time string marking date-time at which app details were fetched.
* `countryCode`: ISO code of source country.
* `languageCode`: Code corresponding to primary language of the app.

Following less useful data fields were removed from details of each app.
* `video`: Not looking to work with videos.
* `videoImage`: Not looking to work with videos.
* `descriptionHTML`: Redundant since another field with same description as plain text already exists.

In [76]:
df = pd.read_csv("./Data/all_countries.csv")

In [77]:
df

Unnamed: 0,ratings1,ratings2,ratings3,ratings4,ratings5,title,description,summary,installs,minInstalls,...,containsAds,released,updated,version,comments,appId,url,searchMoment,countryCode,languageCode
0,0,0,0,0,0,DogDater,"Few animals are as social as our dog friends, ...",The App for all dog lovers. Find dogs around t...,"10,000+",10000.0,...,False,"Sep 22, 2021",1.689945e+09,2.4.2,the only people that are popping up for me are...,com.dogdatr.app,https://play.google.com/store/apps/details?id=...,2023-08-23 13:28:24.696038,ar,en
1,0,0,0,0,0,Pawmates: The Dog Meetup App,Featured on 50+ news and media channels includ...,"Match, chat, and meet dogs and their owners ne...","5,000+",5000.0,...,True,"Jul 21, 2022",1.691885e+09,4.12,great app and presents well visually but alway...,com.colin.barkerchat,https://play.google.com/store/apps/details?id=...,2023-08-23 13:28:25.325660,ar,en
2,0,0,0,0,0,Fetchadate,FetchaDate is Where Pet Lovers Meet! The place...,FetchaDate is where single pet lovers meet and...,"1,000+",1000.0,...,False,"Nov 24, 2020",1.680784e+09,2.1.0,not enough users. i set my radius for around 3...,com.fetchadate,https://play.google.com/store/apps/details?id=...,2023-08-23 13:28:26.063477,ar,en
3,607,0,944,2092,12688,Old Friends Dog Game,"Welcome to Old Friends Dog Game, where love ne...",Save cute dogs and help them enjoy life in thi...,"1,000,000+",1000000.0,...,True,"Aug 8, 2021",1.687988e+09,1.20.03,absolutely loving the game so far! i'm relieve...,com.runawayplay.OldFriends,https://play.google.com/store/apps/details?id=...,2023-08-23 13:28:26.755346,ar,en
4,0,0,0,0,0,Dog Hotel Tycoon,"Become a dog hotel tycoon, and create a world ...","Become a dog hotel tycoon, and create a world ...","1,000,000+",1000000.0,...,True,,1.664449e+09,,,com.idle.dog.hotel.tycoon,https://play.google.com/store/apps/details?id=...,2023-08-23 13:28:27.419246,ar,en
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1139,20166,1745,3440,10393,20771,eharmony dating & real love,Are you ready to take dating to the next level...,Your dating app to match with quality singles ...,"5,000,000+",5000000.0,...,False,"Sep 27, 2010",1.692016e+09,10.12.0,"it will have you go through a lengthy sign up,...",com.eharmony,https://play.google.com/store/apps/details?id=...,2023-08-23 13:21:01.928813,us,en
1140,833,138,694,1389,14961,It's a Dog's Love: Romance you,❏Synopsis❏\r\n\r\nYou expect your new job at H...,Experience a heartwarming romance with woman's...,"500,000+",500000.0,...,True,"Jun 14, 2020",1.623339e+09,2.1.10,i've played a dang good amount of these storie...,com.genius.dogboy,https://play.google.com/store/apps/details?id=...,2023-08-23 13:21:02.637253,us,en
1141,4826,1522,3714,7391,33948,Dog Scanner: Breed Recognition,The Dog Scanner app will identify your dog's b...,"See a dog, but don't know its breed? Just take...","5,000,000+",5000000.0,...,True,"Jul 2, 2017",1.672913e+09,12.8.15-G,to say the ads are intrusive is an understatem...,com.siwalusoftware.dogscanner,https://play.google.com/store/apps/details?id=...,2023-08-23 13:21:03.287560,us,en
1142,1709,420,973,1657,22222,"Wild: Hook up, Meet, Dating Me",WILD - The Fast-growing Hookup App for Adult S...,The Online Dating & Hook up App for Local Sing...,"1,000,000+",1000000.0,...,False,"Oct 22, 2017",1.692072e+09,2.8.5,"overall impressions is that at this stage, the...",com.free.hookup.dating.apps.wild,https://play.google.com/store/apps/details?id=...,2023-08-23 13:21:03.924127,us,en


In [78]:
df.columns

Index(['ratings1', 'ratings2', 'ratings3', 'ratings4', 'ratings5', 'title',
       'description', 'summary', 'installs', 'minInstalls', 'realInstalls',
       'score', 'reviews', 'price', 'free', 'currency', 'sale', 'saleTime',
       'originalPrice', 'saleText', 'offersIAP', 'inAppProductPrice',
       'developer', 'developerId', 'developerEmail', 'developerWebsite',
       'developerAddress', 'privacyPolicy', 'genre', 'genreId', 'categories',
       'contentRating', 'contentRatingDescription', 'adSupported',
       'containsAds', 'released', 'updated', 'version', 'comments', 'appId',
       'url', 'searchMoment', 'countryCode', 'languageCode'],
      dtype='object')

In [79]:
df['genre'].value_counts()

genre
Lifestyle            262
Simulation           232
Dating               216
Casual               136
Education             88
Social                78
Communication         40
Role Playing          30
Puzzle                12
Medical               10
Travel & Local         8
Trivia                 6
Entertainment          4
Tools                  4
Strategy               4
Books & Reference      2
Business               2
Shopping               2
Educational            2
Health & Fitness       2
Name: count, dtype: int64

## 2. Data Storage
Storage shall be managed in the cloud using following services offered by `AWS`.
1. `DynamoDB`: Transactional NoSQL database.
2. `RedShift`: Analytical database. 
3. `AWS Data Pipeline`: For data transfer between DynamoDB and RedShift.