# App Development Opportunities: Profitable Free App Profiles for both Android and iOS

The aim of the is project is to develop a profile for an app that is likely to be popular in both iOS and Android app markets.

Our company is an app development company, and we create free apps that utilise in-app adveritising, our main source of revenue. To maximise this revenue, we want our next app to be as popular as possible across both the Google Playstore and the Apple App Store.

This project will help to inform our game developers on the types of apps that are likely to have a larger user base.

## Importing and Explorting the Data

The Google Playstore and the Apple App Store are by far the two largest markets for apps. In April 2022, there were 3.4 million apps on Google Playstore, and 2.2 million on the Apple App Store. Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/#:~:text=Number%20of%20apps%20available%20in%20leading%20app%20stores%202021&text=As%20of%20the%20first%20quarter,million%20available%20apps%20for%20iOS)

It would be difficult and costly to source a dataset that comprises of all of these apps, so a sample dataset was sought for this project instead. 

Two free, relevant sources of Apple App Store and Google Playstore data were found:

- Apple Store - [Available for download here](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps). This dataset contains approximately 7000 entries and was collected in July 2017.
- Google Playstore - [Avaialble for download here](https://www.kaggle.com/datasets/lava18/google-play-store-apps). This dataset contains approximately 10,000 entries and was lated updated in 2019.

In [1]:
import pandas as pd
from csv import reader

In [2]:
appstore_data = open('AppleStore.csv')
playstore_data = open('googleplaystore.csv')

In [3]:
g_reading_data = reader(playstore_data)
android = list(g_reading_data)

a_reading_data = reader(appstore_data)
apple = list(a_reading_data)

Below, we're going to build a function that makes it easier to explore the two datasets - _explore_data_. It will display the data in a more legible way by adding lines between rows, and describing the number of columns and rows.

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [5]:
explore_data(apple, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


In [6]:
explore_data(android, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


In [7]:
apple_header = apple[0]
android_header = android[0]
print(android_header)
print(apple_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The datasets have some differing categories but overall there are interesting columns here to consider further:
<br><br>
Google Playstore:
- App (name of app)
- Price
- Category
- Rating
- Installs
- Genres

Apple:
- track_name (name of app)
- currency
- price
- rating_count_tot
- user_rating
- prime_genre

## Cleaning the Data

One of the helpful aspects of using open source, free data like these, is that there is an established community of other users who have already identified dataset issues.

### Erroneous Entries
On the community discussion around the Google Playstore dataset, there is indication that there is an error with one of the entries, 10472. It is allegedly missing a column. Find the discussion [here](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015)

In [8]:
print(android[10473]) #adding a digit to accomodate my header

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The above is the erroneous line as reported in the discussion.

In [9]:
len(android[10473]) # checking whether this row has the same number of categories as the header

12

In [10]:
len(android_header)

13

13 columns in the header, 12 in row 10473.

In [11]:
del android[10473] # delete the bad row

In [12]:
print(android[10473]) #confirming row is deleted

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


### Removing Duplicates

#### Part One: Identification
There also seems to be some discussion regarding duplicates in the Google Playstore data, so we will take a look at removing those next if that is the case. See example [here](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409)

One notable example from the discussion is Instagram. We can confirm this by finding Instagram in the dataset.

In [13]:
for row in android:
    name = row[0]
    if name == 'Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


There are a few entries for Instragram and the differences between them seem to be small. Just the Reviews (row=3) section differs between entries. We'll see how many more duplicates there are before deciding what to do next

In [14]:
apps = []
dupe_apps = []

for row in android[1:]:
    name = row[0]
    if name not in apps:
        apps.append(name)
    else:
        dupe_apps.append(name)

In [15]:
print('The number of duplicate apps = ' + str(len(dupe_apps)))
print('The number of unique apps = ' + str(len(apps)))

The number of duplicate apps = 1181
The number of unique apps = 9659


There are >1000 duplicate entries in the dataset that need to be removed. 

#### Part Two: Removal

We will keep only the entry with the highest reviews. The reason for this is that as we see in the Instagram example, the duplicates represent the same app, with Reviews as the only differentiator. As review counts do not go down with time, the entry with the highest review numbers must be the most recent.

To need to identify the duplicate entries to keep, we'll compile a list of these into a dictionary.

The below loop will:
 - Take the app name and number of reviews, and add these to a dict _max_reviews_
 - Check for each app name whether it is already in the dict.
      - If it finds an app that is already in the dict, it will check whether the new entry's review count is higher or lower. If higher, the old entry in the dict will be replaced.
      
The final dict will include only a single entry for each app name, and the value of only the highest review count for that app name.

In [16]:
max_reviews = {}

for app in android[1:]:
    name = app[0]
    num_reviews = float(app[3])
    
    if name in max_reviews and max_reviews[name] < num_reviews:
        max_reviews[name] = num_reviews
    elif name not in max_reviews:
        max_reviews[name] = num_reviews

In [17]:
print(len(max_reviews)) #checking if length of this list matches the number of unique apps listed above

9659


The below code removes the duplicates from the android dataset. The steps are as follows:
- The data is stored as a list, so a new list, android_clean, is created.
- We iterate through the android dataset, checking apps by name and the number of reviews.
    - If the number of reviews for an app matchs the number of reviews in the dict created above, max_reviews, then it is added to the android_clean list.
- A separate list is also created called already_added. When an app is added to the android_clean list, it is also added to the already_added list. Then, if a subsequent version of the same app with the same number of reviews is encountered, the for loop checks it is not yet part of already_added. If the name is already in that lsit, then the entry is not added to android_clean, avoiding further duplicates.

In [18]:
android_clean = []
already_added = []

In [19]:
for row in android[1:]:
    name = row[0]
    num_reviews = float(row[3])
    
    if num_reviews == max_reviews[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

In [20]:
print(len(android_clean)) #checking that the length of the dataset matches the length of unique apps we found earlier

9659


The figure above matches the number of unique apps we found above. 

We're OK to use this dataset now as no other issues have been reported in the community.

In [21]:
android = android_clean

Using similar to code above, we'll do a quick check now to ensure there are no duplicates in the Apple data.

In [22]:
ios_apps = []
ios_dupe_apps = []

for row in apple[1:]:
    app_id = row[0] #unique id of the app
    if app_id not in apps:
        ios_apps.append(app_id)
    else:
        ios_dupe_apps.append(app_id)

In [23]:
print(len(ios_dupe_apps))

0


In [24]:
print(len(ios_apps))

7197


In [25]:
print(len(apple[1:]))

7197


Looks like there are no duplicates in the AppleStore dataset.

## Removing Non-English Apps


As part of our project, we want to constrain the dataset to English language apps. This is because we exclusively build English-language apps, and don't want data from other language groups influencing our analysis. 

Below, we'll first identify the non-English apps by looking for those with non-latin alphabet characters in their titles, using ASCII.  The latin alphabet is constrained to the first 127 characters in ASCII.

This may not exclude all non-English apps (e.g. French, German, Indonesian all use the latin alphabet) but it will remove a bulk of non-English. As we are lacking a language column, this method seems a viable a proxy.

We'll use a for loop in a new function, _isitenglish_, to iterate over the characters in the app name strings and check if they are part of the latin alphabet.

In [26]:
def isitenglish(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    return True

In [27]:
isitenglish('Instagram') #testing the function here and below

True

In [28]:
isitenglish('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [29]:
isitenglish('Docs To Go™ Free Office Suite')

False

In [30]:
isitenglish('Instachat 😜')

False

This code returns some English language apps that just have non latin alphabet characters, such as emojis, trademark symbols. Plenty of relevant characters come later in the ASCII range than 127, so we'd be excluding these too.

Another option? We shall do the same as above - iterate over characters in a string - but only look for those with >3 non latin alphabet characters (i.e. characters outside the range of 0-127 in ASCII).

Not a perfect solution but it should remove most non-English ones and it is efficient.

In [31]:
def isitenglish2(string):
    non_eng = []
    for character in string:
        if ord(character) > 127:
            non_eng.append(character)
    if len(non_eng) > 3:
            return False
    return True

In [32]:
isitenglish2('Instachat 😜')

True

In [33]:
isitenglish2('Docs To Go™ Free Office Suite')

True

In [34]:
isitenglish2('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

The function _isitenglish2_ works as we expected and keeps English language apps with an emoji, but removes clearly non-English apps.

Below, we run it on the Google Playstore data.

In [35]:
android_eng = []
non_eng = []

In [36]:
for row in android[1:]:
    name = row[0]
    result = isitenglish2(name)
    if result == True:
        android_eng.append(row)
    else:
        non_eng.append(row)

In [37]:
print(non_eng[0:10])

[['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up'], ['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'July 26, 2018', '4.2.1', '4.0.3 and up'], ['РИА Новости', 'NEWS_AND_MAGAZINES', '4.5', '44274', '8.0M', '1,000,000+', 'Free', '0', 'Everyone', 'News & Magazines', 'August 6, 2018', '4.0.6', '4.4 and up'], ['صور حرف H', 'ART_AND_DESIGN', '4.4', '13', '4.5M', '1,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 27, 2018', '2.0', '4.0.3 and up'], ['L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]', 'LIFESTYLE', '4.0', '45224', '49M', '5,000,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'August 1, 2018', '6.5.1', '4.1 and up'], ['RMEduS - 음성인식을 활용한 R 프로그래밍 실습 시스템', 'FAMILY', 'NaN', '4', '64M', '1+', 'Free', '0', 'Everyone', 'Education', 'July 17, 2018', '1.0.1', '4.4 and up'], ['AJ렌터카 법인 카셰어링', 'MAPS_AND_

The above sample does look like a selection of non-English apps - Russian, Chinese, Korean are all present. We don't see any English titles in the sample, so we can call this a success and do the same with the Apple data.

In [38]:
apple_eng = []
apple_non_eng = []

for row in apple[1:]:
    name = row[1]
    result = isitenglish2(name)
    if result == True:
        apple_eng.append(row)
    else:
        apple_non_eng.append(row)

In [39]:
print(apple_non_eng[20:30])

[['303191318', '同花顺-炒股、股票', '122886144', 'USD', '0.0', '1744', '0', '3.5', '0.0', '10.10.46', '4+', 'Finance', '37', '0', '1', '1'], ['438426078', '聚力视频-蓝光电视剧电影在线热播', '99607552', 'USD', '0.0', '1670', '4', '4.5', '4.0', '6.4.16', '12+', 'Entertainment', '38', '0', '1', '1'], ['906936224', '快看漫画', '63058944', 'USD', '0.0', '1647', '225', '5.0', '5.0', '4.1.1', '17+', 'Book', '38', '4', '2', '1'], ['385285922', '乐视视频-白鹿原,欢乐颂,奔跑吧全网热播', '184689664', 'USD', '0.0', '1590', '6', '4.5', '5.0', '7.1', '17+', 'Entertainment', '38', '0', '2', '1'], ['1058287503', '酷我音乐HD-无损在线播放', '40784896', 'USD', '0.0', '1340', '264', '5.0', '5.0', '4.0.6', '4+', 'Entertainment', '24', '5', '1', '1'], ['373454750', '随手记（专业版）-好用的记账理财工具', '83899392', 'USD', '0.99', '1267', '0', '4.5', '0.0', '10.6.3', '4+', 'Finance', '38', '0', '3', '1'], ['485534181', 'Dictionary ( قاموس عربي / انجليزي + ودجيت الترجمة)', '69448704', 'USD', '3.99', '1112', '7', '4.5', '4.0', '6.4', '4+', 'Reference', '37', '5', '2', '1'], ['5544

In [40]:
print('Total Android apps in English = ' + str(len(android_eng)) + ' and Total Apple apps in English = ' + str(len(apple_eng)))
print('We removed ' + str(len(apple_non_eng)+len(non_eng)) + ' non-English apps in total across both datasets')

Total Android apps in English = 9613 and Total Apple apps in English = 6183
We removed 1059 non-English apps in total across both datasets


In [41]:
android = android_eng
apple = apple_eng

## Removing Paid Apps

As we'll only be making free apps, it makes sense to constraint these data to exlude paid apps. 

To start with, we'll check how pricing is named in the Google Playstore data

In [42]:
print(android_header)
print(android[24]) #randomly selected row to check price string structure

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Animated Photo Editor', 'ART_AND_DESIGN', '4.1', '203', '6.1M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 21, 2018', '1.03', '4.0.3 and up']


In [43]:
android_paid = []
android_free = []
for row in android[1:]:
    cost = row[6]
    if cost != 'Free':
        android_paid.append(row)
    elif cost == 'Free':
        android_free.append(row)        

In [44]:
print('The lenth of android_free = ' + str(len(android_free)) + 
      ' and the length of android_paid = ' + str(len(android_paid)))

The lenth of android_free = 8861 and the length of android_paid = 751


In [45]:
len(android)-1 #minus the header

9612

In [46]:
len(android_free)+len(android_paid) # Sum of paid + free should = total of the dataset

9612

The loop worked to successfully extract only the free Google Playstore apps. We'll use the same code for the Apple Store below.

In [47]:
print(apple_header)
print(apple[55]) #randomly selected row to check price string structure

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['363590051', 'Netflix', '125016064', 'USD', '0.0', '308844', '139', '3.5', '3.0', '9.21.3', '4+', 'Entertainment', '37', '5', '20', '1']


In [48]:
apple_paid = []
apple_free = []
for row in apple[1:]:
    cost = row[4]
    if cost != '0.0':
        apple_paid.append(row)
    elif cost == '0.0':
        apple_free.append(row)        

In [49]:
print('The lenth of apple_free = ' + str(len(apple_free)) + 
      ' and the length of apple_paid = ' + str(len(apple_paid)))

The lenth of apple_free = 3221 and the length of apple_paid = 2961


In [50]:
len(apple)-1 #again removing 1 for the header

6182

In [51]:
len(apple_free)+len(apple_paid) # Sum of paid + free should = total of the dataset as above

6182

Now happy with these data, we'll move on to analysis.

In [52]:
android = android_free
apple = apple_free

## Analysis
Our aim is to build an app that will be popular for both iOS and Android users, allowing us to maximise our in-app advertising revenue.

Toward this end, we need to understand what sort of apps are profitable across both Android and Apple.

A good place to start would be with the Categories/Genre columns in both datasets. This may help us determine an app profile that does well in similar categories across both app Stores.

### Top App Categories in Both Stores

#### Part One

In [53]:
print(android_header)
print(apple_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The headers for both datasets are above, so we can familiarise ourselves with the relevant columns. For Android, these are columns 1 (Category) and 9 (Genres), and for Apple it is column 11 (prime_genre)

To determine the popularity of each category/genre, we'll use frequency tables.

We'll create a pair of functions that:
- Displays the frequency table of a column as percentages
- Displays the percentages created by the previous function in a sorted, descending order.

In [54]:
def freq_table(dataset, idx):
    cat = {}
    counts = 0
    for row in dataset[1:]:
        counts += 1
        category = row[idx]
        if category in cat:
            cat[category] += 1
        else:
            cat[category] = 1    
            
    pct = {}
    for key in cat:
        percentage = (cat[key] / counts) * 100
        pct[key] = percentage 
    
    return pct

In [55]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [56]:
android_cats = freq_table(android, 1) #testing the first function

In [57]:
print(android_cats)

{'ART_AND_DESIGN': 0.6094808126410836, 'AUTO_AND_VEHICLES': 0.9255079006772009, 'BEAUTY': 0.5981941309255079, 'BOOKS_AND_REFERENCE': 2.144469525959368, 'BUSINESS': 4.5936794582392775, 'COMICS': 0.6207674943566591, 'COMMUNICATION': 3.239277652370203, 'DATING': 1.8623024830699775, 'EDUCATION': 1.162528216704289, 'ENTERTAINMENT': 0.9593679458239277, 'EVENTS': 0.7110609480812641, 'FINANCE': 3.7020316027088036, 'FOOD_AND_DRINK': 1.2415349887133182, 'HEALTH_AND_FITNESS': 3.0812641083521446, 'HOUSE_AND_HOME': 0.8239277652370203, 'LIBRARIES_AND_DEMO': 0.9367945823927766, 'LIFESTYLE': 3.9051918735891653, 'GAME': 9.729119638826186, 'FAMILY': 18.905191873589164, 'MEDICAL': 3.5327313769751694, 'SOCIAL': 2.6636568848758464, 'SHOPPING': 2.2460496613995486, 'PHOTOGRAPHY': 2.945823927765237, 'SPORTS': 3.397291196388262, 'TRAVEL_AND_LOCAL': 2.336343115124153, 'TOOLS': 8.465011286681715, 'PERSONALIZATION': 3.3182844243792324, 'PRODUCTIVITY': 3.8939051918735887, 'PARENTING': 0.6546275395033859, 'WEATHER'

Confirmed the freq_table function works. Now to use display_table

In [58]:
android_categories = display_table(android, 1)

FAMILY : 18.905191873589164
GAME : 9.729119638826186
TOOLS : 8.465011286681715
BUSINESS : 4.5936794582392775
LIFESTYLE : 3.9051918735891653
PRODUCTIVITY : 3.8939051918735887
FINANCE : 3.7020316027088036
MEDICAL : 3.5327313769751694
SPORTS : 3.397291196388262
PERSONALIZATION : 3.3182844243792324
COMMUNICATION : 3.239277652370203
HEALTH_AND_FITNESS : 3.0812641083521446
PHOTOGRAPHY : 2.945823927765237
NEWS_AND_MAGAZINES : 2.799097065462754
SOCIAL : 2.6636568848758464
TRAVEL_AND_LOCAL : 2.336343115124153
SHOPPING : 2.2460496613995486
BOOKS_AND_REFERENCE : 2.144469525959368
DATING : 1.8623024830699775
VIDEO_PLAYERS : 1.7945823927765236
MAPS_AND_NAVIGATION : 1.399548532731377
FOOD_AND_DRINK : 1.2415349887133182
EDUCATION : 1.162528216704289
ENTERTAINMENT : 0.9593679458239277
LIBRARIES_AND_DEMO : 0.9367945823927766
AUTO_AND_VEHICLES : 0.9255079006772009
HOUSE_AND_HOME : 0.8239277652370203
WEATHER : 0.8013544018058691
EVENTS : 0.7110609480812641
PARENTING : 0.6546275395033859
COMICS : 0.620767

In [59]:
android_genres = display_table(android, 9)

Tools : 8.45372460496614
Entertainment : 6.072234762979685
Education : 5.349887133182844
Business : 4.5936794582392775
Productivity : 3.8939051918735887
Lifestyle : 3.8939051918735887
Finance : 3.7020316027088036
Medical : 3.5327313769751694
Sports : 3.4650112866817158
Personalization : 3.3182844243792324
Communication : 3.239277652370203
Action : 3.1038374717832955
Health & Fitness : 3.0812641083521446
Photography : 2.945823927765237
News & Magazines : 2.799097065462754
Social : 2.6636568848758464
Travel & Local : 2.325056433408578
Shopping : 2.2460496613995486
Books & Reference : 2.144469525959368
Simulation : 2.0428893905191874
Dating : 1.8623024830699775
Arcade : 1.8510158013544018
Video Players & Editors : 1.7720090293453723
Casual : 1.7607223476297968
Maps & Navigation : 1.399548532731377
Food & Drink : 1.2415349887133182
Puzzle : 1.1286681715575622
Racing : 0.9932279909706546
Role Playing : 0.9367945823927766
Libraries & Demo : 0.9367945823927766
Auto & Vehicles : 0.925507900677

In [60]:
apple_genres = display_table(apple, 11)

Games : 58.19875776397515
Entertainment : 7.888198757763975
Photo & Video : 4.937888198757764
Education : 3.6645962732919255
Social Networking : 3.260869565217391
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


Let's look at this data more closely.

For Apple:
- The most common genre is Gaming, followed by Entertainment

- Generally the apps categories cover mostly practical topics - News, Finance, Travel, Lifestyle, Education. 
    - However, 4 of the top 5 are fun/entertainment based, except for Education.

- While the majority of the seem to be Games or other fun/leisure types, this doesn't mean they are the most popular or used. They may simply be the easiest to develop, and thus end up with a saturation in the store. Of course, apps of this kind could indeed be very popular. More information will be needed.

For Android:
- At first glance it's not clear what the difference is between the 'genre' and the 'category' columns. 
    - It's possible 'genre' is a bit more granular than categories, based on the size of the list.

- Across both genre and category, top genres include Tools, Family, Education, Entertainment and Games.

- In general in the top 5 across both, practical type categories make up 4 of the entries. The opposite of the Apple Store set.

- Just as above, more info needed before deciding which category is the best to develop in.

### Most Installs by App Category in Apple Store

The above frequency tables tell us which categories have the most apps. But there may be many apps in each category, with only a few installs each - so it's not particularly telling of app popularity. 

We want our app to be popular, so we should consider how many apps in each categories have a significant number of users. This will give us an indication of the popularity of apps in a particular category. We can do this by looking at the following data:

- The Google Playstore data has a column 'installs' which we can use to calculate the number of users per app category.
- For Apple it's more tricky as there is no similar column. Intead, we'll can use the user reviews as a proxy for the number of users. The column is rating_count_tot


#### Part One: Apple Store Category Rating Counts

We're going to use the _freq_table_ function from above to again build a freqency table of genres in the Apple App Store.

We will then use a for loop to:
- Iterate over our Apple data
- For each row, take the app genre and check for that same genre in the frequency table we just made.
    - The loop will add the rating_count_tot value (as _rct_) to a variable named _total_
    - The length of the list of items added to total will also increase with each new item, in variable _len_genre_
- Find the average number of ratings for each genre using the total and the genre length variables.

In [61]:
apple_genre_freq = freq_table(apple_free, 11)

In [143]:
for genre in apple_genre_freq:
    total = 0
    len_genre = 0
    for row in apple:
        genre_app = row[11]
        if genre_app == genre:
            rct = float(row[5])
            total += rct
            len_genre += 1
    avg_num_ratings = round(total/len_genre, 1)
    print(genre, ':', avg_num_ratings)

Games : 22788.7
Music : 57326.5
Social Networking : 43899.5
Reference : 74942.1
Health & Fitness : 23298.0
Weather : 52279.9
Utilities : 18684.5
Travel : 28243.8
Shopping : 26919.7
News : 21248.0
Navigation : 86090.3
Lifestyle : 16485.8
Photo & Video : 28441.5
Entertainment : 14029.8
Food & Drink : 33333.9
Sports : 23008.9
Book : 39758.5
Finance : 31467.9
Education : 7004.0
Productivity : 21028.4
Business : 7491.1
Catalogs : 4004.0
Medical : 612.0


Above is the average number of ratings per app, per category, in the Apple Store. We're using this data as a proxy for installs/popularity of the apps in each category.

#### Part Two: Analysis

The data shows a few categories with particularly high average rating numbers per app. The top 10 are:

1. Navigation (86090)
2. Reference
3. Music
4. Weather
5. Social Networking
6. Book
7. Food & Drink
8. Finance
9. Photo & Video
10. Travel (28243)

We'll take a look at each in turn and get a feel for the app landscape.

In [63]:
#First looking at Navigation
for row in apple:
    if row[11] == 'Navigation':
        print(row[1], ':', row[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


By far the category with the most installs is Navigation. 

The average number of ratings per app for this category 86090, however this seems to be a figure that is dragged down by a few apps with a small number of installs, vs some very popular apps. E.g. the list includes major stakeholders like Google Maps and Waze for driving.

We can confirm this by checking the median of the data, below.

In [64]:
import statistics

In [65]:
median_list = []
for row in apple:
    if row[11] == 'Navigation':
        median_list.append(int(row[5]))

In [66]:
print('The median of the Navigation rating counts is ' + str(statistics.median(median_list)))

The median of the Navigation rating counts is 8196.5


So it looks like it may not be worth our developer's time to build an app in this category, dominated as it is with just a few, extremely popular apps. Let's look now at Reference, the category with the 2nd highest average number of ratings per app, with an average rating number of 74942.

In [67]:
for row in apple:
    if row[11] == 'Reference':
        print(row[1],':', row[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


The two most popular apps are the Bible and forms of Dictionaries with very high values, and the average again being pulled down by a host of smaller apps.

Unlike the Navigation category, Reference has a slightly wider pool of moderately successfull apps to compete against these bigger players. For example, Night Sky for identifying constellations with 12122 ratings, and reference apps for specific games like Pokemon GO and Minecraft.

In [68]:
for row in apple:
    if row[11] == 'Music':
        print(row[1],':', row[5])

Pandora - Music & Radio : 1126879
Spotify Music : 878563
Shazam - Discover music, artists, videos & lyrics : 402925
iHeartRadio – Free Music & Radio Stations : 293228
SoundCloud - Music & Audio : 135744
Magic Piano by Smule : 131695
Smule Sing! : 119316
TuneIn Radio - MLB NBA Audiobooks Podcasts Music : 110420
Amazon Music : 106235
SoundHound Song Search & Music Player : 82602
Sonos Controller : 48905
Bandsintown Concerts : 30845
Karaoke - Sing Karaoke, Unlimited Songs! : 28606
My Mixtapez Music : 26286
Sing Karaoke Songs Unlimited with StarMaker : 26227
Ringtones for iPhone & Ringtone Maker : 25403
Musi - Unlimited Music For YouTube : 25193
AutoRap by Smule : 18202
Spinrilla - Mixtapes For Free : 15053
Napster - Top Music & Radio : 14268
edjing Mix:DJ turntable to remix and scratch music : 13580
Free Music - MP3 Streamer & Playlist Manager Pro : 13443
Free Piano app by Yokee : 13016
Google Play Music : 10118
Certified Mixtapes - Hip Hop Albums & Mixtapes : 9975
TIDAL : 7398
YouTube Mu

In [69]:
median_list = []
for row in apple:
    if row[11] == 'Music':
        median_list.append(int(row[5]))
        
print('The median of the Music rating counts is ' + str(statistics.median(median_list)))

The median of the Music rating counts is 3850.0


The third most popular category is Music with 57326 average ratings, and the median is 3850, indicating a few very popular apps are dragging the average up.

However again, even more so than the Reference category, we can see the actual list of apps in this category is large and the types of apps is broader. For example, we see music players (the majority of apps), music making apps (edjing Mix:DJ turntable to remix and scratch music and AutoRap). There is even an app just dedicated to the artist Nikki Minaj.

### Recommendation based on Apple data

Given the information above, it may be worth developing an app that combines both Music and Reference material. 
<br>
For example:
- an app with music quizzes (gamification of the reference category)
- music facts (perhaps specific to the user's favourite artists)
- information about upcoming gigs in an area or online mentions, new videos, new tweets, so on. 
- suggests similar artists based on user's preferences

As there are a lot of music playing apps available, one idea is do develop an app that could run alongside these and 'hear' what the user is listening to, providing content based on the current song/artist/genre etc. If we could integrate a player too, then the user would not need to leave our app to change songs etc.

### Most installs by category in Google Playstore

Using the same method as above for Apple, in the Google Playstore data we'll be looking at the 'Installs' column to get the number of installs per category. This will give us an indication of the popularity of apps within that category.

#### Part One: Data collection

In [70]:
android_installs = display_table(android, 5) #installs column

1,000,000+ : 15.733634311512414
100,000+ : 11.557562076749434
10,000,000+ : 10.553047404063205
10,000+ : 10.191873589164786
1,000+ : 8.397291196388263
100+ : 6.918735891647855
5,000,000+ : 6.817155756207676
500,000+ : 5.564334085778781
50,000+ : 4.774266365688487
5,000+ : 4.514672686230249
10+ : 3.5440180586907446
500+ : 3.250564334085779
50,000,000+ : 2.291196388261851
100,000,000+ : 2.1331828442437923
50+ : 1.9187358916478554
5+ : 0.7900677200902935
1+ : 0.5079006772009029
500,000,000+ : 0.27088036117381487
1,000,000,000+ : 0.2257336343115124
0+ : 0.04514672686230248


The above gives us a good indication of the spread in installs/popularity of apps. Some 15% have over 1M downloads, for example. However this data isn't precise enough for our needs nor does it tell us much about the categories of interest. We'll dive deeper.

In [71]:
android_categories = freq_table(android, 1)

In [139]:
for category in android_categories:
    total = 0
    len_category = 0
    for row in android:
        category_app = row[1]
        if category_app == category:
            n_installs = row[5]
            n_installs = n_installs.replace('+','') # the Android 'Installs' column is a str containing + and , signs
            n_installs = n_installs.replace(',','') # these chars need to be removed so we can convert to float and calculate.  
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    avg_num_installs = round(total/len_genre, 1)
    print(category, ':', avg_num_installs)

ART_AND_DESIGN : 18035183.3
AUTO_AND_VEHICLES : 8846676.8
BEAUTY : 4532841.7
BOOKS_AND_REFERENCE : 277647376.7
BUSINESS : 116150348.3
COMICS : 7495191.7
COMMUNICATION : 1839484366.8
DATING : 23485792.8
EDUCATION : 31475000.0
ENTERTAINMENT : 164910000.0
EVENTS : 2662193.3
FINANCE : 75860522.0
FOOD_AND_DRINK : 35289791.8
HEALTH_AND_FITNESS : 190591400.3
HOUSE_AND_HOME : 16200410.2
LIBRARIES_AND_DEMO : 8832635.0
LIFESTYLE : 82914071.5
GAME : 2239478241.7
FAMILY : 1032315948.3
MEDICAL : 6288724.0
SOCIAL : 914643650.3
SHOPPING : 233389764.2
PHOTOGRAPHY : 776044802.5
SPORTS : 182538447.2
TRAVEL_AND_LOCAL : 482450681.0
TOOLS : 1350173912.3
PERSONALIZATION : 254872648.0
PRODUCTIVITY : 965271552.3
PARENTING : 5245168.3
WEATHER : 60048086.7
VIDEO_PLAYERS : 655288620.0
NEWS_AND_MAGAZINES : 394699376.7
MAPS_AND_NAVIGATION : 83843463.3


#### Part Two: Analysis

From the above, the 10 most installed categories are in order:


1. GAME
2. COMMUNICATION
3. TOOLS
4. FAMILY
5. PRODUCTIVITY
6. SOCIAL
7. PHOTOGRAPHY
8. VIDEO_PLAYERS
9. TRAVEL_AND_LOCAL
10. NEWS_AND_MAGAZINES


In [73]:
for row in android:
    if row[1] == 'COMMUNICATION':
        print(row[0],':', row[5])

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 

This is a large category with a lot of apps, however many of them seem to be extremely popular e.g. WhatsApp. We'll try to separate these out to see how much of the landscape of this category they do take up.

In [74]:
for row in android:
    if row[1] == 'COMMUNICATION' and (row[5] == '1,000,000,000+'
                                      or row[5] == '500,000,000+'
                                      or row[5] == '100,000,000+'):
        print(row[0],':',row[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

That's a lot of apps still, with some big names that are established and difficult to compete against, e.g. WhatsApp, Messenger, Skype.

If we remove these apps, we find number of installs in this category is on average much smaller -- approx. 10x smaller!

In [75]:
under_100_m = []
for row in android:
    n_installs = row[5]
    n_installs = n_installs.replace('+','') # the Android 'Installs' column is a str containing + and , signs
    n_installs = n_installs.replace(',','') # these chars need to be removed so we can convert to float and calculate.  
    n_installs = float(n_installs)
    if (row[1] == 'COMMUNICATION' and n_installs < 100000000):
        under_100_m.append(n_installs)

In [76]:
avg_under_100_m = sum(under_100_m)/len(under_100_m)
print(avg_under_100_m)

3603485.3884615386


So Communication is giving us a misleading idea of how popular it is - it's dominated by a few extremely popular behemoths.

We'll now look at Video Players in a similar way.

In [77]:
for row in android:
    if row[1] == 'VIDEO_PLAYERS':
        print(row[0],':', row[5])

YouTube : 1,000,000,000+
All Video Downloader 2018 : 1,000,000+
Video Downloader : 10,000,000+
HD Video Player : 1,000,000+
Iqiyi (for tablet) : 1,000,000+
Video Player All Format : 10,000,000+
Motorola Gallery : 100,000,000+
Free TV series : 100,000+
Video Player All Format for Android : 500,000+
VLC for Android : 100,000,000+
Code : 10,000,000+
Vote for : 50,000,000+
XX HD Video downloader-Free Video Downloader : 1,000,000+
OBJECTIVE : 1,000,000+
Music - Mp3 Player : 10,000,000+
HD Movie Video Player : 1,000,000+
YouCut - Video Editor & Video Maker, No Watermark : 5,000,000+
Video Editor,Crop Video,Movie Video,Music,Effects : 1,000,000+
YouTube Studio : 10,000,000+
video player for android : 10,000,000+
Vigo Video : 50,000,000+
Google Play Movies & TV : 1,000,000,000+
HTC Service － DLNA : 10,000,000+
VPlayer : 1,000,000+
MiniMovie - Free Video and Slideshow Editor : 50,000,000+
Samsung Video Library : 50,000,000+
OnePlus Gallery : 1,000,000+
LIKE – Magic Video Maker & Community : 50,

In [78]:
for row in android:
    if row[1] == 'VIDEO_PLAYERS' and (row[5] == '1,000,000,000+'
                                      or row[5] == '500,000,000+'
                                      or row[5] == '100,000,000+'):
        print(row[0],':',row[5])

YouTube : 1,000,000,000+
Motorola Gallery : 100,000,000+
VLC for Android : 100,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+
Dubsmash : 100,000,000+
VivaVideo - Video Editor & Photo Movie : 100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera : 100,000,000+
Motorola FM Radio : 100,000,000+


In [79]:
under_100_m = []
for row in android:
    n_installs = row[5]
    n_installs = n_installs.replace('+','') # the Android 'Installs' column is a str containing + and , signs
    n_installs = n_installs.replace(',','') # these chars need to be removed so we can convert to float and calculate.  
    n_installs = float(n_installs)
    if (row[1] == 'VIDEO_PLAYERS' and n_installs < 100000000):
        under_100_m.append(n_installs)

In [80]:
avg_under_100_m = sum(under_100_m)/len(under_100_m)
print(avg_under_100_m)

5544878.133333334


Again, this category seems to be misleadingly popular in terms of installs, weighted toward a higher figure of 24727872 by a few very popular apps. The average installs without those popular apps is much smaller at 5544878.

Similar to above, the video player apps market looks to be dominated by established brands which might make it hard for our app to be noticed.

In [81]:
for row in android:
    if row[1] == 'PHOTOGRAPHY':
        print(row[0],':', row[5])

TouchNote: Cards & Gifts : 1,000,000+
FreePrints – Free Photos Delivered : 1,000,000+
Groovebook Photo Books & Gifts : 500,000+
Moony Lab - Print Photos, Books & Magnets ™ : 50,000+
LALALAB prints your photos, photobooks and magnets : 1,000,000+
Snapfish : 1,000,000+
Motorola Camera : 50,000,000+
HD Camera - Best Cam with filters & panorama : 5,000,000+
LightX Photo Editor & Photo Effects : 10,000,000+
Sweet Snap - live filter, Selfie photo edit : 10,000,000+
HD Camera - Quick Snap Photo & Video : 1,000,000+
B612 - Beauty & Filter Camera : 100,000,000+
Waterfall Photo Frames : 1,000,000+
Photo frame : 100,000+
Huji Cam : 5,000,000+
Unicorn Photo : 1,000,000+
HD Camera : 5,000,000+
Makeup Editor -Beauty Photo Editor & Selfie Camera : 1,000,000+
Makeup Photo Editor: Makeup Camera & Makeup Editor : 1,000,000+
Moto Photo Editor : 5,000,000+
InstaBeauty -Makeup Selfie Cam : 50,000,000+
Garden Photo Frames - Garden Photo Editor : 500,000+
Photo Frame : 10,000,000+
Selfie Camera - Photo Edito

In [82]:
for row in android:
    if row[1] == 'PHOTOGRAPHY' and (row[5] == '1,000,000,000+'
                                      or row[5] == '500,000,000+'
                                      or row[5] == '100,000,000+'):
        print(row[0],':',row[5])

B612 - Beauty & Filter Camera : 100,000,000+
YouCam Makeup - Magic Selfie Makeovers : 100,000,000+
Sweet Selfie - selfie camera, beauty cam, photo edit : 100,000,000+
Google Photos : 1,000,000,000+
Retrica : 100,000,000+
Photo Editor Pro : 100,000,000+
BeautyPlus - Easy Photo Editor & Selfie Camera : 100,000,000+
PicsArt Photo Studio: Collage Maker & Pic Editor : 100,000,000+
Photo Collage Editor : 100,000,000+
Z Camera - Photo Editor, Beauty Selfie, Collage : 100,000,000+
PhotoGrid: Video & Pic Collage Maker, Photo Editor : 100,000,000+
Candy Camera - selfie, beauty camera, photo editor : 100,000,000+
YouCam Perfect - Selfie Photo Editor : 100,000,000+
Camera360: Selfie Photo Editor with Funny Sticker : 100,000,000+
S Photo Editor - Collage Maker , Photo Collage : 100,000,000+
AR effect : 100,000,000+
Cymera Camera- Photo Editor, Filter,Collage,Layout : 100,000,000+
LINE Camera - Photo editor : 100,000,000+
Photo Editor Collage Maker Pro : 100,000,000+


In [83]:
under_100_m = []
for row in android:
    n_installs = row[5]
    n_installs = n_installs.replace('+','') # the Android 'Installs' column is a str containing + and , signs
    n_installs = n_installs.replace(',','') # these chars need to be removed so we can convert to float and calculate.  
    n_installs = float(n_installs)
    if (row[1] == 'PHOTOGRAPHY' and n_installs < 100000000):
        under_100_m.append(n_installs)
        
avg_under_100_m = sum(under_100_m)/len(under_100_m)
print(avg_under_100_m)

7670532.29338843


A similar story for the Photography category seems to be happening as for Communication and Video Players - Google photos, popular selfie and editing tools. Further exploration shows the same for Social Media too, with Facebook, Instagram, Twitter all dominating.

The Entertainment has a wider range of apps including some more niche ones that look popular (Hamilton the offical app, Meme Creator, LOL pics, Colouring books) alongside more dominant examples of Netflix, Youtube Gaming).

The books_reference categories looks the same, with some big names like Google Play Books and the Bible, but plenty of reasonably popular >1m installs nice books, such as free e-book readers. 

In [84]:
for row in android:
    if row[1] == 'BOOKS_AND_REFERENCE':
        print(row[0],':', row[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

As reference is a popular category in the Apple Store too, it might be worth considering this category further. It's notable that Music is not in the Android categories list. Perhaps it is in the Genres column instead?

In [85]:
for row in android[1:]:
    genre = row[9] # genres column
    if genre in 'Music':
        print(row[0],':',row[5], ',', row[1])

Magic Tiles 3 : 50,000,000+ , GAME
Cardi B Piano Game : 10,000+ , GAME
Perfect Piano : 50,000,000+ , GAME
J Balvin Piano Tiles : 500+ , GAME
Magic Tiles - TWICE Edition (K-Pop) : 100,000+ , GAME
Magic Tiles - Blackpink Edition (K-Pop) : 100,000+ , GAME
DJMAX TECHNIKA Q - Music Game : 100,000+ , GAME
Au Mobile: Audition Chính Hiệu : 1,000,000+ , GAME
AU Mobile Indonesia : 1,000,000+ , GAME
Au-allstar for KR : 100,000+ , GAME
Super Dancer VN : 500,000+ , GAME
Love Dance : 1,000,000+ , GAME
Avatar Musik : 1,000,000+ , GAME
RIDE ZERO : 100,000+ , GAME
Cytus : 5,000,000+ , GAME
Just Dance Now : 10,000,000+ , GAME
Piano Free - Keyboard with Magic Tiles Music Games : 50,000,000+ , GAME
Dr Dre - Beatmaker : 10,000+ , GAME


There are a mix of apps here -- many of which are very popoular (over 1m installs), and all of which are categorised as games.

A lot of the names look like they are to do with a similar theme of Piano Simulators/games. E.g. Piano Free - Keyboard with Magic Tiles Music Games, Cardi B Piano Game, Perfect Piano, J Balvin Piano Tiles. A quick search of the playstore shows this type of game is saturated already.

There doesn't seem to be anything that overlaps with the Reference genre at all. Given the popularity and broad landscape of the Reference genre, this could be a good niche in which to build an app.

## Conclusion

In this analysis we've aimed to find a viable category in which to build a free app that will see success in both the Apple Store and the Google Play Store. The aim being that after establishing an audience in the app, we can build in paid elements.


Looking at both the Apple Store and Google Play Store data, it's clear that a shared popular category is Reference material. Music is also a well-downloaded app category, and one that in particular does not seem to already have a saturated base for apps that overlap with Reference.

It's my recommendation that an app that meets these two categories - Music and Reference - may prove popular. An example may be an app that utises music preferences and players from more popular apps - Spotify, Amazon Prime, etc - and allows users to learn about the playing artist while listening to their music. Another option may be to have an app which gamifies finding similar artists - e.g. searching for your favourite, and then playing a game to explore similar artists and learn more about them as they play.