# iOS and Android App store analysis

In this project I will aim to analyze the applications in the App Store to isolate the characteristics of apps that tend to attract more visitor - from the Appstore characteristics themselves. 

We will use the data sets for the 
10 000 most popular apps from the Play as of august 2018

and 

7 000 ios from the appstore


First We will initialize the databases

In [1]:
from csv import reader

# Apple Dataset

opened_file_apple = open('AppleStore.csv')
read_file = reader(opened_file_apple)
apple = list(read_file)

# Android Dataset
opened_file_play = open('googleplaystore.csv')
read_file = reader(opened_file_play)
android = list(read_file)



We will now define the `explore_data` function.


In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Here we will explore the database using the `explore_data()` function. The count for the size of the dataset includes the headers. 

This will gives us an idea, of what an entry in the databse looks like, for both the AppleStore and the Android Store. 

Above the analysis we also include the Headers.

A full description of hte headers can be found here :
Android - https://www.kaggle.com/lava18/google-play-store-apps
iOS - https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

In [3]:
# Apple Analysis - headers printed on top
print(apple[0])
print('\n')

explore_data(apple, 1, 5, True)

print('============================')

# Android Analysis - headers printed on top
print(android[0])
print('\n')

explore_data(android, 2, 6, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring

As our target is to create an app that will attract the most users, we can isolate a few categories, that we can conjecture to be useful in such a design. 

From the Apple Store we will be looking more closely at  - 

| Header | Description |
|----------|-------|
|'track_name'  | App name|
|'currency' |- In case we need to convert to USD|
|'price'| price |
|'rating_count_tot'| - Total amount of ratings|
|'rating_count_ver'| - Amount ofratings in a prticular version|
|'prime_genre'| - genre of App|

Note: The Play store does gives the total amount of installs, but the Aple store only gives the total amount of ratings. 

For the Play Store we will be looking more closely at:

| Header | Description |
|----------|----------------|
|'App' | app name|
|Category| app category|
|Reviews| number of reviews|
|'Installs'| number of installations|
|'Type',| genre of application |
|'Price'| price |
|'Genres'| Genre|



The Data given has an errounous row, which shifts the data, and so we will remove the bad row. 

In [4]:
print(android[10473])
short = len(android[10473])
print('Length of the string is ' + str(short))

print('\n')
# Now We delete

del android[10473]
print(android[10473])
length = len(android[10473])

print('Length of the new string after correction is ' + str(length))


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Length of the string is 12


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
Length of the new string after correction is 13


# Data Cleaning

We will continue our data cleaning, since while observing the data I had noticed that there are duplicate rows for the same apps in the Play store.

In order to isolate which apps are duplicates, we will create a ann accompanying list call  `duplicate_apps` which will store all the duplicates of an app store entry (based on the app name) , and we will seperate that list from the main database. 

Here for example we see that there are 2 seperate entries for facebook.

In [5]:
for app in android:
    name = app[0]
    if name == 'Facebook':
        print(app)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


First we will create the list of duplicates in the list 'duplicate_apps'


In [6]:
duplicate_apps = []
uniqe_apps = []

for app in android:
    name = app[0]
    if name in uniqe_apps:
        duplicate_apps.append(name)
    else:
        uniqe_apps.append(name)

print('Number of duplicated apps:', len(duplicate_apps))
print('Examples of duplicated apps: ', duplicate_apps[:10])

Number of duplicated apps: 1181
Examples of duplicated apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


We will now amend the data, and delete the duplicates from the data. The Duplicates will not be removed at random - we will use the the version with the most reviews, as that is most likely to be the newst version. Unfortunately, we cannot use the release version data as it isn't precise enough in the dataset, and we are not given an exact patch dates in the dataset either. (theese are likely minor fixes), which is why there have been multiple collections of the datapoints in the scrape.

In [7]:
#Expected length of data after cleanup
print('Expected length: ', len(android) - len(duplicate_apps))

Expected length:  9660


In [8]:
reviews_max = {}

for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))

print('The most reviewed Facebook app has reviews : ' +
      str(reviews_max['Facebook']))


9659
The most reviewed Facebook app has reviews : 78158306.0


Now we will use the dictionary that we just created with unique PlayStore entries in order to create a cleaned up list . 

We will form 2 lists :
` android_clean` for storing new cleaned data set. This cleaned dataset will remove all named duplicates, and keep the one with the highest number of ratings overall, as we use that as the proxy of the newest release. 

`already_added` which will just store app names, to track names already processed, in order to make the loop work.


In [9]:
android_clean = []
already_added = []

for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

len(android_clean)

    


9659

We have now created a cleaned up file for the android apps. 


*Note* : android_clean includes headers. 


# Removing non-english apps from data set

We will now remove the apps which are not in english, since those are beyond the scope of our analysis, as we would be targeting the English-speaking market primarily. 

Our Current data set contains apps which are not titled in english, examples below:

In [10]:
print(apple[814][1])
print(apple[6732][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


We will attept to isolate apps which have characters without english characters.

We will do this such that, it will only count as non-english if more than 3 characters are beyond the desired ASCII range of 0 -127 for english characters. This is done in order to prevent exclusion of applications which have e.g Use of emoji, or special characters once in the name.

This method isn't fool proof, as it can for example take an entire french woord, which uses no special characters and count it as 'english'. However it is a useful filter to reduce the scope of some of the data , to a more desirable dataset. 


In [11]:
def english(word):
    for letter in word:  
        if ord(letter) > 127:
                return False
    return True

print(' Instachaat 😜 is an example where the test for any character provides a false positive because the name uses an Emoji in the name, and yet its in english')
english('Instachat 😜')

# New and Improved algorithm that checks for 3 letters.

def english(word):
    n = 0
    for letter in word:
        
        
        if ord(letter) > 127:
            n += 1
            
    if n > 3:
        return False
    else:
        return True

english('lama剧剧😜😜剧剧')
print(english('lama剧剧😜😜剧剧'))
english('dupa')


 Instachaat 😜 is an example where the test for any character provides a false positive because the name uses an Emoji in the name, and yet its in english
False


True

Now we we have an `english` function that will check if a word, is in English , based if we have more than 3 standard characters. 



In [12]:
android_english = []
apple_english = []

for app in android_clean:
    name = app[0]
    if english(name):
        android_english.append(app)       

# we now repeat the process for ios apps

for app in apple[1:]:
    name = app[1]
    if english(name):
        apple_english.append(app)
              
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(apple_english, 0, 3, True)
     

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

Now we will isolate the free apps from the respective stores, having already used the previous filters.

For android, the price index is 7.

For apple, the price index is 4.



In [13]:
android_free = []
apple_free = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_free.append(app)

print(len(android_free))
# We have 8864 free Android apps to work with
        
for app in apple_english:
    price = float(app[4])
    if price == 0:
        apple_free.append(app)
        
len(apple_free)
print(len(apple_free))
# We have 3222 free Apple apps to work wtih. 

8864
3222


We now have 2 lists : *apple_free* (8864 free apps) and *android_free* (3222 free apps), which list all the free, english apps in respective stores. 



# Data Processing

Since we have now managed to clean up and filter all the data that we need for our analysis, we can now move one to data processing. We will be analysing data from both the Play and App Store, since the goal of the project was to look for an app profile, that would work for both. 

Initally we would rollo out the application on the Play store and understand if it gains any success on that platform, rather than sepnding the time and resources without testing, to port the app to iOS. 




# Genre Analysis 
We Will firstly be looking at the Genres of apps, to see which genres are the most popular in the market. 

In order to do so we will  build a frequency table for the prime_genre column of the App Store data set, and for the Genres and Category columns of the Google Play data set.

The functions below will create a Frequency dictionary (listed as % of total apps in app store), for any list.

We will then sort the results, to make the results more readable.

In [14]:


# apple genre index is 11 (or -5)

def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
# this part will convert the table frequencies into percentages
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    return table_percentages





    

We Will now use these functions to analyse The frequencies of Genres for apps in the Apple Store and the Play Store.

We note that the genre columns for the respective stores are:

For Apple : -5 (for *prime_genre*)

For Android: 1 (for *Genres* and *Category*) 

In [15]:
print('Frequency Table for App Store')
print('\n')
print(freq_table(apple_free, -5))
print('\n')
print(freq_table(android_free,1))


Frequency Table for App Store


{'Music': 2.0484171322160147, 'Lifestyle': 1.5828677839851024, 'Medical': 0.186219739292365, 'Utilities': 2.5139664804469275, 'Business': 0.5276225946617008, 'Book': 0.4345127250155183, 'Travel': 1.2414649286157666, 'Shopping': 2.60707635009311, 'Social Networking': 3.2898820608317814, 'Games': 58.16263190564867, 'Navigation': 0.186219739292365, 'Health & Fitness': 2.0173805090006205, 'Photo & Video': 4.9658597144630665, 'Education': 3.662321539416512, 'Finance': 1.1173184357541899, 'Catalogs': 0.12414649286157665, 'Weather': 0.8690254500310366, 'News': 1.3345747982619491, 'Productivity': 1.7380509000620732, 'Food & Drink': 0.8069522036002483, 'Entertainment': 7.883302296710118, 'Sports': 2.1415270018621975, 'Reference': 0.5586592178770949}


{'SHOPPING': 2.2450361010830324, 'HEALTH_AND_FITNESS': 3.0798736462093865, 'PERSONALIZATION': 3.3167870036101084, 'FOOD_AND_DRINK': 1.2409747292418771, 'HOUSE_AND_HOME': 0.8235559566787004, 'DATING': 1.8614620938628

As we can see now, the function returns the frequencies in no particaluar order, so we will implemente a sorting function (to sort by highest frequency to lowest) and some formatting, to make the data more readable. 





In [16]:
# This function uses the freq table function, and sorts the output

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Now that we have a the functions that will obtain the frequency data and sort and organize it, we will apply them to the datasets


In [17]:
display_table(apple_free, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


From the frequency distribution we can see that the 5 most common genres in the app store are: Games, Entertainmeent, Photo & Video, Educations and  Social Networking. Games win out by a large margin of 55.16% compared to the runner up of 7.88% of Entertainmnent apps. The other also score in single digit percentages. 

The app store in general seesm to have much more applications aimed at 'Fun and Entertainmnet' Rather than other types of apps such as the 'Health' and 'Utility' Categories.

That being said, this nly pertains to the number of apps in the app store, not neccesarily the popularity of total downloads. We can say that a lot of games are being published. We can't say from the data that Social Media apps are less popular - perhaps fewer apps are used (e.g People can play many games, but use Faceboook as their sole Social media application). The large frequeny of apps can indicate the competetiveness in the publisher space - as we can clearly see that many more games are released than health apps. 

Let's now Investigate the Play store apps by Category

In [18]:
display_table(android_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

From the *Category* section we can see that the most popular groupings are Family (18.9%), Games (9.72%), Tools (8.46), Business ( 4.59%) and Lifestyle(3.9%).

In that respect, at least if we look at these categories, games have much less of a foothold in the Play Store than the App sore, at least in terms of number of released apps. 

That being said, Gmaes are still one of the top categories. Faimly h=category, see ms to be much igher. We see howver that business an tools are much closer to the top categories, which suggest a more utalitarian profile of the store. 



In [19]:
display_table(android_free, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Looking at the Play store again, we see more confirmation on the biew that the Play store applications might be more Utilatarian as a whole.

The 5 most popular categories are: Tools (8.45%), Entertainment ( 6.06%), Education (5.35%), Busines ( 4.59%)
Productivity ( 3.89%). 

We again see the entertainemnt category ( which could encompass games ) in the top, but also maginuted smaller, as compared to the app store. 3 of the top 5 categories, are more utalitarian (tools, businesss and producitivity). 

Again - the frequency analysis can gives us better clues about competition, and released apps , but since we dont compare to download numbers of reviews, we can't commment particularly on popularity. 


# Popularity analysis

We have analysed the frequency of the apps published on the respecitve stores, to understand what genres are most popular on the apps store - which can be a proxy of demand, and could allow us to narrow down our search . Now we will turn our heads to understand the actual download of the apps. 

To do so, we will compute the Average number of downloads per app in the genre, to gather an expectation of how an individual app 'can do' in the market. 



For the play store we can measure this directly - by using the number of * Installs * from the data.

For the apple stare, we do not have the install number, but we can take the number of reviewws as a proxy, ffrom the *rating_count_to* column.



# App Store Analysis

In [20]:
genres_ios = freq_table(apple_free, -5)

pop_list = []

for genre in genres_ios:
    total = 0 
    len_genre = 0
    for app in apple_free:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings =  total / len_genre
    print(genre, ':', avg_n_ratings)
    pop_list.append((genre, avg_n_ratings))

print('\n')
print('Sorted list by average popularity per app')
print('\n')

a =  sorted(pop_list, key = lambda x: int(x[1]))

print(*a, sep = "\n")


Music : 57326.530303030304
Lifestyle : 16485.764705882353
Medical : 612.0
Utilities : 18684.456790123455
Business : 7491.117647058823
Book : 39758.5
Travel : 28243.8
Shopping : 26919.690476190477
Social Networking : 71548.34905660378
Games : 22788.6696905016
Navigation : 86090.33333333333
Health & Fitness : 23298.015384615384
Photo & Video : 28441.54375
Education : 7003.983050847458
Finance : 31467.944444444445
Catalogs : 4004.0
Weather : 52279.892857142855
News : 21248.023255813954
Productivity : 21028.410714285714
Food & Drink : 33333.92307692308
Entertainment : 14029.830708661417
Sports : 23008.898550724636
Reference : 74942.11111111111


Sorted list by average popularity per app


('Medical', 612.0)
('Catalogs', 4004.0)
('Education', 7003.983050847458)
('Business', 7491.117647058823)
('Entertainment', 14029.830708661417)
('Lifestyle', 16485.764705882353)
('Utilities', 18684.456790123455)
('Productivity', 21028.410714285714)
('News', 21248.023255813954)
('Games', 22788.6696905016)
(

The 5 most downloaded apps per genre in the App Store are as follows:

1. Navigation (86090 downloads per app)
2. Reference (74942 downloads per app)
3. Social Networking (71548 downloads per app)
4. Weather (57326 downloads per app)
5. Music (52279 downloads per app)

This can give as an indication of what types of apps are popular on the app store. That being said - the averages do not tell us anything about any skewness in the data. We can however see, that for example Google Maps in the Navigation category will skew the averages up, as it is by far and large monopoly in the space, and will skew the averages upward. 

We can draw similar conjecture for Social Networking or Music, where people would tend to use a small number of popular platforms such as Facebook or Spotify. 

The reference Category seems like a popular category, however with not one dominant player. Reference aps inlcude: Bible and Other Religious book references, Game Guides, Translators - often quick access knowledge sources that can be obtained offline.  This category seems to be less concentrated around big players in the market and more scattered , leaving more space for an app to entere the competetive space. 





# Play Store Category Analysis

In this section we will perform the download popularity analysis based on the Play store, This time we have downlaod figures quoted in the dataset, so we can use those directly. Let's first have a look at the downloads data:


In [22]:
display_table(android_free, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


As we can se above, the install collumn does not actually quote an individual download number, however it gives us an open ended range of values of downloads for an app (e.g an app has been download 5000+ or 1,000,000+ times).

Because of this we willl have to reduce our measurement precision, but Given the sample size, this data should be sufficient to iunderstnad whic apps are the most popular. 

Since the popularity ranking for our intents an purposes can be treated as an ordinal metric, we will convert the ranges for  e.g 100,000+ into 100,000.

So we will treat 100,000+ installs as 100,000; 500+ installs as 500, etc. 

First doing a freq table:


In [24]:
freq_table(android_free, 1)

{'ART_AND_DESIGN': 0.6430505415162455,
 'AUTO_AND_VEHICLES': 0.9250902527075812,
 'BEAUTY': 0.5979241877256317,
 'BOOKS_AND_REFERENCE': 2.1435018050541514,
 'BUSINESS': 4.591606498194946,
 'COMICS': 0.6204873646209386,
 'COMMUNICATION': 3.2378158844765346,
 'DATING': 1.861462093862816,
 'EDUCATION': 1.1620036101083033,
 'ENTERTAINMENT': 0.9589350180505415,
 'EVENTS': 0.7107400722021661,
 'FAMILY': 18.907942238267147,
 'FINANCE': 3.7003610108303246,
 'FOOD_AND_DRINK': 1.2409747292418771,
 'GAME': 9.724729241877256,
 'HEALTH_AND_FITNESS': 3.0798736462093865,
 'HOUSE_AND_HOME': 0.8235559566787004,
 'LIBRARIES_AND_DEMO': 0.9363718411552346,
 'LIFESTYLE': 3.9034296028880866,
 'MAPS_AND_NAVIGATION': 1.3989169675090252,
 'MEDICAL': 3.531137184115524,
 'NEWS_AND_MAGAZINES': 2.7978339350180503,
 'PARENTING': 0.6543321299638989,
 'PERSONALIZATION': 3.3167870036101084,
 'PHOTOGRAPHY': 2.944494584837545,
 'PRODUCTIVITY': 3.892148014440433,
 'SHOPPING': 2.2450361010830324,
 'SOCIAL': 2.662454873646

In [30]:
categories_android = freq_table(android_free, 1)

pop_list = []

for category in categories_android:
    total = 0 
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            untreated_rating = app[5]
            untreated_rating = untreated_rating.replace('+','')
            untreated_rating = untreated_rating.replace(',','')
            n_ratings = float(untreated_rating)
            total += n_ratings
            len_category += 1
    avg_n_ratings =  total / len_category
    print(category, ':', avg_n_ratings)
    pop_list.append((category, avg_n_ratings))

print('\n')
print('Sorted list by average popularity per app')
print('\n')

a =  sorted(pop_list, key = lambda x: int(x[1]))

print(*a, sep = "\n")


SHOPPING : 7036877.311557789
HEALTH_AND_FITNESS : 4188821.9853479853
PERSONALIZATION : 5201482.6122448975
FOOD_AND_DRINK : 1924897.7363636363
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMMUNICATION : 38456119.167247385
COMICS : 817657.2727272727
TRAVEL_AND_LOCAL : 13984077.710144928
PHOTOGRAPHY : 17840110.40229885
TOOLS : 10801391.298666667
VIDEO_PLAYERS : 24727872.452830188
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
ENTERTAINMENT : 11640705.88235294
SOCIAL : 23253652.127118643
FINANCE : 1387692.475609756
SPORTS : 3638640.1428571427
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
LIBRARIES_AND_DEMO : 638503.734939759
NEWS_AND_MAGAZINES : 9549178.467741935
ART_AND_DESIGN : 1986335.0877192982
LIFESTYLE : 1437816.2687861272
PRODUCTIVITY : 16787331.344927534
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
WEATHER : 5074486.197183099
GAME : 15588015.603248259
MEDICAL : 120550.61980830671
PARENTING : 54260

# Popular  of Android Apps Categories 
Looking at the category column, we can see that the 5 most popular genre of apps (in terms of download per app in the genre) are:

1. Communiciation (38,456,119)
2. Video Players (24, 727, 872)
3. Social (23, 253, 652)
4. Photography (17, 840, 110)
5. Productivity ( 16, 787 , 331) 

Let's investigate the communication category:


In [32]:
for app in android_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

By looking at the data we can see that this space is dominated by a large number of big players : such as WeChat for the chinese market communiation messegner app, and Gmail for the popular choice Email Android client. 

Howevere Let's have a look at the books and reference category, as we have identified the App Store Reference setion as being one of the potential market in which we could enter first. It does not seem as popular a category on the Play store, but it is still in the top 10 popular categories overall in the store, and might be less monopolized than others. Let's first explore this category:



In [34]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


We Can see that there is still a number of very large competititors in the space, but we can explore further:



In [36]:

for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

As much as there are a few dominant space, it seems that there also is a 'long tail' of other applications in the space, i.e there are other apps in the genre with millions of downloads. 

Particularyl reference books such as A dictionaroy or religious books references, have seperate apps with large popularity. However - we can see that also Game Guides such as 'Stats for Clash Royale' or ' My Little Pony AR Guide' seem to have very large popularity in the stores. 

# Recommendation And conclusion

In this project we have analysed the data in the app stores for iOS and Android devices, looking at the frequencies of apps in the app store, and also taking application download and rating data from the resepctive stores. 

The objective of this exerceise was to use this data to recommend  a profile for an app that would be 'popular'. The conjecture is that the more popular the app, the more revenue it would generate from ad traffic. 

We have discovered that the stores have different genres have different popular genres, the App Store being more entertainmnet driven, while the Play Store seems to be more productivity driven . Thew most popular genres in both stores seem to be quite skewed towards large players, Such as Google Maps in the Navigation section on the App Store, or e.g WhatsApp on the Communication section on the Play Store.

The Existience of these large players, means that a new-entrant would have a hard time competing to gain market share with a new app, especialy if we watned to make it profitable.

As ana alternative, we have disctover thath the reference categroy in the App Store is among the most popular, and one of the top 10 categories in the play store. These applcations are references such as book, religious book, translator and Game Guides. These can be relatively uncostly to develop, as they are mainly knwoledge repositories - provided that the information is easily accessible and easily structured. 

As an application we would propose to create a Game guide to a popular mobile  game. We can see for example that 'Clash of Clans' or 'My Little Pony' App references are popular guides. Ideally, the guide we would create would be for a game that exists on both the iOS and Android platforms - so porting from one platform to another would make sense, as someone playing the game on one platfrorm is likely to look up the guide from the same platform. 

Other popular dicionary or book references could be an area which is trickier to enter since those contents - tend to be static, a new entrant to the dictionary space, would prensent similar knolwedge to the other dictionaries, with other large players already in place. However , new and popular mobile games, and PC games keep changing in the market, so there is a larger window of opporunnnity to create such an application and become a large player, whilst there are not many players in the marketplace aat the time of publishing. 

There are a number of games coming up in 2020 which could potentialy require reference (commonly these are Competetitve Online games or MMORPGS), such as CyberPunk 2077, Final Fantasy VII Remaster. Games with a lot of content and high popularity are the targers of these references. Building a presence on the App Store prior to release of these games, can redirect users to the early launched application prioor to the realese, for  people to use as a reference. 

Pre-game release content can include readily available information online, and often Wikia content is open source and available to use, provided that sources to the original are cited, making creating the information relatively simple, especially as this is often suer generated content. 

In terms of monetization we would be looking to implement partnerships to lo link to game marketplaces, and potentially sales parterships with the target games