# Analysis of succesful apps in Google Play and the AppStore

What it takes to make a free app successful? The aim of this analysis is to englight the key factors in making a free app a success, studying 10,000 apps from Google Paly and about 7,000 apss from the App Store, through data collected in 2018.

Both dataset are from Kaggle, info about the Google Play Store Apps dataset can be found [here](https://www.kaggle.com/lava18/google-play-store-apps),
while info about the Apple Store Apps can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

### Technologies and helper functions

This study uses only pure Python and no other libreries.

In [9]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [10]:
def open_dataset(file_name='AppleStore.csv', remove_header = True):
    opened_file = open(file_name, encoding="utf8")
    from csv import reader
    read_file = reader(opened_file)
    data = list(read_file)
    if remove_header:
        return data[1:]
    
    return data

### Data exploration

In [11]:
apple_data_full = open_dataset(remove_header = False)

In [12]:
apple_header = apple_data_full[0]
apple_header

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [13]:
print(len(apple_header))

16


In [14]:
apple_data = apple_data_full[1:]

In [15]:
print(len(apple_data))

7197


In [16]:
explore_data(apple_data, 0, 5)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']




In [17]:
google_play_full = open_dataset('googleplaystore.csv', remove_header = False)

In [18]:
google_play_header = google_play_full[0]
google_play_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [19]:
print(len(google_play_header))

13


In [20]:
google_play = google_play_full[1:]

In [21]:
print(len(google_play))

10841


In [22]:
explore_data(google_play, 0, 5)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']




## Data cleaning

### Removing row with errors

The Google Play dataset has a problem at row  10472, as stated by [this discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) on Kaggle <br>
The column 'Category' is missing, so there is shift for the remaining data, in the row.

In [23]:
problem_row = google_play[10472]
print(google_play[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [24]:
print(len(problem_row))

12


In [25]:
#removing the row 10472
del google_play[10472]

In [26]:
print(len(google_play))

10840


### Removing duplicates

The Google Play dataset has some duplicate rows, as we can see here for the Intagram App.

In [27]:
for app in google_play:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


There are four Instagram apps: they are the same app, but with data collected at different times, with increasing number of user ratings. <br>
We are going to mantain only the app with the maximum number of user ratings, the most recent one, and remove all the older entries with less user ratings.
The same procedure will be applied to all the duplicate apps in the dataset.

First we are going to count all the duplicates apps

In [28]:
duplicate_apps_gp = []
unique_apps_gp = []

for app in google_play:
    name = app[0]
    if name in unique_apps_gp:
        duplicate_apps_gp.append(name)
    else:
        unique_apps_gp.append(name)

In [29]:
print('Number of duplicate apps: ', len(duplicate_apps_gp))

Number of duplicate apps:  1181


There are 1181 duplicate apps in the Google Play data

In [30]:
##number of rows expected after removal of duplcates
len(google_play) - len(duplicate_apps_gp)

9659

In [31]:
duplicate_apps_apple = []
unique_apps_apple = []

for app in apple_data:
    name = app[0]
    if name in unique_apps_apple:
        duplicate_apps_apple.append(name)
    else:
        unique_apps_apple.append(name)

In [32]:
print('Number of duplicate apple apps: ', len(duplicate_apps_apple))

Number of duplicate apple apps:  0


There are no duplicate apps for Apple.

we are going to create a dictionary for mapping the apps in Google Play withe highest number of user ratings and keep only those app the Google Play data.

In [33]:
#the number of reviews for Google Play data is at index 3
rating_index = 3
app_name_index = 0
reviews_max = {}
for app in google_play:
    name = app[app_name_index]
    user_ratings = float(app[rating_index])
    if name in reviews_max:
        if user_ratings > reviews_max[name] :
            reviews_max[name] = user_ratings
    else:
        reviews_max[name] = user_ratings
        
print(len(reviews_max))        
        

9659


In [34]:
#removing duplicates
google_play_clean = []
already_added = []
for app in google_play:
    name = app[app_name_index]
    user_ratings = float(app[rating_index])
    max_user_rating = reviews_max[name]
    if(max_user_rating == user_ratings and name not in already_added):
        google_play_clean.append(app)
        already_added.append(name)
        
print(len(google_play_clean)) 
print(len(already_added))
   
    

9659
9659


### Removing non-English app names

We want to remove apps not in English, so we'll test the app names for the presence of non-English characters.

In [35]:
def is_app_name_English(app_name):
    non_eng_counter = 0
    for letter in app_name:
        if ord(letter) > 127:
            non_eng_counter += 1
    return non_eng_counter <= 3

In [36]:
#testing function
instagram_name  = 'Instagram'
chinese_name = '爱奇艺PPS -《欢乐颂2》电视剧热播'
tm_name = 'Docs To Go™ Free Office Suite'
emoji_name = 'Instachat 😜'
testing_names= [instagram_name, chinese_name, tm_name, emoji_name]

In [37]:
for name in testing_names:
    print(is_app_name_English(name))

True
False
True
True


In [38]:
google_play_eng = []
apple_eng = []
#the index of the app name is different from google play apps
apple_name_index = 1
for app in google_play_clean:
    
    if is_app_name_English(app[app_name_index]):
        google_play_eng.append(app)
        
for app in apple_data:
      if is_app_name_English(app[apple_name_index]):
        apple_eng.append(app)

In [39]:
print(len(google_play_eng))

9614


In [40]:
print(len(apple_eng))

6183


### Isolating free apps

We are going to consider only free apps, with revenue provided by ads, in this study, so we are going to remove non free apps from both datasets.

In [41]:
gp_price_index = 7
apple_price_index = 4

In [42]:
gp_free = []
apple_free = []

for app in google_play_eng:
    price = (app[gp_price_index]).strip('$')
    price = float(price)
    if price == 0.0:
        gp_free.append(app)
        
for app in apple_eng:
    price = (app[apple_price_index]).strip('$')
    price = float(price)
    if price == 0.0:
        apple_free.append(app)
    

In [43]:
print(len(gp_free))
print(len(apple_free))

8864
3222


### Strategy for finding successful apps features

One of the strategy for building successful apps, consists in building a fast minimal prototype in android, sperimenting it on Google Play and if the app has enough response from user, start building a iOs app for the AppStore. <br>
In order to minimize costs and risks, we need to find the minimal set of characteristics defining the profile of a successful app and this is what we'll  do in the next sections of this analysis.

### Most common genres

In [44]:
gp_genre_index = -4
apple_genre_index = 11

In [45]:
def freq_table(data_set, index):
    frequency_table = {}
    
    for row in data_set[1:]:
        value = row[index]
        if value in frequency_table:
            frequency_table[value] += 1
        else:
            frequency_table[value] = 1
    total_apps = len(data_set)
    percentage_table = {}
    for key in frequency_table:
        freq = frequency_table[key]
        percentage_table[key] = freq/total_apps * 100
        
        
    return percentage_table

In [46]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [47]:
display_table(gp_free,gp_genre_index)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [48]:
display_table(apple_free, apple_genre_index)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2588454376163876
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The most common genre for free apps in the Apple Store is clearly Games, with over 58% of apps followed by nearly 8% of apps dedicated to enterteinment. <br>
The Google Play store shows instead a greater fragmentation of genres, with the most popular Genre, Tools, at only 8.4% followed by Entertainment at 6% .

### Measuring popularity: number of installs

Until now we focused our analysis on the number of apps in both Google Play and the AppStore, considering the frequency of genres, but the choice of the app genre is made by developers, we are more interested in the choice made by users, because the real success of an app is always determined by users. <br>
One way to measure the interests of users for an app is certainly the number of installs of the app. <br>
The two datasets we are considering both miss precise data about the number of installs, but we have data about the number of ratings for the Apple Store and a a category of installs for Google Play. <br>
We'll make an assumption: the number of ratings is proportional to the number of installs (we don't have data to prove it, but it seems to make sense). <br>
We will measure the average number of user ratings for genre.

#### Number of Install for the App Store

In [49]:
apple_genres = freq_table(apple_free, apple_genre_index)

In [50]:
total_genres = len(apple_genres)
total_genres

23

In [51]:
apple_tot_ratings_index = 5
apple_avg_ratings_for_genres ={}
for genre in apple_genres:
    total = 0
    total_user_ratings = 0
    for app in apple_free:
        genre_app = app[apple_genre_index]
        if(genre_app == genre):
            total += 1
            total_user_ratings += float(app[apple_tot_ratings_index])
    apple_avg_ratings_for_genres[genre]  = total_user_ratings/total
 

In [52]:
def display_dict(dictionary):
    for key in dictionary:
        print("{} : {}".format(key, dictionary[key]))

In [53]:
display_dict(apple_avg_ratings_for_genres)

Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Social Networking : 71548.34905660378
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Considering the average number of install per genres, games don't dominate the scene anymore in the AppStore. <br>
Navigation is the most popular genre with over 86000 average install, followed by Music apps and Weather apps. 

#### Number of Installs for Google Play

In [54]:
google_play_categories =  freq_table(gp_free,1)

In [55]:
google_play_categories

{'ART_AND_DESIGN': 0.631768953068592,
 'AUTO_AND_VEHICLES': 0.9250902527075812,
 'BEAUTY': 0.5979241877256317,
 'BOOKS_AND_REFERENCE': 2.1435018050541514,
 'BUSINESS': 4.591606498194946,
 'COMICS': 0.6204873646209386,
 'COMMUNICATION': 3.2378158844765346,
 'DATING': 1.861462093862816,
 'EDUCATION': 1.1620036101083033,
 'ENTERTAINMENT': 0.9589350180505415,
 'EVENTS': 0.7107400722021661,
 'FINANCE': 3.7003610108303246,
 'FOOD_AND_DRINK': 1.2409747292418771,
 'HEALTH_AND_FITNESS': 3.0798736462093865,
 'HOUSE_AND_HOME': 0.8235559566787004,
 'LIBRARIES_AND_DEMO': 0.9363718411552346,
 'LIFESTYLE': 3.9034296028880866,
 'GAME': 9.724729241877256,
 'FAMILY': 18.907942238267147,
 'MEDICAL': 3.531137184115524,
 'SOCIAL': 2.6624548736462095,
 'SHOPPING': 2.2450361010830324,
 'PHOTOGRAPHY': 2.944494584837545,
 'SPORTS': 3.395758122743682,
 'TRAVEL_AND_LOCAL': 2.33528880866426,
 'TOOLS': 8.461191335740072,
 'PERSONALIZATION': 3.3167870036101084,
 'PRODUCTIVITY': 3.892148014440433,
 'PARENTING': 0.65

In [56]:
display_table(gp_free, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.187274368231048
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


In [57]:
gp_avg_installs ={}
for category in google_play_categories:
    total = 0
    len_category = 0
    for app in gp_free:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    gp_avg_installs[category] = avg_n_installs
   
    

In [58]:
display_dict(gp_avg_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

Comunication seem to be the most popular genre for android apps