# Analyzing Popular Apps

This project will analyze popular apps on the App Store and Google Play. The analysis will be valuable for determining which app types are the most popular, and therefore most likely to generate revenue through in-app ads.

These datasets come from Kaggle. Apple data is from [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) and Google Play data is from [here](https://www.kaggle.com/lava18/google-play-store-apps/home)

In [1]:
import csv

apple_file = open('AppleStore.csv')
google_file = open('googleplaystore.csv')
read_apple = csv.reader(apple_file)
read_google = csv.reader(google_file)
apple = list(read_apple)
google = list(read_google)

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
# Separate header
google_header = google[0]
apple_header = apple[0]
google = google[1:]
apple = apple[1:]

In [4]:
print(apple_header)
print(google_header)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [5]:
explore_data(apple, 0, 3, True)
explore_data(google, 0, 3, True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '

## Apple Headings

(starts with a blank row, which is why the indices start at 1 instead of 0)

|#| Row ID               | Explanation                                  | 
| - | :----------------- | :----------------------------------------- |
|1| id               | App ID                                   | 
|2| track_name       | App Name                                 |
| 3  | size_bytes       | Bytes                                 |
| 4  | currency         | Currency                              |
| 5  |price            | Price                                  |
| 6  | rating_count_tot | User Rating counts (for all versions)    |
| 7   | rating_count_ver | User Rating counts (for current version) |
| 8 | user_rating      | Average user rating (all versions)       | 
| 9  | user_rating_ver  | Average user rating (current version)    |
| 10  | ver              | Latest versions                          |
| 11  | cont_rating      | Content Rating                           |
| 12 | prime_genre      | Primary Genre                            |
| 13 | sup_devices.num  | number of supporting devices             |
| 14 | ipadSc_urls.num  | Number of screenshots showed for display |
| 15 | lang.num         | Number of supported languages            |
| 16 |vpp_lic          | Vpp Device Based Licensing Enabled       |

## Google Headings

|Index |	Column |
| - | - |
|0	|App |
|1	|Category |
|2	|Rating |
|3|	Reviews |
|4	|Size |
|5	|Installs |
|6	|Type |
|7	|Price |
|8	|Content Rating |
|9	|Genres |
|10	|Last Updated |
|11	|Current Ver |
|12	|Android Ver |

In [6]:
# Forums say this row is incorrect, so deleting it
del google[10472]

Check for duplicate entries:

In [7]:
# assumes no header row
def find_dupes(dataset, name_col=0):
    duplicates = []
    all_apps = []
    
    for app in dataset:
        name = app[name_col]
        if name in all_apps:
            duplicates.append(name)
        all_apps.append(name)
    print(len(duplicates))
    print(duplicates[:15])
    return duplicates

In [8]:
apple_dupes = find_dupes(apple, 2)
google_dupes = find_dupes(google)

2
['VR Roller Coaster', 'Mannequin Challenge']
1181
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Google dataset has 1181 duplicates. Apple dataset has 2 duplicates. Let's remove the duplicate entries that have fewer reviews, because we want the most data possible.

In [9]:
# assumes no header row
# identify the max number of reviews for each app
# that is the row that we will keep
def find_max_reviews(dataset, name_col=0, max_col=3):
    reviews_max = {}
    for app in dataset:
        name = app[name_col]
        n_reviews = float(app[max_col])
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
        elif name not in reviews_max:
            reviews_max[name] = n_reviews
    return reviews_max

google_reviews_max = find_max_reviews(google)
apple_reviews_max = find_max_reviews(apple,2,6)
print(len(apple_reviews_max))
print(len(google_reviews_max))

7195
9659


In [10]:
# remove the rows that do not have the max number of reviews
def remove_duplicates(dataset, reviews_max, name_col=0, max_col=3):
    clean = []
    added = []
    for app in dataset:
        name = app[name_col]
        n_reviews = float(app[max_col])
        if n_reviews == reviews_max[name] and name not in added:
            clean.append(app)
            added.append(name)
    return clean
            
google_clean = remove_duplicates(google, google_reviews_max)
apple_clean = remove_duplicates(apple, apple_reviews_max, 2, 6)

print(len(apple_clean))
print(len(google_clean))

7195
9659


Now that duplicates are removed, let's remove non-English apps.

In [11]:
def has_few_non_english_characters(string):
    strikes = 0
    for c in string:
        if ord(c) > 127:
            strikes += 1
            if strikes > 2:
                return False
    return True    
    
has_few_non_english_characters('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [12]:
def remove_non_english_apps(dataset, name_col=0):
    english_apps = []
    for app in dataset:
        if has_few_non_english_characters(app[name_col]):
            english_apps.append(app)
    return english_apps

google_clean = remove_non_english_apps(google_clean)
apple_clean = remove_non_english_apps(apple_clean,2)

In [13]:
print(len(apple_clean))
print(len(google_clean))

6153
9597


In [14]:
def remove_paid_apps(dataset, price_col):
    free = []
    for app in dataset:
        price = app[price_col]
        if '$' in price:
            price = price[1:]
        if not float(price) > 0:
            free.append(app)
    return free

In [15]:
google_clean = remove_paid_apps(google_clean,7)
apple_clean = remove_paid_apps(apple_clean,5)
print(len(apple_clean))
print(len(google_clean))

3201
8848


And we conclude that there are many more paid apps on Apple than on Google.

Our "validation strategy" for developing apps is to try out an app on the Play Store, and if it makes money after 6 months, develop an iOS version. So, we're going to examine the Genres, Category, and prime_genre columns to determine the most popular kinds of apps.

In [16]:
def freq_table(dataset, index):
    freq = {}
    for app in dataset:
        key = app[index]
        if key in freq:
            freq[key] += 1
        else:
            freq[key] = 1
    # change to percentages
    total = len(dataset)
    for key, val in freq.items():
        freq[key] = val / total
    return freq

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [17]:
display_table(apple_clean, 12) # prime_genre

Games : 0.5823180256169946
Entertainment : 0.07841299593876913
Photo & Video : 0.0499843798812871
Education : 0.03686348016244923
Social Networking : 0.033114651671352704
Shopping : 0.02592939706341768
Utilities : 0.024679787566385506
Sports : 0.02155576382380506
Music : 0.020618556701030927
Health & Fitness : 0.020306154326772883
Productivity : 0.017494532958450486
Lifestyle : 0.015620118712902219
News : 0.013433302093095907
Travel : 0.012496094970321775
Finance : 0.010934083099031553
Weather : 0.008747266479225243
Food & Drink : 0.008122461730709154
Reference : 0.005310840362386754
Business : 0.005310840362386754
Book : 0.0037488284910965324
Navigation : 0.0018744142455482662
Medical : 0.0018744142455482662
Catalogs : 0.0012496094970321774


The two most common genres are Games (58%) and Entertainment (7%). There are many free English apps for fun. To recommend a type of app that would have the most users, we would have to include the number of users in our analysis. Just because there are many game apps doesn't mean anybody downloads them.

In [18]:
display_table(google_clean, 9) # Genres

Tools : 0.08442585895117541
Entertainment : 0.06080470162748644
Education : 0.05357142857142857
Business : 0.04599909584086799
Productivity : 0.03899186256781193
Lifestyle : 0.038765822784810125
Finance : 0.037070524412296565
Medical : 0.035375226039783
Sports : 0.03458408679927667
Personalization : 0.03322784810126582
Communication : 0.032323688969258586
Action : 0.03096745027124774
Health & Fitness : 0.030854430379746837
Photography : 0.029498191681735987
News & Magazines : 0.02802893309222423
Social : 0.02667269439421338
Travel & Local : 0.02328209764918626
Shopping : 0.02249095840867993
Books & Reference : 0.021360759493670885
Simulation : 0.020456600361663652
Dating : 0.018648282097649186
Arcade : 0.01842224231464738
Video Players & Editors : 0.017744122965641953
Casual : 0.01763110307414105
Maps & Navigation : 0.013901446654611212
Food & Drink : 0.012432188065099457
Puzzle : 0.011301989150090416
Racing : 0.009945750452079566
Role Playing : 0.009380650994575045
Libraries & Demo : 

In [19]:
display_table(google_clean, 1) # Category

FAMILY : 0.18942133815551537
GAME : 0.09697106690777577
TOOLS : 0.08453887884267632
BUSINESS : 0.04599909584086799
PRODUCTIVITY : 0.03899186256781193
LIFESTYLE : 0.03887884267631103
FINANCE : 0.037070524412296565
MEDICAL : 0.035375226039783
SPORTS : 0.033905967450271246
PERSONALIZATION : 0.03322784810126582
COMMUNICATION : 0.032323688969258586
HEALTH_AND_FITNESS : 0.030854430379746837
PHOTOGRAPHY : 0.029498191681735987
NEWS_AND_MAGAZINES : 0.02802893309222423
SOCIAL : 0.02667269439421338
TRAVEL_AND_LOCAL : 0.02339511754068716
SHOPPING : 0.02249095840867993
BOOKS_AND_REFERENCE : 0.021360759493670885
DATING : 0.018648282097649186
VIDEO_PLAYERS : 0.017970162748643763
MAPS_AND_NAVIGATION : 0.013901446654611212
FOOD_AND_DRINK : 0.012432188065099457
EDUCATION : 0.01164104882459313
ENTERTAINMENT : 0.009606690777576853
LIBRARIES_AND_DEMO : 0.009380650994575045
AUTO_AND_VEHICLES : 0.009267631103074141
HOUSE_AND_HOME : 0.008024412296564195
WEATHER : 0.007911392405063292
EVENTS : 0.00712025316455

The most common genres on the Play Store are Tools (8%), Entertainment (6%), and Education (5%). Family (18%), Game (9%), and Tools (8%) are the most common categories. I am surprised there is such a surfeit of games on the App Store compared to the Play Store. Though because the Google data splits out its genres so finely, it might not be a fair comparison.

Now let's examine how popular the apps in each category are on average. We don't have this data for both datasets, so we'll use the number of ratings per app as a stand-in for the number of installs.

In [20]:
def find_avg_ratings_per_genre(dataset, genre_col, rat_col):
    rat_per_gen = {}
    app_rat_freq = freq_table(dataset, genre_col) 
    for genre, f in app_rat_freq.items():
        total = 0 # num of ratings in this genre
        len_genre = 0 # num apps in this genre
        for app in dataset:
            genre_app = app[genre_col]
            if genre_app == genre:
                total += float(app[rat_col])
                len_genre += 1
        rat_per_gen[genre] = total / len_genre
    return rat_per_gen
        
def display_dictionary_sorted(dic):
    table_display = []
    for key, val in dic.items():
        key_val_as_tuple = (val, key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [21]:
app_rat_per_gen = find_avg_ratings_per_genre(apple_clean, 12, 6)
display_dictionary_sorted(app_rat_per_gen)

Navigation : 86090.33333333333
Reference : 79350.4705882353
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 46384.916666666664
Food & Drink : 33333.92307692308
Finance : 32367.02857142857
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 27230.734939759037
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22910.83100858369
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 19156.493670886077
Lifestyle : 16815.48
Entertainment : 14195.358565737051
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


This analysis shows that Navigation, Reference, and Social Networking apps have far and away the most reviews per app in the App Store. The genres with the greatest number of apps (ie Games, Entertainment, Photo & Video, and Education), were more mediocre performers in this metric. However, I would still hesitate to propose an app category from this analysis. I wonder if it is simply the case that everybody with an iPhone has Apple Maps, for instance, and so they all have an opinion of it. It may be the case that the highest numbers are the case with one "killer app" (say, Facebook), and the others in that category are non-entities. Or perhaps the reviews are all negative (though I suppose that is an opportunity for disruption).

To do the same analysis for the Google Play data, we first need to clean the column for 'Installs'

In [22]:
for app in google_clean:
    app[5] = app[5].replace(',','')
    app[5] = app[5].replace('+','')
    app[5] = float(app[5])

In [23]:
goo_rat_per_gen = find_avg_ratings_per_genre(google_clean, 1, 3)
display_dictionary_sorted(goo_rat_per_gen)

COMMUNICATION : 999089.6118881119
SOCIAL : 965830.9872881356
GAME : 684290.0629370629
VIDEO_PLAYERS : 425350.08176100627
PHOTOGRAPHY : 404081.3754789272
TOOLS : 306550.3034759358
ENTERTAINMENT : 301752.24705882353
SHOPPING : 223887.34673366835
PERSONALIZATION : 181122.31632653062
WEATHER : 173679.5285714286
PRODUCTIVITY : 160634.5420289855
MAPS_AND_NAVIGATION : 143611.27642276423
TRAVEL_AND_LOCAL : 129484.42512077295
SPORTS : 117317.25666666667
FAMILY : 113142.99821002387
NEWS_AND_MAGAZINES : 93088.03225806452
BOOKS_AND_REFERENCE : 88460.62962962964
HEALTH_AND_FITNESS : 78094.9706959707
FOOD_AND_DRINK : 57478.79090909091
EDUCATION : 56293.09708737864
COMICS : 43371.57407407407
FINANCE : 38535.8993902439
LIFESTYLE : 34118.90406976744
HOUSE_AND_HOME : 27113.309859154928
ART_AND_DESIGN : 24699.42105263158
BUSINESS : 24239.727272727272
DATING : 21953.272727272728
PARENTING : 16378.706896551725
AUTO_AND_VEHICLES : 14140.280487804877
LIBRARIES_AND_DEMO : 10925.807228915663
BEAUTY : 7476.2264

In the Google Play Store, communication and social apps have the most ratings per app on average, followed by games, video players, and photography. Again, it would make sense to evaluate if there are just a few "winner" apps in those categories that are skewing the results, but it seems like a good place to start.

In [24]:
# Just out of curiosity
ratings_per_category = find_avg_ratings_per_genre(google_clean, 9, 3)
display_dictionary_sorted(ratings_per_category)

Adventure;Action & Adventure : 1513269.0
Strategy : 1251841.8148148148
Communication : 999089.6118881119
Social : 965830.9872881356
Casual;Action & Adventure : 942726.4166666666
Card;Action & Adventure : 920571.0
Casual : 837706.0064102564
Sports;Action & Adventure : 730014.0
Arcade : 713174.8282208589
Racing : 597997.1590909091
Action : 544150.7116788321
Puzzle;Action & Adventure : 533895.3333333334
Video Players & Editors : 428992.6050955414
Photography : 404081.3754789272
Tools;Education : 342336.0
Role Playing;Action & Adventure : 322308.3333333333
Tools : 306502.3975903614
Adventure;Education : 288606.0
Adventure : 285217.72881355934
Role Playing : 249256.8313253012
Education;Education : 234564.86666666667
Word : 228272.04347826086
Educational;Action & Adventure : 225802.33333333334
Shopping : 223887.34673366835
Music : 216456.44444444444
Puzzle : 215662.56
Educational;Pretend Play : 214720.125
Sports : 214124.96078431373
Racing;Action & Adventure : 203194.86666666667
Trivia : 193