# Types of Apps with the Most Users

I am analyzing data from the Apple Store to learn which kinds of apps users are most likely to download.  I am doing this imagining that I work for a company that makes free apps and gets its revenue from ads.  I'm hoping to learn which types of apps have the most users and thus which would make the most sense for this company to work on developing.

I'm going to use the Apple Store data set which can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps#AppleStore.csv)
And the Google Play store data set which can be found [here](https://www.kaggle.com/lava18/google-play-store-apps)

The first thing I will do is open both the files and convert them to lists that I can work with

In [11]:
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apple_data = list(read_file)

opened_file2 = open('googleplaystore.csv')
from csv import reader
read_file2 = reader(opened_file2)
google_data = list(read_file2)


# Cleaning the Data
I'm going to start with cleaning up the data.
From the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) about the Google data I see that there is a specific row with a missing piece of data which I will want to delete if that's the case.  To double check I'll print it below (the index number is 10473 instead of 10472 as noted in the discussion because I did not delete the header row from my data set).

In [12]:
google_data[10473]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

So now I will delete this row and then print the new row at the same index number to make sure it was deleted:

In [13]:
del google_data[10473]
google_data[10473]

['osmino Wi-Fi: free WiFi',
 'TOOLS',
 '4.2',
 '134203',
 '4.1M',
 '10,000,000+',
 'Free',
 '0',
 'Everyone',
 'Tools',
 'August 7, 2018',
 '6.06.14',
 '4.4 and up']

There are known duplicate entries in the Google Data set so I will also address that issue.  First I will check to see how many duplicates there are: 

In [15]:
duplicate_apps = []
unique_apps = []

for app in google_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of Duplicate Apps', len(duplicate_apps))
print('\n')
print('Examples of Duplicate Apps', duplicate_apps[:10])

Number of Duplicate Apps 1181


Examples of Duplicate Apps ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


So there are 1,181 duplicates in the data set.  I'm going to look at one of the examples to try to get an idea of which entries to keep and which to delete.

In [18]:
for app in google_data:
    name = app[0]
    if name == 'Slack':
        print(app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


The only difference between the entries appears to be the 4th number (51507, 51507, and 51510).  This number refers to the number of reviews the app received.  In general I would like to use the most recent data to compare the apps so I will keep the entry with the highest number of reviews and delete the other ones.

First I'm going to figure out how many unique apps there are so I know what number of apps I'm trying to get in my data set (from the code above I learned that there are 1,181 dupliate apps):

In [24]:
print('Expected length of clean data set:', len(google_data[1:]) - 1181)

Expected length of clean data set: 9659


So I expect my clean data set to have 9,659 entries.  I will now create a dictionary where the keys are the names of the apps and the values are the highest number of reviews.  Each app name will be added to the dictionary as the code loops through the data set and the number of reviews will get updated to the highest number.  At the end I will check the length of my dictionary to make sure it is 9,659 like I expect.

In [23]:
reviews_max = {}

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        n_reviews = reviews_max[name]
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Length of Dictionary:', len(reviews_max))

Length of Dictionary: 9659


The dictionary length is what I expected and now I have dictionary with each unique app name and the value for the most ratings.  I'm now going to use this dictionary to loop through the original data set and create a cleaned one with all the duplicate entries removed and only the entires with the highest number of reviews remaining.  Some of the duplicate entries have the exact same number of reviews, but I still only want one entry for each app.  Because of this while I'm looping through the data set I will also be creating a list of the names that have already been added to the data set.  That way entries will only be added if they contain the highest number of reviews AND they are not already in the data set.

When I'm done I will check the length of my clean data to make sure it is 9,659 like I expect.

In [25]:
google_data_clean = []
already_added = []

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        google_data_clean.append(app)
        already_added.append(name)

print('Length of Clean Data Set:', len(google_data_clean))

Length of Clean Data Set: 9659


Now I have a new clean data set with only unique values.

For the analysis I am doing I am only interested in apps that are in English.  I'm going to write a function that can tell if an app name is English or not.  Because some English app names contains symbols or emojis I'm going to allow apps in the data set with up to 3 non-English characters.  This is not a perfect method but should be good enough for my analysis.  I will test my function on a few app names that I already know are English or not English and see if the function returns the correct values.


In [33]:
def is_english(string):
    non_english = []
    for character in string:
        if ord(character) > 127:
            non_english.append(character)
            if len(non_english) > 3:
                return False
    
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Instachat 😜'))
print(is_english('Docs To Go™ Free Office Suite'))

True
False
True
True


My function worked like I expected!  I will now use this function to loop through both the Google and Apple data and create two new clean data sets to work with.  I will then measure the length of each data set to see how many entries are left.


In [59]:
google_data_clean_eng = []
apple_data_eng = []

for app in google_data_clean:
    name = app[0]
    if is_english(name):
        google_data_clean_eng.append(app)
        
for app in apple_data[1:]:
    name = app[0]
    if is_english(name):
        apple_data_eng.append(app)

print('Length of Google Data:', len(google_data_clean_eng))
print('Length of Apple Data:', len(apple_data_eng))

Length of Google Data: 9614
Length of Apple Data: 7197


Of these apps left I am only interested in analyzing data from apps that are free.  So my last step in cleaning the data will be to loop through each list and only keep the apps that are free.

In [68]:
google_fullclean = []
apple_fullclean = []

for app in google_data_clean_eng:
    price = app[7]
    if price == '0':
        google_fullclean.append(app)
        
for app in apple_data_eng:
    price = app[5]
    if price == '0':
        apple_fullclean.append(app)

print('Length of Google Data:', len(google_fullclean))
print('Length of Apple Data:', len(apple_fullclean))

Length of Google Data: 8862
Length of Apple Data: 4056


# Analyzing the Data

I am now going to begin analyzing the data.  The ultimate goal is to create an app that will be profitable in both the Android and iOS markets.  Therefore I am going to start by looking at some of the most common genres of apps with the highest user ratings in both data sets and see if there are any similarities.  Profits in this scenario depend on in app ads so the more people using the app the better.  

I'm going to define two functions below.  One is for creating a frequency table of a particular column in the data set and the other one is for displaying that table so the most common category in the column is displayed first and the frequency is displayed as a percentage.


In [124]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        category = row[index]
        if category in table:
            table[category] += 1
        else:
            table[category] = 1
            
    table_percentages = {}
    for key in table:
        percentage = (table[key]/total) * 100
        table_percentages[key] = percentage
    
    return table_percentages
    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

I will now use the functions to look at the 'Category' and 'Genre' columns in the Google Play Store data and also the 'prime_genre' column in the App Store data.

In [101]:
print('Category Column:')
display_table(dataset = google_fullclean, index = 1)

Category Column:
FAMILY : 18.449559918754233
GAME : 9.873617693522906
TOOLS : 8.440532611148726
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.782893252087565
MAPS_AND_NAVIGATION : 1.399232678853532
EDUCATION : 1.2863913337846988
FOOD_AND_DRINK : 1.2412547957571656
ENTERTAINMENT : 1.128413450688332
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8350259535093659
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
ART_AND_DESIGN : 0.6770480704129994

In [102]:
print('Genre Column:')
display_table(dataset = google_fullclean, index = 9)

Genre Column:
Tools : 8.429248476641842
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5206499661475967
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7603249830737984
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.

In [103]:
print('prime_genre column:')
display_table(dataset = apple_fullclean, index = 12)

prime_genre column:
Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


The data above indicates that in the App Store most of the apps are games whereas in the Google Play store games and practical apps are more equal.  Now I am going to look at data about which apps in each store have the most users.  I'm going to use the 'Installs' column in the Google data to look at this and the 'rating_count_tot' column in the Apple data.  I'm goint to start with the Apple data:

In [122]:
table = freq_table(dataset = apple_fullclean, index = 12)

for genre in table:
    total = 0
    len_genre = 0
    for row in apple_fullclean:
        genre_app = row[12]
        if genre_app == genre:
            user_ratings = float(row[6])
            total += user_ratings
            len_genre += 1
    avg_ratings = total / len_genre
    print(genre, avg_ratings)
        

Productivity 19053.887096774193
Weather 47220.93548387097
Shopping 18746.677685950413
Reference 67447.9
Finance 13522.261904761905
Music 56482.02985074627
Utilities 14010.100917431193
Travel 20216.01785714286
Social Networking 53078.195804195806
Sports 20128.974683544304
Health & Fitness 19952.315789473683
Games 18924.68896765618
Food & Drink 20179.093023255813
News 15892.724137931034
Book 8498.333333333334
Photo & Video 27249.892215568863
Entertainment 10822.961077844311
Business 6367.8
Lifestyle 8978.308510638299
Education 6266.333333333333
Navigation 25972.05
Medical 459.75
Catalogs 1779.5555555555557


And now I'll look at the Google Data:

In [126]:
google_table = freq_table(dataset = google_fullclean, index = 1)
for category in google_table:
    total = 0
    len_category = 0
    for row in google_fullclean:
        category_app = row[1]
        if category == category_app:
            installs = row[5]
            installs = installs.replace("+", "")
            installs = installs.replace(",", "")
            installs = float(installs)
            total += installs
            len_category += 1
    avg_installs = total / len_category
    print(category, avg_installs)

ART_AND_DESIGN 1905351.6666666667
AUTO_AND_VEHICLES 647317.8170731707
BEAUTY 513151.88679245283
BOOKS_AND_REFERENCE 8767811.894736841
BUSINESS 1712290.1474201474
COMICS 817657.2727272727
COMMUNICATION 38456119.167247385
DATING 854028.8303030303
EDUCATION 3082017.543859649
ENTERTAINMENT 21134600.0
EVENTS 253542.22222222222
FINANCE 1387692.475609756
FOOD_AND_DRINK 1924897.7363636363
HEALTH_AND_FITNESS 4188821.9853479853
HOUSE_AND_HOME 1313681.9054054054
LIBRARIES_AND_DEMO 638503.734939759
LIFESTYLE 1437816.2687861272
GAME 15837565.085714286
FAMILY 2691618.159021407
MEDICAL 120616.48717948717
SOCIAL 23253652.127118643
SHOPPING 7036877.311557789
PHOTOGRAPHY 17805627.643678162
SPORTS 3638640.1428571427
TRAVEL_AND_LOCAL 13984077.710144928
TOOLS 10695245.286096256
PERSONALIZATION 5201482.6122448975
PRODUCTIVITY 16787331.344927534
PARENTING 542603.6206896552
WEATHER 5074486.197183099
VIDEO_PLAYERS 24852732.40506329
NEWS_AND_MAGAZINES 9549178.467741935
MAPS_AND_NAVIGATION 4056941.7741935486


Both "Social Networking" and "COMMUNICATION" have very high numbers of users.  This indicates that it would be worthwhile to develop an app that allows for some sort of sharing or interacting with others.  The health and fitness categories also have a fairly large number of users in both the Google Play store and the Apple Store.  A lot of different conclusions can be drawn from this data, and it would honestly make sense to go in a number of different directions.  As an example though it would likely be profitable to develop a health and fitness app with the ability to share or maybe "compete" with friends and family so the app would have both the health and fitness and also the communication element combined.