# Profitable App Profiles

The aim of this analysis is to determine what goes into a successful app.  By looking a data associated with free apps, the ultimate goal is to understand which types of apps will attract the most users.  The profile will be based on free apps as well as apps for english language users only.

Here we import the necessary modules and open both the googleplay and applestore files in list formats:

In [64]:
from csv import reader
opened_file = open('googleplaystore.csv', encoding ='utf8')
read_file = reader(opened_file)
googleplay_list = list(read_file)

opened_file2 = open('AppleStore.csv', encoding ='utf8')
read_file2 = reader(opened_file2)
apple_list = list(read_file2)



This is a quick function that will allow for the viewing of both datasets in a meaningful format:

In [103]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Here is some sample data:

In [104]:
explore_data(googleplay_list, 0, 5)

explore_data(apple_list, 0, 5)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_ratin

Here is some general info about each data set:

In [101]:
print('Google Play Stats:')
print('Number of columns: ' + str(len(googleplay_list[0])))
print('Number of rows: ' + str(len(googleplay_list)))
print('\n')
print('Apple Store Stats:')
print('Number of columns: ' + str(len(apple_list[0])))
print('Number of rows: ' + str(len(apple_list)))




Google Play Stats:
Number of columns: 13
Number of rows: 10842


Apple Store Stats:
Number of columns: 17
Number of rows: 7198


In order to use both datasets effectively for the same purposes, the same categories or columns should be used.

Therefore, it makes sense to look at reviews, genre, maybe size.

More information about the columns and what they represent can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).


In [114]:
print(googleplay_list[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


An error in the google play dataset was identified in the disscussion section of the data set found [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015). This app was deleted from the dataset for a more clean analysis.

In [115]:
del googleplay_list[10473]

In [117]:
explore_data(googleplay_list, 10471, 10475)

['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']




Also noted in the discussions are instances of duplicate app entries...

In [172]:
for app in googleplay_list:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Duplicates do not seem to be a problem for the apple store data set:

In [171]:
for app in apple_list:
    name = app[2]
    if name == 'Instagram':
        print(app)

['591', '389801252', 'Instagram', '113954816', 'USD', '0', '2161558', '1289', '4.5', '4', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


In [126]:
single_apps = []
duplicate_apps = []

for app in googleplay_list[1:]:
    name = app[0]
    if name in single_apps:
        duplicate_apps.append(name)
    else:
        single_apps.append(name)
        
print('Number of unique apps: ' + str(len(single_apps)))
print('Number of duplicate apps: ' + str(len(duplicate_apps)))
print('\n')
print(duplicate_apps[0:5])

Number of unique apps: 9659
Number of duplicate apps: 1181


['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


When determining which duplicate app entry to keep the total reviews number column can be used as the most recent entry will have the most reviews (as seen with the Instagram example).

Below we create a dictionary with the unique number of apps and key value pair is the highest number of reviews associated with that app:

In [132]:
reviews_max = {}

for app in googleplay_list[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print('Unique apps: ' + str(len(reviews_max)))


Unique apps: 9659


Below we created a cleaned list of lists (cleaned_googleplay_list), containing only the most recent version of unique apps from the googleplay store.  

We did this by looping through the original googleplay data set (not including the header row) and adding the app information to a new list based on two criteria:

- only if it had the max number of reviews that we determined above in the dictionary max_reviews

- also only if the name of the app was not already added (checked against_added list)


In [192]:
cleaned_googleplay_list = []
already_added = []

for app in googleplay_list[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        cleaned_googleplay_list.append(app)
        already_added.append(name)
    
print('Number of apps in cleaned list: ' + str(len(cleaned_googleplay_list)))
print('\n')
print(cleaned_googleplay_list[0:2])
print('\n')
print(cleaned_googleplay_list[4412])


Number of apps in cleaned list: 9659


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']]


['中国語 AQリスニング', 'FAMILY', 'NaN', '21', '17M', '5,000+', 'Free', '0', 'Everyone', 'Education', 'June 22, 2016', '2.4.0', '4.0 and up']


Now we will look at removing any apps not in the english language.  To do this we will focus on identifying characters in a string based on their corresponding number behind the scenes. By using the built in **ord()** function.

Example:

In [193]:
print(ord('a'))
print(ord('A'))
print(ord('5'))
print(ord('+'))
print(ord('爱'))
print(ord('ン'))


97
65
53
43
29233
12531


According to the [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) system, characters commonly used in the English language fall in range 0 to 127.

Below we will make a simple function that takes in a string and returns a boolean value depending on whether or not the characters in the string are within the English range:

In [194]:
def english_checker(app_name):
    string = str(app_name)
    for character in string:
        if ord(character) > 127:
            return False
    else:
        return True


print(english_checker('Instagram'))
print(english_checker('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_checker('Docs To Go™ Free Office Suite'))
print(english_checker('Instachat 😜'))
    

True
False
False
False


Unfortunately, this function isn't perfect becuase while it returns False for apps containing characters such as 爱 or 奇 it also returns false for emojis 😜 and characters such as ™.

To address this the function will instead allow for up to 3 characters outside the 127 character range to build in some flexability (I used the 3 strikes analogy here).  The function isn't perfect and will cause the loss of some data in instances where the app names litter with emojis.

In [195]:
def english_checker2(string):
   
    strikes = 0
    for character in string:
        if ord(character) > 127:
            strikes += 1
            
    if strikes > 3:
            return False
    else:
        return True
    
print(english_checker2('Docs To Go™ Free Office Suite'))
print(english_checker2('Instachat 😜'))
print(english_checker2('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


Now this function will be used to create English only lists from the original apple dataset and cleaned google dataset:

In [209]:
english_google = []
english_apple = []

for app in cleaned_googleplay_list:
    name = app[0]
    if english_checker2(name) == True:
        english_google.append(app)

for app in apple_list[1:]:
    name = app[2]
    if english_checker2(name) == True:
        english_apple.append(app)
    
print('English apps in cleaned google dataset: ' + str(len(english_google)))
print('\n')
print('English apps in apple dataset: ' + str(len(english_apple)))

English apps in cleaned google dataset: 9614


English apps in apple dataset: 6183


Now that the data represent unique English language apps the next step is to isolate only the free apps. The google dataset price is at index 6 and the apple dataset price is at index 5.

In [210]:
print(googleplay_list[0])
print(apple_list[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [213]:
free_google = []
free_apple = []

for app in english_google:
    price = app[7]
    
    if price == '0':
        free_google.append(app)


for app in english_apple:
    price = float(app[5])
    
    if price == 0:
        free_apple.append(app)
        
print('Number of free, English apps from google: ' + str(len(free_google)))
print('\n')
print('Number of free, English apps from apple: ' + str(len(free_apple)))

        


Number of free, English apps from google: 8864


Number of free, English apps from apple: 3222


## Analysis

At this stage the data is cleaned and now it is possible to look at the datasets together and determine which categories will be most important.  They should be able to be pulled from both data sets.

Therefore we will be looking at the prime_genre column from the Apple store data set and the generes and Category column from the google play data set.

First, a function to obtain a frequency table for the particular category we want to look at will need to be made:


In [223]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    
    return table_percentages
    

Now that we have the function for generating frequency tables and returning precentages of genres we can create another function to clean up how those percentages are shown:

In [224]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

For the google dataset, category is at index 1 and genres is at index 9.  For the apple dataset prime_genre is at index 12.

In [230]:
display_table(free_apple, 12) #prime_genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Based on the results above it appears that games are the most common genre by a landslide.  Following that is entertainment which probably acts in a similar capacity to games in terms of occupying users in their free time.

Really other than games most of the other genres are similar in frequency meaning that there is not a great amount of variety within those genres, at least not to the magnitude of games.  For instance, how many competing weather apps can there possibly be in such a niche category?

This also doesn't take into account the ammount of users so while there might be more instances of games on the app store, social media apps like facebook may have way more users though their success cannot be seen through this analysis alone.

In [232]:
display_table(free_google, 1) #category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

For the category frequencies from google play we see something very different.  Family is the most frequent category followed by games.  This might mean that games for andriods are not as popular or that they are more difficult to make than on ios.  Also, these categories are far more evenly distributed than the apple store genres.

In [229]:
display_table(free_google, 9) #genre

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The genre frequencies from the google dataset divide the category of games into various subtypes such as racing.  This dataset is still far more balanced than the apple store frequencies.  This is not useful for determining which apps are the most popular since it is still lacking the number of users and ratings information.


To add strength to the analysis we should focus on finding how many users there are across the categories rather than how many apps are available within those categories.


In [240]:
#print(freq_table(free_apple, 12))

apple_freqs = freq_table(free_apple, 12)

for genre in apple_freqs:
    total = 0    # total count of user ratings
    len_genre = 0  # number of apps in each genre
    
    for app in free_apple:
        genre_app = app[12]
        if genre_app == genre:
            ratings = float(app[6])
            total += ratings
            len_genre += 1
        
    avg_users = total / len_genre

    print('App Genre: ' + str(genre))
    print('Average Users: ' + str(avg_users))
    print('\n')
  

App Genre: Productivity
Average Users: 21028.410714285714


App Genre: Weather
Average Users: 52279.892857142855


App Genre: Shopping
Average Users: 26919.690476190477


App Genre: Reference
Average Users: 74942.11111111111


App Genre: Finance
Average Users: 31467.944444444445


App Genre: Music
Average Users: 57326.530303030304


App Genre: Utilities
Average Users: 18684.456790123455


App Genre: Travel
Average Users: 28243.8


App Genre: Social Networking
Average Users: 71548.34905660378


App Genre: Sports
Average Users: 23008.898550724636


App Genre: Health & Fitness
Average Users: 23298.015384615384


App Genre: Games
Average Users: 22788.6696905016


App Genre: Food & Drink
Average Users: 33333.92307692308


App Genre: News
Average Users: 21248.023255813954


App Genre: Book
Average Users: 39758.5


App Genre: Photo & Video
Average Users: 28441.54375


App Genre: Entertainment
Average Users: 14029.830708661417


App Genre: Business
Average Users: 7491.117647058823


App Genre:

Based on this data we can see which categories actually have the most users.  It looks as if though social networking and navigation have the most users per app. This likely means that there are a few outstanding apps in these categories such as Facebook or Google Maps that are used universally.  

For a company that is trying to decide how to invest its resources making a successful app for the **Apple Store**, it make want to take the shotgun approach when creating games but in other categories spend a long time on refining a single app that can hopefully break into niche categories with long standing popular apps such as instagram. 

Now to look into the user data for the google play store:


In [242]:
display_table(free_google, 5) #installs

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


This generalized data will be fine for this analysis but the extra '+' and ',' characters will need to be removed to convert the values to floats for further calculations.

In [245]:
google_freqs = freq_table(free_google, 1) #category

for category in google_freqs:
    total = 0  #total installs per category
    len_category = 0 #total apps in each category
    
    for app in free_google:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+', '')
            installs = float(installs.replace(',', ''))
            total += installs
            len_category += 1
            
    avg_installs = total / len_category
    
    print('App Genre: ' + str(category))
    print('Average Number of Installs: ' + str(avg_installs))
    print('\n')
    


App Genre: ART_AND_DESIGN
Average Number of Installs: 1986335.0877192982


App Genre: AUTO_AND_VEHICLES
Average Number of Installs: 647317.8170731707


App Genre: BEAUTY
Average Number of Installs: 513151.88679245283


App Genre: BOOKS_AND_REFERENCE
Average Number of Installs: 8767811.894736841


App Genre: BUSINESS
Average Number of Installs: 1712290.1474201474


App Genre: COMICS
Average Number of Installs: 817657.2727272727


App Genre: COMMUNICATION
Average Number of Installs: 38456119.167247385


App Genre: DATING
Average Number of Installs: 854028.8303030303


App Genre: EDUCATION
Average Number of Installs: 1833495.145631068


App Genre: ENTERTAINMENT
Average Number of Installs: 11640705.88235294


App Genre: EVENTS
Average Number of Installs: 253542.22222222222


App Genre: FINANCE
Average Number of Installs: 1387692.475609756


App Genre: FOOD_AND_DRINK
Average Number of Installs: 1924897.7363636363


App Genre: HEALTH_AND_FITNESS
Average Number of Installs: 4188821.9853479853

From these results we see that communication has the most number of installs by a large margin.  This is likely due to messaging apps or even ones that come preinstalled on phones.  Social is the next largest and again is likely due to giants like Facebook and LinkedIn.    

Synthesizing both data from both the apple store and google play store allows for a more complete picture of what apps may be successful across both platforms.  

Though at first games may have seem like the most viable option, they are not as popular on the google play store.  This analysis would be prudent to include at least some sort of social media/communication aspect to whichever type of app that is being built so individuals have means of spreading information about the app and reason to continue checking/using the app.  

Maps and navigation has many users/installs but is likely a difficult category to break into.  Maybe coming up with novel ideas to break into this space that include social media on some sort of flexible platform that allows for vidoes/pictures:

- update your friends live everyone 100 miles on your journey with a short video
- maybe a background app that profiles your driving by checking your speed, length of drives, and locations visited
- maybe app that is a front facing GPS but acts a dash cam storing the most recent 5-10 minutes of driving (could be used for legal purposes or a platform to share unique dash cam events)