# Profitable App Profiles for the App Store and Google Play Markets

This project is requested by a company that builds Android and iOS mobile apps that are free to download and install. The business strategy of the company is solely to make money by in-app ads which means that revenue is mostly influenced by the number of users who use the app.

The Google Play dataset is found [here](https://www.kaggle.com/lava18/google-play-store-apps).
The Apple Store dataset is found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps#AppleStore.csv).

The goal of this project is to analyze data to help the developers understand what type of apps are likely to attract more users. 

## Opening and Exploring the Data

First, we will use a function to output and observe the opened data.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
from csv import reader

with open('AppleStore.csv') as f:
    csv_reader = reader(f)
    a_lines = list(csv_reader)
    
print([str(element[1]) + ': ' + element[0] for element in zip(a_lines[0],range(len(a_lines[0])))])
print('\n')
    
explore_data(a_lines, 1, 6, True)

['0: id', '1: track_name', '2: size_bytes', '3: currency', '4: price', '5: rating_count_tot', '6: rating_count_ver', '7: user_rating', '8: user_rating_ver', '9: ver', '10: cont_rating', '11: prime_genre', '12: sup_devices.num', '13: ipadSc_urls.num', '14: lang.num', '15: vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Numb

For Apple, the columns of interest would be 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. For context, the documentation regarding the columns are found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

In [4]:
with open('googleplaystore.csv') as f:
    csv_reader = reader(f)
    g_lines = list(csv_reader)
    
print([str(element[1]) + ': ' + element[0] for element in zip(g_lines[0],range(len(g_lines[0])))])
print('\n')
    
explore_data(g_lines, 1, 6, True)

['0: App', '1: Category', '2: Rating', '3: Reviews', '4: Size', '5: Installs', '6: Type', '7: Price', '8: Content Rating', '9: Genres', '10: Last Updated', '11: Current Ver', '12: Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '9

For Google, the columns of interest would be 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'. For context, the documentation regarding the columns are found [here](https://www.kaggle.com/lava18/google-play-store-apps).

## Cleaning Data
From here, we look to clean any anomalies in the datasets. 

### Removing Innacurate data:

Regarding innacurate data, one issue of which is noted in the Google Play Store's [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015).

In [5]:
header_length = len(g_lines[0])
for row in g_lines:
    if len(row) != header_length:
        print(row)
        print(g_lines.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10473


In [6]:
# Result of above length check
del g_lines[10473]

### Removing Duplicates:

Another thing to check for is duplicate data. This can be checked for primarily using the name. An example would be 'Instagram' which you can see an example of below.

In [7]:
for row in g_lines:
    name = row[0]
    if name == 'Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Since this is confirmed above, we will see how extensive the problem is. Below the count of duplicates will be shown:

In [8]:
apps = {}
duplicate_count = 0
review_row = 3

for app in g_lines:
    name = app[0]
    if name in apps:
        if apps[name] < app[review_row]: # Save only the highest review row (downloads is app[3])
            apps[name] = app[review_row]
        duplicate_count += 1
    else:
        apps[name] = app[review_row]
        
print('The number of duplicates in the Google Play Store dataset is ' + str(duplicate_count) + '.')

The number of duplicates in the Google Play Store dataset is 1181.


The above duplicate analysis was performed on the Apple Store and it did not result in any duplicates there. To deal with the duplicates in the Google Play Store, we will use the number of reviews as a proxy to when the row was entered. The highest number of reviews will indicate the most recent entry.

In [9]:
cleaned_apps = []

for app in g_lines:  # To select non-duplicates
    name = app[0]
    if apps[name] == -1:  # App is used
        continue
    elif apps[name] == app[review_row]:  # Match on review count
        cleaned_apps.append(app)
        apps[name] = -1

explore_data(cleaned_apps, 1, 2, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 9660
Number of columns: 13


This is the expected number of rows (including the header row).

### Removing Non-English Apps:

Aside from duplicates, we should look at removing non-english apps since the apps developed at the company are for an English-speaking audience. First, we will develop a function for it.

In [10]:
def is_english(string):
    non_ascii = 0
    for letter in string:
        if ord(letter) > 127:  # Highest ascii number
            non_ascii += 1
    if non_ascii > 3:  # In case of emoji or symbol
        return False
    else:
        return True

# To Test
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


Now we will clean our data sets.

In [11]:
google_apps = []
apple_apps = []

for app in cleaned_apps:  # Cleaned Google Apps
    if is_english(app[0]):
        google_apps.append(app)

for app in a_lines:  # Apple Apps
    if is_english(app[1]):
        apple_apps.append(app)
        
explore_data(google_apps, 1, 2, True)
explore_data(apple_apps, 1, 2, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 9615
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 6184
Number of columns: 16


### Isolating Free Apps:

Another aspect to look at is the price of the app. Since the company only builds apps that are free to download and install, the data set will have to be cleaned to remove non-free apps.

In [12]:
free_google_apps = []
free_apple_apps = []
google_cost_idx = 7
apple_cost_idx = 4

for app in google_apps:
    if app[google_cost_idx] == '0':
        free_google_apps.append(app)

for app in apple_apps:
    if app[apple_cost_idx] == '0.0':
        free_apple_apps.append(app)
        
explore_data(free_google_apps, 0, 1, True)
explore_data(free_apple_apps, 0, 1, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 8862
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 3222
Number of columns: 16


## Analyzing the Data

Since the end goal is to add an app on both the Google Play and App Store, we should find an app profile that is successful on both markets.

The validation strategy will be the following:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, develop it further.
3. If the app is profitable after six months, build an iOS version of the app and add it to the App Store.

Android will be first due to the larger audience and potential for profit since the revenue is highly influenced by the number of people using the app.

### Generating a Frequency Table:

First, we will look at the genre of the application to develop with use of a frequency table. This is the *prime_genre* for the App Store and *Genres* and *Category* for the Google Play store.

In [13]:
apple_genre_idx = 11
google_genre_idx = 1
google_category_idx = 9

def freq_table(dataset, index):
    frequencies = {}
    total = len(dataset)
    
    for row in dataset:
        value = row[index]
        if value in frequencies:
            frequencies[value] += 1
        else:
            frequencies[value] = 1
    
    table_percentages = {}
    for key in frequencies:
        percentage = (frequencies[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages 

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1]+': '+'{:.2f}'.format(entry[0]))

In [14]:
display_table(free_apple_apps, apple_genre_idx)

Games: 58.16
Entertainment: 7.88
Photo & Video: 4.97
Education: 3.66
Social Networking: 3.29
Shopping: 2.61
Utilities: 2.51
Sports: 2.14
Music: 2.05
Health & Fitness: 2.02
Productivity: 1.74
Lifestyle: 1.58
News: 1.33
Travel: 1.24
Finance: 1.12
Weather: 0.87
Food & Drink: 0.81
Reference: 0.56
Business: 0.53
Book: 0.43
Navigation: 0.19
Medical: 0.19
Catalogs: 0.12


The most common Apple Store application are Games at 58% and Entertainment at 8% as runner-up.

The general impression is that entertainment of some sort are more numerous in the Apple Store.

Just because there are many more games than the other genres does not mean that there are a higher number of users per app.

In [15]:
display_table(free_google_apps, google_genre_idx)

FAMILY: 18.93
GAME: 9.69
TOOLS: 8.45
BUSINESS: 4.59
LIFESTYLE: 3.90
PRODUCTIVITY: 3.89
FINANCE: 3.70
MEDICAL: 3.52
SPORTS: 3.40
PERSONALIZATION: 3.32
COMMUNICATION: 3.24
HEALTH_AND_FITNESS: 3.08
PHOTOGRAPHY: 2.95
NEWS_AND_MAGAZINES: 2.80
SOCIAL: 2.66
TRAVEL_AND_LOCAL: 2.34
SHOPPING: 2.25
BOOKS_AND_REFERENCE: 2.14
DATING: 1.86
VIDEO_PLAYERS: 1.79
MAPS_AND_NAVIGATION: 1.40
FOOD_AND_DRINK: 1.24
EDUCATION: 1.17
ENTERTAINMENT: 0.96
LIBRARIES_AND_DEMO: 0.94
AUTO_AND_VEHICLES: 0.93
HOUSE_AND_HOME: 0.82
WEATHER: 0.80
EVENTS: 0.71
PARENTING: 0.65
ART_AND_DESIGN: 0.64
COMICS: 0.62
BEAUTY: 0.60


The most common Google Play genre is FAMILY at 19% with GAME at 10% as a runner up.

The category is not as dominated as with Apple Store's genre.

In [16]:
display_table(free_google_apps, google_category_idx)

Tools: 8.44
Entertainment: 6.07
Education: 5.35
Business: 4.59
Productivity: 3.89
Lifestyle: 3.89
Finance: 3.70
Medical: 3.52
Sports: 3.46
Personalization: 3.32
Communication: 3.24
Action: 3.10
Health & Fitness: 3.08
Photography: 2.95
News & Magazines: 2.80
Social: 2.66
Travel & Local: 2.32
Shopping: 2.25
Books & Reference: 2.14
Simulation: 2.04
Dating: 1.86
Arcade: 1.85
Video Players & Editors: 1.77
Casual: 1.75
Maps & Navigation: 1.40
Food & Drink: 1.24
Puzzle: 1.13
Racing: 0.99
Role Playing: 0.94
Libraries & Demo: 0.94
Auto & Vehicles: 0.93
Strategy: 0.91
House & Home: 0.82
Weather: 0.80
Events: 0.71
Adventure: 0.68
Comics: 0.61
Beauty: 0.60
Art & Design: 0.60
Parenting: 0.50
Card: 0.44
Casino: 0.43
Trivia: 0.42
Educational;Education: 0.39
Educational: 0.37
Board: 0.37
Education;Education: 0.34
Word: 0.26
Casual;Pretend Play: 0.24
Music: 0.20
Puzzle;Brain Games: 0.18
Racing;Action & Adventure: 0.17
Entertainment;Music & Video: 0.17
Casual;Brain Games: 0.14
Casual;Action & Adventure:

When looking at the category table for Google Play, you can note that practical purposes are more dominant (however not as much as entertainment in the Apple Store).

So far, the genres for both stores are not equivalent and an app profile cannot be recommended using the number of applications for a genre.

### Finding the Genre with the Most Users per App:

Since the number of apps per genre is not helpful for capturing the number of users for that genre, we will look at the average number of users per app for each genre.

In [17]:
users_idx = 5

def avg_users(dataset, genre_index, users_index, google=False):
    user_count = {}
    app_frequency = {}
    
    for row in dataset:
        genre = row[genre_index]
        if google:
            users = int(row[users_index].replace(',','').replace('+',''))
        else:
            users = int(row[users_index])
            
        if genre in user_count:
            user_count[genre] += users
            app_frequency[genre] += 1
        else:
            user_count[genre] = users
            app_frequency[genre] = 1
    
    average_users = {}
    for key in user_count:
        percentage = (user_count[key] / app_frequency[key]) * 100
        average_users[key] = percentage 
    
    return average_users 

def display_table_avg_users(dataset, genre_index, users_index, google=False):
    table = avg_users(dataset, genre_index, users_index, google)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1]+': '+'{:.0f}'.format(entry[0]))

In [18]:
display_table_avg_users(free_apple_apps, apple_genre_idx, users_idx)

Navigation: 8609033
Reference: 7494211
Social Networking: 7154835
Music: 5732653
Weather: 5227989
Book: 3975850
Food & Drink: 3333392
Finance: 3146794
Photo & Video: 2844154
Travel: 2824380
Shopping: 2691969
Health & Fitness: 2329802
Sports: 2300890
Games: 2278867
News: 2124802
Productivity: 2102841
Utilities: 1868446
Lifestyle: 1648576
Entertainment: 1402983
Business: 749112
Education: 700398
Catalogs: 400400
Medical: 61200


In [19]:
display_table_avg_users(free_google_apps, google_category_idx, users_idx, google=True)

Communication: 3845611917
Adventure;Action & Adventure: 3533333333
Video Players & Editors: 2494733580
Social: 2325365213
Arcade: 2288836549
Casual: 1963095852
Puzzle;Action & Adventure: 1836666667
Photography: 1780562764
Educational;Action & Adventure: 1701666667
Productivity: 1678733134
Racing: 1591064568
Travel & Local: 1405147615
Casual;Action & Adventure: 1291666667
Action: 1260358887
Strategy: 1119990253
Tools: 1068321320
Tools;Education: 1000000000
Role Playing;Brain Games: 1000000000
Lifestyle;Pretend Play: 1000000000
Casual;Music & Video: 1000000000
Card;Action & Adventure: 1000000000
Adventure;Education: 1000000000
News & Magazines: 954917847
Music: 944558333
Educational;Pretend Play: 937500000
Word: 909445870
Puzzle;Brain Games: 901312500
Racing;Action & Adventure: 881666667
Books & Reference: 876781189
Puzzle: 830286191
Video Players & Editors;Music & Video: 750000000
Shopping: 703687731
Role Playing;Action & Adventure: 700000000
Casual;Pretend Play: 695714286
Entertainment

In [20]:
display_table_avg_users(free_google_apps, google_genre_idx, users_idx, google=True)

COMMUNICATION: 3845611917
VIDEO_PLAYERS: 2472787245
SOCIAL: 2325365213
PHOTOGRAPHY: 1780562764
PRODUCTIVITY: 1678733134
GAME: 1556096560
TRAVEL_AND_LOCAL: 1398407771
ENTERTAINMENT: 1164070588
TOOLS: 1068230103
NEWS_AND_MAGAZINES: 954917847
BOOKS_AND_REFERENCE: 876781189
SHOPPING: 703687731
PERSONALIZATION: 520148261
WEATHER: 507448620
HEALTH_AND_FITNESS: 418882199
MAPS_AND_NAVIGATION: 405694177
FAMILY: 369427633
SPORTS: 363864014
ART_AND_DESIGN: 198633509
FOOD_AND_DRINK: 192489774
EDUCATION: 182067308
BUSINESS: 171229015
LIFESTYLE: 143781627
FINANCE: 138769248
HOUSE_AND_HOME: 133154056
DATING: 85402883
COMICS: 81765727
AUTO_AND_VEHICLES: 64731782
LIBRARIES_AND_DEMO: 63850373
PARENTING: 54260362
BEAUTY: 51315189
EVENTS: 25354222
MEDICAL: 12061649


It seems that social applications in general have the highest number of users on each store. However, in context there may be skewed data due to massive downloads for the most popular social apps. A more in-depth look at the distributions within a category would indicate which genres from the top ones above would have an easier time breaking into.

### Looking at Favored Genres

Since a possibility is also making revenue via in-app purchases and subscriptions, seeing which genres in general are most liked by users is also an open avenue for exploration.

In [21]:
apple_rating_idx = 7
google_rating_idx = 2

def avg_rating_per_genre(dataset, genre_index, rating_index):
    rating_count = {}
    app_frequency = {}
    
    for row in dataset:
        genre = row[genre_index]
        ratings = float(row[rating_index])
            
        if genre in rating_count:
            rating_count[genre] += ratings
            app_frequency[genre] += 1
        else:
            rating_count[genre] = ratings
            app_frequency[genre] = 1
            
    average_ratings = {}
    for key in rating_count:
        percentage = rating_count[key] / app_frequency[key]
        average_ratings[key] = percentage 
    
    return average_ratings 

def display_table_avg_ratings(dataset, genre_index, users_index):
    table = avg_rating_per_genre(dataset, genre_index, users_index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1]+': '+'{:.2f}'.format(entry[0]))

In [22]:
display_table_avg_ratings(free_apple_apps, apple_genre_idx, apple_rating_idx)

Catalogs: 4.12
Games: 4.04
Productivity: 4.00
Business: 3.97
Shopping: 3.97
Music: 3.95
Photo & Video: 3.90
Navigation: 3.83
Health & Fitness: 3.77
Reference: 3.67
Education: 3.64
Food & Drink: 3.63
Social Networking: 3.59
Entertainment: 3.54
Utilities: 3.53
Travel: 3.49
Weather: 3.48
Lifestyle: 3.41
Finance: 3.38
News: 3.24
Book: 3.07
Sports: 3.07
Medical: 3.00


Some free Google apps do not have ratings, so those will have to be cleaned out prior to analysis.

In [23]:
free_cleaned_google_apps = []

for app in free_google_apps:
    if app[google_rating_idx] == 'NaN':
        continue
    else:
        free_cleaned_google_apps.append(app)

print(len(free_cleaned_google_apps))  # The remaining apps cleaned

7564


In [24]:
display_table_avg_ratings(free_cleaned_google_apps, google_genre_idx, google_rating_idx)

EVENTS: 4.44
BOOKS_AND_REFERENCE: 4.35
EDUCATION: 4.34
PARENTING: 4.34
ART_AND_DESIGN: 4.34
PERSONALIZATION: 4.30
BEAUTY: 4.28
SOCIAL: 4.25
HEALTH_AND_FITNESS: 4.24
GAME: 4.23
WEATHER: 4.23
SHOPPING: 4.23
SPORTS: 4.21
AUTO_AND_VEHICLES: 4.18
PRODUCTIVITY: 4.18
LIBRARIES_AND_DEMO: 4.18
COMICS: 4.18
FAMILY: 4.17
FOOD_AND_DRINK: 4.17
PHOTOGRAPHY: 4.17
MEDICAL: 4.15
HOUSE_AND_HOME: 4.14
FINANCE: 4.13
COMMUNICATION: 4.13
ENTERTAINMENT: 4.12
NEWS_AND_MAGAZINES: 4.10
BUSINESS: 4.10
LIFESTYLE: 4.08
TRAVEL_AND_LOCAL: 4.07
VIDEO_PLAYERS: 4.04
MAPS_AND_NAVIGATION: 4.04
TOOLS: 4.03
DATING: 3.98


Looking at the top ten of both sets, it seems that games seem to have the highest ratings between both stores. Since it is common to have free games that have in-app purchases, it seems that as long as the game is able to break into a niche and have a good UX and UI then it could make a lot of money both with advertisement revenue and in-app purchases or subscriptions to remove those advertisements.

## Conclusion

While certain genres in either store show as promising for development, further analysis is needed to determine the statistical distribution in those genres. Likewise, comparing that analysis to further rating analysis can lead to a more targeted recommendation for which to begin planning and development of an application in that particular space.