Profitable App Profiles for the App Store and Google Play Markets

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

    A data set containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from this link.
    A data set containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this link.

Let's start by opening the two data sets and then continue with exploring the data.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

In [2]:
from csv import reader

open_file = open('/home/sewoong/DataQuest/Project_1/googleplaystore.csv')
read_file = reader(open_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

open_file = open('/home/sewoong/DataQuest/Project_1/AppleStore.csv')
read_file = reader(open_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

data cleaning

we only build apps that are free to download and install, and that are directed toward an English-speaking audience. This means that we'll need to:

    Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
    Remove apps that aren't free.

The Google Play data set has a dedicated discussion section, and we can see that one of the discussions describes an error for a certain row: 10472

In [3]:
del android[10472]

delete dupicate dataset

In [4]:
latest_apps = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in latest_apps and n_reviews > latest_apps[name]:
        latest_apps[name] = n_reviews
    elif name not in latest_apps:
        latest_apps[name] = n_reviews
        
print('Expected length: ', len(android) - 1181)
print('Actual length: ', len(latest_apps))

Expected length:  9659
Actual length:  9659


In [8]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (latest_apps[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [9]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9659
Number of columns:  13


Removing Non-English Apps

In [25]:
def defineEnglish(name):
    tempCount = 0
    for character in name:
        if ord(character) > 127:
            tempCount += 1
     
    if tempCount > 3:
        return False
    else:
        return True

In [28]:
android_english = []
apple_english = []

for app in android_clean:
    name = app[0]
    if defineEnglish(name) == True:
        android_english.append(app)
        
for app in apple:
    name = app[0]
    if defineEnglish(name) == True:
        apple_english.append(app)
        
explore_data(android_english, 0, 3, True)
explore_data(apple_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of columns:  13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

In [30]:
android_free_app = []
apple_free_app = []

for value in android_english:
    if value[7] == '0':
        android_free_app.append(value)

for value in apple_english:
    if value[4] == '0.0':
        apple_free_app.append(value)

print(len(android_free_app))
print(len(apple_free_app))

8864
4056


In [31]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    percentageTable = {}
    
    for key in table:
        percentage = (table[key] / total) * 100
        percentageTable[key] = percentage
        
    return percentageTable

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        temp = (table[key], key)
        table_display.append(temp)
    
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ': ', entry[0])

In [33]:
display_table(android_free_app, 1)

FAMILY :  18.907942238267147
GAME :  9.724729241877256
TOOLS :  8.461191335740072
BUSINESS :  4.591606498194946
LIFESTYLE :  3.9034296028880866
PRODUCTIVITY :  3.892148014440433
FINANCE :  3.7003610108303246
MEDICAL :  3.531137184115524
SPORTS :  3.395758122743682
PERSONALIZATION :  3.3167870036101084
COMMUNICATION :  3.2378158844765346
HEALTH_AND_FITNESS :  3.0798736462093865
PHOTOGRAPHY :  2.944494584837545
NEWS_AND_MAGAZINES :  2.7978339350180503
SOCIAL :  2.6624548736462095
TRAVEL_AND_LOCAL :  2.33528880866426
SHOPPING :  2.2450361010830324
BOOKS_AND_REFERENCE :  2.1435018050541514
DATING :  1.861462093862816
VIDEO_PLAYERS :  1.7937725631768955
MAPS_AND_NAVIGATION :  1.3989169675090252
FOOD_AND_DRINK :  1.2409747292418771
EDUCATION :  1.1620036101083033
ENTERTAINMENT :  0.9589350180505415
LIBRARIES_AND_DEMO :  0.9363718411552346
AUTO_AND_VEHICLES :  0.9250902527075812
HOUSE_AND_HOME :  0.8235559566787004
WEATHER :  0.8009927797833934
EVENTS :  0.7107400722021661
PARENTING :  0.6543

Most Popular Apps by Genre on the App Store