# Profitable App Profiles for the App Store and Google Play Markets

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

In [1]:
from csv import reader
### The App Store data set ###
### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2755: character maps to <undefined>

In [6]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [None]:
print(ios_header)
print('\n')

explore_data(ios,0,5,True)

In [None]:
print(android_header)
print('\n')

explore_data(android,0,5,True)

# Deleting Wrong Data

In [None]:
print(android_header)
print('\n')
print(android[10472])

In [None]:
del android[10472]
print(len(android))

# Removing Duplicate Entries
Part One

If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:

In [None]:
def duplicate_entries(dataset,position):
    duplicate=[]
    unique=[]
    for item in dataset:
        name=item[position]
        if name in unique:
            duplicate.append(name)
        else:
            unique.append(name)
        
    print('No. of duplicate apps: ',len(duplicate))
    print('\n')
    print('Example of duplicate apps: ',duplicate[0:30])

In [None]:
duplicate_entries(android,0)

In [None]:
reviews_max={}
for item in android:
    name=item[0]
    n_reviews=float(item[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name]=n_reviews
    elif name not in reviews_max:
        reviews_max[name]=n_reviews

print('length of the dictionary: ',len(reviews_max))

In [None]:
android_clean=[]
already_added=[]
for item in android:
    name=item[0]
    n_reviews=float(item[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(item)
        already_added.append(name) 

In [None]:
explore_data(android_clean,0,5,True)

# Removing Non-English Apps
Part One

If you explore the data sets enough, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience. Below, we see a couple of examples from both data sets

In [None]:
def function(string):
    for item in string:
        if ord(item)>127:
            return False
        else:
            return True
print(function('Clash of Clans'))
print(function('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(function('Docs To Go™ Free Office Suite'))
print(function('Instachat 😜'))

In [None]:
def function(string):
    non_ascii = 0
    for item in string:
        if ord(item)>127:
            non_ascii +=1
    if non_ascii>3:
        return False
    else:
        return True
        
print(function('Docs To Go™ Free Office Suite'))
print(function('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(function('Docs To Go™ Free Office Suite'))
print(function('Instachat 😜'))

In [None]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if function(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if function(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

# Isolating the Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [None]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

###                        Why do all this?

The answer to that question is quite simple. The aim here is to determine the kinds of apps that are lilkely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

Build a minimal Android version of the app, and add it to Google Play.
If the app has a good response from users, we develop it further.
If the app is profitable after  six months, we build an iOS version of the app and add it to the App Store.
Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

The frequency table can be built using the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

In [None]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [None]:
display_table(android_english,-4)

In [None]:
display_table(ios,-5)

# Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we calculate the average number of user ratings per app genre on the App Store:

In [None]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

# Most Popular Apps by Genre on Google Play

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [None]:
display_table(android_final, 5) # the Installs columns

In [None]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

In [None]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

In [None]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

In [None]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])