# Profitable App Profiles for the Apple App Store and Google Play App Store Markets

by Drew P. Worden - 08/26/2021

The goal of this project is to use two data sources which are data sets of app data from both the Apple App Store and the Google Play App Store. The data needs to be cleaned and adjusted for our use cases and then analysed. The purpose of this project is to inform a hypothetical company on what types of free, English-based apps to invest development resourses into for both platforms. This means that we need to find an app profile that works well in both markets.

## Data Sources

Google Play Store Apps - https://www.kaggle.com/lava18/google-play-store-apps

Mobile App Store ( 7200 apps) - https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

## Import Packages

In [2]:
from csv import reader

## Define Helper Functions

In [3]:
def import_data(path):
    open_file = open(path)
    data = list(reader(open_file))
    return data

This function opens a CSV file and exports it as a list or list of lists.

In [4]:
def explore_data(data, display_header = True, display_sample = True, display_all = False, display_num_columns = True, display_num_rows = True):
    if display_header == True:
        print("---------- Header ---------")
        print(data[0], "\n")
    
    if display_sample == True:
        print("---------- Sample Row ---------")
        print(data[1], "\n")
        
    if display_all == True:
        print("---------- All Data ---------")
        print(data, "\n")
        
    if display_num_rows == True:
        print("---------- Number of Rows ---------")
        print(len(data), "\n")
        
    if display_num_columns == True:
        print("---------- Number of Columns ---------")
        print(len(data[0]))

This function takes in a data set and outputs its header, a sample row, the number of columns, number of rows, and optionally all of the data at once.

## Data Overview

With our helper functions we can get a better look at what data we are working with.

In [5]:
#Open Apple App Store Data
apple_data = import_data("AppleStore.csv")

#Explore Data
explore_data(apple_data)

---------- Header ---------
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

---------- Sample Row ---------
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 

---------- Number of Rows ---------
7198 

---------- Number of Columns ---------
16


In [6]:
#Open Google Play App Store Data
google_data = import_data("googleplaystore.csv")

#Explore Data
explore_data(google_data)

---------- Header ---------
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

---------- Sample Row ---------
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

---------- Number of Rows ---------
10842 

---------- Number of Columns ---------
13


## Remove Incomplete Rows

Luckily for us, there is only one incomplete row among the two datasets. It Exists in the Google Play Store data. The genre is missing in row 10473.

In [7]:
del google_data[10473]

In [8]:
explore_data(google_data)

---------- Header ---------
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

---------- Sample Row ---------
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

---------- Number of Rows ---------
10841 

---------- Number of Columns ---------
13


Our number of rows has been updated to reflect the change.

## Duplicate Data

Unfortunatelty some of our data in the Google Play data set contains duplicate entries. Let's contruct a function that takes in a data set and returns another, one that only contains unique entries. Note that we will be removing all duplicate entries, leaving only the one with the most reviews.

In [9]:
def remove_dups(dataset):
    #Create Unique App Dictionary
    unique_apps = {}
    for app in dataset[1:]:
        app_name = app[0]
        if app_name in unique_apps:
            unique_app_reviews = unique_apps[app_name][3]
            app_reviews = app[3]
            if (app_reviews > unique_app_reviews):
                unique_apps[app_name] = app
        else:
            unique_apps[app_name] = app
    
    #Convert Unique App Dictionary to List
    unique_app_list = []
    unique_app_list.append(dataset[0])
    for app in unique_apps:
        unique_app_list.append(unique_apps[app])
    
    return unique_app_list

If we inspect the data after duplicate removal, we can see that Google Play data set has been reduced from 10841 rows(apps) to 9659(without header) rows(apps). 

In [10]:
explore_data(google_data)
google_data = remove_dups(google_data)
explore_data(google_data)

---------- Header ---------
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

---------- Sample Row ---------
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

---------- Number of Rows ---------
10841 

---------- Number of Columns ---------
13
---------- Header ---------
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

---------- Sample Row ---------
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

---------- Number of Rows ---------
9660 

---------- Number of Columns ---------
13


## Removal of Non-English Apps

Both data sets contain apps that are designed for non-English speaking markets. These are identifiable by the non-English alphabet characters in the name of the app. We will remove these since this project is designed for businesses of English-speaking markets.

In [11]:
def remove_non_english(dataset, name_col_index):
    clean_set = []
    clean_set.append(dataset[0])
    for app in dataset[1:]:
        app_name = app[name_col_index]
        english_word = True
        for char in app_name:
            char_type = ord(char)
            if char_type > 127:
                english_word = False
        if english_word == True:
            clean_set.append(app)
        else:
            english_word = False
            
                
    return clean_set

Now when we inspect the Apple data set we can see that it reduced the set from 7198 apps to 5708 apps.

In [12]:
explore_data(apple_data)
apple_data = remove_non_english(apple_data, 1)
explore_data(apple_data)

---------- Header ---------
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

---------- Sample Row ---------
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 

---------- Number of Rows ---------
7198 

---------- Number of Columns ---------
16
---------- Header ---------
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

---------- Sample Row ---------
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 

---------- Number of Rows ---------
5708 

----

Similarly we can see that the Google data set has been reduced from 9660 apps to 9118 apps.

In [13]:
explore_data(google_data)
google_data = remove_non_english(google_data, 0)
explore_data(google_data)

---------- Header ---------
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

---------- Sample Row ---------
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

---------- Number of Rows ---------
9660 

---------- Number of Columns ---------
13
---------- Header ---------
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

---------- Sample Row ---------
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

---------- Number of Rows ---------
9118 

---------- Number of Columns ---------
13


## Removal of Paid Apps

This project focuses on the app profiles of free apps in both app stores. This means we must remove all paid apps from the data sets. Let's define a function and pass both data sets through.

In [14]:
def remove_paid(dataset, price_col):
    free_apps = []
    free_apps.append(dataset[0])
    for app in dataset[1:]:
        app_price = app[price_col]
        if app_price == "0" or app_price == "0.0":
            free_apps.append(app)
    
    return free_apps

In [15]:
apple_data = remove_paid(apple_data, 4)
explore_data(apple_data)

google_data = remove_paid(google_data, 7)
explore_data(google_data)

---------- Header ---------
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

---------- Sample Row ---------
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 

---------- Number of Rows ---------
2923 

---------- Number of Columns ---------
16
---------- Header ---------
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

---------- Sample Row ---------
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

---------- Number of Rows ---------
8407 

---------- Number of Columns -------

We can see that the Apple data set was reduced from 5708 rows to 2923 rows. Similarly the Google data set was reduced from 9118 rows to 8407 rows.

## Analysis

Now that are data sets are ready for analysis we can start by building frequency tables for the relevant columns. Let's start by building a function that builds a frequency table for a certain column and prints the output as percentages in decending order.

In [16]:
def freq_table(dataset, index):
    table = {}
    for app in dataset[1:]:
        target_col = app[index]
        if target_col in table:
            table[target_col] += 1
        else:
            table[target_col] = 1
    
    for entry in table:
        table[entry] = (table[entry] / len(dataset[1:])) * 100
        table[entry] = round(table[entry], 2)
        
    return sorted((value,key) for (key,value) in table.items())

Now that we can generate frequency tables, let's take a look at a few different columns from the data sets.

In [17]:
freq_table(apple_data, 11)

[(0.1, 'Catalogs'),
 (0.14, 'Navigation'),
 (0.21, 'Medical'),
 (0.27, 'Book'),
 (0.51, 'Business'),
 (0.51, 'Reference'),
 (0.89, 'Food & Drink'),
 (0.89, 'Weather'),
 (1.1, 'Finance'),
 (1.13, 'Travel'),
 (1.33, 'News'),
 (1.47, 'Lifestyle'),
 (1.71, 'Productivity'),
 (1.98, 'Health & Fitness'),
 (2.05, 'Sports'),
 (2.16, 'Music'),
 (2.26, 'Utilities'),
 (2.5, 'Shopping'),
 (3.11, 'Social Networking'),
 (3.83, 'Education'),
 (5.13, 'Photo & Video'),
 (7.53, 'Entertainment'),
 (59.17, 'Games')]

In [18]:
freq_table(google_data, 9)

[(0.01, 'Adventure;Education'),
 (0.01, 'Art & Design;Action & Adventure'),
 (0.01, 'Art & Design;Pretend Play'),
 (0.01, 'Books & Reference;Education'),
 (0.01, 'Card;Action & Adventure'),
 (0.01, 'Card;Brain Games'),
 (0.01, 'Comics;Creativity'),
 (0.01, 'Entertainment;Education'),
 (0.01, 'Health & Fitness;Action & Adventure'),
 (0.01, 'Health & Fitness;Education'),
 (0.01, 'Lifestyle;Pretend Play'),
 (0.01, 'Music & Audio;Music & Video'),
 (0.01, 'Parenting;Brain Games'),
 (0.01, 'Puzzle;Education'),
 (0.01, 'Racing;Pretend Play'),
 (0.01, 'Role Playing;Brain Games'),
 (0.01, 'Simulation;Education'),
 (0.01, 'Strategy;Creativity'),
 (0.01, 'Strategy;Education'),
 (0.01, 'Tools;Education'),
 (0.01, 'Trivia;Education'),
 (0.01, 'Video Players & Editors;Creativity'),
 (0.02, 'Board;Action & Adventure'),
 (0.02, 'Casual;Education'),
 (0.02, 'Education;Brain Games'),
 (0.02, 'Educational;Action & Adventure'),
 (0.02, 'Entertainment;Action & Adventure'),
 (0.02, 'Entertainment;Creativity

In [19]:
freq_table(google_data, 1)

[(0.57, 'COMICS'),
 (0.63, 'BEAUTY'),
 (0.65, 'PARENTING'),
 (0.67, 'ART_AND_DESIGN'),
 (0.71, 'EVENTS'),
 (0.8, 'WEATHER'),
 (0.81, 'HOUSE_AND_HOME'),
 (0.9, 'LIBRARIES_AND_DEMO'),
 (0.94, 'AUTO_AND_VEHICLES'),
 (0.94, 'ENTERTAINMENT'),
 (1.18, 'EDUCATION'),
 (1.2, 'FOOD_AND_DRINK'),
 (1.36, 'MAPS_AND_NAVIGATION'),
 (1.76, 'VIDEO_PLAYERS'),
 (1.83, 'DATING'),
 (2.19, 'BOOKS_AND_REFERENCE'),
 (2.25, 'SHOPPING'),
 (2.31, 'TRAVEL_AND_LOCAL'),
 (2.66, 'SOCIAL'),
 (2.8, 'NEWS_AND_MAGAZINES'),
 (3.01, 'PHOTOGRAPHY'),
 (3.13, 'HEALTH_AND_FITNESS'),
 (3.22, 'COMMUNICATION'),
 (3.26, 'SPORTS'),
 (3.31, 'PERSONALIZATION'),
 (3.63, 'MEDICAL'),
 (3.74, 'FINANCE'),
 (3.89, 'LIFESTYLE'),
 (3.97, 'PRODUCTIVITY'),
 (4.71, 'BUSINESS'),
 (8.57, 'TOOLS'),
 (9.58, 'GAME'),
 (18.83, 'FAMILY')]

From the frequency tables we can make some observations. First the Apple App Store seems to dominated by apps that are designed for fun, while the Google Play store shows a more balanced landscape of fun and practical apps.

## Generating the App Profile & Recommendation

To generate an app profile we will have to take different approaches for each store. For the Apple App Store we will sort by user rating and give profiles based on the most popular category(games) that appear in the result.

For the Google Play we will sort by number of installs and give profiles based on the most popular categories that appear in the result.

In [28]:
apple_games_category = []
for app in apple_data[1:]:
    if app[11] == "Games":
        apple_games_category.append(app)
        
apple_recommended_profiles = []
for app in apple_games_category:
    rating = app[7]
    if float(rating) == 5.0:
        apple_recommended_profiles.append(app)

We can now see all of the most liked games on that Apple App Store. These are the most popular apps on the platform and are recommended as profiles for creating a new app that reaches the most users and are the most liked.

In [36]:
for app in apple_recommended_profiles:
    print(app[1])

Head Soccer
Sniper 3D Assassin: Shoot to Kill Gun Game
Geometry Dash Lite
CSR Racing 2
Pictoword: Fun 2 Pics Guess What's the Word Trivia
Iron Force
Sniper Shooter: Gun Shooting Games
PewDiePie's Tuber Simulator
Blackbox - think outside the box
Egg, Inc.
Flight Pilot Simulator 3D: Flying Game For Free
Logos Quiz -Guess the most famous brands, new fun!
Gin Rummy Plus - Multiplayer Online Card Game
Yu-Gi-Oh! Duel Links
Arrow Ambush
Crazy Kitchen
SMILE Inc.
Castle Crush: Epic Strategy Game
Vlogger Go Viral - Clicker Game & Vlog Simulator
Sugar Smash: Book of Life
X-War: Clash of Zombies
War Machines: 3D Multiplayer Tank Shooting Game
Bee Brilliant
War Tortoise
Monster Super League
Stupid Zombies 3
Tank.IO War - Free Tank games of snake
Despicable Bear - Top Beat Action Game
Burrito Bison: Launcha Libre
Dan The Man (Retro Action Platformer)
Super Cat Tales
Slots: DoubleUp Free Slot Games - Slot Machines
Good Knight Story
Grumpy Cat's Worst Game Ever
Crazy Cake Swap
Slots: Hot Vegas Slot Ma

Alternatively for the Google Play store.

In [41]:
google_family_category = []
for app in google_data[1:]:
    if app[1] == "FAMILY":
        google_family_category.append(app)
        
google_recommended_profiles = []
for app in google_family_category:
    if float(app[2]) >= 4.8:
        google_recommended_profiles.append(app)

In [42]:
google_recommended_profiles

[['No. Color - Color by Number, Number Coloring',
  'FAMILY',
  '4.8',
  '269194',
  '6.9M',
  '10,000,000+',
  'Free',
  '0',
  'Everyone',
  'Entertainment;Brain Games',
  'August 3, 2018',
  'Varies with device',
  'Varies with device'],
 ['Fuzzy Seasons: Animal Forest',
  'FAMILY',
  '4.8',
  '12137',
  '63M',
  '100,000+',
  'Free',
  '0',
  'Everyone 10+',
  'Simulation;Pretend Play',
  'August 6, 2018',
  '149',
  '4.1 and up'],
 ['Pino chess',
  'FAMILY',
  '4.8',
  '61',
  '25M',
  '1,000+',
  'Free',
  '0',
  'Everyone',
  'Casual;Brain Games',
  'January 26, 2017',
  '0.1.0',
  '4.0 and up'],
 ['Find a Way: Addictive Puzzle',
  'FAMILY',
  '4.8',
  '39480',
  '14M',
  '500,000+',
  'Free',
  '0',
  'Everyone',
  'Puzzle',
  'June 16, 2017',
  '4.1.1',
  '4.1 and up'],
 ['CompTIA Exam Training',
  'FAMILY',
  '4.8',
  '3053',
  '17M',
  '50,000+',
  'Free',
  '0',
  'Everyone',
  'Education',
  'July 3, 2018',
  '1.7',
  '4.2 and up'],
 ['Hungry Hearts Diner: A Tale of Star-C

## Conclusion

Based on our analysis, with the goal of developing an application for both Apple and Google Play app stores. The app profile that does the best in terms of user ratings and installs are short, fun games that tease the brain and help pass time. Theses games are meant to be easy to pick and put down. The UI should be simple and the content easy to pick up but hard to master. These apps do the best on both platforms.