# Profitable App Profiles for the App Store and Google Play Markets

The purpose of this project is to examine app data in both the App Store and Google Play to determine what is the most popular.  This will help our developers to determine where it will be the most benefitial to place ads. Our company works exclusively with free (with ad) English apps so we will only want to analyze within that market.  We also publish apps to both Google Play and the App store so we need to determine what kind of app we can create for the maximum profit within both platforms.

We will first clean the data by removing any bad data (as defined below) as well as any nonrelevant data, meaning paid apps and non-english apps.  We will then analyze the remaining data by categories to determine, across both the App Store and Google Play, what is a profitable and not overly saturated or overly dominated category in which to create our next app.

After completing this cleaning and analysis, we determined that it could be profitable to create and app in the Book category.  Adding interactive features such as daily quotes or a user forum would increase the time users spent in the app and therefore increase the likelyhood of them clicking in-app ads so we can make money!

### General Setup Code

In [114]:
from csv import reader
import re

In [115]:
AS_COLS = 16
AS_NAME_IDX = 1
AS_REVIEW_IDX = 5
AS_COST_IDX = 4
AS_GENRE_IDX = 11
AS_RATING_COUNT_IDX = 5

GP_COLS = 13
GP_NAME_IDX = 0
GP_REVIEW_IDX = 3
GP_COST_IDX = 7
GP_CATEGORY_IDX = 1
GP_GENRE_IDX = 9
GP_INSTALLS_IDX = 5



### Initial Data Exploration

In [116]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [117]:
def convert_csv_to_list(filename):
    '''
    takes in a csv file and returns that file as a list
    '''
    open_file = open(filename)
    read_file = reader(open_file)
    list_file = list(read_file)
    return list_file


In [118]:
def app_count_dict(app_data, name_idx, header=False):
    if header: start = 1
    else: start=0
    app_count_dict = {}
    
    for app in app_data:
        app_name = app[name_idx]
        if app_name in app_count_dict:
            app_count_dict[app_name] += 1
        else:
            app_count_dict[app_name] = 1
    
    return app_count_dict

In [119]:
app_store_data = convert_csv_to_list('AppleStore.csv')

In [120]:
explore_data(app_store_data, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


### AppleStore.csv File Definition (raw data)

Number of rows: 7197   
Number of columns: 16

#### Header Description
id  
track_name - app name   
size_bytes  
currency   
price  
rating_count_tot  
rating_count_ver  
user_rating  
user_rating_ver  
ver  
cont_rating  
prime_genre  
sup_devices.num  
ipadSc_urls.num  
lang.num  
vpp_lic  

In [121]:
google_play_data = convert_csv_to_list('googleplaystore.csv')

In [122]:
explore_data(google_play_data, 0 , 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


### googleplaystore.csv File Definition

Number of rows: 10842   
Number of columns: 13

### Remove Bad Data

In this project, "bad data" is defined as a row with at least 1 blank column.

In [123]:
def check_data(dataset, expected_num_col):
    '''
    Examines a dataset for empty columns
    Returns a list of the rows with missing column data
    '''
    dirty_rows = []
    idx = 0
    for row in dataset:
        if len(row) != expected_num_col:
            row.append(idx)
            dirty_rows.append(row)
        idx += 1
    return dirty_rows
        

In [124]:
def clean_bad_data(data, expected_nbr_col, update=True):
    '''
    removes any rows of data missing column data
    '''
    data_clean_ctr = 0
    
    dirty_data = check_data(data, expected_nbr_col) # last occ is idx of bad entry
    
    for app in dirty_data:
        if update: del data[app[-1]]
        data_clean_ctr += 1
    
    return data_clean_ctr

In [125]:
as_rows_deleted = clean_bad_data(app_store_data, AS_COLS)

print("Nbr App Store Rows Deleted: " + str(as_rows_deleted))

Nbr App Store Rows Deleted: 0


In [126]:
gp_rows_deleted = clean_bad_data(google_play_data, GP_COLS)

print("Nbr Google Play Rows Deleted: " + str(gp_rows_deleted))

Nbr Google Play Rows Deleted: 1


### Filter out Unique Apps

We want to remove any duplicates so that we don't see inflated number of app types within the data we want to analyze.

In [127]:
def check_dups(app_list, name_idx):
    '''
    returns Boolean if there are any duplicate names in app data
    '''
    unique_apps = []
    for app in app_list[1:]:
        app_name = app[name_idx]
        if app_name in unique_apps:
            return True
        else: 
            unique_apps.append(app_name)
    return False

def find_dups(app_list, name_idx):
    '''
    takes in a list of lists of app data and the index of the app name
    returns a list of duplicate apps, nbr duplicates, list of unique apps and nbr duplicates
    '''
    duplicate_apps = []
    unique_apps = []
    
    for app in app_list[1:]:
        app_name = app[name_idx]
        if app_name in unique_apps:
            duplicate_apps.append(app_name)  
        else:
            unique_apps.append(app_name)
    
    return duplicate_apps, len(duplicate_apps), unique_apps, len(unique_apps)

In [128]:
app_store_dups_exist = check_dups(app_store_data, AS_NAME_IDX)
as_dups, as_dups_nbr, as_unique, as_unique_ctr = find_dups(app_store_data, AS_NAME_IDX)
as_dict = app_count_dict(app_store_data, AS_NAME_IDX)

print('App Store Duplication Numbers')
print("Duplicates Exist: " + str(app_store_dups_exist))
print("Nbr Duplicates: " + str(as_dups_nbr))
# print(as_dups[0:3])
print("Nbr Unique: " + str(as_unique_ctr))
# print(as_unique[0:3])


App Store Duplication Numbers
Duplicates Exist: True
Nbr Duplicates: 2
Nbr Unique: 7195


In [129]:
google_play_dups_exist = check_dups(google_play_data, GP_NAME_IDX)
gp_dups, gp_dups_nbr, gp_unique, gp_unique_ctr = find_dups(google_play_data, GP_NAME_IDX)
gp_dict = app_count_dict(google_play_data, GP_NAME_IDX)

print('Google Play Duplication Numbers')
print("Duplicates Exist: " + str(google_play_dups_exist)) 
print("Nbr Duplicates: " + str(gp_dups_nbr))
# print(gp_dups[0:3])
print("Nbr Unique: " + str(gp_unique_ctr))
# print(gp_unique[0:3])

Google Play Duplication Numbers
Duplicates Exist: True
Nbr Duplicates: 1181
Nbr Unique: 9659


In [130]:
def print_duplicates(app_data, app_dict, name_idx):
    '''
    This function is really just for exploring the duplicates to see what can be done to clean them up.
    All it does is read, organize and print out the date.
    '''
    
    for app_name, app_count in app_dict.items():
        if app_count == 1:
            continue
        print('*' * 25)
        print(app_name + ': '+ str(app_count))
        
        dup_idxs = [(i, app.index(app_name)) for i, app in enumerate(app_data) if app_name in app]
        
        print('*' * 25)
        for dup in dup_idxs:
            idx = dup[0]
            print(app_data[idx])
        

In [131]:
print('App Store Duplicates')
print('=' * 20)
print_duplicates(app_store_data, as_dict, AS_NAME_IDX)

App Store Duplicates
*************************
Mannequin Challenge: 2
*************************
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
*************************
VR Roller Coaster: 2
*************************
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


These App Store Duplicates are not true duplicates since their size vary drastically and they have different versions.

https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409

In [132]:
print('Google Play Duplicates')
print('=' * 22)
print_duplicates(google_play_data, gp_dict, GP_NAME_IDX)

Google Play Duplicates
*************************
Coloring book moana: 2
*************************
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
['Coloring book moana', 'FAMILY', '3.9', '974', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
*************************
Mcqueen Coloring pages: 2
*************************
['Mcqueen Coloring pages', 'ART_AND_DESIGN', 'NaN', '61', '7.0M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Action & Adventure', 'March 7, 2018', '1.0.0', '4.1 and up']
['Mcqueen Coloring pages', 'FAMILY', 'NaN', '65', '7.0M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Action & Adventure', 'March 7, 2018', '1.0.0', '4.1 and up']
*************************
UNICORN - Color By Number & Pixel Art Coloring: 2
*************************
['UNICORN - Color By Number & Pixe

*************************
['UC Browser Mini -Tiny Fast Private & Secure', 'COMMUNICATION', '4.4', '3648120', '3.3M', '100,000,000+', 'Free', '0', 'Teen', 'Communication', 'July 18, 2018', '11.4.0', '4.0 and up']
['UC Browser Mini -Tiny Fast Private & Secure', 'COMMUNICATION', '4.4', '3648480', '3.3M', '100,000,000+', 'Free', '0', 'Teen', 'Communication', 'July 18, 2018', '11.4.0', '4.0 and up']
['UC Browser Mini -Tiny Fast Private & Secure', 'COMMUNICATION', '4.4', '3648765', '3.3M', '100,000,000+', 'Free', '0', 'Teen', 'Communication', 'July 18, 2018', '11.4.0', '4.0 and up']
*************************
WhatsApp Business: 2
*************************
['WhatsApp Business', 'COMMUNICATION', '4.4', '136662', '32M', '10,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 30, 2018', '2.18.116', '4.0.3 and up']
['WhatsApp Business', 'COMMUNICATION', '4.4', '137144', '32M', '10,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 30, 2018', '2.18.116', '4.0.3 and up']
*************

*************************
['KakaoTalk: Free Calls & Text', 'COMMUNICATION', '4.3', '2546527', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Communication', 'August 3, 2018', 'Varies with device', 'Varies with device']
['KakaoTalk: Free Calls & Text', 'COMMUNICATION', '4.3', '2546527', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Communication', 'August 3, 2018', 'Varies with device', 'Varies with device']
['KakaoTalk: Free Calls & Text', 'COMMUNICATION', '4.3', '2546549', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Communication', 'August 3, 2018', 'Varies with device', 'Varies with device']
*************************
CM Browser - Ad Blocker , Fast Download , Privacy: 2
*************************
['CM Browser - Ad Blocker , Fast Download , Privacy', 'COMMUNICATION', '4.6', '2264916', '6.1M', '50,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 31, 2018', '5.22.18.0006', '4.0 and up']
['CM Browser - Ad Blocker , Fast Downl

*************************
['Girls Live Chat - Free Text & Video Chat', 'DATING', '4.8', '110', '4.9M', '10,000+', 'Free', '0', 'Mature 17+', 'Dating', 'July 9, 2018', '8.2', '4.0.3 and up']
['Girls Live Chat - Free Text & Video Chat', 'DATING', '4.8', '110', '4.9M', '10,000+', 'Free', '0', 'Mature 17+', 'Dating', 'July 9, 2018', '8.2', '4.0.3 and up']
*************************
Random Video Chat: 2
*************************
['Random Video Chat', 'DATING', 'NaN', '3', '16M', '1,000+', 'Free', '0', 'Mature 17+', 'Dating', 'July 15, 2018', '4.20', '4.0.3 and up']
['Random Video Chat', 'DATING', 'NaN', '3', '16M', '1,000+', 'Free', '0', 'Mature 17+', 'Dating', 'July 15, 2018', '4.20', '4.0.3 and up']
*************************
MouseMingle: 2
*************************
['MouseMingle', 'DATING', '2.7', '3', '3.9M', '100+', 'Free', '0', 'Mature 17+', 'Dating', 'July 17, 2018', '1.0.0', '4.4 and up']
['MouseMingle', 'DATING', '2.7', '3', '3.9M', '100+', 'Free', '0', 'Mature 17+', 'Dating', 'July 

*************************
['Lynda - Online Training Videos', 'EDUCATION', '4.2', '8599', '17M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 11, 2018', '4.9.10', '4.1 and up']
['Lynda - Online Training Videos', 'EDUCATION', '4.2', '8599', '17M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 11, 2018', '4.9.10', '4.1 and up']
*************************
Brilliant: 2
*************************
['Brilliant', 'EDUCATION', '4.5', '41185', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Brilliant', 'EDUCATION', '4.5', '41185', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'August 3, 2018', 'Varies with device', 'Varies with device']
*************************
CppDroid - C/C++ IDE: 2
*************************
['CppDroid - C/C++ IDE', 'EDUCATION', '4.1', '29980', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'August 17, 2017', 'V

*************************
['Nick', 'ENTERTAINMENT', '4.2', '123279', '25M', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Entertainment;Music & Video', 'January 24, 2018', '2.0.8', '4.4 and up']
['Nick', 'ENTERTAINMENT', '4.2', '123279', '25M', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Entertainment;Music & Video', 'January 24, 2018', '2.0.8', '4.4 and up']
['Nick', 'ENTERTAINMENT', '4.2', '123279', '25M', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Entertainment;Music & Video', 'January 24, 2018', '2.0.8', '4.4 and up']
['Nick', 'ENTERTAINMENT', '4.2', '123279', '25M', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Entertainment;Music & Video', 'January 24, 2018', '2.0.8', '4.4 and up']
['Nick', 'FAMILY', '4.2', '123322', '25M', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Entertainment;Music & Video', 'January 24, 2018', '2.0.8', '4.4 and up']
['Nick', 'FAMILY', '4.2', '123309', '25M', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Entertainment;Music & Video', 'January 24, 2018', '2.

*************************
['LEGO® TV', 'ENTERTAINMENT', '3.7', '17247', '7.2M', '1,000,000+', 'Free', '0', 'Everyone 10+', 'Entertainment;Music & Video', 'August 4, 2018', '4.0.2', '5.0 and up']
['LEGO® TV', 'FAMILY', '3.7', '17250', '7.2M', '1,000,000+', 'Free', '0', 'Everyone 10+', 'Entertainment;Music & Video', 'August 4, 2018', '4.0.2', '5.0 and up']
*************************
HBO GO: Stream with TV Package: 2
*************************
['HBO GO: Stream with TV Package', 'ENTERTAINMENT', '3.8', '87723', '32M', '10,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 19, 2018', '16.0.0.437', '4.1 and up']
['HBO GO: Stream with TV Package', 'FAMILY', '3.8', '87734', '32M', '10,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 19, 2018', '16.0.0.437', '4.1 and up']
*************************
Showtime Anytime: 2
*************************
['Showtime Anytime', 'ENTERTAINMENT', '3.7', '18523', 'Varies with device', '1,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 2, 2018', 'V

*************************
['Calm - Meditate, Sleep, Relax', 'HEALTH_AND_FITNESS', '4.6', '111450', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Health & Fitness', 'August 5, 2018', 'Varies with device', 'Varies with device']
['Calm - Meditate, Sleep, Relax', 'HEALTH_AND_FITNESS', '4.6', '111455', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Health & Fitness', 'August 5, 2018', 'Varies with device', 'Varies with device']
*************************
Relax Melodies: Sleep Sounds: 2
*************************
['Relax Melodies: Sleep Sounds', 'HEALTH_AND_FITNESS', '4.5', '233243', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Health & Fitness', 'July 23, 2018', 'Varies with device', 'Varies with device']
['Relax Melodies: Sleep Sounds', 'HEALTH_AND_FITNESS', '4.5', '233243', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Health & Fitness', 'July 23, 2018', 'Varies with device', 'Varies with device']
*************************
Simp

*************************
['Redfin Real Estate', 'HOUSE_AND_HOME', '4.6', '36857', '19M', '1,000,000+', 'Free', '0', 'Everyone', 'House & Home', 'July 25, 2018', '220.0', '5.0 and up']
['Redfin Real Estate', 'HOUSE_AND_HOME', '4.6', '36857', '19M', '1,000,000+', 'Free', '0', 'Everyone', 'House & Home', 'July 25, 2018', '220.0', '5.0 and up']
*************************
realestate.com.au - Buy, Rent & Sell Property: 2
*************************
['realestate.com.au - Buy, Rent & Sell Property', 'HOUSE_AND_HOME', '3.8', '14653', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'House & Home', 'July 16, 2018', 'Varies with device', 'Varies with device']
['realestate.com.au - Buy, Rent & Sell Property', 'HOUSE_AND_HOME', '3.8', '14657', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'House & Home', 'July 16, 2018', 'Varies with device', 'Varies with device']
*************************
Mortgage by Zillow: Calculator & Rates: 2
*************************
['Mortgage by Zil

*************************
['Super Jim Jump - pixel 3d', 'GAME', '4.5', '10393', '18M', '1,000,000+', 'Free', '0', 'Everyone', 'Arcade', 'July 11, 2018', '2.2.3181', '4.0 and up']
['Super Jim Jump - pixel 3d', 'GAME', '4.5', '10434', '18M', '1,000,000+', 'Free', '0', 'Everyone', 'Arcade', 'July 11, 2018', '2.2.3181', '4.0 and up']
['Super Jim Jump - pixel 3d', 'GAME', '4.5', '10460', '18M', '1,000,000+', 'Free', '0', 'Everyone', 'Arcade', 'July 11, 2018', '2.2.3181', '4.0 and up']
*************************
8 Ball Pool: 7
*************************
['8 Ball Pool', 'GAME', '4.5', '14198297', '52M', '100,000,000+', 'Free', '0', 'Everyone', 'Sports', 'July 31, 2018', '4.0.0', '4.0.3 and up']
['8 Ball Pool', 'GAME', '4.5', '14198602', '52M', '100,000,000+', 'Free', '0', 'Everyone', 'Sports', 'July 31, 2018', '4.0.0', '4.0.3 and up']
['8 Ball Pool', 'GAME', '4.5', '14200344', '52M', '100,000,000+', 'Free', '0', 'Everyone', 'Sports', 'July 31, 2018', '4.0.0', '4.0.3 and up']
['8 Ball Pool', 'GA

*************************
['Cooking Fever', 'GAME', '4.5', '3197865', '82M', '100,000,000+', 'Free', '0', 'Everyone', 'Arcade', 'July 12, 2018', '2.8.0', '4.0.3 and up']
['Cooking Fever', 'GAME', '4.5', '3198176', '82M', '100,000,000+', 'Free', '0', 'Everyone', 'Arcade', 'July 12, 2018', '2.8.0', '4.0.3 and up']
*************************
Toon Blast: 3
*************************
['Toon Blast', 'GAME', '4.7', '1351068', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Puzzle', 'July 30, 2018', '3196', '4.1 and up']
['Toon Blast', 'GAME', '4.7', '1351089', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Puzzle', 'July 30, 2018', '3196', '4.1 and up']
['Toon Blast', 'GAME', '4.7', '1351771', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Puzzle', 'July 30, 2018', '3196', '4.1 and up']
*************************
Score! Hero: 3
*************************
['Score! Hero', 'GAME', '4.6', '5418675', '96M', '100,000,000+', 'Free', '0', 'Everyone', 'Sports

*************************
['Honkai Impact 3rd', 'GAME', '4.7', '59017', '82M', '1,000,000+', 'Free', '0', 'Teen', 'Action', 'July 3, 2018', '2.2.1', '4.3 and up']
['Honkai Impact 3rd', 'GAME', '4.7', '59017', '82M', '1,000,000+', 'Free', '0', 'Teen', 'Action', 'July 3, 2018', '2.2.1', '4.3 and up']
*************************
The Game of Life: 2
*************************
['The Game of Life', 'GAME', '4.4', '18621', '63M', '100,000+', 'Paid', '$2.99', 'Everyone', 'Board', 'July 4, 2018', '2.1.2', '4.4 and up']
['The Game of Life', 'GAME', '4.4', '18652', '63M', '100,000+', 'Paid', '$2.99', 'Everyone', 'Board', 'July 4, 2018', '2.1.2', '4.4 and up']
*************************
Angry Birds 2: 2
*************************
['Angry Birds 2', 'GAME', '4.6', '3883589', '57M', '100,000,000+', 'Free', '0', 'Everyone', 'Casual', 'July 26, 2018', '2.21.1', '4.1 and up']
['Angry Birds 2', 'FAMILY', '4.6', '3881752', '57M', '100,000,000+', 'Free', '0', 'Everyone', 'Casual', 'July 26, 2018', '2.21.1', '4.

*************************
['Candy Bomb', 'GAME', '4.4', '42145', '20M', '10,000,000+', 'Free', '0', 'Everyone', 'Casual;Brain Games', 'July 4, 2018', '2.9.3181', '4.0.3 and up']
['Candy Bomb', 'FAMILY', '4.4', '42145', '20M', '10,000,000+', 'Free', '0', 'Everyone', 'Casual;Brain Games', 'July 4, 2018', '2.9.3181', '4.0.3 and up']
*************************
Jetpack Joyride: 2
*************************
['Jetpack Joyride', 'GAME', '4.4', '4638163', '96M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 19, 2018', '1.10.12', '4.1 and up']
['Jetpack Joyride', 'GAME', '4.4', '4637439', '96M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'July 19, 2018', '1.10.12', '4.1 and up']
*************************
Racing in Car 2: 2
*************************
['Racing in Car 2', 'GAME', '4.3', '234110', '38M', '50,000,000+', 'Free', '0', 'Everyone', 'Racing', 'November 22, 2016', '1.0', '2.3.3 and up']
['Racing in Car 2', 'GAME', '4.3', '234589', '38M', '50,000,000+', 'Free', '0', 

*************************
['Elmo Calls by Sesame Street', 'FAMILY', '3.9', '6903', '25M', '1,000,000+', 'Free', '0', 'Everyone', 'Educational;Pretend Play', 'January 31, 2018', '2.0.7', '2.3 and up']
['Elmo Calls by Sesame Street', 'FAMILY', '3.9', '6903', '25M', '1,000,000+', 'Free', '0', 'Everyone', 'Educational;Pretend Play', 'January 31, 2018', '2.0.7', '2.3 and up']
['Elmo Calls by Sesame Street', 'FAMILY', '3.9', '6903', '25M', '1,000,000+', 'Free', '0', 'Everyone', 'Educational;Pretend Play', 'January 31, 2018', '2.0.7', '2.3 and up']
*************************
Sago Mini Friends: 3
*************************
['Sago Mini Friends', 'FAMILY', '4.4', '13155', '83M', '1,000,000+', 'Free', '0', 'Everyone', 'Education;Pretend Play', 'June 16, 2016', '1.3', '4.0.3 and up']
['Sago Mini Friends', 'FAMILY', '4.4', '13155', '83M', '1,000,000+', 'Free', '0', 'Everyone', 'Education;Pretend Play', 'June 16, 2016', '1.3', '4.0.3 and up']
['Sago Mini Friends', 'FAMILY', '4.4', '13155', '83M', '1,0

*************************
['Paramedic Protocol Provider', 'MEDICAL', '4.5', '171', '20M', '10,000+', 'Paid', '$10.00', 'Everyone 10+', 'Medical', 'September 21, 2017', '1.8.3', '4.1 and up']
['Paramedic Protocol Provider', 'MEDICAL', '4.5', '171', '20M', '10,000+', 'Paid', '$10.00', 'Everyone 10+', 'Medical', 'September 21, 2017', '1.8.3', '4.1 and up']
*************************
Medical ID - In Case of Emergency (ICE): 2
*************************
['Medical ID - In Case of Emergency (ICE)', 'MEDICAL', '4.6', '717', '5.4M', '5,000+', 'Paid', '$5.99', 'Everyone', 'Medical', 'May 31, 2018', '6.5.0', '5.0 and up']
['Medical ID - In Case of Emergency (ICE)', 'MEDICAL', '4.6', '717', '5.4M', '5,000+', 'Paid', '$5.99', 'Everyone', 'Medical', 'May 31, 2018', '6.5.0', '5.0 and up']
*************************
Human Anatomy Atlas 2018: Complete 3D Human Body: 3
*************************
['Human Anatomy Atlas 2018: Complete 3D Human Body', 'MEDICAL', '4.5', '2921', '25M', '100,000+', 'Paid', '$24.99

*************************
['Banfield Pet Health Tracker', 'MEDICAL', '4.2', '1747', '5.9M', '100,000+', 'Free', '0', 'Everyone', 'Medical', 'October 17, 2016', '1.2.2', '4.0 and up']
['Banfield Pet Health Tracker', 'MEDICAL', '4.2', '1747', '5.9M', '100,000+', 'Free', '0', 'Everyone', 'Medical', 'October 17, 2016', '1.2.2', '4.0 and up']
*************************
Nurse Grid: 2
*************************
['Nurse Grid', 'MEDICAL', '4.5', '1686', '15M', '100,000+', 'Free', '0', 'Everyone', 'Medical', 'July 30, 2018', '2.9', '4.4 and up']
['Nurse Grid', 'MEDICAL', '4.5', '1686', '15M', '100,000+', 'Free', '0', 'Everyone', 'Medical', 'July 30, 2018', '2.9', '4.4 and up']
*************************
EMT-B Pocket Prep: 2
*************************
['EMT-B Pocket Prep', 'MEDICAL', '4.5', '2951', '16M', '50,000+', 'Free', '0', 'Everyone', 'Medical', 'July 11, 2018', '4.5.2', '4.4 and up']
['EMT-B Pocket Prep', 'MEDICAL', '4.5', '2948', '16M', '50,000+', 'Free', '0', 'Everyone', 'Medical', 'July 11,

*************************
['LiveMe - Video chat, new friends, and make money', 'SOCIAL', '4.5', '457197', 'Varies with device', '10,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['LiveMe - Video chat, new friends, and make money', 'SOCIAL', '4.5', '456866', 'Varies with device', '10,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
*************************
Amino: Communities and Chats: 2
*************************
['Amino: Communities and Chats', 'SOCIAL', '4.8', '1259075', '62M', '10,000,000+', 'Free', '0', 'Teen', 'Social', 'July 20, 2018', '1.8.19179', '4.0.3 and up']
['Amino: Communities and Chats', 'SOCIAL', '4.8', '1264084', '62M', '10,000,000+', 'Free', '0', 'Teen', 'Social', 'August 7, 2018', '1.8.19179', '4.0.3 and up']
*************************
Phone Tracker : Family Locator: 2
*************************
['Phone Tracker : Family Locator', 'SOCIAL', '4.3', '231325', '

*************************
['We Heart It', 'SOCIAL', '4.5', '637309', 'Varies with device', '10,000,000+', 'Free', '0', 'Teen', 'Social', 'July 18, 2018', 'Varies with device', 'Varies with device']
['We Heart It', 'SOCIAL', '4.5', '637254', 'Varies with device', '10,000,000+', 'Free', '0', 'Teen', 'Social', 'July 18, 2018', 'Varies with device', 'Varies with device']
*************************
Couple - Relationship App: 2
*************************
['Couple - Relationship App', 'SOCIAL', '4.0', '33249', '8.4M', '1,000,000+', 'Free', '0', 'Everyone', 'Social', 'March 5, 2015', '1.8.0', '2.3 and up']
['Couple - Relationship App', 'SOCIAL', '4.0', '33249', '8.4M', '1,000,000+', 'Free', '0', 'Everyone', 'Social', 'March 5, 2015', '1.8.0', '2.3 and up']
*************************
POF Free Dating App: 3
*************************
['POF Free Dating App', 'SOCIAL', '4.2', '1175794', 'Varies with device', '50,000,000+', 'Free', '0', 'Mature 17+', 'Social', 'July 31, 2018', 'Varies with device', 'Va

*************************
['LivingSocial - Local Deals', 'SHOPPING', '4.1', '28523', '29M', '5,000,000+', 'Free', '0', 'Everyone', 'Shopping', 'August 3, 2018', '18.10.157066', '4.4 and up']
['LivingSocial - Local Deals', 'SHOPPING', '4.1', '28523', '29M', '5,000,000+', 'Free', '0', 'Everyone', 'Shopping', 'August 3, 2018', '18.10.157066', '4.4 and up']
['LivingSocial - Local Deals', 'SHOPPING', '4.1', '28523', '29M', '5,000,000+', 'Free', '0', 'Everyone', 'Shopping', 'August 3, 2018', '18.10.157066', '4.4 and up']
*************************
Modcloth – Unique Indie Women's Fashion & Style: 2
*************************
["Modcloth – Unique Indie Women's Fashion & Style", 'SHOPPING', '4.2', '5121', '8.8M', '500,000+', 'Free', '0', 'Everyone', 'Shopping', 'July 13, 2018', '9.0.30', '4.1 and up']
["Modcloth – Unique Indie Women's Fashion & Style", 'SHOPPING', '4.2', '5121', '8.8M', '500,000+', 'Free', '0', 'Everyone', 'Shopping', 'July 13, 2018', '9.0.30', '4.1 and up']
**********************

*************************
['HD Camera for Android', 'PHOTOGRAPHY', '4.1', '351254', '4.0M', '10,000,000+', 'Free', '0', 'Everyone', 'Photography', 'June 29, 2018', '4.6.0.0', '4.2 and up']
['HD Camera for Android', 'PHOTOGRAPHY', '4.1', '351255', '4.0M', '10,000,000+', 'Free', '0', 'Everyone', 'Photography', 'June 29, 2018', '4.6.0.0', '4.2 and up']
*************************
Photo Editor Pro: 2
*************************
['Photo Editor Pro', 'PHOTOGRAPHY', '4.3', '1871416', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Photography', 'December 21, 2017', 'Varies with device', 'Varies with device']
['Photo Editor Pro', 'PHOTOGRAPHY', '4.3', '1871421', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone', 'Photography', 'December 21, 2017', 'Varies with device', 'Varies with device']
*************************
Camera MX - Free Photo & Video Camera: 2
*************************
['Camera MX - Free Photo & Video Camera', 'PHOTOGRAPHY', '4.3', '244371', 'Varies with de

*************************
['BBC Sport', 'SPORTS', '4.2', '18678', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'Sports', 'April 25, 2018', 'Varies with device', 'Varies with device']
['BBC Sport', 'SPORTS', '4.2', '18679', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'Sports', 'April 25, 2018', 'Varies with device', 'Varies with device']
*************************
All Football - Latest News & Videos: 2
*************************
['All Football - Latest News & Videos', 'SPORTS', '4.6', '152867', '17M', '10,000,000+', 'Free', '0', 'Everyone', 'Sports', 'August 3, 2018', '3.0.5', '4.1 and up']
['All Football - Latest News & Videos', 'SPORTS', '4.6', '152653', '17M', '10,000,000+', 'Free', '0', 'Everyone', 'Sports', 'August 3, 2018', '3.0.5', '4.1 and up']
*************************
Premier League - Official App: 2
*************************
['Premier League - Official App', 'SPORTS', '4.3', '63580', '24M', '5,000,000+', 'Free', '0', 'Everyone', 'Sports', 'July 

*************************
['Skyscanner', 'TRAVEL_AND_LOCAL', '4.5', '481545', '29M', '10,000,000+', 'Free', '0', 'Everyone', 'Travel & Local', 'August 6, 2018', '5.48', '4.4 and up']
['Skyscanner', 'TRAVEL_AND_LOCAL', '4.5', '481546', '29M', '10,000,000+', 'Free', '0', 'Everyone', 'Travel & Local', 'August 6, 2018', '5.48', '4.4 and up']
['Skyscanner', 'TRAVEL_AND_LOCAL', '4.5', '481546', '29M', '10,000,000+', 'Free', '0', 'Everyone', 'Travel & Local', 'August 6, 2018', '5.48', '4.4 and up']
['Skyscanner', 'TRAVEL_AND_LOCAL', '4.5', '481546', '29M', '10,000,000+', 'Free', '0', 'Everyone', 'Travel & Local', 'August 6, 2018', '5.48', '4.4 and up']
['Skyscanner', 'TRAVEL_AND_LOCAL', '4.5', '481546', '29M', '10,000,000+', 'Free', '0', 'Everyone', 'Travel & Local', 'August 6, 2018', '5.48', '4.4 and up']
*************************
Hotels.com: Book Hotel Rooms & Find Vacation Deals: 2
*************************
['Hotels.com: Book Hotel Rooms & Find Vacation Deals', 'TRAVEL_AND_LOCAL', '4.5', '

*************************
['Nova Launcher', 'PERSONALIZATION', '4.6', '1121805', 'Varies with device', '50,000,000+', 'Free', '0', 'Everyone', 'Personalization', 'May 14, 2018', 'Varies with device', 'Varies with device']
['Nova Launcher', 'PERSONALIZATION', '4.6', '1121805', 'Varies with device', '50,000,000+', 'Free', '0', 'Everyone', 'Personalization', 'May 14, 2018', 'Varies with device', 'Varies with device']
*************************
ZEDGE™ Ringtones & Wallpapers: 4
*************************
['ZEDGE™ Ringtones & Wallpapers', 'PERSONALIZATION', '4.6', '6466641', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Personalization', 'July 19, 2018', 'Varies with device', 'Varies with device']
['ZEDGE™ Ringtones & Wallpapers', 'PERSONALIZATION', '4.6', '6466641', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Personalization', 'July 19, 2018', 'Varies with device', 'Varies with device']
['ZEDGE™ Ringtones & Wallpapers', 'PERSONALIZATION', '4.6', '6466641', 'Varie

*************************
['QR Scanner & Barcode Scanner 2018', 'PRODUCTIVITY', '3.8', '8226', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Productivity', 'July 20, 2018', '1.8.4', '4.4 and up']
['QR Scanner & Barcode Scanner 2018', 'PRODUCTIVITY', '3.8', '8211', '4.1M', '5,000,000+', 'Free', '0', 'Everyone', 'Productivity', 'July 20, 2018', '1.8.4', '4.4 and up']
*************************
Chrome Beta: 2
*************************
['Chrome Beta', 'PRODUCTIVITY', '4.4', '228794', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Productivity', 'August 6, 2018', '68.0.3440.91', 'Varies with device']
['Chrome Beta', 'PRODUCTIVITY', '4.4', '228755', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Productivity', 'July 19, 2018', '68.0.3440.70', 'Varies with device']
*************************
Microsoft Outlook: 2
*************************
['Microsoft Outlook', 'PRODUCTIVITY', '4.3', '3252896', '50M', '100,000,000+', 'Free', '0', 'Everyone', 'Productivity', 'Aug

*************************
['Planner Pro-Personal Organizer', 'PRODUCTIVITY', '4.0', '10270', '8.4M', '1,000,000+', 'Free', '0', 'Everyone', 'Productivity', 'March 23, 2018', '4.4.1', '4.0 and up']
['Planner Pro-Personal Organizer', 'PRODUCTIVITY', '4.0', '10270', '8.4M', '1,000,000+', 'Free', '0', 'Everyone', 'Productivity', 'March 23, 2018', '4.4.1', '4.0 and up']
*************************
Todoist: To-do lists for task management & errands: 3
*************************
['Todoist: To-do lists for task management & errands', 'PRODUCTIVITY', '4.5', '155999', '12M', '10,000,000+', 'Free', '0', 'Everyone', 'Productivity', 'August 5, 2018', '12.8.2', '4.4 and up']
['Todoist: To-do lists for task management & errands', 'PRODUCTIVITY', '4.5', '155999', '12M', '10,000,000+', 'Free', '0', 'Everyone', 'Productivity', 'August 5, 2018', '12.8.2', '4.4 and up']
['Todoist: To-do lists for task management & errands', 'PRODUCTIVITY', '4.5', '155998', '12M', '10,000,000+', 'Free', '0', 'Everyone', 'Prod

*************************
['Fox News – Breaking News, Live Video & News Alerts', 'NEWS_AND_MAGAZINES', '4.5', '249919', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone 10+', 'News & Magazines', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Fox News – Breaking News, Live Video & News Alerts', 'NEWS_AND_MAGAZINES', '4.5', '249919', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone 10+', 'News & Magazines', 'July 30, 2018', 'Varies with device', 'Varies with device']
['Fox News – Breaking News, Live Video & News Alerts', 'NEWS_AND_MAGAZINES', '4.5', '249919', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone 10+', 'News & Magazines', 'July 30, 2018', 'Varies with device', 'Varies with device']
*************************
BBC News: 3
*************************
['BBC News', 'NEWS_AND_MAGAZINES', '4.3', '296781', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone 10+', 'News & Magazines', 'July 24, 2018', 'Varies with device', 'Varies wi

*************************
['Transit: Real-Time Transit App', 'MAPS_AND_NAVIGATION', '4.2', '43269', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Maps & Navigation', 'July 18, 2018', '4.4.7', 'Varies with device']
['Transit: Real-Time Transit App', 'MAPS_AND_NAVIGATION', '4.2', '43252', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Maps & Navigation', 'July 18, 2018', '4.4.7', 'Varies with device']
*************************
Mapy.cz - Cycling & Hiking offline maps: 2
*************************
['Mapy.cz - Cycling & Hiking offline maps', 'MAPS_AND_NAVIGATION', '4.5', '56443', '43M', '1,000,000+', 'Free', '0', 'Everyone', 'Maps & Navigation', 'June 26, 2018', '6.2.0', '4.1 and up']
['Mapy.cz - Cycling & Hiking offline maps', 'MAPS_AND_NAVIGATION', '4.5', '56409', '43M', '1,000,000+', 'Free', '0', 'Everyone', 'Maps & Navigation', 'June 26, 2018', '6.2.0', '4.1 and up']
*************************
Uber: 2
*************************
['Uber', 'MAPS_AND_NAVIGATION',

### How duplicates will be removed

#### App Store Data

No Duplicates Exist

#### Google Play Store

The number of reviews is the column that consistently differs in the data.  Per docs, the app with the most reviews is the one to keep.  So we will delete all duplicate rows saving the one with the highest number of reviews (column 4, index 3)

In [133]:
# create function to delete rows based on input for google play data
def del_duplicates(app_data, name_idx, review_idx, return_count=False, header=False):
    '''
    Assumptions: delete all duplicate rows except the one with the most reviews.
    Review count is at index 3.
    Ratings are string type but need to be converted to numeric to compare them.
    Some apps have the same number of reviews
    '''
    if header:
        start = 1
    else:
        start = 0
    
    app_rating_dict = {}

    for app in app_data[start:]:
        app_name = app[name_idx]
        app_rating = float(app[review_idx])
        if app_name in app_rating_dict.keys():
            if app_rating_dict[app_name] < app_rating:
                app_rating_dict[app_name] = app_rating
        if app_name not in app_rating_dict.keys():
            app_rating_dict[app_name] = app_rating
                
    unique_app_data = []
    app_name_list = []
    del_ctr = 0
    for app in app_data[start:]:
        app_name = app[name_idx]
        app_rating = float(app[review_idx])
        
        if app_rating < app_rating_dict[app_name]:
            del_ctr += 1
        else:
            if app_name not in app_name_list:
#               some apps have equal ratings so we want to make sure to still remove the duplicates
                unique_app_data.append(app)
                app_name_list.append(app_name)
    
    if return_count:
        return unique_app_data, del_ctr
    
    return unique_app_data


In [134]:
explore_data(google_play_data, 0 , 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


In [135]:
non_dup_google_data, deleted_google_data = del_duplicates(google_play_data, GP_NAME_IDX, GP_REVIEW_IDX, True, True)

print('Removing Google Duplicates')
print ('*' * 25)
print('Deleted/Ignored Rows: ' + str(deleted_google_data))
print('Clean Nbr Apps: ' + str(len(non_dup_google_data)))

Removing Google Duplicates
*************************
Deleted/Ignored Rows: 786
Clean Nbr Apps: 9659


### Filter out English Apps

In [136]:
def filter_english_apps(app_data, name_idx, header = False):
    if header:
        start = 1
    else:
        start = 0
    
    non_eng_app_dict = {}
    non_english_ctr = 0
    
    english_apps = []
    
    for app in app_data[start:]:
        app_name = app[name_idx]
        for c in app_name:
            if ord(c) > 127:
                non_english_ctr += 1
        if non_english_ctr > 3:
            non_eng_app_dict[app_name] = app_data.index(app)
        else:
            english_apps.append(app)
        non_english_ctr = 0
    
    return english_apps

In [137]:
eng_app_store_data = filter_english_apps(app_store_data, AS_NAME_IDX, True)

print('Cleaning App Store Data')
print('*' * 25)
print('Non Duplicate, English App Count: ' + str(len(eng_app_store_data)))

Cleaning App Store Data
*************************
Non Duplicate, English App Count: 6183


In [138]:
eng_google_play_data = filter_english_apps(non_dup_google_data, GP_NAME_IDX)

print('Cleaning App Store Data - English Apps')
print('*' * 25)
print('Non Duplicate, English App Count: ' + str(len(eng_google_play_data)))

Cleaning App Store Data - English Apps
*************************
Non Duplicate, English App Count: 9614


### Filter out Free Apps
Our company only makes free apps with ads so we don't want to compare to paid apps.

In [139]:
def filter_free_apps(app_data, price_idx, header = False):
    if header:
        start = 1
    else:
        start = 0
    
    free_apps = []
    
    for app in app_data[start:]:
        cost = app[price_idx].replace('$', '').replace(',', '')
        if float(cost) == 0:
            free_apps.append(app)
   
    return free_apps

In [140]:
clean_app_store_data = filter_free_apps(eng_app_store_data, AS_COST_IDX)

print('Cleaning App Store Data - Free Apps')
print('*' * 25)
print('App Count: ' + str(len(clean_app_store_data)))

Cleaning App Store Data - Free Apps
*************************
App Count: 3222


In [141]:
clean_google_play_data = filter_free_apps(eng_google_play_data, GP_COST_IDX)

print('Cleaning Google Play Data - Free Apps')
print('*' * 25)
print('App Count: ' + str(len(clean_google_play_data)))

Cleaning Google Play Data - Free Apps
*************************
App Count: 8864


## *********** Start Analysis***

### Frequency Table of App Types

Since our company wants apps published on both the App Store and Google Play, we need to find the most popular free apps on *both* platforms, not just a single one, to maximize our profit.

#### Data to Review
App Store - column 11 (prime genre)
Google Play - column 2 (Category), column 10 (Genre)

clean_app_store_data and clean_google_data are the cleaned dataset we should now be working with

In [142]:
def display_table(dataset_dict):
    '''
    function provided by DataQuest, but modified to be more generall useful
    '''
    table_display = []
    total = 0
    for key in dataset_dict:
        total += dataset_dict[key]
        key_val_as_tuple = (dataset_dict[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', (entry[0]/total)*100)
        
def freq_table(dataset, idx):
    freq_dict = {}
    
    for row in dataset:
        if row[idx] in freq_dict:
            freq_dict[row[idx]] += 1
        else:
            freq_dict[row[idx]] = 1
    return freq_dict

In [143]:
freq_dict = freq_table(clean_app_store_data, AS_GENRE_IDX)
display_table(freq_dict)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


### App Store Category Analysis

Games are by far the most common free app on the App Store.  Entertainment apps, when grouped together (Photos, Sports, Music, etc) add up to a close second tied with more practical apps (Utilities, Shopping etc).

Having a large number of apps in a genre does not mean there are a large number of users of those apps.  More analysis would be needed to see how many users are using each genre.

In [144]:
gp_freq_dict = freq_table(clean_google_play_data, GP_GENRE_IDX)
display_table(gp_freq_dict)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [145]:
display_table(gp_freq_dict)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

### Google Play Prime Category Analysis

Apps categorized as Family are by far the most common free app on Google Play.  I'm not sure what that really means and would need to look in to it.

Tools are the most common Genre.  

It's hard to analyze teh Google Play data without looking more at the numbers since it is unclear how the categories compare to teh app store and how teh genre and categories work together.  Tools has teh highests Genre count while Family is the highest category.  It is possible there are no Family Apps with the Genre Tools.  More analysis is needed.




In [146]:
def average_rating_by_category(dataset, rating_idx, genre_idx):
    freq_dict = freq_table(dataset, genre_idx)
    avg_rating_dict = {}
    
    for genre in freq_dict:
        total = 0
        for app in dataset:
            if app[genre_idx] == genre:
                total += float(app[rating_idx])
        
        cat = freq_dict[genre]
        avg = total / freq_dict[genre]
        avg_rating_dict[genre] = avg
            
    return avg_rating_dict


In [147]:
avg_category_dict = average_rating_by_category(clean_app_store_data, AS_RATING_COUNT_IDX, AS_GENRE_IDX)
display_table(avg_category_dict)

Navigation : 12.124787887920634
Reference : 10.55469488748488
Social Networking : 10.0767243249404
Music : 8.073752224694571
Weather : 7.362994045356322
Book : 5.599506478565839
Food & Drink : 4.694682098802674
Finance : 4.4318814538731734
Photo & Video : 4.005649319982865
Travel : 3.9777994914123482
Shopping : 3.79131459241208
Health & Fitness : 3.2812452201134454
Sports : 3.2405265917840644
Games : 3.2095100059071817
News : 2.992528487685598
Productivity : 2.9615987028833115
Utilities : 2.6314809875820204
Lifestyle : 2.3218216551102353
Entertainment : 1.975932893502812
Business : 1.055033811547336
Education : 0.9864267633081144
Catalogs : 0.5639152367462963
Medical : 0.08619283838379953


### App Store Recomendation

General Use Apps (Business, Navigation, Reference, etc) have some of the higher using rating numbers and still generally fall around the 50th percentile.  This is probably a good place to make an app and as the app store is not flooded with these apps (at least from our data snapshot) and users are rating them.

In [148]:
def total_installs(dataset, install_idx, category_idx):
    freq_dict = freq_table(dataset, category_idx)
    total = 0
    total_install_dict = {}
    avg_install_dict = {}
    
    for category in freq_dict:
        total = 0
        for app in dataset:
            installs = float(app[install_idx].replace('+', '').replace(',', ''))
            # ignore apps with more that 100 million installs to reduce install skew
            if installs > 100000000:
                continue
            if app[category_idx] == category:
                total += installs
        
        cat = freq_dict[category]
        avg = total / freq_dict[category]
        total_install_dict[category] = total
        avg_install_dict[category] = avg
        
    
    return total_install_dict, avg_install_dict


In [149]:
install_dict, avg_install_dict = total_installs(clean_google_play_data, GP_INSTALLS_IDX, GP_CATEGORY_IDX)
display_table(avg_install_dict)

PHOTOGRAPHY : 10.619907107714939
GAME : 9.178803917418241
ENTERTAINMENT : 8.82475025328021
VIDEO_PLAYERS : 6.826335506994931
COMMUNICATION : 6.701092615065035
PRODUCTIVITY : 6.134254102418642
SHOPPING : 5.334615053852674
SOCIAL : 4.779403832514364
TOOLS : 4.6506984453681195
PERSONALIZATION : 3.9432131920304667
WEATHER : 3.8469379611889023
TRAVEL_AND_LOCAL : 3.2766660569105808
MAPS_AND_NAVIGATION : 3.0755435547625924
SPORTS : 2.7584315630680365
BOOKS_AND_REFERENCE : 2.656857460941885
FAMILY : 2.3493207584529676
HEALTH_AND_FITNESS : 1.787070460757275
ART_AND_DESIGN : 1.5058288771837882
FOOD_AND_DRINK : 1.4592535846357082
EDUCATION : 1.3899618214153613
BUSINESS : 1.2980770294215918
NEWS_AND_MAGAZINES : 1.1255137344742832
LIFESTYLE : 1.0900000060456883
FINANCE : 1.052001455012876
HOUSE_AND_HOME : 1.0094330212769
DATING : 0.6474342031053786
COMICS : 0.6198611405116877
AUTO_AND_VEHICLES : 0.4907278070507012
LIBRARIES_AND_DEMO : 0.484045903536828
PARENTING : 0.41134459434894033
BEAUTY : 0.389

In [150]:
def print_google_by_category(query):
    for app in clean_google_play_data:
        if app[GP_CATEGORY_IDX] == query:
            print(app[GP_NAME_IDX], ':', app[GP_INSTALLS_IDX])

In [151]:
print_google_by_category('PHOTOGRAPHY')

TouchNote: Cards & Gifts : 1,000,000+
FreePrints – Free Photos Delivered : 1,000,000+
Groovebook Photo Books & Gifts : 500,000+
Moony Lab - Print Photos, Books & Magnets ™ : 50,000+
LALALAB prints your photos, photobooks and magnets : 1,000,000+
Snapfish : 1,000,000+
Motorola Camera : 50,000,000+
HD Camera - Best Cam with filters & panorama : 5,000,000+
LightX Photo Editor & Photo Effects : 10,000,000+
Sweet Snap - live filter, Selfie photo edit : 10,000,000+
HD Camera - Quick Snap Photo & Video : 1,000,000+
B612 - Beauty & Filter Camera : 100,000,000+
Waterfall Photo Frames : 1,000,000+
Photo frame : 100,000+
Huji Cam : 5,000,000+
Unicorn Photo : 1,000,000+
HD Camera : 5,000,000+
Makeup Editor -Beauty Photo Editor & Selfie Camera : 1,000,000+
Makeup Photo Editor: Makeup Camera & Makeup Editor : 1,000,000+
Moto Photo Editor : 5,000,000+
InstaBeauty -Makeup Selfie Cam : 50,000,000+
Garden Photo Frames - Garden Photo Editor : 500,000+
Photo Frame : 10,000,000+
Selfie Camera - Photo Edito

In [152]:
print_google_by_category('ENTERTAINMENT')

Complete Spanish Movies : 1,000,000+
Pluto TV - It’s Free TV : 1,000,000+
Mobile TV : 10,000,000+
TV+ : 5,000,000+
Digital TV : 5,000,000+
Motorola Spotlight Player™ : 10,000,000+
Vigo Lite : 5,000,000+
Hotstar : 100,000,000+
Peers.TV: broadcast TV channels First, Match TV, TNT ... : 5,000,000+
The green alien dance : 1,000,000+
Spectrum TV : 5,000,000+
H TV : 5,000,000+
StarTimes - Live International Champions Cup : 1,000,000+
Cinematic Cinematic : 1,000,000+
MEGOGO - Cinema and TV : 10,000,000+
Talking Angela : 100,000,000+
DStv Now : 5,000,000+
ivi - movies and TV shows in HD : 10,000,000+
Radio Javan : 1,000,000+
Talking Ginger 2 : 50,000,000+
Girly Lock Screen Wallpaper with Quotes : 5,000,000+
🔥 Football Wallpapers 4K | Full HD Backgrounds 😍 : 1,000,000+
Movies by Flixster, with Rotten Tomatoes : 10,000,000+
Low Poly – Puzzle art game : 1,000,000+
BBC Media Player : 10,000,000+
Amazon Prime Video : 50,000,000+
Adult Glitter Color by Number Book - Sandbox Pages : 1,000,000+
IMDb M

In [153]:
print_google_by_category('BOOKS_AND_REFERENCE')

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

### Google Play Analysis

Even though the number of apps was pretty evenly spread out, we want to exclude categories that have big players or are already over saturdayed.  By those standards, we can discard Tools, Entertainment, Social, Productivity, and Photography.  We can also discard categries such as Music, Shopping and Food since those general require some association with a paid service.

Books and Reference would be a good bet as they have a few dominant apps (such as the Bible) but it is not over saturated or overly dominated.