# Profitable App Profiles for the App Store and Google Play Markets
## Introduction

This project will explore data in the app and play store markets to help decide what kind of new app should be developed.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

We are using data from the following two sources:  
[App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)  
[Play Store](https://www.kaggle.com/lava18/google-play-store-apps)

The following code opens and stores the datasets as lists of lists:

In [2]:
from csv import reader

playStoreFile = open('googleplaystore.csv')
appStoreFile = open('AppleStore.csv')
playStoreReadFile = reader(playStoreFile)
appStoreReadFile = reader(appStoreFile)
playStoreList = list(playStoreReadFile)
appStoreList = list(appStoreReadFile)

# Cleaning row with missing data per discussion
del(playStoreList[10473])

# Print header rows
print("Play Store Header")
print(playStoreList[0])
print('\n' + 'App Store Header')
print(appStoreList[0])

#print(playStoreList[10473])

Play Store Header
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

App Store Header
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


There are duplicate entries in the dataset for the play store. Here is an example of Instagram having 4 duplicate entries in the play store dataset.

In [3]:
for app in playStoreList:
    name = app[0]
    if name == "Instagram":
        print(app)


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


## Counting duplicates

We can count the number of duplicates with the following code:

In [4]:
duplicatePlayStoreApps = []
uniquePlayStoreApps = []
for app in playStoreList[1:]:
    if app[0] in uniquePlayStoreApps:
        duplicatePlayStoreApps.append(app[0])
    else:
        uniquePlayStoreApps.append(app[0])

print(len(duplicatePlayStoreApps))
print(len(uniquePlayStoreApps))
        
    

1181
9659


## Removing duplicate apps

We will remove the duplicates that have the least number of reviews because these correspond to older entries in the dataset. We do this by creating a dictionary and comparing the review values of the playStoreList to the newly created entries in the dictionary.

We then convert the values of the dictionary into a new list. Note the length of the new list matches the length of the uniquePlayStoreApps list above.

In [5]:
playStoreDict = {}
for app in playStoreList[1:]:
    if app[0] in playStoreDict:
        if app[3] > playStoreDict[app[0]][3]:
            playStoreDict[app[0]] = app
    else:
        playStoreDict[app[0]] = app

cleanPlayStoreList = list(playStoreDict.values())
print(len(cleanPlayStoreList))

9659


## Removing non English apps
In the app store dataset there are names of apps that contain Chinese characters. We would like to only analyze data for apps in the english language. To do this we will remove app names with ascii characters greater than 127. To do this, we will first write a function that can take a string as an input and return `True` or `False` if it contains ascii characters outside our target range.

Some english apps have names with emoji characters that fall outside the range of 0 - 127. We will filter names that have more than three characters outside the range as non english apps.

In [6]:
def IsEnglish(string):
    charCount = 0
    for character in string:
        if ord(character) > 127:
            charCount += 1
    if charCount > 3:
        return False
    return True

print(IsEnglish("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(IsEnglish('Instachat 😜'))

False
True


We will now remove apps that have non English names from the datasets.

In [7]:
cleanAppStoreList = []
for app in appStoreList[1:]:
    if IsEnglish(app[1]):
        cleanAppStoreList.append(app)

englishPlayStoreList = []
for app in cleanPlayStoreList:
    if IsEnglish(app[0]):
        englishPlayStoreList.append(app)
        
print(len(englishPlayStoreList))
print(len(cleanAppStoreList))

9614
6183


## Removing nonfree apps

We will now take the data sets that are free from duplicates and non english apps and remove the non-free apps.

In [8]:
freeCleanPlayStoreList = []
paidCleanPlayStoreList = []
for app in englishPlayStoreList:
    if app[7] == '0':
        freeCleanPlayStoreList.append(app)
    else:
        paidCleanPlayStoreList.append(app)

freeCleanAppStoreList = []
paidCleanAppStoreList = []
for app in cleanAppStoreList:
    if app[4] == '0.0':
        freeCleanAppStoreList.append(app)
    else:
        paidCleanAppStoreList.append(app)

print(len(freeCleanPlayStoreList))
print(len(freeCleanAppStoreList))


8862
3222


## Exploring app genres

Our end goal is to develop an ad supported free app that will be successful on both the play and app store. To do this, let's expore what genre's are popular on the markets by generating a frequency table of genres.

First we need functions to generate and sort frequency tables from our datasets. The following function creates a frequency table from a list of lists

In [9]:
def freq_table(dataset, index):
    freq_table = {}
    datasetLength = len(dataset)
    for row in dataset:
        if row[index] in freq_table:
            freq_table[row[index]] += (100 / datasetLength)
        else:
            freq_table[row[index]] = 100 / datasetLength

    return freq_table
    

The following function sorts the frequency table generate by `freq_table()`.

In [14]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

## Generated Freq tables for Genre and Categories

The play store has tags for both genre and category while the app store has a single genre category. From the app store, games and entertainment dominate showing a bias towards apps for fun. The play store is a bit more diverse with significant representation from both fun and productivty themed apps.

In [13]:
print('Play Store Categories')
display_table(freeCleanPlayStoreList, 1)
print('\n Play Store Genres')
display_table(freeCleanPlayStoreList, 9)
print('\n App Store prime_genre')
display_table(freeCleanAppStoreList, 11)

Play Store Categories
FAMILY : 18.9347777025507
GAME : 9.69307154141295
TOOLS : 8.451816745655742
BUSINESS : 4.592642744301517
LIFESTYLE : 3.904310539381615
PRODUCTIVITY : 3.893026404874732
FINANCE : 3.7011961182577164
MEDICAL : 3.5206499661475843
SPORTS : 3.3965244865718685
PERSONALIZATION : 3.3175355450236856
COMMUNICATION : 3.238546603475503
HEALTH_AND_FITNESS : 3.080568720379137
PHOTOGRAPHY : 2.945159106296538
NEWS_AND_MAGAZINES : 2.7984653577070557
SOCIAL : 2.6630557436244566
TRAVEL_AND_LOCAL : 2.335815842924842
SHOPPING : 2.245542766869776
BOOKS_AND_REFERENCE : 2.1439855563078267
DATING : 1.861882193635745
VIDEO_PLAYERS : 1.7941773865944455
MAPS_AND_NAVIGATION : 1.3992326788535314
FOOD_AND_DRINK : 1.2412547957571658
EDUCATION : 1.1735499887158662
ENTERTAINMENT : 0.959151433085084
LIBRARIES_AND_DEMO : 0.9365831640713173
AUTO_AND_VEHICLES : 0.9252990295644339
HOUSE_AND_HOME : 0.8237418190024836
WEATHER : 0.8011735499887168
EVENTS : 0.7109004739336499
PARENTING : 0.654479801399233
A

## Popular Apps on the App Store  
We would like to know the average number of users per app genre. The app store data is missing data for number of downloads, so number of reviews will be used instead as a proxy.

In [20]:
prime_genreFreqTable = freq_table(freeCleanAppStoreList, 11)
for genre in prime_genreFreqTable:
    totalRatings = 0
    totalApps = 0
    for app in freeCleanAppStoreList:
        if app[11] == genre:
            totalRatings += int(app[5])
            totalApps += 1
    avg = round(totalRatings / totalApps)
    print(genre + ", " + str(avg))

Social Networking, 71548
Photo & Video, 28442
Games, 22789
Music, 57327
Reference, 74942
Health & Fitness, 23298
Weather, 52280
Utilities, 18684
Travel, 28244
Shopping, 26920
News, 21248
Navigation, 86090
Lifestyle, 16486
Entertainment, 14030
Food & Drink, 33334
Sports, 23009
Book, 39758
Finance, 31468
Education, 7004
Productivity, 21028
Business, 7491
Catalogs, 4004
Medical, 612


## App Store Takeaways
Social Networking apps have the highest number of reviews. However, a successful social networking app is highly dependent on userbase. Because of this, it may make more sense to target another genre for a new app. Reference apps have a high review count per app but relatively low number of apps. This may be a good area to target for a new app.

## Popular Apps on the Play Store

The play store dataset does have number of installs, but they are grouped range and not a precise number (See below). Because of this, we will need to remove characters like ',' and '+' to compare the ranges numerically.

In [21]:
display_table(freeCleanPlayStoreList, 5)

1,000,000+ : 15.741367637102615
100,000+ : 11.55495373504876
10,000,000+ : 10.51681336041546
10,000+ : 10.200857594222716
1,000+ : 8.395396073121324
100+ : 6.91717445271956
5,000,000+ : 6.838185511171374
500,000+ : 5.574362446400399
50,000+ : 4.773188896411656
5,000+ : 4.513653802753331
10+ : 3.543218235161351
500+ : 3.249830737982386
50,000,000+ : 2.290679304897309
100,000,000+ : 2.12141728729406
50+ : 1.9183028661701613
5+ : 0.7898894154818334
1+ : 0.5077860528097492
500,000,000+ : 0.2708192281651996
1,000,000,000+ : 0.22568269013766634
0+ : 0.045136538027533285
0 : 0.011284134506883321


In [32]:
categoryFreqTable = freq_table(freeCleanPlayStoreList, 1)
installFreqTable = {}

for cat in categoryFreqTable:
    totalInstalls = 0
    totalApps = 0
    for app in freeCleanPlayStoreList:
        if cat == app[1]:
            installs = app[5].replace('+', '').replace(',', '') # remove , and + from string
            totalInstalls += int(installs)
            totalApps += 1
    avg = round(totalInstalls / totalApps)
    print(cat + ', ' + str(avg))
            

ART_AND_DESIGN, 1986335
FAMILY, 3694276
AUTO_AND_VEHICLES, 647318
BEAUTY, 513152
BOOKS_AND_REFERENCE, 8767812
BUSINESS, 1712290
COMICS, 817657
COMMUNICATION, 38456119
DATING, 854029
EDUCATION, 1820673
ENTERTAINMENT, 11640706
EVENTS, 253542
FINANCE, 1387692
FOOD_AND_DRINK, 1924898
HEALTH_AND_FITNESS, 4188822
HOUSE_AND_HOME, 1331541
TOOLS, 10682301
LIBRARIES_AND_DEMO, 638504
LIFESTYLE, 1437816
GAME, 15560966
VIDEO_PLAYERS, 24727872
MEDICAL, 120616
SOCIAL, 23253652
SHOPPING, 7036877
PHOTOGRAPHY, 17805628
SPORTS, 3638640
TRAVEL_AND_LOCAL, 13984078
PERSONALIZATION, 5201483
PRODUCTIVITY, 16787331
PARENTING, 542604
WEATHER, 5074486
NEWS_AND_MAGAZINES, 9549178
MAPS_AND_NAVIGATION, 4056942


From this list we can see that navigation dominates but is skewed by a few popular apps. Need more skills to analyze data more efficiently to draw deeper conclusions.