Purpose: Identify app profiles that are profitable (more likely to attract users) for developers to make data-driven decisions in kind of apps they build.

Dataset:
- https://www.kaggle.com/datasets/lava18/google-play-store-apps
- https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps

In [1]:
import csv
with open('AppleStore.csv') as file:
    reader = csv.reader(file)
    apple = list(reader)
    apple_header = apple[0]
    apple = apple[1:]
    
with open('googleplaystore.csv') as file:
    reader = csv.reader(file)
    android = list(reader)
    android_header = android[0]
    android = android[1:]

Function used to display data samples:

In [2]:
def display_data(dataset, start, end, row_and_column=False):
    for row in dataset[start:end]:
        print(row)
        print('\n')
    if row_and_column:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]),'\n')

<b>Remove wrong entry (row 10472):</b><br>
10841 is the original number of data samples in Android dataset.

In [3]:
if len(android) == 10841: #so this row is not deleted more than once
    del android[10472] 

<b>Remove duplicates:</b>
<br>Instead of removing duplicates randomly, we want to keep only the latest data entry, which is indicated by the highest review count.<br> A dictionary can be used to store each app’s name along with its highest review count, then only data entry with the name and review number matching those of the dictionary is added into the clean dataset.<br>
First we count the number of duplicates to double check the clean dataset later.

In [4]:
duplicates_android = []
unique_android = []

for row in android:
    name = row[0]
    if name in unique_android:
        duplicates_android.append(row)
    else:
        unique_android.append(name)

duplicates_apple = []
unique_apple = []
for row in apple:
    name = row[0]
    if name in unique_apple:
        duplicates_apple.append(row)
    else:
        unique_apple.append(name)

print('Number of Apple duplicates:', len(duplicates_apple))
print('Number of Android duplicates:', len(duplicates_android))

Number of Apple duplicates: 0
Number of Android duplicates: 1181


In [5]:
reviews_max = {}
android_clean = []
already_added = []

for row in android:
    name = row[0]
    reviews = int(row[3]) #convert string to integer
    if name in reviews_max and reviews_max[name] < reviews:
        reviews_max[name]=reviews #update review number
    elif name not in reviews_max: 
        reviews_max[name]=reviews #add name and review number
        
for row in android:
    name = row[0]
    reviews = int(row[3])
    if reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

print('Length of clean Android dataset:',len(android_clean))

Length of clean Android dataset: 9659


<b>Remove non-english apps:</b>
<br>All English text are in the range 0 to 127, according to the ASCII.<br>
Function below only returns False if more than 3 characters in the app name is out of range (so we do not remove app name with one emoji for example).

In [6]:
def english_app(string):
    non_english = 0
    for i in string:
        if ord(i)>127:
            non_english+=1
    if non_english > 3:
        return False
    else:
        return True #after interating through the whole string
    
print(english_app('Instachat 😜'))
print(english_app('爱奇艺PPS'))

True
True


In [7]:
ios_english = []
android_english = []

for row in apple:
    if english_app(row[1]):
        ios_english.append(row)
        
for row in android_clean:
    if english_app(row[0]):
        android_english.append(row)

print('Number of apps in Apple dataset:', len(apple))
print('Number of ios English apps:', len(ios_english))
print('Number of apps in Android unique dataset:', len(android_clean))
print('Number of Android English apps:', len(android_english))
        

Number of apps in Apple dataset: 7197
Number of ios English apps: 6183
Number of apps in Android unique dataset: 9659
Number of Android English apps: 9614


<b>Isolate free apps:</b><br>We only focus on free apps since they obviously attract more users. 
Price data is #4 entry for ios and #7 entry for android.

In [8]:
ios_final = []
android_final = []

for row in ios_english:
    if row[4] == '0.0':
        ios_final.append(row)
        
for row in android_english:
    if row[7] == '0':
        android_final.append(row)
        
print('Final ios apps:', len(ios_final))
print('Final android apps:', len(android_final))

Final ios apps: 3222
Final android apps: 8864


<b>Identify genre with most apps</b>

Genre: #11 for ios and #1, #10 for android. <br>
<br>
To build frequency table, we need two functions:
<br>
 - Function 1: for each genre in dataset, count its number of occurrence, save both to a dictionary, then convert all occurence number to percentage.
<br>
 - Function 2: sorts a dictionary by its values in descending order, stores the result as a list of tuples, print.

In [9]:
def percentage_table(dataset, i):
    table = {} #key:genre, value: occurence
    percentage_table = {} #key:genre, value: occurrence percentage
    
    for row in dataset: #count occurrence
        genre = row[i]
        if genre in table:
            table[genre] +=1
        else:
            table[genre] = 1
    
    for genre in table: #convert occurrence into percentage
        percentage = round(table[genre]/len(dataset)*100, 2)
        percentage_table[genre] = percentage
    
    return percentage_table

def frequency_table(dataset, i): #sort frequency in descending order
    table = percentage_table(dataset, i)
    
    table_tuple = table.items()
    sorted_table = sorted(table_tuple, key=lambda x: x[1], reverse=True)

    for key, value in sorted_table:
        print(f"{key}: {value}%")

<b>Analyze IOS frequency table</b>

In [10]:
print(apple_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [11]:
frequency_table(ios_final, -5)

Games: 58.16%
Entertainment: 7.88%
Photo & Video: 4.97%
Education: 3.66%
Social Networking: 3.29%
Shopping: 2.61%
Utilities: 2.51%
Sports: 2.14%
Music: 2.05%
Health & Fitness: 2.02%
Productivity: 1.74%
Lifestyle: 1.58%
News: 1.33%
Travel: 1.24%
Finance: 1.12%
Weather: 0.87%
Food & Drink: 0.81%
Reference: 0.56%
Business: 0.53%
Book: 0.43%
Navigation: 0.19%
Medical: 0.19%
Catalogs: 0.12%


We can see that among the free ios English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%.

App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes are more rare. 

<b>Analyze Android's frequency table</b>

In [12]:
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [13]:
print("Android Category Popularity")
frequency_table(android_final, 1)

Android Category Popularity
FAMILY: 18.91%
GAME: 9.72%
TOOLS: 8.46%
BUSINESS: 4.59%
LIFESTYLE: 3.9%
PRODUCTIVITY: 3.89%
FINANCE: 3.7%
MEDICAL: 3.53%
SPORTS: 3.4%
PERSONALIZATION: 3.32%
COMMUNICATION: 3.24%
HEALTH_AND_FITNESS: 3.08%
PHOTOGRAPHY: 2.94%
NEWS_AND_MAGAZINES: 2.8%
SOCIAL: 2.66%
TRAVEL_AND_LOCAL: 2.34%
SHOPPING: 2.25%
BOOKS_AND_REFERENCE: 2.14%
DATING: 1.86%
VIDEO_PLAYERS: 1.79%
MAPS_AND_NAVIGATION: 1.4%
FOOD_AND_DRINK: 1.24%
EDUCATION: 1.16%
ENTERTAINMENT: 0.96%
LIBRARIES_AND_DEMO: 0.94%
AUTO_AND_VEHICLES: 0.93%
HOUSE_AND_HOME: 0.82%
WEATHER: 0.8%
EVENTS: 0.71%
PARENTING: 0.65%
ART_AND_DESIGN: 0.64%
COMICS: 0.62%
BEAUTY: 0.6%


For Google Play: not many apps designed for fun, a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, family category (which accounts for almost 19% of the apps) means mostly games for kids.

Practical apps seem to have a better representation on Google Play compared to App Store. This picture is also confirmed by the frequency table we see for the Genres column:

In [14]:
print("Android Genre Popularity")
frequency_table(android_final, -4)

Android Genre Popularity
Tools: 8.45%
Entertainment: 6.07%
Education: 5.35%
Business: 4.59%
Lifestyle: 3.89%
Productivity: 3.89%
Finance: 3.7%
Medical: 3.53%
Sports: 3.46%
Personalization: 3.32%
Communication: 3.24%
Action: 3.1%
Health & Fitness: 3.08%
Photography: 2.94%
News & Magazines: 2.8%
Social: 2.66%
Travel & Local: 2.32%
Shopping: 2.25%
Books & Reference: 2.14%
Simulation: 2.04%
Dating: 1.86%
Arcade: 1.85%
Video Players & Editors: 1.77%
Casual: 1.76%
Maps & Navigation: 1.4%
Food & Drink: 1.24%
Puzzle: 1.13%
Racing: 0.99%
Libraries & Demo: 0.94%
Role Playing: 0.94%
Auto & Vehicles: 0.93%
Strategy: 0.91%
House & Home: 0.82%
Weather: 0.8%
Events: 0.71%
Adventure: 0.68%
Comics: 0.61%
Art & Design: 0.6%
Beauty: 0.6%
Parenting: 0.5%
Card: 0.45%
Casino: 0.43%
Trivia: 0.42%
Educational;Education: 0.39%
Board: 0.38%
Educational: 0.37%
Education;Education: 0.34%
Word: 0.26%
Casual;Pretend Play: 0.24%
Music: 0.2%
Entertainment;Music & Video: 0.17%
Puzzle;Brain Games: 0.17%
Racing;Action &

Since Genres column is much more granular (it has more categories), but we're only looking for the bigger picture at the moment, we'll only work with Category moving forward.

Conclusion from frequency tables: App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps.

<b>Identify most popular app genre</b>

 - Android dataset: Install number.
 - IOS dataset: Install number is missing so we use rating number instead. We will calculate the average number of user ratings per genre.

In [15]:
ios_genre = percentage_table(ios_final, -5)
max_num_ratings = 0
most_common = ""

for genre in ios_genre: 
    genre_ratings = 0 #number of ratings of each app in genre
    genre_apps = 0 #number of apps in the genre
    for row in ios_final:
        genre_name = row[-5]
        rating_count = row[5]
        if genre_name == genre:
            genre_ratings += float(rating_count) #add app's ratings to genre's ratings
            genre_apps +=1
    avg_num_ratings = round(genre_ratings / genre_apps)
    print(f"{genre}: {avg_num_ratings}")
    if avg_num_ratings > max_num_ratings:
        max_num_ratings = avg_num_ratings
        most_common = genre #not genre_name 

print(f"\nMost downloaded ios genre: {most_common}")

Social Networking: 71548
Photo & Video: 28442
Games: 22789
Music: 57327
Reference: 74942
Health & Fitness: 23298
Weather: 52280
Utilities: 18684
Travel: 28244
Shopping: 26920
News: 21248
Navigation: 86090
Lifestyle: 16486
Entertainment: 14030
Food & Drink: 33334
Sports: 23009
Book: 39758
Finance: 31468
Education: 7004
Productivity: 21028
Business: 7491
Catalogs: 4004
Medical: 612

Most downloaded ios genre: Navigation


Runner-up: Reference

In [16]:
giants = 0
total = 0
struggle = 0

for app in ios_final:
    if app[-5] == 'Navigation':
        total += 1
        if(float(app[5]) >= 10000):
            giants += 1
            print(app[1], ':', app[5])
        elif(float(app[5]) <= 1000):
            struggle +=1
            
print(f"\n{round(giants/total*100)}% have over 10,000 ratings")
print(f"\n{round(struggle/total*100)}% have less than 1000 ratings")

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811

50% have over 10,000 ratings

33% have less than 1000 ratings


<b>Concern</b> 

Only half of the navigation apps does well while over 30% does poorly.

In [17]:
android_category = percentage_table(android_final, 1)
max_installs = 0
most_common = ""

for category in android_category:
    category_installs = 0
    category_apps = 0
    for row in android_final:
        app_category = row[1]
        if app_category == category:
            installs = row[5].replace("+","")
            installs = installs.replace(",","")
            category_installs += float(installs)
            category_apps +=1
    avg_installs = round(category_installs / category_apps) 
    print(f"{category}: {avg_installs}")
    if avg_installs > max_installs:
        max_installs = avg_installs
        most_common = category

print(f"\nMost downloaded Android genre: {most_common}")

ART_AND_DESIGN: 1986335
AUTO_AND_VEHICLES: 647318
BEAUTY: 513152
BOOKS_AND_REFERENCE: 8767812
BUSINESS: 1712290
COMICS: 817657
COMMUNICATION: 38456119
DATING: 854029
EDUCATION: 1833495
ENTERTAINMENT: 11640706
EVENTS: 253542
FINANCE: 1387692
FOOD_AND_DRINK: 1924898
HEALTH_AND_FITNESS: 4188822
HOUSE_AND_HOME: 1331541
LIBRARIES_AND_DEMO: 638504
LIFESTYLE: 1437816
GAME: 15588016
FAMILY: 3695642
MEDICAL: 120551
SOCIAL: 23253652
SHOPPING: 7036877
PHOTOGRAPHY: 17840110
SPORTS: 3638640
TRAVEL_AND_LOCAL: 13984078
TOOLS: 10801391
PERSONALIZATION: 5201483
PRODUCTIVITY: 16787331
PARENTING: 542604
WEATHER: 5074486
VIDEO_PLAYERS: 24727872
NEWS_AND_MAGAZINES: 9549178
MAPS_AND_NAVIGATION: 4056942

Most downloaded Android genre: COMMUNICATION


In [18]:
giants = 0
communication_app = 0

for app in android_final:
    if app[1] == 'COMMUNICATION':
        communication_app += 1
        if(app[5] == '1,000,000,000+' or app[5] == '500,000,000+'or app[5] == '100,000,000+'):
            giants += 1
            print(app[0], ':', app[5])
            
print(f"\nOnly {round(giants/communication_app*100)}% have over 100,000,000 installs")

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

<b>Concern</b> 

This category install number is heavily skewed up by only a few apps that have over one billion installs. 

We see the same pattern for the video players category - dominated by Youtube, Google Play Movies & TV, or MX Player; social apps (Facebook, Instagram, Google+, etc.); photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

These app genres might seem more popular than they really are! 

The game category seems popular, however, the frequency tables showed the market is already pretty saturated.

<b>Solution</b>

Let's focus on app categories that are both popular and not yet saturated with existing apps.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,812. 

We found that this genre has some potential to work well on the App Store, now we need to figure if it's also profitable Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [22]:
giants = 0
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        total +=1
        if (app[5] == '1,000,000,000+' or app[5] == '500,000,000+'or app[5] == '100,000,000+'):
            giants += 1
            print(app[0], ':', app[5])
print(f"{round(giants/total*100)}% are popular")

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+
1% are giants


Since we only see a few that are popular, this market has great potential.

<b>Conclusions</b>
In this project, we analyzed App Store and Google Play mobile apps data to identify an app profile that can be profitable for both markets.

We concluded that a free app with a wide variety of books could be profitable for both the Google Play and the App Store markets. However, nearly 3% of the markets are libraries, adding some special features besides the raw version of the book would be beneficial.