# Profitable App Profiles for the App Store and Google Play Markets
The goal of this project is to analyze the data to find mobile app profiles with huge profitable potentials. I only focus on the apps that are free to download and install, so the main source of revenues is the in-app ads. The revenue is highly correlational to the number of app users. Developers can consult this project results to better understand what types of apps are more likely to attract more users.



In [1]:
from csv import reader
#the Google Play data set
open_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(open_file)
android_list = list(read_file)
android_header = android_list[0]
android_dataset = android_list[1:]

In [2]:
from csv import reader
#the Apple Store data set
open_file = open("AppleStore.csv", encoding="utf8")
read_file = reader(open_file)
apple_list = list(read_file)
apple_header = apple_list[0]
apple_dataset = apple_list[1:]

In [3]:
def explore_data(dataset, start, end, row_column_number = False): 
    dataset_slice = dataset[start:end]
    for app_row in dataset_slice:
        print(app_row)
        print("\n")
    if row_column_number:
        print("Number of rows: ", len(dataset))
        print("Number of columns: ", len(dataset[0]))

In [4]:
print(android_header)
print('\n')
explore_data(android_dataset, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  10841
Number of columns:  13


In [5]:
print(apple_header)
print('\n')
explore_data(apple_dataset, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows:  7197
Number of columns:  16


### **3. Deleting wrong data - Data cleaning** ###

#### Android dataset

In [6]:
print(len(android_dataset))

10841


In [7]:
print(android_dataset[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [8]:
del android_dataset[10472]

In [9]:
print(len(android_dataset))

10840


#### Apple dataset (optional)

In [10]:
for row in apple_dataset:
    if len(row) != len(apple_dataset[0]) :
        print(apple_dataset.index[row])

Apple dataset is clean.

### **4. Removing Duplicate Entries: Part One**

In [11]:
unique_apps = []
duplicate_apps = []

for app in android_dataset:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print("A few duplicate apps: ", duplicate_apps[0:20])
print("The total number of duplicate apps: ", len(duplicate_apps))

A few duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']
The total number of duplicate apps:  1181


In [12]:
for google_ad in android_dataset:
    if google_ad[0] in duplicate_apps and google_ad[0] == "Google Ads":
        print(google_ad)

['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29331', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']


Looking at the above example of duplicate apps, we can deduce the number of reviews can be the criteon to remove the duplicate app.

### **5. Removing Duplicate Entries: Part One**

In [13]:
expected_value = len(android_dataset)-1181 #The total number of duplicate apps:  1181

In [14]:
#{app_name, review_number}
reviews_max = {}
for app in android_dataset:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews
    elif name not in  reviews_max:
        reviews_max[name] = n_reviews


In [15]:
print(expected_value)
print(len(reviews_max))

9659
9659


We will use the reviews_max dictionary to remove the duplicate apps. 
First we create the android_clean list to store the cleaned data set.
To account for the already_added list, we need it as a supplementary condition in case the highest number of reviews of an app is the same for more than 1 entry.

In [16]:
android_clean = []
already_added = []
for app in android_dataset:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
print("The number of unique apps: ", len(android_clean))

The number of unique apps:  9659


### **6. Removing Non-english Apps: Part 1** ###

In [17]:
def only_English(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True

print(only_English('Instagram'))
print(only_English('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(only_English('Docs To Go™ Free Office Suite'))
print(only_English('Instachat 😜'))
#The only_English function removes some English apps with special character that fall out of ASCII range.

True
False
False
False


### **7. Removing Non-english Apps: Part 2**

In [18]:
def only_English(string):
    i = 0
    for character in string:
        if ord(character) > 127:
            i+=1
    if i > 3:
        return False
    else:
        return True
    
print(only_English('Docs To Go™ Free Office Suite'))
print(only_English('Instachat 😜'))
print(only_English('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


### **8. Isolating Free apps**
The Data sets contain both free and non-free apps, so we need to isolate the free apps.
The cleanest lists at this point are: android_clean and apple_dataset

In [19]:
free_android = []
free_apple = []
for app in android_clean:
    if app[7] == "0":
        free_android.append(app)
        
for app in apple_dataset:
    if app[4] == "0.0":
        free_apple.append(app)
        
print("We have ", len(free_android), " free android apps left.")
print("We have ", len(free_apple), " free apple apps left.")

We have  8905  free android apps left.
We have  4056  free apple apps left.


### **9. Most Common Apps by Genre: Part One**
The validation strategy for an app idea comprises of 3 step:
1. Buid an Android app and add it to Google Play.
2. If the app receives many positive reviews and feedbacks, develop it further.
3. If profitable after 6 months, we develop an iOS version of it and add it the Apple Store.

### **10. Most Common Apps by Genre: Part Two**
dataset: list of lists 
index: an integer
The freq_table function should return a frequency table as a dictionary {} for any column we want. The frequencies should be expressed in percentages.

We will create 2 functions. One to create the a FREQUENCY TABLE, the other to display the percentages in descending order.

In [20]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        key = row[index]
        if key in table:
            table[key] += 1
        else:
            table[key] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages # dictionary of percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple) # list of tuples
        
    table_sorted = sorted(table_display, reverse = True) # sorted list of tuples(value, key)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
# display_table function displays the percentages in descending order.


In [21]:
print(free_android[0])

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


In [22]:
print(free_apple[0])


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


### **11. Most Common Apps by Genre: Part Three**


**I start by examining the Category and Genre columns of Apple dataset.**

In [23]:
display_table(free_apple, -5)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


Games apps are the most popular ones. Entertainment apps are the most numberous.

**Next, I will analyze the Category and Genres columns Google Play Store.**

In [24]:
display_table(free_android, 1) #Category

FAMILY : 18.97810218978102
GAME : 9.70241437394722
TOOLS : 8.433464345873105
BUSINESS : 4.581695676586187
LIFESTYLE : 3.9303761931499155
PRODUCTIVITY : 3.885457608085345
FINANCE : 3.6833239752947784
MEDICAL : 3.5148792813026386
SPORTS : 3.3801235261089273
PERSONALIZATION : 3.312745648512072
COMMUNICATION : 3.2341381246490735
HEALTH_AND_FITNESS : 3.065693430656934
PHOTOGRAPHY : 2.9421673217293653
NEWS_AND_MAGAZINES : 2.829870859067939
SOCIAL : 2.6501965188096577
TRAVEL_AND_LOCAL : 2.3245367770915215
SHOPPING : 2.2459292532285233
BOOKS_AND_REFERENCE : 2.1785513756316677
DATING : 1.8528916339135317
VIDEO_PLAYERS : 1.7967434025828188
MAPS_AND_NAVIGATION : 1.4149354295339696
FOOD_AND_DRINK : 1.235261089275688
EDUCATION : 1.167883211678832
ENTERTAINMENT : 0.9545199326221224
LIBRARIES_AND_DEMO : 0.9320606400898372
AUTO_AND_VEHICLES : 0.9208309938236946
HOUSE_AND_HOME : 0.8197641774284109
WEATHER : 0.7973048848961257
EVENTS : 0.7074677147669848
PARENTING : 0.6513194834362718
ART_AND_DESIGN : 0

In [25]:
display_table(free_android, -4) #Genres

Tools : 8.422234699606962
Entertainment : 6.086468276249298
Education : 5.390230207748456
Business : 4.581695676586187
Lifestyle : 3.9191465468837734
Productivity : 3.885457608085345
Finance : 3.6833239752947784
Medical : 3.5148792813026386
Sports : 3.4475014037057834
Personalization : 3.312745648512072
Communication : 3.2341381246490735
Action : 3.0881527231892196
Health & Fitness : 3.065693430656934
Photography : 2.9421673217293653
News & Magazines : 2.829870859067939
Social : 2.6501965188096577
Travel & Local : 2.313307130825379
Shopping : 2.2459292532285233
Books & Reference : 2.1785513756316677
Simulation : 2.0662549129702414
Dating : 1.8528916339135317
Arcade : 1.8416619876473892
Video Players & Editors : 1.7742841100505335
Casual : 1.7518248175182483
Maps & Navigation : 1.4149354295339696
Food & Drink : 1.235261089275688
Puzzle : 1.1229646266142617
Racing : 0.9882088714205502
Role Playing : 0.9320606400898372
Libraries & Demo : 0.9320606400898372
Strategy : 0.9208309938236946
Au

Most apps are designed for practical purposes.
Apps for family accounts for 19% of the market.

#### => Apple app store is dominated with apps designed for fun and entertainment.
#### => Google play shows a more balanced landscape of both practical and fun apps.

### 12. Most popular apps by genres on the App Store
One way to find out what genre is the most popular is to calculate the average downloads or installs of all the apps of a genre.
That information is missing in the App Store. I will use the rating_count column instead.

In [29]:
ios_genres = freq_table(free_apple, -5) # frequency table of genres STILL being expressed in percentages
# print(ios_genres)
for genre in ios_genres:
    total = 0 #the variable will store the sum of user ratings of 1 genre
    len_genre = 0 #the variable will store the number of apps in that 1 genre
    for app in free_apple:
        genre_app = app[-5]
        if genre_app == genre:
            n_rating = float(app[5])
            total += n_rating
            len_genre += 1
    avg_rating = total / len_genre
    print(genre, ":", avg_rating)

Social Networking : 53078.195804195806
Photo & Video : 27249.892215568863
Games : 18924.68896765618
Music : 56482.02985074627
Reference : 67447.9
Health & Fitness : 19952.315789473683
Weather : 47220.93548387097
Utilities : 14010.100917431193
Travel : 20216.01785714286
Shopping : 18746.677685950413
News : 15892.724137931034
Navigation : 25972.05
Lifestyle : 8978.308510638299
Entertainment : 10822.961077844311
Food & Drink : 20179.093023255813
Sports : 20128.974683544304
Book : 8498.333333333334
Finance : 13522.261904761905
Education : 6266.333333333333
Productivity : 19053.887096774193
Business : 6367.8
Catalogs : 1779.5555555555557
Medical : 459.75


Social Networking and Music are the most popular genres on the App Store market.

### 13. Most popular apps by genres on the Google Play

In [33]:
category_android = freq_table(free_android, 1)

for category in category_android:
    total_installs = 0 # This variable will store the sum of INSTALLS specific to 1 genre.
    n_apps = 0 # This variable will store the TOTAL NUMBER OF APPS specific to 1 genre.
    for app in free_android:
        category_app = app[1]
        if category_app ==category:
            n_installs = app[5]
            n_installs = n_installs.replace(',' , '')
            n_installs = n_installs.replace('+' , '')
            total_installs += float(n_installs)
            n_apps += 1
    avg_installs = total_installs/n_apps #average number of installs of 1 genre
    print(category, ":", avg_installs)
            


ART_AND_DESIGN : 1952105.1724137932
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8587351.855670104
BUSINESS : 1708215.906862745
COMICS : 803234.8214285715
COMMUNICATION : 38322625.697916664
DATING : 854028.8303030303
EDUCATION : 1825480.7692307692
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1436126.94
GAME : 15551995.891203703
FAMILY : 3668870.823076923
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7001693.425
PHOTOGRAPHY : 17772018.759541985
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10787009.952063914
PERSONALIZATION : 5183850.806779661
PRODUCTIVITY : 16738957.554913295
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24573948.25
NEWS_AND_MAGAZINES : 9401635.95

Communication category has the most installs (38322625.697916664 downloads).

In [34]:
for app in free_android:
    if app[1] == "COMMUNICATION" and app[5] == "1,000,000,000+" or app[5] == "500,000,000+" or app[5] == "100,000,000+":
        print(app[0], ":" , app[5])

OfficeSuite : Free Office + PDF Editor : 100,000,000+
WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Hotstar : 100,000,000+
Talking Angela : 100,000,000+
IMDb Movies & TV : 100,000,000+
Talking Ben the Dog : 100,000,000+
Netflix : 100,000,000+
Period Tracker - Period Calendar Ovulation Tracker : 100,000,000+
Sonic Dash : 100,000,000+
PAC-MAN : 100,000,000+
Roll the Ball® - slide puzzle : 100,000,000+
Piano Ti

In [40]:
under_100Mi =[]

for app in free_android:
    n_installs = app[5]
    n_installs = n_installs.replace(',' , '')
    n_installs = n_installs.replace('+' , '')
    if app[1] == "COMMUNICATION" and float(n_installs) < 1000000:
        under_100Mi.append(float(n_installs))

sum(under_100Mi)/len(under_100Mi)

46417.45637583893

The average number of downloads of all the apps that have fewer than 100 million installs are significantly less than that of the entire communication category. Although it has the most downloads, i.e.,38322625.697916664 downloads, the number is highly skewed up by a few apps dominating the genre (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts). The communication app market is highly saturated.

With the same pattern, Social apps, Photography apps, and Productivity apps are also saturated.

# TO BE CONTINUED ...