# Profitale Apps for App Store and Google Play Markets


This project is to analyze data for ios and android mobile apps.The apps are free to download and install and the main source of revenue is in-app adds.The number of users determines revenue for any given app. The goal is to analyze the data and help developers understand what kind of apps would likely attract more users.


## Opening and Exploring the data

There are currently nearly 2 million apps on the [Apple app store](https://www.apple.com/app-store/#:~:text=Because%20we%20offer%20nearly%20two,every%20single%20one%20of%20them.) and over 3 million [apps](https://appinventiv.com/blog/google-play-store-statistics/#:~:text=As%20per%20latest%20Google%20Play,Play%20Store%20every%20single%20day.) on the google play store. Collecting data for over 5 million apps requires a significant amount of time and money, so for this project I'll be analyzing a sample of data instead. There's currently existing data that seems suitable for the purpose of this project. 

* A [dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) containing data about 10, 000 apps on google play store
* A [dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) containing data about 7200 apps

In [63]:
from csv import reader

# Ths function takes any file as an input and return the output as a list 
def read_file(filename):
    opened_file = open(filename)
    dataset = reader(opened_file)
    return list(dataset)


    

In [64]:
# Apple store dataset
apple_store_dataset = read_file('AppleStore.csv')
ios_dataset_header = apple_store_dataset[0]
ios_dataset = apple_store_dataset[1:]


#Google play store dataset
google_store_dataset = read_file('googleplaystore.csv')
android_dataset_header = google_store_dataset[0]
android_dataset = google_store_dataset[1:]



In [65]:
# This function takes in dataset as a list of list without the header row, 
#start and end as intergers to show how to slice the list 
def explore_data(dataset, start, end, rows_and_cols=False):
    data_split = dataset[start:end]
    for row in data_split:
        print(row)
        print("\n")
    if rows_and_cols:
        print(f"Number of rows: {len(dataset)}")
        print(f"Number of columns: {len(dataset[0])}")
                      

In [66]:
# Explore data for the android apps 
explore_data(android_dataset, 0, 3, True)
print("\n")
print(android_dataset_header)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


The number of rows in the google dataset is 10841 and the number of columns is 13. The rows corresponds to the number of apps on the google play store, so there are 10841 apps on the google play store.  The column that seem useful for the analysis are: "App, Category, Rating, Reviews, Type, Price, Content Rating, Genres"

In [67]:
# Explore data for the apple store data set
explore_data(ios_dataset, 0, 3, True)
print("\n")
print(ios_dataset_header)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The number of rows in the apple dataset is 7197 and the number of columns is 16. The rows corresponds to the number of apps in the apple store which is 7197 in this case. The columns that seem useful for this analysis are "track_name, currency, price, rating_count_tot, user_rating, prime_genre".

## Cleaning Data

The company only builds apps that are free to download and install for English speakers. With this in mind, I'll be removing data on paid apps and non-English apps. From the [discussion](http://a.com), a column in row 10472 is missing. Row 10472 corresponds to the *Life Made WI-Fi Touchscreen Photo Frame app* and the category column is missing and the rest of the data in this row is placed in the wrong columns. In other to ensure that the analysis is correct, I'll be removing this row from the dataset. 

In [68]:
print(android_dataset[10472])
print(len(android_dataset[10472]))
print("\n")
print(android_dataset_header)
print(len(android_dataset_header))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
13


In [69]:
# deleting row 10472 and ensuring the row after it isn't deleted if code is run twice
if (len(android_dataset[10472])) == 12 and android_dataset[10472][0] == "Life Made WI-Fi Touchscreen Photo Frame":
    del(android_dataset[10472])
    print("yes")
else:
    print ("no")

yes


There are multiple entries in the app data from google play store. For the analysis, I don't want to count certain apps more than once, so I'll delete the duplicates and keep one entry per app. There's a difference in the number of reviews, for example a  duplicate entry for the instagram app has different review numbers which means the data was collected at different times. Instead of removing the duplicates randomly, I'll be keeping the app entry with the highest number of reviews as this means that particular entry is the most recent.

In [70]:
android_data_dict = {}  #dictionary to store android dataset without duplicates
for row in android_dataset:
    if row[0] not in android_data_dict:
        android_data_dict[row[0]] = int(row[3])
    else:
        if android_data_dict[row[0]] < int(row[3]):
            android_data_dict[row[0]] = int(row[3])
print(len(android_data_dict))
        

9659


The total number of apps in the android dataset is 10841. After I removed the *Life Made WI-Fi Touchscreen Photo Frame*, there are 10840 apps left. Since there are 1181 duplicates, the total number of apps without duplicates should be 10840 - 1181 = 9659. The android_data_dict that has apps with the highest number of revieews for each without duplicate entry has a length of 9659. I'll use this dictionary to remove the duplicates in the actual dataset list.

In [71]:
android_clean_dataset = []
already_seen = []
for row in android_dataset:
    ratings = int(row[3])
    if ratings == android_data_dict[row[0]] and row[0] not in already_seen:
        android_clean_dataset.append(row)
        already_seen.append(row[0])
print(len(android_clean_dataset))     

9659


The length of the clean list matches the length of the android_data_dict; they're both 9659. The duplicates in the actual dataset list have been successfully removed. Next I'll check if the apps are designed for an English speaking audience since the company apps are for only English speakers.I'll remove apps that are for non-English speaking audiences. To do this, I'll use the built-in ord() function to check that the characters are in the range of 0 to 127 according to the ACII system.

In [72]:
def is_english_app(app_name):
    count = 0  #Use count to account for emojis and special characters in English apps
    for letter in app_name: 
        if ord(letter) <0 or ord(letter)>127: 
            count += 1
    if count > 3:
        return False
    else:
        return True
    
#Filter google dataset for English apps 
android_eng_dataset = []
for row in android_clean_dataset:
    if is_english_app(row[0]) == True:
        android_eng_dataset.append(row)
print("Number of android apps left: "+ str(len(android_eng_dataset)))

#Filter apple store dataset for English apps
ios_eng_dataset = []
for row in ios_dataset:
    if is_english_app(row[1]) == True:
        ios_eng_dataset.append(row)
print("Number of ios apps left: "+ str(len(ios_eng_dataset)))

Number of android apps left: 9614
Number of ios apps left: 6183


As mentioned in the introduction, we only build apps that are free to download and install and the main source of revenue consists of in-app adds. The datasets contain both free and non-free apps; I'll isolate only the free apps for the analysis.

In [73]:
#Retrieve list of free android apps
android_free_apps = []
for row in android_eng_dataset:
    if row[7] == '0':
        android_free_apps.append(row)
        
#Retrieve list of free ios apps
ios_free_apps = []
for row in ios_eng_dataset:
    if float(row[4]) == 0.0:
        ios_free_apps.append(row)
        
print("Final number of android apps for analysis: "+ str(len(android_free_apps)))
print("Final number of ios apps for analysis: "+ str(len(ios_free_apps)))

Final number of android apps for analysis: 8864
Final number of ios apps for analysis: 3222


## Analysis 
The goal of the analysis is to determine what kind of apps would attract more users because revenue is highly influenced by number of people using the apps. The validation strategy for an app idea is comprised of 3 steps:
1. Build a minimal android version of the app and add it to google play
2. If the app has a good response from users, develop it further
3. If the app is profitable after 6 months, build an ios version of the app and add it to the app store

Since the end goal is to add the app to both google play and app store, I need to find apps that are successful in both markets. I'll begin the analysis by exploring the common apps in each market

In [77]:
#function to create frequency table takes in a nested list(dataset) and an index as inputs to 
#generate a frequency table 
def freq_table(dataset, index):
    freq_dict = {}
    count = 0
    for row in dataset:
        count += 1
        value = row[index]
        if value in freq_dict:
            freq_dict[value] += 1
        else:
            freq_dict[value] = 1
    for key in freq_dict:
        freq_dict[key] = (freq_dict[key]/count) * 100
    return freq_dict


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
    

In [78]:
# Display prime genre frequency table for ios apps
display_table(ios_free_apps, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


For the free English ios apps, 58% of the apps are in the games category, 8% of the apps are in entertainment category while 5% are in the photos and videos category. The general impression is that most of the apps are in the fun category(games, entertainment, photo & video, social networking, sports) compared to fewer apps in the practical category (education with about 4%, utilities with about 3% and productivity about 2%).  

In [80]:
#Display category frequency table for android apps
display_table(android_free_apps, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

The overall impression for the free English android apps is different from the free English ios apps. There's no one dominant genre consisiting of more than 50% as there is in the ios apps. Also most of the apps are designed for more practical purposes(e.g family, tools, business, lifestyle, productivity) unlike the ios apps. This is also higlighted in the frquency table below showing the genres column with andorid apps dataset. The difference between the category and the genres columns in android apps is not clear at the moment. However, the genres column appears more detailed, but since I'm looking for the big picture I'll focus on the category column moving forward.

In [81]:
display_table(android_free_apps, 9) #genres

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

## Most popular apps by genre on the app store 

I'll be determining the number of apps with the most users by using the total number of user ratings which can be found in the rating_count_tot.

In [88]:
ios_genres = freq_table(ios_free_apps, 11)
for genre in ios_genres:
    total = 0
    len_genre = 0
    for row in ios_free_apps:
        app_genre = row[11]
        if app_genre == genre:
            ratings = float(row[5])
            total += ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Social networking apps seem popular but that is because of dominance of apps like Instagram, Facebook, Pinterest etc. The same is true for the reference genre where reviews are dominated by apps like the bible, dictonary etc. If these popular apps are eliminated from the list, the user reviews for these genres would be much smaller than they are currently. Since I'm only focusing on the high level analysis, I won't be separating out these popular apps.

The fun and games section have the most reviews in general. In my earlier analysis, the fun and games(games, entertainment, photo & video and social networking) had the most supply of apps too, and this could mean a market saturation. 

Other interesting categories that seem popular among users are: weather, travel, food & drink, book, and finance. The weather genre would not be suitable because people don't spend a lot of time on weather apps so it would be difficult to generate in-app adds. Travel wouldn't work  for what we're looking for becuase it would require us to integrate with APIs from hotels and airlines and this would make the app non-free. Finance, especially personal finance sounds really interesting and I would have picked it as the recommmendation especially because of how popular it is with users, but it requires domain knowledge and the company isn't looking to hire a finance expert. 

The remaining popular categories are Food & Drink and Book. Only 0.4% of apps on the app store fall under the book category and 0.8% fall under the food & drink category. If we go with the book or food & drinks category, then there would definitely be room for the company's app to stand out among other apps.

I'll recommend the Food & drink category and the app can be a content based app instead of a food or food delivery app. This can be different receipes from various countries, various diets, and how to make different drinks. 

## Most popular apps by genre on google play

In [89]:
display_table(android_free_apps, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


For analyzing apps on google play, I'll be using the install column displayed above. The installs have + after the numbers, so I'm unaware of the exact number of installs for each app. To simplify the analysis I'll remove the pluses and focus on the actual number. E.g 1,000,000+ installs would be analyzed as 1,000,000.

In [92]:
android_freq = freq_table(android_free_apps, 1) #frequency table for google category data
for genre in android_freq:
    total = 0
    len_genre = 0
    for app in android_free_apps:
        if app[1] == genre:
            installs = app[5]
            installs = installs.replace("+", "")
            installs = installs.replace(",", "")
            installs = float(installs)
            total += installs
            len_genre += 1
    average_installs = total/len_genre
    print(genre, ": ", average_installs)
    

ART_AND_DESIGN :  1986335.0877192982
AUTO_AND_VEHICLES :  647317.8170731707
BEAUTY :  513151.88679245283
BOOKS_AND_REFERENCE :  8767811.894736841
BUSINESS :  1712290.1474201474
COMICS :  817657.2727272727
COMMUNICATION :  38456119.167247385
DATING :  854028.8303030303
EDUCATION :  1833495.145631068
ENTERTAINMENT :  11640705.88235294
EVENTS :  253542.22222222222
FINANCE :  1387692.475609756
FOOD_AND_DRINK :  1924897.7363636363
HEALTH_AND_FITNESS :  4188821.9853479853
HOUSE_AND_HOME :  1331540.5616438356
LIBRARIES_AND_DEMO :  638503.734939759
LIFESTYLE :  1437816.2687861272
GAME :  15588015.603248259
FAMILY :  3695641.8198090694
MEDICAL :  120550.61980830671
SOCIAL :  23253652.127118643
SHOPPING :  7036877.311557789
PHOTOGRAPHY :  17840110.40229885
SPORTS :  3638640.1428571427
TRAVEL_AND_LOCAL :  13984077.710144928
TOOLS :  10801391.298666667
PERSONALIZATION :  5201482.6122448975
PRODUCTIVITY :  16787331.344927534
PARENTING :  542603.6206896552
WEATHER :  5074486.197183099
VIDEO_PLAYERS 

The communication genre seems to be the most popular with over 38 million installs, followed by video player with over 24 million, social with over 23 million and photography with over 17 million installs. The communication genere are dominated by a few popular apps(Gmail, hangouts, skype, Messenger). This also applies to the video player and photography genres. The companies that dominate these categories are very difficult to compete against so I'll be moving on to other categories to explore further what could work best in our situation. 


Other popular categories include: entertainment, games,productivity, tools, shopping and travel. I determined earlier that there seems to be a market saturation in the fun/entertainment and games categoory so I wouldn't be recommending these categories. Other popular apps with over 1 million installs include: business, education, finance, art & design, food & drinks, health & fitness, lifestyle and family. My recommendation here is to go with the food & drink genre and make a content based app on different receipes for food, snacks and drinks becasue we're looking for apps that can work for both the google and apple markets. Also, there's a vast idea of contents that can be created in this genre, plus a huge room for generating revenue through ads in this genre. It is important to note that only 1.2 percent of apps available in the google play store are in the food & drink niche so there's definitely room for the company's app to standout amongst other apps in the market.

# Conclusion 

In this project, I analyzed data about free apps for English speakers in the google play store and Apple app store with a goal of recommending a niche that can be profitable in both markets.

I concluded that creating a content based app in the Food & Drink genre that would recommend various receipes and teach users how to cook different diets and dishes from various countries would be profitable for both google play store and apple apps markets.