# Analyzing Mobile Apps Data

The goal of the project is to analyze data to help our company's developers understand what type of apps are likely to attract more users. I am working as a data analyst for a fictional company that builds mobile apps. We are building apps that are free to download and install, and the major source of reveneue consits of in-app ads showed to users. 

For this project I chose to analyze data for both Android and iOS mobile apps from the the Google Play and Apple's App Store.

# Exploring the Data

I started by looking for readily available data sets online. 
* [Google Play data set](https://www.kaggle.com/lava18/google-play-store-apps) (apps data as of Sept. 2018)
* [Apple Apps Store data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) (apps data as of Sept. 2018)

I then opened each data set.


In [2]:
##Apple Store Data Set##
opened_file= open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios_dataset = ios[1:]

## Google Play Data Set##
opened_file = open('googleplaystore.csv')
from csv import reader
read_file = reader(opened_file)
android = list(read_file)
android_header = android [0]
android_dataset = android [1:]

 

Then we will create a new function that we can use repeatedly to explore the data. 

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(ios_header)     
print('\n')
explore_data(ios_dataset, 1 , 5 ,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
print(android_header)
print('\n')
explore_data(android_dataset, 1, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


# Finding and Removing Duplicates

At our company, we are primarily concenred with only free apps directed at an English-speaking audience. We also want to remove any duplicate or incorrect data. 

In the [discussion section on Kaggle](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) of the Google Play data set, we can see that a user identified an error that needs to be corrected.

In [14]:
print(android_dataset[10472])

Life Made WI-Fi Touchscreen Photo Frame,1.9,19,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up



It is unclear whether or not this row is missing its rating entry or category entry. To be safe, we will delete the entire row.

In [5]:
print(len(android_dataset))
del android_dataset [10472]
print(len(android_dataset))

10841
10840


Next, I checked for duplicates within each data set. 

In [6]:
duplicates_android = []
uniques_android = []

for app in android_dataset:
    name = app[0]
    if name in uniques_android:
        duplicates_android.append(name)
    else:
        uniques_android.append(name)
        
print ('Number of Duplicate Apps:', len(duplicates_android))
print ('Number of Unique Apps:', len(uniques_android))
print ('Examples of Duplicates Apps:', duplicates_android[:5])

Number of Duplicate Apps: 1181
Number of Unique Apps: 9659
Examples of Duplicates Apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


In [7]:
duplicates_ios = []
uniques_ios= []

for app in ios_dataset:
    name = app[0]
    if name in uniques_ios:
        duplicates_ios.append(name)
    else:
        uniques_ios.append(name)
        
print ('Number of Duplicate Apps:', len(duplicates_ios))
print ('Number of Unique Apps:', len(uniques_ios))
print ('Examples of Duplicates Apps:', duplicates_ios[:5])

Number of Duplicate Apps: 0
Number of Unique Apps: 7197
Examples of Duplicates Apps: []


The Google Play data contains 1,181 duplicates, while the App Store data contains none. According to the various [discussion threads on Kaggle](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) this seems to be the consensus.

From previewing the duplicate rows, we can see the main discrepancy is in the number of reviews left in the Google Play Store. We can use this as an opportunity to create criteria for deleting rows. 

Since our analysis focuses on number of downloads, we will keep the row with the highest amount of downloads.

Now that we have a criteria, we will begin the process of isolating and removing the duplicates.

In order to do this, we will need to create a dictionary.

Since we know there are 1,181 duplicates in the Play Store, we need to isolate those duplicates and confirm the new lengeth, 9,659.

In [8]:
reviews_max = {}

for app in android_dataset:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Expected length:', len(android_dataset) - 1181)
print('Google Play Length:', len(reviews_max))

Expected length: 9659
Google Play Length: 9659


The dictionary now contains the duplicate apps, but we want to only keep the duplicates with the highest number of reviews.

We need to delete the duplicates.

In [9]:
android_clean = []
already_added = []

for app in android_dataset:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
                

In [10]:
explore_data(android_clean,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


# Removing Non-English Apps

Now that I have managed to remove the duplicate app entries, I want to narrow down the list to just apps that are made for English-speaking audience. 

In order to do that, I created a function to identify apps with [ASCII chracters](https://en.wikipedia.org/wiki/ASCII) below 127 (ASCII chracterers below 127 are english chracters such as 'a' or '9')

In [11]:
def english_app(string):
    for chracter in string:
        if ord(chracter) > 127:
            return False 
    else:
        return True
    
print(english_app('Instagrm'))
print(english_app('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_app('Docs To Go™ Free Office Suite'))
print(english_app('Instachat 😜'))


True
False
False
False


While the function seems to work for the most part, if we were to use the function as is, it  would exclude apps that contains special symbols like emojis or trademarks because they fall outside of the ASCII range. Therefore, we know the app needs to be modified to be more inclusive. 

We will only elimnate apps with more than 3 non-ASCII chracters. This isn't perfect, but will minimize unnecessary data loss. 

In [12]:
def english_app(string):
    not_english = 0
    
    for character in string:
        if ord(character) > 127:
            not_english += 1
        
    if not_english > 3:
            return False
    else:
        return True

print(english_app('Docs To Go™ Free Office Suite'))
print(english_app('Instachat 😜'))
print(english_app('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


While there may be a few english language apps that end up getting filtered, there shouldn't be enought to severely impact our analysis.

Now, we will loop through each out the data sets and pull out the non-English apps.

In [13]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if english_app(name):
        android_english.append(app)
        
for app in ios_dataset:
    name = app[1]
    if english_app(name):
        ios_english.append(app)
        
explore_data(android_english,0,3, True)
print('\n')
explore_data(ios_english,0,3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

We are left with 9,614 for the Google Play Store and 6,183 for the IOS App Store.

# Isolating Free Apps

The last piece that we must remove are paid apps. Again, our company only builds apps that are free.

In [14]:
free_apps = []
paid_apps = []

for app in android_english:
    price = app[6]
    if price == 'Free':
        free_apps.append(app)
    
    else:
        paid_apps.append(app)

print(explore_data(free_apps,0,3,True))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8863
Number of columns: 13
None


In [18]:
ios_free =[]
ios_paid = []

for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)
    else:
        ios_paid.append(app)

        
        
print(explore_data(ios_free,0,3,True))

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16
None


Our final list has 8,863 apps in the Google Play Store and 3,222 apps in the Apple App Store.

# Analysis: Most Common Apps by Genre

Our goal is to create an app that attracts the most users.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

We will start our analysis by looking at the most popular genres of apps so we get a general idea of the apps we need to build.

We will do this by first building two fuctions to analyze our frequency table.

* The first one will build frequency tables with percentages 
* The second one will to show the percentages in descending order.

In [15]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages
            
    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
    

Next, we'll analyze the category column in the Google Play data set.

In [16]:
display_table(free_apps,1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

We can see that amongst the free, English-language apps, the Family category is double the category of the Games category. This is surprising considering that is commonly thought that games perform the best due to their addictive nature. 

In [19]:
display_table(ios_free, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The App Store drastically differs from the Google Play store results. Games make up more than half of the App Store.

On the surface this may seem advantageous and we could easily draw the conclusion that since games are the most popular. However, we should keep in mind that this means increased competition and possibly less visibility (a more crowded category means users have more options).

In order to gauge popularity, we need to look at a different metric.

# Analysis: Most popular app by genre

To deepen our analysis, we want to figure out popularity guage by usage. One measure of this is the Installs column found in the Google Play data set. The App Store data set lacks a similar column, so as workaround, we will use the total number of user ratings, found in the rating count tot column.

In [20]:
genres_ios = freq_table(ios_free, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Finance : 31467.944444444445
Social Networking : 71548.34905660378
Games : 22788.6696905016
Utilities : 18684.456790123455
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Book : 39758.5
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Music : 57326.530303030304
Shopping : 26919.690476190477
Education : 7003.983050847458
Medical : 612.0
News : 21248.023255813954
Lifestyle : 16485.764705882353
Weather : 52279.892857142855
Photo & Video : 28441.54375
Travel : 28243.8
Productivity : 21028.410714285714
Catalogs : 4004.0
Navigation : 86090.33333333333
Business : 7491.117647058823


In the App Store the most review app genres are navigation and social networking apps. While these types of apps are the most popular, they are difficult to build and scale. There is a [long list](https://en.wikipedia.org/wiki/List_of_defunct_social_networking_websites) of social networking apps failing. Secondly, building a [good navigation app can be costly](https://stormotion.io/blog/how-much-does-it-cost-to-create-an-app-like-waze/).

Again, we want to build a free app that can be easily reach mass adoption and relies mostly on ad revenue. Therefore the 3rd and 4th most reviewed genres, weather and music are much more interesting to look at.

Both of these categories focus on daily habitual user behavior, and if we find the right audience segment, we should be able get more ad clicks. 

In [21]:
for app in ios_free:
    if app[-5] == 'Weather':
        print(app[1],':', app[5])

The Weather Channel: Forecast, Radar & Alerts : 495626
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking : 208648
WeatherBug - Local Weather, Radar, Maps, Alerts : 188583
MyRadar NOAA Weather Radar Forecast : 150158
AccuWeather - Weather for Life : 144214
Yahoo Weather : 112603
Weather Underground: Custom Forecast & Local Radar : 49192
NOAA Weather Radar - Weather Forecast & HD Radar : 45696
Weather Live Free - Weather Forecast & Alerts : 35702
Storm Radar : 22792
QuakeFeed Earthquake Map, Alerts, and News : 6081
Moji Weather - Free Weather Forecast : 2333
Hurricane by American Red Cross : 1158
Forecast Bar : 375
Hurricane Tracker WESH 2 Orlando, Central Florida : 203
FEMA : 128
iWeather - World weather forecast : 80
Weather - Radar - Storm with Morecast App : 78
Yurekuru Call : 53
Weather & Radar : 37
WRAL Weather Alert : 25
Météo-France : 24
JaxReady : 22
Freddy the Frogcaster's Weather Station : 14
Almanac Long-Range Weather Forecast : 12
TodayAir

It seems that in the weather category, there's a drastic difference between the top 5 apps and the rest of the apps. This may mean the space is harder to break into because a few key players dominate the space. 

In [22]:
for app in ios_free:
    if app[-5] == 'Music':
        print(app[1],':', app[5])

Pandora - Music & Radio : 1126879
Spotify Music : 878563
Shazam - Discover music, artists, videos & lyrics : 402925
iHeartRadio – Free Music & Radio Stations : 293228
SoundCloud - Music & Audio : 135744
Magic Piano by Smule : 131695
Smule Sing! : 119316
TuneIn Radio - MLB NBA Audiobooks Podcasts Music : 110420
Amazon Music : 106235
SoundHound Song Search & Music Player : 82602
Sonos Controller : 48905
Bandsintown Concerts : 30845
Karaoke - Sing Karaoke, Unlimited Songs! : 28606
My Mixtapez Music : 26286
Sing Karaoke Songs Unlimited with StarMaker : 26227
Ringtones for iPhone & Ringtone Maker : 25403
Musi - Unlimited Music For YouTube : 25193
AutoRap by Smule : 18202
Spinrilla - Mixtapes For Free : 15053
Napster - Top Music & Radio : 14268
edjing Mix:DJ turntable to remix and scratch music : 13580
Free Music - MP3 Streamer & Playlist Manager Pro : 13443
Free Piano app by Yokee : 13016
Google Play Music : 10118
Certified Mixtapes - Hip Hop Albums & Mixtapes : 9975
TIDAL : 7398
YouTube Mu

The music section does have a few key players that dominate the list, but those are streaming and song recognition apps. Three of the top 15 apps are karaoke apps while one is a concert alert app. There is also less than 50 apps in this category making it less saturated than a general game app.

For the App Store, I reccomend the company builds a karaoke app. 

Now let's analyze the Google Play store.

In [23]:
display_table(free_apps, 5)

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


While this is helpful, the data isn't very precise. For example, we don't know if the data if apps in the 5,000+ category got 10,000 installs or 20,0000.

Since our analysis is more high-level, we can create a new criteria: we can assume the the listed install number is the install number. For example, we will assume that an app with 5,000+ installs has 5,000 installs and so on.

First, in order to do this, we must convert the install numbers from stings to floats. 

In [24]:
freq_table(free_apps,1)

{'ART_AND_DESIGN': 0.6431230960171499,
 'AUTO_AND_VEHICLES': 0.9251946293580051,
 'BEAUTY': 0.5979916506826132,
 'BOOKS_AND_REFERENCE': 2.1437436533904997,
 'BUSINESS': 4.592124562789123,
 'COMICS': 0.6205573733498815,
 'COMMUNICATION': 3.2381812027530184,
 'DATING': 1.8616721200496444,
 'EDUCATION': 1.1621347173643235,
 'ENTERTAINMENT': 0.9590432133589079,
 'EVENTS': 0.7108202640189552,
 'FAMILY': 18.898792733837304,
 'FINANCE': 3.7007785174320205,
 'FOOD_AND_DRINK': 1.241114746699763,
 'GAME': 9.725826469592688,
 'HEALTH_AND_FITNESS': 3.0802211440821394,
 'HOUSE_AND_HOME': 0.8236488773552973,
 'LIBRARIES_AND_DEMO': 0.9364774906916393,
 'LIFESTYLE': 3.9038700214374367,
 'MAPS_AND_NAVIGATION': 1.399074805370642,
 'MEDICAL': 3.5315355974275078,
 'NEWS_AND_MAGAZINES': 2.798149610741284,
 'PARENTING': 0.6544059573507841,
 'PERSONALIZATION': 3.317161232088458,
 'PHOTOGRAPHY': 2.944826808078529,
 'PRODUCTIVITY': 3.8925871601038025,
 'SHOPPING': 2.245289405393208,
 'SOCIAL': 2.66275527473767

In [33]:
categories_free= freq_table(free_apps, 1)
for genre in categories_free:
    total = 0
    len_category = 0
    for app in free_apps:
        category_app = app[1]
        if category_app == genre:
            n_installs = app[5]
            n_installs = n_installs.replace(',','')
            n_installs = n_installs.replace('+','')
            total += float(n_installs)
            len_category += 1      
    avg_n_installs = total/ len_category
    print(genre, ':', avg_n_installs)

NEWS_AND_MAGAZINES : 9549178.467741935
DATING : 854028.8303030303
FAMILY : 3697848.1731343283
BEAUTY : 513151.88679245283
HOUSE_AND_HOME : 1331540.5616438356
BUSINESS : 1712290.1474201474
AUTO_AND_VEHICLES : 647317.8170731707
PHOTOGRAPHY : 17840110.40229885
ENTERTAINMENT : 11640705.88235294
BOOKS_AND_REFERENCE : 8767811.894736841
COMMUNICATION : 38456119.167247385
TRAVEL_AND_LOCAL : 13984077.710144928
SPORTS : 3638640.1428571427
COMICS : 817657.2727272727
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
FINANCE : 1387692.475609756
ART_AND_DESIGN : 1986335.0877192982
VIDEO_PLAYERS : 24727872.452830188
GAME : 15588015.603248259
LIFESTYLE : 1437816.2687861272
LIBRARIES_AND_DEMO : 638503.734939759
PERSONALIZATION : 5201482.6122448975
SOCIAL : 23253652.127118643
HEALTH_AND_FITNESS : 4188821.9853479853
TOOLS : 10801391.298666667
MEDICAL : 120550.61980830671
EVENTS : 253542.22222222222
PARENTING : 542603.6206896552
SHOPPING : 7036877.311557789
MAPS_AND_NAVIGATION : 4056941.77

The communication category performs the best with an average of 38 millions installs.

In [38]:
for app in free_apps:
    if app[1] == 'COMMUNICATION':
        print(app[0],':', app[5])

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 

Just from doing a quick browse, we can see that this genre is already saturated. Let's try looking at a less competitive space.

In [39]:
for app in free_apps:
    if app[1] == 'PRODUCTIVITY':
        print(app[0],':', app[5])

Microsoft Word : 500,000,000+
All-In-One Toolbox: Cleaner, Booster, App Manager : 10,000,000+
AVG Cleaner – Speed, Battery & Memory Booster : 10,000,000+
QR Scanner & Barcode Scanner 2018 : 10,000,000+
Chrome Beta : 10,000,000+
Microsoft Outlook : 100,000,000+
Google PDF Viewer : 10,000,000+
My Claro Peru : 5,000,000+
Power Booster - Junk Cleaner & CPU Cooler & Boost : 1,000,000+
Google Assistant : 10,000,000+
Microsoft OneDrive : 100,000,000+
Calculator - unit converter : 50,000,000+
Microsoft OneNote : 100,000,000+
Metro name iD : 10,000,000+
Google Keep : 100,000,000+
Archos File Manager : 5,000,000+
ES File Explorer File Manager : 100,000,000+
ASUS SuperNote : 10,000,000+
HTC File Manager : 10,000,000+
MyMTN : 1,000,000+
Dropbox : 500,000,000+
ASUS Quick Memo : 10,000,000+
HTC Calendar : 10,000,000+
Google Docs : 100,000,000+
ASUS Calling Screen : 10,000,000+
lifebox : 5,000,000+
Yandex.Disk : 5,000,000+
Content Transfer : 5,000,000+
HTC Mail : 10,000,000+
Advanced Task Killer : 50

Again this is a largerly saturated category. Market saturaturation within the top categories seems inevitable in the Google Play store.

Rather than looking at it from a saturation lens, we could look for external data about time spent in app. According to the [2019 Business of Apps Report](https://www.businessofapps.com/data/app-statistics/) and [App Annie's 2019 Mobile app report](https://www.appannie.com/en/go/state-of-mobile-2019/), communications apps account for 50% of user time spent in apps globally. This makes it much more likely for a user to click our ads and produce revenue. 

However, this category is dominated by a few big names like What's App and Facebook Messanger who have much more resources and man-power to scale. This begs the question, even if the top performing apps get users to spend a large amount of time in app, will we be capable with limited resources to do the same?

I don't reccomend building an app for the Google Play store for this reason. 



# Conclusion

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

However, given the saturation of the Google Play store, I suggest sticking to Apple's App Store and building a karaoke app because it is smaller and less competitive.