# Profitable App Profiles for the App Store and Google Play Markets

The goal for this project is to analyze data from the App Store and Google Play to help developers understand what type of apps are likely to attract more users. As many free apps receive revenue from ads, the more users engaged with the app, the better.


As of September 2018, there were approximately **2 million iOS apps** available on the App Store, and **2.1 million Android apps** on Google Play.

![Image](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png)

Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)

<br>

Collecting data for over 4 million apps requires a significant amount of time and money. Luckily, these are two data sets that seem suitable for the project's goals:

* A [data set](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) containing data about approximately 10,000 Android apps from Google Play, collected in August 2018. 
* A [data set](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)containing data about approximately 7,000 iOS apps from the App Store, collected in July 2017.

<br>

### Exploring the data sets

In [1]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android_data = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios_data = ios[1:]

To make them easier to explore, a function named `explore_data()` can be used repeatedly to print rows in a readable way.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
#First four rows from the Android data set

explore_data(android, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


We can see that our data set contains 10842 rows and 13 columns or categories of information. Some insightful information to use in the analysis includes: "App", "Category", "Rating", "Reviews", "Type", "Price", "Content Rating".

In [4]:
#First four rows from the ios data set

explore_data(ios, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16


We can see that our data set contains 7198 rows and 16 columns or categories of information. Some insightful information to use in the analysis includes: "track_name", "price", "rating_count_tot", "user_rating", "content_rating", "prime_genre".

### Deleting inaccurate data

Before beginning our analysis, we need to make sure the data we analyze is accurate, otherwise the results of our analysis will be wrong. This means that we need to:

* Detect inaccurate data, and correct or remove it.
* Detect duplicate data, and remove the duplicates.

Recall that at our company, we only build **apps that are free** to download and install, and that are directed toward an **English-speaking audience.** This means that we'll need to:

* Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
* Remove apps that aren't free.

According to the discussion section of the Google Pay data set, an error has been reported for entry 10472. This entry will be printed below.

In [5]:
print(android_header)
print('\n')

explore_data(android, 10473, 10474, False)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




Entry 10472 corresponds to the app, "Life Made WI-Fi Touchscreen Photo Frame". The rating stated is 19 which is incorrect because the maximum rating for a Google Play app is 5. This problem is caused by a missing value in the 'Category' column. This row will be deleted since there is an error in it.

In [6]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android))

10842
10841


### Deleting duplicate entries (part one)

We will check for duplicated entries in the android and ios data set. 

In [7]:
duplicate_android = []
unique_android = []

for row in android[1:]:
    name = row[0]
    if name in unique_android:
        duplicate_android.append(name)
    else:
        unique_android.append(name)

print("Number of duplicate apps:", len(duplicate_android))
print("Example of duplicate apps:", duplicate_android[:8])
print('\n')
print("Number of unique apps:", len(unique_android))

Number of duplicate apps: 1181
Example of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads']


Number of unique apps: 9659


In [8]:
duplicate_ios = []
unique_ios = []

for row in ios[1:]:
    name = row[0]
    if name in unique_ios:
        duplicate_ios.append(name)
    else:
        unique_ios.append(name)

print("Number of duplicate apps:", len(duplicate_ios))
print("Example of duplicate apps:", duplicate_ios[:8])
print('\n')
print("Number of unique apps:", len(unique_ios))

Number of duplicate apps: 0
Example of duplicate apps: []


Number of unique apps: 7197


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. 

Below is an example of the app, "Instagram" which has duplicate entries in the Android data set.



In [9]:
for row in android:
    name = row[0]
    if name == "Instagram":
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


If you examine the rows we printed for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times.

We can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

We looped through the Google Play data set and found that there are 1,181 duplicates. After we remove the duplicates, we should be left with 9,659 rows.

To remove the duplicates, we will:
* Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
* Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [10]:
reviews_max = {}

for row in android[1:]:
    name = row[0]
    review = (row[3])
    
    if name in reviews_max and review < reviews_max[name]:
        reviews_max[name] = review
        
    elif name not in reviews_max:
        reviews_max[name] = review
        
print("Expected dictionary length: 9659")
print('\n')
print("Actual:", len(reviews_max))

Expected dictionary length: 9659


Actual: 9659


In [11]:
android_clean = [] #clean data
android_added = [] #just app names

for row in android[1:]:
    name = row[0]
    review = row[3]
    
    #if the app name in the dictionary corresponds to the number of reviews in the data set, then it is the right entry
    if reviews_max[name] == review and name not in android_added:
        android_clean.append(row)
        android_added.append(name)
        
print(len(android_clean))

9659


### Removing non-English Apps

Both data sets have apps with names that suggest they are not directed toward an English-speaking audience. We're not interested in keeping these apps, so we'll remove them. 

One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

Behind the scenes, each character we use in a string has a corresponding number associated with it (according to the ASCII system).

Write a function that takes in a string and returns `False` if there's any character in the string that doesn't belong to the set of common English characters, otherwise it returns `True`.

In [12]:
def eng_or_not(string):
    for character in string:
        if ord(character) > 127:
            print(character)
            return False
            
    return True

In [13]:
# Let's test the eng_or_not function

print(eng_or_not("Instagram"))

print("\n")

print(eng_or_not("爱奇艺PPS -《欢乐颂2》电视剧热播"))

print("\n")

print(eng_or_not("Docs To Go™ Free Office Suite"))

print("\n")

print(eng_or_not("Instachat 😜"))


True


爱
False


™
False


😜
False


The function seems to work fine, but some English app names use emojis or other symbols (™, — (em dash), – (en dash), etc.) that fall outside of the ASCII range. Because of this, we'll remove useful apps if we use the function in its current form.

To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. Our filter function is still not perfect, but it should be fairly effective.

In [14]:
# New function

def eng_or_not(string):
    count = 0
    for character in string:
        if ord(character) > 127:
            count += 1
    if count > 3:
        return False
    else:
        return True
    

In [15]:
# Let's test the new function

print(eng_or_not("爱奇艺PPS -《欢乐颂2》电视剧热播"))

print("\n")

print(eng_or_not("Docs To Go™ Free Office Suite"))

print("\n")

print(eng_or_not("Instachat 😜"))

False


True


True


In [16]:
# Clean and English apps

CE_android = []
CE_ios = []

for row in android_clean:
    name = row[0]
    eng_or_not(name)
    if True:
        CE_android.append(row)

for row in ios[1:]:
    name = row[1]
    eng_or_not(name)
    if True:
        CE_ios.append(row)
        
print("Number of clean, English Android apps:", len(CE_android))
print("Number of clean, English ios apps:", len(CE_ios))

Number of clean, English Android apps: 9659
Number of clean, English ios apps: 7197


### Isolate the free apps

In [17]:
CEF_android = []
CEF_ios = []

for row in CE_android:
    price = row[7]
    if price == "0":
        CEF_android.append(row)
    
for row in CE_ios:
    price = row[4]
    if price == "0.0":
        CEF_ios.append(row)
    
print("Number of clean, English, free Android apps:", len(CEF_android))
print("Number of clean, English, free ios apps:", len(CEF_ios))

Number of clean, English, free Android apps: 8904
Number of clean, English, free ios apps: 4056


As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:
* Build a minimal Android version of the app, and add it to Google Play.
* If the app has a good response from users, we develop it further.
* If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the "prime_genre" column of the App Store data set, and the "Genres" and "Category" columns of the Google Play data set.

We'll build two functions we can use to analyze the frequency tables:
* One function to generate frequency tables that show percentages
* Another function we can use to display the percentages in a descending order

In [18]:
# Frequency Table function 

def freq_table(data, index):
    a_dic = {}
    total = 0
    for row in data:
        total += 1
        category = row[index]
        if category in a_dic: 
            a_dic[category] +=1
        else:
            a_dic[category] = 1

    percentage = {}
    for item in a_dic:
        percentage[item]= a_dic[item] / total *100
    return percentage


# Display Table function


def display(dataset, index):
    table = freq_table(dataset, index)
    list_of_tuples = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        list_of_tuples.append(key_val_as_tuple)

    table_sorted = sorted(list_of_tuples, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


The sorted() function works well if we transform the dictionary into a list of tuples, where each tuple contains a dictionary key along with its corresponding dictionary value.

* Takes in two parameters: dataset and index. dataset is expected to be a list of lists, and index is expected to be an integer.
* Generates a frequency table using the freq_table() function (which you're going to write as an exercise).
* Transforms the frequency table into a list of tuples, then sorts the list in a descending order.
* Prints the entries of the frequency table in descending order.

In [19]:
# App Store data set

display(CEF_ios, -5)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


Among the free English apps, more than a half (55.64%) are **games**. The share of **entertainment** apps drop significantly to 8%, followed by **photo and video** apps at 4%, **social networking** apps at 3.53%, and **education** apps at 3.25%.

The general impression is that the App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.


In [20]:
# Google Play data set

display(CEF_android, 1)

FAMILY : 18.845462713387242
GAME : 9.63611859838275
TOOLS : 8.423180592991914
BUSINESS : 4.5709793351302785
LIFESTYLE : 3.930817610062893
PRODUCTIVITY : 3.8858939802336026
FINANCE : 3.6837376460017968
MEDICAL : 3.526504941599281
SPORTS : 3.4029649595687337
PERSONALIZATION : 3.30188679245283
COMMUNICATION : 3.2457322551662173
HEALTH_AND_FITNESS : 3.054806828391734
PHOTOGRAPHY : 2.9424977538185084
NEWS_AND_MAGAZINES : 2.8301886792452833
SOCIAL : 2.6504941599281224
TRAVEL_AND_LOCAL : 2.324797843665768
SHOPPING : 2.2461814914645104
BOOKS_AND_REFERENCE : 2.178796046720575
DATING : 1.853099730458221
VIDEO_PLAYERS : 1.7857142857142856
MAPS_AND_NAVIGATION : 1.4150943396226416
EDUCATION : 1.2466307277628033
FOOD_AND_DRINK : 1.2353998203054808
ENTERTAINMENT : 1.0332434860736748
LIBRARIES_AND_DEMO : 0.9321653189577718
AUTO_AND_VEHICLES : 0.9209344115004492
HOUSE_AND_HOME : 0.8310871518418688
WEATHER : 0.7973944294699011
EVENTS : 0.7075471698113208
ART_AND_DESIGN : 0.6850853548966757
PARENTING : 0

There are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

![Image](https://camo.githubusercontent.com/9bf24b9efc3d88a3d55f5c09e314987941f0bab5/68747470733a2f2f73332e616d617a6f6e6177732e636f6d2f64712d636f6e74656e742f3335302f7079316d385f66616d696c792e706e67)

<br>

Even so, practical apps seem to have a better representation on Google Play compared to App Store. This picture is also confirmed by the frequency table we see for the "Genres" column:


In [21]:
display(CEF_android, -4)

Tools : 8.41194968553459
Entertainment : 6.087151841868823
Education : 5.3908355795148255
Business : 4.5709793351302785
Lifestyle : 3.919586702605571
Productivity : 3.8858939802336026
Finance : 3.6837376460017968
Medical : 3.526504941599281
Sports : 3.447888589398023
Personalization : 3.30188679245283
Communication : 3.2457322551662173
Action : 3.088499550763702
Health & Fitness : 3.054806828391734
Photography : 2.9424977538185084
News & Magazines : 2.8301886792452833
Social : 2.6504941599281224
Travel & Local : 2.3135669362084457
Shopping : 2.2461814914645104
Books & Reference : 2.178796046720575
Simulation : 2.0664869721473496
Dating : 1.853099730458221
Arcade : 1.8418688230008984
Video Players & Editors : 1.7857142857142856
Casual : 1.7520215633423182
Maps & Navigation : 1.4150943396226416
Food & Drink : 1.2353998203054808
Puzzle : 1.1230907457322552
Racing : 0.9883198562443846
Role Playing : 0.9321653189577718
Libraries & Demo : 0.9321653189577718
Strategy : 0.9209344115004492
Auto

The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the **App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps.** Now we'd like to get an idea about the kind of apps that have most users.

### Most popular app by genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. 

For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the "rating_count_tot".

In [38]:
genre_freq = freq_table(CEF_ios, -5)

sorted_genre = {}

for genre in genre_freq:
    total = 0
    len_genre = 0
    a_list = []
    for row in CEF_ios:
        app_genre = row[-5]
        rating = float(row[5])
        if row[-5] == genre:
            total += rating
            len_genre += 1
    avg_genre = total/ len_genre
    sorted_genre[genre] = avg_genre
    
display_genre = []
for item in sorted_genre:
    tuple_genre = (sorted_genre[item], item)
    display_genre.append(tuple_genre)
final_table = sorted(display_genre, reverse = True)
for row in final_table:
    print(row[1], ":", row[0])
        

Reference : 67447.9
Music : 56482.02985074627
Social Networking : 53078.195804195806
Weather : 47220.93548387097
Photo & Video : 27249.892215568863
Navigation : 25972.05
Travel : 20216.01785714286
Food & Drink : 20179.093023255813
Sports : 20128.974683544304
Health & Fitness : 19952.315789473683
Productivity : 19053.887096774193
Games : 18924.68896765618
Shopping : 18746.677685950413
News : 15892.724137931034
Utilities : 14010.100917431193
Finance : 13522.261904761905
Entertainment : 10822.961077844311
Lifestyle : 8978.308510638299
Book : 8498.333333333334
Business : 6367.8
Education : 6266.333333333333
Catalogs : 1779.5555555555557
Medical : 459.75


Navigation apps have a very high number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [58]:
nav_apps = []
for app in CEF_ios:
    if app[-5] == 'Navigation':
        hi =(app[1], ":", app[5])
        nav_apps.append(hi)
for app in nav_apps[0:4]:
    print(app)


('Waze - GPS Navigation, Maps & Real-time Traffic', ':', '345046')
('Google Maps - Navigation & Transit', ':', '154911')
('Geocaching®', ':', '12811')
('CoPilot GPS – Car Navigation & Offline Maps', ':', '3582')


The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

We will also explore some other genres below:

In [63]:
hf_apps = []
for app in CEF_ios:
    if app[-5] == 'Health & Fitness':
        bye = (app[1], ":", app[5])
        hf_apps.append(bye)
for app in hf_apps[0:4]:
    print(app)


('Calorie Counter & Diet Tracker by MyFitnessPal', ':', '507706')
('Lose It! – Weight Loss Program and Calorie Counter', ':', '373835')
('Weight Watchers', ':', '136833')
('Sleep Cycle alarm clock', ':', '104539')


**Health & Fitness** apps are very relevant in today's society and can appeal to a wide range of audiences and age range. However, this will require fitness professionals for content creation and could be costly. On the contrary, content can be re-used over and over again on these apps as people can repeat the same exercises.

In [64]:
travel_apps = []
for app in CEF_ios:
    if app[-5] == 'Travel':
        bye = (app[1], ":", app[5])
        travel_apps.append(bye)
for app in travel_apps[0:4]:
    print(app)


('Google Earth', ':', '446185')
('Yelp - Nearby Restaurants, Shopping & Services', ':', '223885')
('GasBuddy', ':', '145549')
('TripAdvisor Hotels Flights Restaurants', ':', '56194')


**Travel apps** - these apps seem like a fun idea but will often require a lot of data to be collected on a global scale. Without full, complete information, the app is unlikely to appeal to users.

In [67]:
fin_apps = []
for app in CEF_ios:
    if app[-5] == 'Finance':
        bye = (app[1], ":", app[5])
        fin_apps.append(bye)
for app in fin_apps[0:4]:
    print(app)


('Chase Mobile℠', ':', '233270')
('Mint: Personal Finance, Budget, Bills & Money', ':', '232940')
('Bank of America - Mobile Banking', ':', '119773')
('PayPal - Send and request money safely', ':', '119487')


**Finance apps** — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

In [68]:
prod_apps = []
for app in CEF_ios:
    if app[-5] == 'Productivity':
        bye = (app[1], ":", app[5])
        prod_apps.append(bye)
for app in prod_apps[0:4]:
    print(app)


('Evernote - stay organized', ':', '161065')
('Gmail - email by Google: secure, fast & organized', ':', '135962')
('iTranslate - Language Translator & Dictionary', ':', '123215')
('Yahoo Mail - Keeps You Organized!', ':', '113709')


**Productivity Apps** - these apps are more straightforward to make compared to the others and if users develop the habit of using the app, they are likely to go on it one to several times a day due to the nature of the app and the need to stay on top with daily tasks.

### Most popular apps on Google Play

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [72]:
display(CEF_android, 5) 

1,000,000+ : 15.700808625336926
100,000+ : 11.567834681042228
10,000,000+ : 10.489667565139262
10,000+ : 10.276280323450134
1,000+ : 8.434411500449237
100+ : 6.918238993710692
5,000,000+ : 6.7946990116801445
500,000+ : 5.536837376460018
50,000+ : 4.818059299191375
5,000+ : 4.526055705300988
10+ : 3.5377358490566038
500+ : 3.234501347708895
50,000,000+ : 2.279874213836478
100,000,000+ : 2.1226415094339623
50+ : 1.9092542677448336
5+ : 0.7861635220125787
1+ : 0.5166217430368374
500,000,000+ : 0.2695417789757413
1,000,000,000+ : 0.22461814914645103
0+ : 0.04492362982929021
0 : 0.011230907457322553


We don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users. 

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

In [80]:
cat_freq = freq_table(CEF_android, 1)
sorted_cat = {}

for category in cat_freq:
    total = 0
    len_cat = 0
    for row in CEF_android:
        genre = row[1]
        downloads = row[5]
        if genre == category:
            downloads = downloads.replace(',', '')
            downloads = downloads.replace('+', '')
            total += float(downloads)
            len_cat += 1
    avg_downloads = total/ len_cat
    sorted_cat[category] = avg_downloads

display_cat = []
for item in sorted_cat:
    tuple_cat = (sorted_cat[item], item)
    display_cat.append(tuple_cat)
final_cat = sorted(display_cat, reverse = True)
for row in final_cat:
    print(row[1], ":", row[0])
    


COMMUNICATION : 38193481.66435986
VIDEO_PLAYERS : 24634790.6918239
SOCIAL : 23253652.127118643
ENTERTAINMENT : 19428913.04347826
PHOTOGRAPHY : 17772018.759541985
PRODUCTIVITY : 16724506.687861271
TRAVEL_AND_LOCAL : 13984077.710144928
GAME : 13022056.468531469
TOOLS : 10800059.298666667
NEWS_AND_MAGAZINES : 9401635.952380951
BOOKS_AND_REFERENCE : 8587351.855670104
SHOPPING : 7001693.425
PERSONALIZATION : 5201142.816326531
WEATHER : 5074486.197183099
FAMILY : 4341651.782479142
SPORTS : 4274688.722772277
HEALTH_AND_FITNESS : 4167457.3602941176
MAPS_AND_NAVIGATION : 3993339.603174603
EDUCATION : 3061711.7117117117
FOOD_AND_DRINK : 1924897.7363636363
ART_AND_DESIGN : 1874132.786885246
BUSINESS : 1700127.9852579853
LIFESTYLE : 1436126.94
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1313681.9054054054
DATING : 854028.8303030303
COMICS : 803234.8214285715
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 513151.88679245283
EVENTS

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [88]:
for row in CEF_android:
    category = row[1]
    downloads = row[5]
    if category == "COMMUNICATION" and (downloads == '1,000,000,000+'
                                        or downloads == '100,000,000+'
                                        or downloads == '500,000,000+'):
                                    
        print(row[0], ":", downloads)

Messenger – Text and Video Chat for Free : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+
Firefox Browser fast & private : 100,000,000+
Yahoo Mail – Stay Organized : 100,000,000+
imo beta free calls and text : 100,000,000+
imo free video calls and chat : 500,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Who : 100,000,000+
WeChat : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Android Messages : 100,000,000+
Telegram : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
BBM - Free Calls & Messages : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
WhatsApp Messen

We'll explore some other popular categories below:

In [89]:
for row in CEF_android:
    category = row[1]
    downloads = row[5]
    if category == "PHOTOGRAPHY" and (downloads == '1,000,000,000+'
                                        or downloads == '100,000,000+'
                                        or downloads == '500,000,000+'):
                                    
        print(row[0], ":", downloads)

Sweet Selfie - selfie camera, beauty cam, photo edit : 100,000,000+
YouCam Perfect - Selfie Photo Editor : 100,000,000+
Cymera Camera- Photo Editor, Filter,Collage,Layout : 100,000,000+
LINE Camera - Photo editor : 100,000,000+
Camera360: Selfie Photo Editor with Funny Sticker : 100,000,000+
Photo Editor Pro : 100,000,000+
Retrica : 100,000,000+
Candy Camera - selfie, beauty camera, photo editor : 100,000,000+
YouCam Makeup - Magic Selfie Makeovers : 100,000,000+
Photo Collage Editor : 100,000,000+
PhotoGrid: Video & Pic Collage Maker, Photo Editor : 100,000,000+
B612 - Beauty & Filter Camera : 100,000,000+
PicsArt Photo Studio: Collage Maker & Pic Editor : 100,000,000+
S Photo Editor - Collage Maker , Photo Collage : 100,000,000+
Z Camera - Photo Editor, Beauty Selfie, Collage : 100,000,000+
AR effect : 100,000,000+
Google Photos : 1,000,000,000+
BeautyPlus - Easy Photo Editor & Selfie Camera : 100,000,000+
Photo Editor Collage Maker Pro : 100,000,000+


**Photography Apps** - Very popular and in high demand but probably complicated to create. Simpler ones could include apps that provide pretty templates and filters for Instagram stories. 

In [90]:
for row in CEF_android:
    category = row[1]
    downloads = row[5]
    if category == "TOOLS" and (downloads == '1,000,000,000+'
                                        or downloads == '100,000,000+'
                                        or downloads == '500,000,000+'):                               
        print(row[0], ":", downloads)

Cache Cleaner-DU Speed Booster (booster & cleaner) : 100,000,000+
Calculator : 100,000,000+
Device Help : 100,000,000+
Account Manager : 100,000,000+
Samsung Calculator : 100,000,000+
Google Korean Input : 100,000,000+
Tiny Flashlight + LED : 100,000,000+
GO Keyboard - Cute Emojis, Themes and GIFs : 100,000,000+
Speedtest by Ookla : 100,000,000+
Applock : 100,000,000+
Google Translate : 500,000,000+
Clean Master- Space Cleaner & Antivirus : 500,000,000+
Lookout Security & Antivirus : 100,000,000+
Gboard - the Google Keyboard : 500,000,000+
Google : 1,000,000,000+
Google Now Launcher : 100,000,000+
SHAREit - Transfer & Share : 500,000,000+
360 Security - Free Antivirus, Booster, Cleaner : 100,000,000+
Samsung Smart Switch Mobile : 100,000,000+
Share Music & Transfer Files - Xender : 100,000,000+
Avast Mobile Security 2018 - Antivirus & App Lock : 100,000,000+
AppLock : 100,000,000+
AVG AntiVirus 2018 for Android Security : 100,000,000+
Security Master - Antivirus, VPN, AppLock, Booster 

** Tools Apps** - A possible opportunity as many people seek tools for convenience. However, engagement rate is likely to be low as these are transactional apps that people often do not have an attachment to.

In [93]:
for row in CEF_android:
    category = row[1]
    downloads = row[5]
    if category == "HEALTH_AND_FITNESS" and (downloads == '1,000,000,000+'
                                        or downloads == '100,000,000+'
                                        or downloads == '500,000,000+'):
                                    
        print(row[0], ":", downloads)

Period Tracker - Period Calendar Ovulation Tracker : 100,000,000+
Samsung Health : 500,000,000+


**Health and Fitness apps** - In comparisson, not as many of these apps with over 100M downloads. This could be an opportunity. But there are also many Health and Fitness apps in the Google Play store with a smaller number of downloads which signigies strong competition.

In [94]:
for row in CEF_android:
    category = row[1]
    downloads = row[5]
    if category == "PRODUCTIVITY" and (downloads == '1,000,000,000+'
                                        or downloads == '100,000,000+'
                                        or downloads == '500,000,000+'):
                                    
        print(row[0], ":", downloads)

Microsoft Excel : 100,000,000+
Google Keep : 100,000,000+
Google Calendar : 500,000,000+
Samsung Notes : 100,000,000+
SwiftKey Keyboard : 100,000,000+
Google Docs : 100,000,000+
Microsoft PowerPoint : 100,000,000+
Google Slides : 100,000,000+
ColorNote Notepad Notes : 100,000,000+
Microsoft OneNote : 100,000,000+
Microsoft OneDrive : 100,000,000+
Cloud Print : 500,000,000+
Adobe Acrobat Reader : 100,000,000+
Microsoft Word : 500,000,000+
Dropbox : 500,000,000+
ES File Explorer File Manager : 100,000,000+
Evernote – Organizer, Planner for Notes & Memos : 100,000,000+
Microsoft Outlook : 100,000,000+
Google Drive : 1,000,000,000+
WPS Office - Word, Docs, PDF, Note, Slide & Sheet : 100,000,000+
CamScanner - Phone PDF Creator : 100,000,000+
Google Sheets : 100,000,000+


**Productivity apps** - Dominated by some large software companies like Microsoft, Dropbox, Evernote. Google's applications (if available) are often amongst the most downloaded in every category.

### Conclusion

Consider **creating a productivity app** as this is simpler compared to finance apps, games, photography apps. Create free features/ functions that large players do not currently offer or require users to pay for. Gamifying the productivity app is also a good way to encourage user retention and engagement to increase ad revenue. 

**Advantages** of productivity apps over other apps:
* Does not require a skilled professional to provide specific areas of knowledge.
* Does not require constant content creation 
* Lower competition compared to communications and social networking apps.
* Relatively simpler interface and design.