# Profitable App Profiles for the App Store and Google Play Markets

This project is mimicking a life-like problem, where I am playing the role of data analyst that works for a company that builds Android and iOS mobile apps. My job is to enable our team of developers to make data-driven decisions concerning the kind of apps they build.

This project aims to help the developers understand what type of apps are likely to attract more users on Google Play and the App Store. To do this, we'll need to collect and analyze data about mobile apps available on Google Play and the App Store. The analysis is focused on English and free of charge applications as this is the market we are interested in.

# Collecting and Exploring Data

According to the DataQuest, as of September 2018, there were about 2 million iOS apps accessible on the App Store, and 2.1 million Android apps on Google Play.

For this introductory project we will use only a sample of the data that is available publicly on Kaggle:

1- Android apps from Google Play data set, containing almost 11K applications:
https://www.kaggle.com/lava18/google-play-store-apps 

2- iOS apps from the App Store data set, containing approx. 7K applications:
https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

Firstly I will read in two data sets.

In [1]:
from csv import reader

## The Android Apps data set ##
opened_file = open("googleplaystore.csv", encoding="Latin-1")
# the file in question gave me an error while using default encoding #
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

## The ios Apps data set ##
opened_file = open("AppleStore.csv", encoding="Latin-1")
# the file in question gave me an error while is not using default encoding #
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

I'll continue by exploring these two data sets. To make this process easier, let's create a function named explore_data() that can be repeatedly used to print rows in a readable way.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(android_header)
print("\n")
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite â\x80\x93 FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Now I have a big picture of the size and content of the dataset. There are 10841 observations ( Google Play applications) and 13 different variables.

In [3]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In the case of the Apple Store dataset, I have 7197 applications and 16 variables.
Not all column headers are informative. A more detailed explanation of what the given variable is representing may be found at the links provided above.
Not all variables are useful for my analysis. I will focus on those that may help me find the answer to the question: what type of apps are likely to attract more users?

# Deleting Error Entry

The data set for Google Play has a section dedicated to discussions. In one of them (https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) one of the users spotted a missing rating entry. An error was identified for row 10472. 
I will verify if this is really the case by comparing this row with another one.

In [4]:
print(android[10472])  # supposedly incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The row 10472 represents the "Life Made WI-Fi Touchscreen Photo Frame" app. The number corresponding to the rating column is 19. This is incorrect as the maximum rating for a Google Play app is 5. It was probably a missing value and it seems that all the other columns have shifted. This observation is unusable, that is why I will delete it.

In [5]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android))

10841
10840


# Identifying and Cleaning Duplicates

Scanning the discussion section, I've seen a few debates regarding the duplicates in data set. I'll have a closer look at this issue.

In [6]:
duplicates_app = []
unique_app = []

for app in android:
    name = app[0]
    if name in unique_app:
        duplicates_app.append(name)
    else:
        unique_app.append(name)
        
print("Number of duplicates in apps:", len(duplicates_app))
print("\n")
print("Examples of duplicates:", duplicates_app[:15])

Number of duplicates in apps: 1181


Examples of duplicates: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


There are 1181 duplicates in the data set for android apps. I don't want to take into account multiple observations for the same application as this will distort my analysis. Duplicates should be extracted from the data set that is being analyzed. The important question here is: How do I choose which observations should be deleted and which will I keep?

I will have a closer look at the set of one of the duplicates.

In [7]:
print(android_header)
print("\n")
for app in android:
    name=app[0]
    if name == "Google Ads":
        print(app)
print("\n")
for app in android:
    name=app[0]
    if name == "Slack":
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29331', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']


['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000

The only differences between observations for the same applications appear in the "Reviews" column. This would imply that the data for those applications were collected at different times. We can use this as a criterion for cleaning the duplicates, by keeping the observation with the most reviews obtained. This way we will maintain the most reliable information for each of the applications.

To retrieve those unique observations I will:

1- Set a dictionary with the key being a unique app name and the corresponding value will be the highest app review number
2- Create new data set using the previously set dictionary to make sure I end up having only one entry per application (one with the highest number of reviews).

In [8]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

As it may be seen above I have counted 1181 duplicates in the data set. Therefore, the length of the created dictionary should correspond to the difference between the length of the data set and the number of those duplicates.

In [9]:
print("Expected length: ", len(android)-1181)
print("Actual length: ", len(reviews_max))

Expected length:  9659
Actual length:  9659


Both numbers are equal, which means that the dictionary was created correctly. Now, we can use it to remove the duplicates.
I will retrieve from the original data set all observations for the application and review the number indicated in the dictionary. I will also account for the applications that are duplicates but have the same number of reviews (where there is no unique observation with maximum review number). 
To achieve it I passed two conditions in the if statement. The clean data should have unique values for each application with a higher review number.

In [10]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name]==n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

Let's verify if the clean data in fact contains only unique observations

In [11]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite â\x80\x93 FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


The total amount of rows (applications) is 9659, that is consistent with what I have expected.

# Discarding Non-English Apps

Since the company for which I am preparing this analysis uses English for the apps they develop, I'd like to analyze only the apps that are directed toward an English-speaking audience. 
However, our data set contains applications in a foreign language. I will have to identify them and remove from the data set. One of the methods to do this is to remove all the apps where the name contains a symbol that is not regularly used in English text. English text commonly includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

Luckily the whole set of the characters specific to English texts is encoded using the ASCII standard. Each character from this standard has assigned a number between 0 and 127. I can use this to build a function that will verify whether the given app name contains English or foreign characters.


In [12]:
def is_english(string):
    for character in string:
        if ord(character)>127:
            return False
    return True

print(is_english("Google Ads"))
print(is_english("爱奇艺PPS -《欢乐颂2》电视剧热播")) #app found in the data set

True
False


The result is correct, so the function seems to complete its purpose. However, some of the English names of the apps include non-English characters. Like for example, they use emojis or ™, — (em dash), – (en dash), etc. Those characters are not included in the ASCII standard, which means that using this function will result in wrongly labeling English apps as foreign (as can be seen in the example below). 

In [13]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

print(ord('™'))
print(ord('😜'))

False
False
8482
128540


In order to lower the risk of information loss, I will modify the function to allow up to 3 non-english characters in the name. It's not a perfect solution as it maybe gives away for non-english apps to enter our data set, but at this stage of the analysis, I will not spend more time on the classification.

In [14]:
def is_english(string):
    no_ascii = 0
    
    for character in string:
        if ord(character)>127:
            no_ascii += 1
    
    if no_ascii >3:
        return False
    else:
        return True

print(is_english("Google Ads"))
print(is_english("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


Now I can use the constructed function on both data sets.

In [15]:
android_eng = []
ios_eng = []

for app in android_clean:
    if is_english(app[0]):
        android_eng.append(app)
        
for app in ios:
    if is_english(app[1]):
        ios_eng.append(app)
        
explore_data(android_eng, 0, 3, True)
print("\n")
explore_data(ios_eng, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite â\x80\x93 FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9500
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12'

I end up with data set of 9500 android applications and 6100 of ios applications.

# Retriving only free of charge Apps

Since the company builds only free apps, gaining revenue from the in-app ads, I want to focus the analysis on the subset of the applications that are free of charge to download and install.
That is why I will have to isolate applications that are free, both in Google Play and Apple Store.

In [16]:
android_final = []
ios_final = []

for app in android_eng:
    if app[6] == "Free":
        android_final.append(app)
        
for app in ios_eng:
    if app[4] == "0.0":
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8759
3169


Our final data set consists of 8759 observations for android and 3169 observations for ios.

# Most common genres of Apps for android and ios market

The company revenue is strongly dependent on the number of people using their apps. That is why, the aim of this project, as was mentioned before, is to determine the features of the app that are more likely to attract a higher number of users. 
For this project purpose, let's assume that in our company the strategy for the app idea consists of the following steps: 1- Build a minimal Android version of the app, and add it to Google Play. 2- If the app has a good response from users, we then develop it further. 3- If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.
Since our ultimate aim is to succeed with the app in both the App Store and Google Play, we need to find characteristics that work in both markets. The first step in this analysis will be to understand what are the most common genres in the markets in question, for that I will construct frequency tables.

In [17]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

As the first step I will analyze the frequency table of Google Play data set:

In [18]:
display_table(android_final, 1) # "Category" column

FAMILY : 18.929101495604524
GAME : 9.658636830688435
TOOLS : 8.48270350496632
BUSINESS : 4.646649160863112
PRODUCTIVITY : 3.9388057997488297
LIFESTYLE : 3.9159721429386916
FINANCE : 3.721886060052517
MEDICAL : 3.5506336339764815
SPORTS : 3.3337138942801694
PERSONALIZATION : 3.288046580659892
COMMUNICATION : 3.2537960954446854
HEALTH_AND_FITNESS : 3.0939604977737187
PHOTOGRAPHY : 2.9797922137230275
NEWS_AND_MAGAZINES : 2.8085397876469917
SOCIAL : 2.6487041899760246
TRAVEL_AND_LOCAL : 2.34044982303916
SHOPPING : 2.249115195798607
BOOKS_AND_REFERENCE : 2.1463637401529856
DATING : 1.8609430300262586
VIDEO_PLAYERS : 1.8038588880009132
MAPS_AND_NAVIGATION : 1.3814362370133577
FOOD_AND_DRINK : 1.23301746774746
EDUCATION : 1.1759333257221143
ENTERTAINMENT : 0.9590135860258021
AUTO_AND_VEHICLES : 0.9247631008105948
LIBRARIES_AND_DEMO : 0.9019294440004566
WEATHER : 0.7877611599497659
HOUSE_AND_HOME : 0.7877611599497659
EVENTS : 0.7192601895193516
ART_AND_DESIGN : 0.6507592190889371
PARENTING : 0

In [19]:
display_table(android_final, -4) # "Genres" column

Tools : 8.471286676561252
Entertainment : 6.085169539901815
Education : 5.388743007192602
Business : 4.646649160863112
Productivity : 3.9388057997488297
Lifestyle : 3.9045553145336225
Finance : 3.721886060052517
Medical : 3.5506336339764815
Sports : 3.4022148647105834
Personalization : 3.288046580659892
Communication : 3.2537960954446854
Action : 3.105377326178788
Health & Fitness : 3.0939604977737187
Photography : 2.9797922137230275
News & Magazines : 2.8085397876469917
Social : 2.6487041899760246
Travel & Local : 2.329032994634091
Shopping : 2.249115195798607
Books & Reference : 2.1463637401529856
Simulation : 2.055029112912433
Dating : 1.8609430300262586
Arcade : 1.8266925448110514
Video Players & Editors : 1.7810252311907753
Casual : 1.735357917570499
Maps & Navigation : 1.3814362370133577
Food & Drink : 1.23301746774746
Puzzle : 1.1416828405069073
Racing : 1.0046808996460783
Role Playing : 0.947596757620733
Auto & Vehicles : 0.9247631008105948
Strategy : 0.9133462724055257
Librari

It is not clear how the "Genres" and "Category" variables were constructed and what is the difference between them. What is obvious at first glance is that the former column is much more detailed, having more categories. Since I am only interested in the big picture, for now, I will focus on the "Category" column.

The leading category in existing applications is "family" accounting for almost 19% of the whole data set. It is followed by the "game" category that has only half as many applications in the Google Play, and then by "tools" category with 8,48% of the representation.

In [20]:
display_table(ios_final, -5) # "prime_genre" column

Games : 58.53581571473651
Entertainment : 7.82581255916693
Photo & Video : 5.0489113284947935
Education : 3.72357210476491
Social Networking : 3.2817923635216157
Shopping : 2.5244556642473968
Utilities : 2.398232881035027
Sports : 2.1773430104133795
Music : 2.0511202272010096
Health & Fitness : 1.9880088355948247
Productivity : 1.7040075733669928
Lifestyle : 1.5462290943515304
News : 1.3253392237298833
Travel : 1.1360050489113285
Finance : 1.1044493531082362
Weather : 0.8520037866834964
Food & Drink : 0.8204480908804039
Reference : 0.5364468286525718
Business : 0.5364468286525718
Book : 0.3786683496371095
Navigation : 0.18933417481855475
Medical : 0.18933417481855475
Catalogs : 0.12622278321236985


The Apple Store frequency table at first glance shows a different market, where the "Games" category is leading with overwhelming 58,3%. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.72% of the apps are designed for education.

However, let's keep in mind that the number of apps does not directly imply a number of users — the demand might not be the same as the supply.

It is obvious that the Apple Store (the English, free of charge subset) is dominated by applications that are intended for fun and entertainment, while the practical apps (education, shopping, utilities, productivity, lifestyle, etc.) are less common.
Google Play seems to have the majority of applications that are designed for practical purposes. Nevertheless, when you look closer at the "family" category, it contains also games for kids, so in fact, it may also account for the games and entertainment.

In summary, what I may say for sure is that the Apple Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. 

Now I'd like to explore which apps have most users.

# Most popular genres for Apps in the Google Play

In order to find out which kind of apps attract most of the users I will calculate the average number of installs per every app genre.
For the Google Play data set, I will take the "Installs" column that represents a number of installments. However, data for the variable is not very precise, as most of the numbers are open-ended (100+, 1,000+, 5,000+, etc.).

In [21]:
display_table(android_final, 5)

1,000,000+ : 15.74380637059025
100,000+ : 11.519579860714693
10,000,000+ : 10.606233588309168
10,000+ : 10.20664459413175
1,000+ : 8.368535220915629
100+ : 6.9528484986870644
5,000,000+ : 6.872930699851581
500,000+ : 5.548578604863569
50,000+ : 4.772234273318872
5,000+ : 4.486813563192145
10+ : 3.5163831487612742
500+ : 3.208128781824409
50,000,000+ : 2.2833656810138145
100,000,000+ : 2.1349469117479165
50+ : 1.929444000456673
5+ : 0.7877611599497659
1+ : 0.5137572782281082
500,000,000+ : 0.27400388172165774
1,000,000,000+ : 0.22833656810138142
0+ : 0.04566731362027629


Such data won't give me precise results and the classification is very wide. I cannot know whether 10,000+ has 10,001 installs or closer 49,000 installs. However, for the purpose of this project, we don't need very precise data. Our aim is to only get an idea regarding the genres and which of them attract more users, without specifying the exact number.

To simplify the analysis I will assume all the categories correspond to their lower border. That means I will assume that 10,000+ installs correspond to 10,000 installs, etc.

Firstly, I will remove the + sign and convert the numbers to floats, so I can perform mathematical calculations on them.

In [22]:
categ_android = freq_table(android_final, 1)

for category in categ_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace("+", "")
            n_installs = n_installs.replace(",", "")
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total/len_category
    print(category, ":", avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 654074.8271604938
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8329168.936170213
BUSINESS : 1712290.1474201474
COMICS : 859042.1568627451
COMMUNICATION : 38550548.03859649
DATING : 861409.5521472392
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11767380.952380951
EVENTS : 253542.22222222222
FINANCE : 1365500.4049079753
FOOD_AND_DRINK : 1951283.8055555555
HEALTH_AND_FITNESS : 4219697.055350553
HOUSE_AND_HOME : 1385541.463768116
LIBRARIES_AND_DEMO : 649314.0506329114
LIFESTYLE : 1447458.976676385
GAME : 15571586.690307328
FAMILY : 3718295.0422195415
MEDICAL : 121161.87781350482
SOCIAL : 23628689.23275862
SHOPPING : 7103190.78680203
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3750580.6438356163
TRAVEL_AND_LOCAL : 14120454.07804878
TOOLS : 10902378.834454913
PERSONALIZATION : 5240358.986111111
PRODUCTIVITY : 16787331.344927534
PARENTING : 552875.1785714285
WEATHER : 5212877.101449275
VIDEO_PLAYERS : 24878048.860759493
NEWS_AND_MAGAZ

Apps in the "Communication" category have the most installs: 38,550,548. This value, however, is heavily skewed up by a few apps with an extreme number os installs, like: WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, etc.
Here is the closer look at the giants in the communication app market:

In [23]:
for app in android_final:
    if app[1] == "COMMUNICATION" and (app[5]=="1,000,000,000+"
                                      or app[5]=="500,000,000+"
                                      or app[5]=="100,000,000+"):
        print(app[0], ":", app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger â Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Me

In [24]:
under_100_mln = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace("+", "")
    n_installs = n_installs.replace(",", "")
    if (app[1] == "COMMUNICATION") and (float(n_installs)<100000000):
        under_100_mln.append(float(n_installs))
sum(under_100_mln)/ len(under_100_mln)

3437620.895348837

A similar pattern may be found in the "video_players" category which represents 24,878,048 installs. Here the market is led by applications like Youtube, Google Play Movies & TV, or MX Player. The situation repeats itself in the category "social" where we have hidden influence of Facebook, Instagram, Google+, etc., or "productivity" apps with leading Microsoft Word, Dropbox, Google Calendar, Evernote. Those categories are dominated by a few enormous companies, where the economy of the network plays a significant role. This means that it will be almost impossible to compete in those markets for a small company such as my client. In applications like Facebook, Youtube, Google+, Instagram, etc the main value is added through social networks operating on global scales. 

The "game" category seems to be also popular, but as I have analyzed it in the previous segment, this market seems to be overflooded, that is why I will focus on other possibilities. There is a "books_and_reference" category that pinned my interest. The average number of installs for this genre is equal to 8,329,168.

In [25]:
for app in android_final:
    if app[1] == "BOOKS_AND_REFERENCE":
        print(app[0], ":", app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra â free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+

This category contains a wide range of apps, like software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. There are some applications with very high numbers of the installs that skew the average:

In [26]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Audiobooks from Audible : 100,000,000+


Nevertheless, there are only a few of those applications. In my opinion for my company, the best strategy would be to focus on the market with apps that are situated in the middle of the popularity in a given category. Here it would mean somewhere between 1mln and 100mln installs:

In [27]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra â free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+

This segment is dominated by software for processing and reading ebooks, various libraries and dictionaries. Therefore, it would not be the best strategy to invest in the creation of similar applications, due to the high competition.

It seems that popular books, like Quran, have built few apps around them, which would suggest that such applications may be profitable. However, the market is full of libraries and dictionaries, so such an application would have to bring an extra added-value. As an example, such an application could feature a summary of the books, quotes of the day from the book, audio versions and a discussion panel for other readers.

# Most popular genres for Apps in the App Store

Unlike in the case of Google Play data set, the information on a number of app installs is missing for the Apple Store data set. Therefore I will use a proxy, a "rating_count_tot" that represents a number of users that rated the given app.

In [28]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            total += float(app[5])
            len_genre +=1

    avg_rating = total/len_genre
    print(genre, ":", avg_rating)
  

Social Networking : 72916.54807692308
Photo & Video : 28441.54375
Games : 22985.211320754715
Music : 58205.03076923077
Reference : 79350.4705882353
Health & Fitness : 24037.634920634922
Weather : 54215.2962962963
Utilities : 19900.473684210527
Travel : 31358.5
Shopping : 27816.2
News : 21750.071428571428
Navigation : 86090.33333333333
Lifestyle : 16739.34693877551
Entertainment : 14364.774193548386
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 46384.916666666664
Finance : 32367.02857142857
Education : 7003.983050847458
Productivity : 21799.14814814815
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


The analysis concludes that the highest number of user reviews on average comes from the application in the "Navigation" genre. What is surprising, because from the previous analysis I know that this genre account only for approx 0.19% of the total applications in the Apple Store.

In [29]:
for app in ios_final:
    if app[-5] == "Navigation":
        print(app[1], ":", app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
GeocachingÂ® : 12811
CoPilot GPS â Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


This figure is highly biased by two lead application: Waze and Google Maps, which together have almost half a million user reviews.

This pattern seems to appear in other genres, like for example "Social Networking", where the number representing the average user reviews is influenced by few extreme values provided by applications like Facebook, Pinterest, Skype, and more. A similar situation takes place in the "Music" genre.

The goal of this project was to find popular genres, nevertheless, it is difficult to answer that question having results being biased by a few huge market players. The outcome of the average numbers implies skewness of the distribution by very few apps that accounts for hundreds of thousands of user ratings. This could be partially corrected by removing the extreme values of the few apps for each of the genres and compute the averages. However, I won't proceed with such details, as it is not necessary for the purposes of this project.

The third highest rating corresponds to the "Reference" genre:

In [30]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD â¢ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for PokÃ©mon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
Jishokun-Japanese English Dictionary & Translator : 0


However, those are the Bible and Dictionary.com which skew up the average rating. Other genres that seem popular include weather, book, food and drink, or finance. The essence of "Book" and "Reference" genres are overlapping. This may carry some potential in the app idea. There could be an application for popular books with some extra features than simply a book. An added value could bring some quotes from the book, the book's audio version, quizzes or games, etc. Moreover, the app could include an in-app dictionary.  

Apple Store is dominated by "for-fun" apps, mostly represented by giants in the industry, which would make very difficult, for the company I am representing, to compete in this genre and gain profits. This is why a good idea would be to explore the niche in the market, by creating a practical app, like the one with the book. This idea seems to have more sense taking into consideration that we found this genre has some potential to succeed on Google Play as well. Let's remember that my aim is to recommend an app genre that shows potential for being profitable on both markets: android and ios apps.

# Conclusions

This project's aim was to conduct basic data analyzes regarding the App Store and Google Play mobile apps in order to recommend an app profile that might be profitable for both markets. I've concluded that a good idea for an app could be the creation of a book-related application. 
This is a preliminary analysis and there is much more that could be done. Nevertheless, this is a project for pre-intermediate Python users, based on the Python course of the Dataquest.io.