# Profitable App Profiles for the App Store and Google Play Markets

In this project I will analyze free Android and iOS apps on Google Play and the App Store using Python in Jupyter Notebook. I will combine loops, conditionals, dictionaries, and functions to explore app metadata and usage.

The goal is to identify app categories and features that correlate with high user engagement under an ad-revenue model. These insights will guide developers toward building apps with the greatest potential to attract large user bases.

## Opening and Exploring the Data

In September 2018 the Apple App Store contained nearly two million iOS applications and Google Play hosted just over 2.1 million Android apps.  
![Figure 1: Number of apps in leading app stores in September 2018](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png)  
*Source: Statista*

Collecting information on all four million plus apps would be expensive and time intensive. I will instead work with two publicly available samples. The first includes around ten thousand Android apps from Google Play. The second consists of roughly seven thousand iOS apps from the App Store. I will load these data sets into Python and proceed with an initial exploration of the data.  


In [1]:
from csv import reader

open_as = open('/Users/alexsnihur/Desktop/DA Projects/Profitable App Profiles for the App Store and Google Play Markets/AppleStore.csv')
open_gp = open('/Users/alexsnihur/Desktop/DA Projects/Profitable App Profiles for the App Store and Google Play Markets/googleplaystore.csv')

read_as = reader(open_as)
read_gp = reader(open_gp)

app_store = list(read_as)
ios_apps = app_store[1:]
ios_apps_header = app_store[0]
                        
google_play = list(read_gp)
android_apps = google_play[1:]
android_apps_header = google_play[0]

I will define a function named `explore_data()` that displays a chosen range of rows from any data set in a clear, tabular view. I will add a parameter that, when set to true, prints the total number of rows and columns. This utility will simplify repeated inspection of the data sets. 

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
# The App Store data set

print(ios_apps_header)
print('\n')
explore_data(ios_apps, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


The iOS data set contains 7197 apps. Some of the more relevant columns for this analysis include `track_name`, `currency`, `price`, `rating_count_tot`, `rating_count_ver`, and `prime_genre`. Column descriptions are available in the data set documentation: [Kaggle - App Store Data Set](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).

In [4]:
# The Google Play data set

print(android_apps_header)
print('\n')
explore_data(android_apps, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


The Google Play data set includes 10841 apps and 13 columns. For this project, the most useful columns appear to be `App`, `Category`, `Reviews`, `Installs`, `Type`, `Price`, and `Genres`.

## Deleting Wrong Data

Before I start the analysis, I need to make sure the data is accurate and consistent. I will review both datasets and check the original sources, including forums and discussions, to see if there are any reported issues. If I find incorrect, incomplete, or corrupted rows, I will remove them to avoid errors in the analysis.

### Google Play Dataset

In the [discussion](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) section of the Google Play dataset, there is a reported issue with a specific row, which I will try to handle with the code below.

In [5]:
print(android_apps_header)
print(len(android_apps_header))
print('\n')

print(android_apps[10472])
print(len(android_apps[10472]))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
13


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12


I printed the header row to check the correct structure of the dataset. It contains 13 columns. Then I printed the row at index 10472, which was flagged in the dataset's discussion section as problematic. I noticed that this row only has 12 columns.

Based on the discussion, the issue is that this row is missing a value in the `Category` column. Because of that, all the values in the row are shifted one position to the left. For example, the value `1.9`, which should be under the `Rating` column, ends up under `Category`. This causes the entire structure of the row to be misaligned.

Although one user suggested manually correcting the row by inserting the missing category (e.g. `Lifestyle`), I decided to delete the row. I made this decision because:

- The row structure is broken and misaligned.
- It is safer to remove incorrect rows than to try to manually fix them without full confidence in the original values.
- It is just one row out of over 10,000, so removing it does not impact the dataset significantly.

This way I make sure the data I analyze is structurally correct and does not introduce errors in later steps.

In [6]:
del android_apps[10472]

print(android_apps[10472])
print(len(android_apps[10472]))

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
13


I deleted the row at index `10472` using the `del` statement. Then I printed the new row at the same index and confirmed that it now contains valid data with the correct number of columns. This confirms that the malformed row was successfully removed from the dataset.

### App Store Dataset

Now I am going to inspect the App Store dataset to check for structural issues. I want to find any rows that are either missing values or have fewer columns than expected. This helps ensure the dataset is clean and consistent before analysis.

In [7]:
for i, row in enumerate(ios_apps, 1):
    if len(row) != len(ios_apps_header):
        print('Row', i, 'has missing collumns:', row)
    elif any(cell is None or cell == "" for cell in row):
        print('Row', i, 'has empty values:', row)

I looped through each row in the App Store dataset to check for structural problems. The code did not return any output, which means all rows have the correct number of columns and there are no missing or empty values. The dataset appears to be clean and does not require any corrections at this stage.

## Removing Duplicate Entries: Part One

In this part of the cleaning process, I am going to identify apps that appear more than once in the dataset. Duplicate entries can distort results by giving extra weight to certain apps. My goal here is to detect and list duplicates, understand how they differ, and then define a clear rule for keeping only the best version of each app.

### Google Play Dataset

I am going to check the Google Play dataset for duplicate app entries. Some apps appear more than once, which could affect the accuracy of the analysis. In this step, I want to count how many duplicates there are and print a few examples to confirm that they exist.

In [8]:
android_duplicate = []
android_unique = []

for name in android_apps:
    name = name[0]
    if name in android_unique:
        android_duplicate.append(name)
    else: android_unique.append(name)

print('Number of duplicate Android apps:', len(android_duplicate))
print('\n')
print('Examples of duplicate Android apps:', android_duplicate[:10])
print('\n')
print('Number of unique Android apps:', len(android_unique))

Number of duplicate Android apps: 1181


Examples of duplicate Android apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Number of unique Android apps: 9659


I created two lists: `android_duplicate` and `android_unique`. I looped through the dataset and added app names to the correct list depending on whether they had already been seen. In total, I found **1,181 duplicate entries**. I printed out a few of them to confirm. These include apps like *Box*, *Slack*, and *Google My Business*.

Since duplicate rows may contain outdated or repeated information, I will remove them. But instead of removing them randomly, I will keep only the row with the most reviews, since more reviews likely means the data is more recent.

To better understand what kind of variation exists between duplicate entries, I will print all the rows for a specific app that appeared multiple times. I chose *Slack* as an example.

In [9]:
for row in android_apps:
    name = row[0]
    if name == 'Slack':
        print(row)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


I printed all entries for the app *Slack*. Even though the name is the same, I noticed some variation in the number of reviews. This confirms that these duplicates were likely collected at different times. Again, it supports my decision to keep only the row with the highest number of reviews for each app.

### App Store Dataset

I also want to check if the App Store dataset has any duplicate app entries. I will use the same method I applied to the Google Play dataset.

In [10]:
ios_duplicate = []
ios_unique = []

for name in ios_apps:
    name = name[0]
    if name in ios_unique:
        ios_duplicate.append(name)
    else: ios_unique.append(name)

print('Number of duplicate iOS apps:', len(ios_duplicate))
print('\n')
print('Examples of duplicate iOS apps:', ios_duplicate[:10])
print('\n')
print('Number of unique iOS apps:', len(ios_unique))

Number of duplicate iOS apps: 0


Examples of duplicate iOS apps: []


Number of unique iOS apps: 7197


I ran the same process for the App Store data. I created two lists, `ios_duplicate` and `ios_unique`, and looped through the app names. The result shows that there are no duplicate entries in the App Store dataset. That means I do not need to clean anything here related to duplicates, and I can move on to the next step.

## Removing Duplicate Entries: Part Two

In the previous step I identified 1,181 duplicate app entries in the Google Play dataset. I will now remove these duplicates while keeping only one entry per app, specifically the one with the highest number of reviews.

In [11]:
reviews_max = {}

for app in android_apps:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


I created an empty dictionary called `reviews_max`. Each key in the dictionary is a unique app name, and the value is the highest number of reviews for that app. I looped through the dataset, updating the value if the current row had more reviews or adding the app name if it was not yet in the dictionary. The dictionary length was 9659, which matches the number of unique apps.

I will now use the `reviews_max` dictionary to remove duplicate rows and keep only the row with the highest number of reviews for each app.

In [12]:
android_clean = []
already_added = []

for app in android_apps:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))

9659


I created two empty lists: `android_clean` to store the cleaned dataset and `already_added` to track app names already included. I looped through the dataset again and added a row to `android_clean` only if its number of reviews matched the maximum value for that app and the app was not already in already_added. This ensured that only one entry per app was kept. The cleaned dataset length is 9659, confirming that all duplicates were removed.

I will now display part of the cleaned dataset to check its structure and verify the contents.

In [13]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


The dataset contains 9659 rows and 13 columns, confirming that each app has only one entry and all rows are complete.

## Removing Non-English Apps

In the previous step I removed duplicate app entries from the Google Play dataset. I will now remove apps that are not intended for an English-speaking audience. I will keep only apps with names containing characters from the standard ASCII range of 0 to 127, which includes letters, numbers, common punctuation, and basic symbols used in English.

I will create a function that checks whether a string contains only common English characters by verifying that each character’s [ASCII](https://en.wikipedia.org/wiki/ASCII) value is 127 or less. If any character has a value greater than 127, the function will return False.

In [14]:
def eng_checker(string):
    for char in string:
        if ord(char) > 127:
            return False    
    return True

print(eng_checker('Instagram'))
print(eng_checker('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(eng_checker('Instachat 😜'))

True
False
False


The result is that `eng_checker('Instagram')` returns True, indicating it is English. `eng_checker('爱奇艺PPS -《欢乐颂2》电视剧热播')` returns False, indicating it is non-English. `eng_checker('Instachat 😜')` also returns False because the emoji character has an ASCII value greater than 127, so the function incorrectly flags it as non-English.

I will modify the `eng_checker` function to reduce the loss of valid English apps that contain a small number of special characters or emojis. Instead of returning False when it finds a single character outside the ASCII range, the function will now count these characters and only return False if there are more than three of them.

In [15]:
def eng_checker(string):
    limit = 0
    for char in string:
        if ord(char) > 127:
            limit += 1
    if limit > 3:
        return False
    else:
        return True

print(eng_checker('Instagram'))
print(eng_checker('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(eng_checker('Instachat 😜'))

True
False
True


After modifying the function, `eng_checker('Instagram')` returns True because all characters are within the ASCII range, `eng_checker('爱奇艺PPS -《欢乐颂2》电视剧热播')` returns False because it contains more than three characters outside the ASCII range, and `eng_checker('Instachat 😜')` returns True because the single emoji is allowed under the limit of three non-ASCII characters.

I will use the updated `eng_checker`function to filter out non-English apps from both datasets. For each dataset, I will loop through the rows and check if the app name is identified as English. If it is, I will append the entire row to a new list that will store only English apps.

In [16]:
ios_eng = []
android_eng = []

for app in ios_apps:
    name = eng_checker(app[1])
    if name:
        ios_eng.append(app)

for app in android_clean:
    name = eng_checker(app[0])
    if name:
        android_eng.append(app)

print('App Store Dataset:', len(ios_eng))
print('Google Play Dataset:', len(android_eng))

App Store Dataset: 6183
Google Play Dataset: 9614


After running the code, the App Store dataset contains 6183 English apps and the Google Play dataset contains 9614 English apps.

## Isolating the Free Apps

So far in the data cleaning process I have removed inaccurate data, removed duplicate entries, and removed non-English apps. The final cleaning step is to isolate only the free apps. This is because the analysis will focus on apps that are free to download and install, where revenue is generated mainly through in-app ads.

I will loop through each dataset, check the value in the price column, and append the row to a new list if the price is zero.

In [17]:
ios_free = []
android_free = []

for app in ios_eng:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)

for app in android_eng:
    price = app[7]
    if price == '0':
        android_free.append(app)

print('Free iOS Apps:', len(ios_free))
print('Free Android Apps:', len(android_free))

Free iOS Apps: 3222
Free Android Apps: 8864


After filtering, the App Store dataset contains 3222 free apps and the Google Play dataset contains 8864 free apps.

## Most Common Apps by Genre

Up to this point I have cleaned the datasets by removing inaccurate data, removing duplicate app entries, removing non-English apps, and isolating the free apps. The next step is to start analyzing the data to identify the kinds of apps that are likely to attract more users, since higher user numbers can increase revenue.

The validation strategy for an app idea follows three steps. First, build a minimal Android version of the app and release it on Google Play. If it receives a good response from users, develop it further. If it becomes profitable after six months, create an iOS version and release it on the App Store.

Since the final goal is to release the app on both platforms, I will look for app profiles that perform well in both markets. To start, I will examine the most common genres for each market by building a frequency table for the `prime_genre` column of the App Store dataset and the `Genres` and `Category` columns of the Google Play dataset.

I will create a function named `freq_table()` that generates a frequency table for any column in a dataset, expressed as percentages. The function takes a dataset (list of lists) and an index (integer) as parameters. It counts the occurrences of each unique value in the specified column, calculates the percentage for each value, and returns a dictionary with these percentages.

In [18]:
def freq_table(dataset, index):
    ft = {}
    total = 0
    for row in dataset:
        total +=1
        row_index = row[index]
        if row_index in ft:
            ft[row_index] += 1
        else:
            ft[row_index] = 1
            
    percentages = {}
    for key in ft:
        percentage = (ft[key] / total) * 100
        percentages[key] = round(percentage, 2)
        
    return percentages

I will also use a helper function named `display_table()` to sort and display the frequency table in descending order. This function calls `freq_table()` to generate the table, converts it into a list of tuples with the percentage first and the category second, sorts the list in reverse order, and prints the results.

In [19]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

These two functions together allow me to quickly generate and view ordered frequency tables for columns like `prime_genre` in the App Store dataset and `Genres` or `Category` in the Google Play dataset.

I will use the `display_table()` function to view the distribution of app genres in the App Store dataset. I will pass `ios_free` as the dataset and use index 11, which corresponds to the `prime_genre` column.

In [20]:
display_table(ios_free, 11)

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


The output shows that `Games` is the dominant genre, making up 58.16% of free English apps in the App Store. `Entertainment` is a distant second at 7.88%, followed by `Photo & Video` and `Education`. Most of the apps appear to be designed for entertainment rather than practical use.

Next, I examine the `Category` column in the Google Play dataset by calling `display_table()` on `android_free` using index 1.

In [21]:
display_table(android_free, 1) #Category

FAMILY : 18.91
GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


The most common categories are `FAMILY` (18.91%), `GAME` (9.72%), and `TOOLS` (8.46%). Many categories are related to practical use, but entertainment-related categories are also present. The distribution is more balanced than in the App Store.

To get a more detailed breakdown, I will also check the `Genres` column (index 9) in the same dataset.

In [22]:
display_table(android_free, 9) #Genres

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;

`Tools` is the top genre at 8.45%, followed by `Entertainment` at 6.07% and `Education` at 5.35%. The genre distribution is highly varied, with many categories appearing in small percentages. This fragmentation suggests there may be opportunities in underrepresented practical genres like `Education` or `Productivity`, where competition is lower but user demand still exists.

## Most Popular Apps by Genre on the App Store

So far I looked at which genres are most common in the App Store. Now I want to understand which genres are the most popular in terms of user engagement. Since the App Store dataset does not include install numbers, I will use the total number of user ratings (`rating_count_tot`) as a proxy for measuring popularity.

To do this, I will calculate the average number of user ratings for each genre. This will help identify which types of apps attract more attention from users and may guide the selection of a genre with higher engagement potential.

First, I will generate a frequency table from the `prime_genre` column to get the unique genres. Then I will loop through each genre and use a nested loop to isolate the apps belonging to that genre, sum their user ratings, and count how many apps fall under that category. Finally, I will calculate the average number of user ratings by dividing the total by the number of apps in that genre.

In [23]:
ios_ft = freq_table(ios_free, 11)

for genre in ios_ft:
    total = 0
    len_genre = 0
    for row in ios_free:
        genre_app = row[11]
        user_ratings = float(row[5])
        if genre_app == genre:
            total += user_ratings
            len_genre += 1
    avg_ratings = round((total / len_genre), 2)
    print(genre, ':', avg_ratings)

Social Networking : 71548.35
Photo & Video : 28441.54
Games : 22788.67
Music : 57326.53
Reference : 74942.11
Health & Fitness : 23298.02
Weather : 52279.89
Utilities : 18684.46
Travel : 28243.8
Shopping : 26919.69
News : 21248.02
Navigation : 86090.33
Lifestyle : 16485.76
Entertainment : 14029.83
Food & Drink : 33333.92
Sports : 23008.9
Book : 39758.5
Finance : 31467.94
Education : 7003.98
Productivity : 21028.41
Business : 7491.12
Catalogs : 4004.0
Medical : 612.0


The results show that `Navigation`, `Reference`, and `Social Networking` have the highest average user ratings, indicating that these genres attract more engagement per app. This suggests there may be potential in targeting high-utility or information-focused genres, even though they are less common in the store overall.

To understand why certain genres have high average user ratings, I will inspect the apps in the top three genres: `Navigation`, `Reference`, and `Social Networking`. This will help assess whether the high ratings reflect strong user engagement or are driven by a few standout apps.

In [24]:
for app in ios_free:
    if app[11] == 'Navigation':
        print(app[1], ':', app [5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [25]:
for app in ios_free:
    if app[11] == 'Reference':
        print(app[1], ':', app [5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


In [26]:
for app in ios_free:
    if app[11] == 'Social Networking':
        print(app[1], ':', app [5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

I started with `Navigation` because it had the highest average number of ratings. Most apps here are well-known utilities like Google Maps or Waze, which are already dominant. These are hard to compete with unless we offer a very specific niche or region-based service.

Then I looked at `Reference` apps. While this genre also shows high engagement, the list is heavily skewed by a few popular dictionary and Bible apps. Some of the engagement is also driven by Minecraft-related tools, which may not reflect general demand. Still, this could offer opportunities for targeted educational or utility content.

`Social Networking` also has high engagement but is completely dominated by giants like Facebook, Kik, and WhatsApp. This space is highly saturated and competitive, making it difficult for new apps to break through without a unique value proposition.

Out of the three, `Reference` looks like a more realistic area to explore. It has high user engagement without being completely locked by dominant players. A focused, well-designed reference app for a specific audience could potentially stand out.

## Most Popular Apps by Genre on Google Play

Now I want to find out which app genres actually attract the most users on Google Play. Instead of just knowing which categories are common, I want to know which ones get the most engagement. That means looking at how many people are installing apps in each category.

To do this, I loop through each app category, collect the total number of installs, and divide by the number of apps to get the average installs per category. Since the install numbers are formatted as strings with "+" and "," characters, I remove those and convert the values to floats before doing any math.

In [27]:
android_ft = freq_table(android_free, 1)

for category in android_ft:
    total = 0
    len_category = 0
    for row in android_free:
        category_app = row[1]
        installs = row[5]
        if category_app == category:
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            installs = float(installs)
            total += installs
            len_category += 1
    avg_installs = round((total / len_category), 2)
    print(category, ':', avg_installs)

ART_AND_DESIGN : 1986335.09
AUTO_AND_VEHICLES : 647317.82
BEAUTY : 513151.89
BOOKS_AND_REFERENCE : 8767811.89
BUSINESS : 1712290.15
COMICS : 817657.27
COMMUNICATION : 38456119.17
DATING : 854028.83
EDUCATION : 1833495.15
ENTERTAINMENT : 11640705.88
EVENTS : 253542.22
FINANCE : 1387692.48
FOOD_AND_DRINK : 1924897.74
HEALTH_AND_FITNESS : 4188821.99
HOUSE_AND_HOME : 1331540.56
LIBRARIES_AND_DEMO : 638503.73
LIFESTYLE : 1437816.27
GAME : 15588015.6
FAMILY : 3695641.82
MEDICAL : 120550.62
SOCIAL : 23253652.13
SHOPPING : 7036877.31
PHOTOGRAPHY : 17840110.4
SPORTS : 3638640.14
TRAVEL_AND_LOCAL : 13984077.71
TOOLS : 10801391.3
PERSONALIZATION : 5201482.61
PRODUCTIVITY : 16787331.34
PARENTING : 542603.62
WEATHER : 5074486.2
VIDEO_PLAYERS : 24727872.45
NEWS_AND_MAGAZINES : 9549178.47
MAPS_AND_NAVIGATION : 4056941.77


From the output, it is clear that certain categories on Google Play have very high average install numbers. Categories like `Communication`, `Video Players`, `Social`, and `Productivity` are at the top. However, most of these are dominated by a few major apps, for example, apps like WhatsApp, YouTube, or Facebook, which drive the average up but are not easy to compete with.

This makes categories like `Productivity` or `Books and Reference` more interesting. They still have solid average install numbers but are not as dependent on a small number of extremely popular apps. A good idea in one of these categories could have a better chance to get attention and grow.

To better understand what is driving the high average install numbers in the `Communication` category, I filtered for apps with 100 million installs or more. I want to see if a few major players are dominating this category.

In [28]:
for app in android_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

The results confirm that a few major apps like WhatsApp, Skype, and Facebook Messenger dominate this category. This explains the high average installs, but it also suggests that breaking into this space would be very difficult due to the level of competition and user loyalty.

I will apply the same filter to the `Books and Reference` category. My goal is to check whether the high average installs here are more evenly distributed across multiple apps, which could suggest lower competition and more opportunity.

In [30]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'
                                            or app[5] =='1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Aldiko Book Reader : 10,000,000+
Wattpad 📖 Free Books : 100,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al ka

Unlike `Communication`, this category has a broader range of popular apps. While there are still some large players, the install numbers are more evenly spread. This could indicate a less saturated market with more room to grow, making it a potentially better option for a new app.

Next, I will look into the `Productivity` category to see what kind of apps are behind the high average install numbers.

In [31]:
for app in android_free:
    if app[1] == 'PRODUCTIVITY' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Microsoft Word : 500,000,000+
Microsoft Outlook : 100,000,000+
Microsoft OneDrive : 100,000,000+
Microsoft OneNote : 100,000,000+
Google Keep : 100,000,000+
ES File Explorer File Manager : 100,000,000+
Dropbox : 500,000,000+
Google Docs : 100,000,000+
Microsoft PowerPoint : 100,000,000+
Samsung Notes : 100,000,000+
SwiftKey Keyboard : 100,000,000+
Google Drive : 1,000,000,000+
Adobe Acrobat Reader : 100,000,000+
Google Sheets : 100,000,000+
Microsoft Excel : 100,000,000+
WPS Office - Word, Docs, PDF, Note, Slide & Sheet : 100,000,000+
Google Slides : 100,000,000+
ColorNote Notepad Notes : 100,000,000+
Evernote – Organizer, Planner for Notes & Memos : 100,000,000+
Google Calendar : 500,000,000+
Cloud Print : 500,000,000+
CamScanner - Phone PDF Creator : 100,000,000+


It is clear that the `Productivity` category is heavily populated by apps from major tech companies like Microsoft and Google. These apps offer essential tools like word processing, cloud storage, and calendar management. Although the install numbers are high, the dominance of a few big players suggests that breaking into this space would require either a niche angle or significantly better functionality.

## Final Thoughts

After cleaning the data and analyzing app genres from both the App Store and Google Play, I found that while entertainment apps like games and social networking are the most common, they are not necessarily the best choices for a new app project. These categories are oversaturated and often dominated by major companies.

Instead, categories like `Navigation`, `Reference`, and `Productivity` on the App Store show strong user engagement based on average user ratings, despite having fewer apps overall. On Google Play, categories like `Books and Reference` and `Productivity` also perform well in terms of average installs, without being entirely dominated by a few apps.

If I were to recommend a direction for building a new app, I would focus on one of these practical categories, especially where user demand is high but competition is lower. A well-designed, focused tool within one of these genres could stand out and gain traction more easily than trying to break into crowded entertainment categories.
