# Guided project: Profitable App Profiles for the App Store and Google Play Markets

# Table of contents

[Introduction](#Introduction-*)

* [Exploring two datasets](#Exploring-two-datasets-*)
* [Getting the column names](#Getting-the-column-names-*)

[Data Cleaning](#Data-Cleaning-*)
* [Deleting the errors](#Deleting-the-errors-*)
* [Removing duplicates](#Removing-duplicates-*)
* [Deleting non-English apps](#Deleting-non-English-apps-*)
* [Storing only free apps](#Storing-only-free-apps-*)

[Analyzing the data](#Analyzing-the-data-*)
* [The aim](#The-aim-*)
* [Most common genres](#Most-common-genres-*)
* [Number of installations](#Number-of-installations-*)
* [Choosing an app](#Choosing-an-app-*)

[Conclusions](#Conclusions-*)

## Introduction [*](#Table-of-contents)

The project involves building apps for the App Store and Google Play. These apps will be free to download and install, so the primary source of revenue will consist of in-app ads. The main goal is to analyse and understand what types of apps are likely to attract more users.

In [5]:
opened_file_applestore = open('AppleStore.csv', encoding="utf8")
opened_file_googleplaystore = open('googleplaystore.csv', encoding="utf8")

from csv import reader
read_file_applestore = reader(opened_file_applestore)
read_file_googleplaystore = reader(opened_file_googleplaystore)

apps_data_applestore = list(read_file_applestore)
apps_data_googleplaystore = list(read_file_googleplaystore)



#### Exploring two datasets [*](#Table-of-contents)
The code above opens two CSV files: `Applestore.csv` and `googleplaystore.csv`. Using `csv` module we let the computer read those files, which are then converted into **lists**.

In [6]:
def explore_data(dataset, start=0, end=1, rows_and_columns=True):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print("\n")

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print("\n")

`Explore_data` function prints the table with indicated indexes. It prints out the number of rows and columns as well. Here is an illustration of the first few rows.

In [7]:
explore_data(apps_data_applestore, 1, 3, True)
explore_data(apps_data_googleplaystore, 1, 3, True) 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13




#### Getting the column names [*](#Table-of-contents)
To print the column names, we will do the following:

In [8]:
explore_data(apps_data_applestore)
explore_data(apps_data_googleplaystore)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Number of rows: 7198
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows: 10842
Number of columns: 13




Looking at the result, it would be a good idea to indicate the common or similar column names that the `Applestore.csv` and `googleplaystore.csv` files have. We can use these columns for our later analysis.

**price**, **genres**, **rating**, **reviews**, **version**, **content rating** for instance  

## Data Cleaning [*](#Table-of-contents)

#### Deleting the errors [*](#Table-of-contents)
We only care about apps that are free to download and designed for an English-speaking audience. The Google Play dataset has a [dedicated discussion](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion?sort=undefined) section, and we can see that [one of the discussions](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row (10472). We'll first make sure whether the entry is 10472.

In [9]:
apps_data_googleplaystore[10472]

['Xposed Wi-Fi-Pwd',
 'PERSONALIZATION',
 '3.5',
 '1042',
 '404k',
 '100,000+',
 'Free',
 '0',
 'Everyone',
 'Personalization',
 'August 5, 2014',
 '3.0.0',
 '4.0.3 and up']

In [10]:
apps_data_googleplaystore[10473]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

10473rd row **had** an error, so we removed it using the `del` statement.

In [11]:
del apps_data_googleplaystore[10473]

We did the same operation on the ****9149th row**** as it was reported by [Sharon Mathys](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps/discussion/106176#676953) to have an error.

In [12]:
del apps_data_googleplaystore[9149]

#### Removing duplicates [*](#Table-of-contents)
As said earlier Google Play dataset has a [dedicated discussion](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion?sort=undefined) section, where it is mentioned that there are duplicate entries. Now our goal will be to remove the duplicates.

Firstly, let us show that there are duplicate entries with **Instagram** being the example.

In [13]:
def insta_duplicates():
    for app in apps_data_googleplaystore:
        name = app[0]
        if name == "Instagram":
            print(app)

insta_duplicates()

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Let us also find the number of duplicates.

In [14]:
duplicate_apps = []
unique_apps = []

for app in apps_data_googleplaystore:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print(f"The number of duplicate apps is {len(duplicate_apps)}", end="\n\n")
print("Here is a few of them:", end="\n\n")
print(duplicate_apps[:10])


The number of duplicate apps is 1181

Here is a few of them:

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Going back to the `insta_duplicates()` function's output, it can be seen that the **4th**(number of reviews) column has different values for the same app. 

In [15]:
insta_duplicates()

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Therefore, it would be better to keep the row with the **highest number of reviews** to show the most *recent* data. 

We will use dictionaries to remove duplicates where each dictionary key is a **unique app name** (which is not a duplicate) with the coressponsing dictionary value being the **highest number of reviews**.

In [16]:
reviews_max = {}

for app in apps_data_googleplaystore[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

Firstly, we do **not** include the header row. After creating an empty dictionary we check whether the app name already exists in the dictionary or not. If it does not, we store it as a new key with its corresponding key value (`n_reviews`). If it does (which means it is a duplicate), we check the number of reviews and set the highest among them to its corresponding app name. 

In [17]:
android_clean = []
already_added = []

for app in apps_data_googleplaystore[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

len(android_clean)

9658

Now that each app has the highest number of reviews, let us get rid of duplicates. The code loops through the original dataset (`apps_data_googleplaystore`) and separates the name of the app and number of reviews. Since `reviews_max` dictionary already stores the highest number of reviews, the program has to check for the names to match. It will also store the names of the apps. Why do we do this? Some apps have the exact highest number of reviews so if we do not check `name not in already_added` condition then in `android_clean` list there would be duplicate apps with the same highest number of reviews.

#### Deleting non-English apps [*](#Table-of-contents)
Our next step will be to get rid of non-English apps in both datasets. As you may or may not know, each character used in a string has a corresponding number. We can get that number using `ord()` function

In [18]:
print(ord("G"))
print(ord("8"))
print(ord("}"))

71
56
125


According to the **[ASCI](https://en.wikipedia.org/wiki/ASCII)** (American Standard Code for Information Interchange), characters in an English text have values between `0` and `127`. So if a character's associated number is greater than `127`, it is not an English character.

In [19]:
def is_english(text):
    wrong_character_count = 0 
    for character in text:
        if ord(character) > 127:
            wrong_character_count += 1
            if wrong_character_count > 3:
                return False
    else:
        return True

print(is_english('Instagram'))
print(is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))
print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Instachat üòú'))


True
False
True
True


Some of the apps, though they consist of English characters, also have symbols or emojis which are outside **[ASCI](https://en.wikipedia.org/wiki/ASCII)** range, therefore we set an additional condition that ***at most `3` characters can be outside of the range***. Although, this condition will still exclude some of the useful apps, it should be more or less effective.

In [20]:
def filter_non_english_android(dataset):
    new_dataset = []
    for app in dataset[1:]:
        name = app[0]
        if is_english(name):
            new_dataset.append(app)
    return new_dataset

def filter_non_english_ios(dataset):
    new_dataset = []
    for app in dataset[1:]:
        name = app[1]
        if is_english(name):
            new_dataset.append(app)
    return new_dataset


filtered_appstore  = filter_non_english_ios(apps_data_applestore)
filtered_googleplay = filter_non_english_android(android_clean)

print(len(filtered_appstore))
print(len(filtered_googleplay))




6183
9612


#### Storing only free apps [*](#Table-of-contents)
Recall, that the app name in GooglePlay dataset is at `index=0`, whereas in AppStore dataset it is at `index=1`. Do not forget, that we have already cleaned `apps_data_googleplaystore` into `android_clean` by removing the duplicates. For now, we are left with `6183` IOS apps and `9612` Android apps.

In [21]:
def ios_free(row):
        price = float(row[4])
        if price > 0:
            return False
        return True

def android_free(row):
        free_or_paid = row[6]
        if free_or_paid == "Paid":
            return False
        return True

def free_appestore_apps(dataset):
    free_dataset = []
    for app in dataset:
        if ios_free(app):
            free_dataset.append(app)
    return free_dataset

def free_googleplay_apps(dataset):
    free_dataset = []
    for app in dataset:
        if android_free(app):
            free_dataset.append(app)
    return free_dataset

appstore_final = free_appestore_apps(filtered_appstore)
googleplay_final = free_googleplay_apps(filtered_googleplay)

print(len(appstore_final))
print(len(googleplay_final))

3222
8862


The first two functions check whether the app is free or not. For IOS apps, the price is at `index=4`, for Android apps we check the `index=6` rather than `index=7`. `index=6` can either be `Free` or `Paid`. Afterwards, we loop through the two datasets and store only the **free** versions of them.

## Analyzing the data [*](#Table-of-contents)

#### The aim [*](#Table-of-contents)
The aim of this project is to build apps that will attract as many users as possible to maximize the revenue. 
To minimize the risks and workload, the strategy will comprise the following points:

1. Build an Android version of the app, and add it on Google Play.
2. If it is successful, we improve and advance it.
3. If it continues delivering good results and responses from people after six months, we build an IOS version of the app.

So far we have been cleaning data. Now is the time to start analyzing it by detecting the most common genre types.

#### Most common genres [*](#Table-of-contents)
Firstly, we need to create a function which will do the following:
1. Print the names of genres in the dataset,
2. show how the number of apps in each genre
3. lastly display the number of genres in total

`Genres of datasets` is a test function, fell free to try it out. It accepts two variables: `index` and `dataset` which correspond to the column index and the name of the dataset. The first part stores unique (**not duplicates**) variables in the `storage` dictionary, as well as it displays the number of unique genres in total. The second half calculates the number of times each genre appears in the dataset. Feel free to experiment with the data.

In [22]:
def genres_of_datasets(index, dataset):
    duplicate_list = []
    storage = {}
    count = 0
    for app in dataset:
        genre = app[index]
        storage[genre] = 0
        if genre in duplicate_list:
            continue
        else:
            duplicate_list.append(genre)
            count += 1

    for app in dataset:
        genre = app[index]
        for genre_name in duplicate_list:
            if genre_name == genre:
                storage[genre] += 1

    print(storage, f"There are {len(storage)} genres in the dataset", sep="\n\n", end="\n\n")
    
genres_of_datasets(1, googleplay_final)
print("-" * 100, end="\n\n")
genres_of_datasets(11, appstore_final)


#appstore index=11
#googleplay index=1 and index=9



{'ART_AND_DESIGN': 56, 'AUTO_AND_VEHICLES': 82, 'BEAUTY': 53, 'BOOKS_AND_REFERENCE': 190, 'BUSINESS': 407, 'COMICS': 55, 'COMMUNICATION': 287, 'DATING': 165, 'EDUCATION': 103, 'ENTERTAINMENT': 85, 'EVENTS': 63, 'FINANCE': 328, 'FOOD_AND_DRINK': 110, 'HEALTH_AND_FITNESS': 273, 'HOUSE_AND_HOME': 73, 'LIBRARIES_AND_DEMO': 83, 'LIFESTYLE': 346, 'GAME': 862, 'FAMILY': 1675, 'MEDICAL': 313, 'SOCIAL': 236, 'SHOPPING': 199, 'PHOTOGRAPHY': 261, 'SPORTS': 301, 'TRAVEL_AND_LOCAL': 207, 'TOOLS': 750, 'PERSONALIZATION': 294, 'PRODUCTIVITY': 345, 'PARENTING': 58, 'WEATHER': 71, 'VIDEO_PLAYERS': 159, 'NEWS_AND_MAGAZINES': 248, 'MAPS_AND_NAVIGATION': 124}

There are 33 genres in the dataset

----------------------------------------------------------------------------------------------------

{'Social Networking': 106, 'Photo & Video': 160, 'Games': 1874, 'Music': 66, 'Reference': 18, 'Health & Fitness': 65, 'Weather': 28, 'Utilities': 81, 'Travel': 40, 'Shopping': 84, 'News': 43, 'Navigation': 6, 'Lif

Similar, though changed a bit, we created `freq_table` function from `genres_of_datasets` function. Now our goal is to somehow sort the `storage` dictionary based on keys that have the largest values to find the most common genres. To do that, we made `display_table` function which does the following:

1. Uses the `freq_table` function to create a frequency table.
2. It converts the `storage` dictionary into a list where each elements is a tuple containing a key with its corresponding value.
3. Then it makes use of built-in `sorted` function which will sort the tuples in descending order.
4. Lastly, the function prints the genres in `genre : number` format

In [23]:
def freq_table(dataset, index):
    duplicate_list = []
    storage = {}
    for app in dataset:
        genre = app[index]
        storage[genre] = 0
        if genre in duplicate_list:
            continue
        else:
            duplicate_list.append(genre)

    for app in dataset:
        genre = app[index]
        for i in duplicate_list:
            if i == genre:
                storage[genre] += 1
                
    return storage
#appstore index=11
#googleplay index=1 and index=9


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

display_table(appstore_final, 11)
print("\n", "-" * 100, end="\n\n")
display_table(googleplay_final, 1)
print("\n", "-" * 100, end="\n\n")
display_table(googleplay_final, 9)


Games : 1874
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4

 ----------------------------------------------------------------------------------------------------

FAMILY : 1675
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN

Let us analyze the results. We will start with `prime_genre` column (**corresponding to `index=11`**) of the App Store dataset. We can observe that the first few most common app genres are:

1. Games
3. Entertainment
4. Photo & Video
5. Education
6. Social Networking

It can be seen that the vast majority of free English apps are games. There are fewer apps related to health, education, productivity, etc. (basically apps for **practical purposes**) and more apps linked with brining entertainment. One cannot jump to a conclusion that games are the most likely bring the highest revenue as not all of them are successful and there can be apps from other genres that are at least as successful as games. Nevertheless, on average, games can lead to a better start in terms of bringing some revenue than other applications.


    
Secondly, let us take a look at `Category` and `Genres` columns (**corresponding to `index=1` and `index=11` respectively**) of the Google Play dataset. As with App Store dataset, we can analyze the first few most common categories and genres:

For *Categories*:
1. Family
2. Game
3. Tools
4. Business
5. Lifestyle

For *Genres*:
1. Tools
2. Entertainment
3. Education
4. Business
5. Productivity

Here, the picture is a bit different. Apps related to family take the lion's share. Unsurprisingly, the second most common category is games which shows that, no matter the platform, games are popular among users. Interestingly, the most observed genre in Google Play dataset is tools. Rest of the genres follow a similar descending and **balanced** pattern in terms of the numbers .

#### Number of installations [*](#Table-of-contents)
So far, we have found the most common genres. This is still incomplete as the app popularity also depends on the number of installations. One way to find out the apps which have the most users is to compute the average number of installs for each app genre. By observing the Google Play data set, we can see that there exists useful information in `Installs` column, but the same cannot be said for App Store dataset. We can, however, use the `rating_count_tot` column as an alternative. 
The plan is simple, we need to:
1. Isolate the apps of each genre.
2. Add up the user ratings for the apps of that genre.
3. Divide the sum by the number of apps belongiing to that genre.

In [24]:
appstore_freq_table = freq_table(appstore_final, 11)
for genre in appstore_freq_table:
    total = 0
    len_genre = 0
    for app in appstore_final:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
    avg_number_of_user_ratings = total / len_genre
    print(f"{genre} : {avg_number_of_user_ratings}")



Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


By looking at the results, on average *Navigation*, *Reference*, and *Social Networking* apps have the highest number of user ratings. The least rated apps (not by actual ratings, but by the number of ratings) are *Medical* apps.

Moving on to the Google Play dataset. The issue with the `Installs` column is that the number of installations are open-ended (meaning they are given in e.g. **100+, 1000+, etc** format). We do not know if an app with 10000+ installs has 10000, 40000, or 53948 installs. As we do not have a precise figure, we will rely on the numbers as they are, meaning that if an app has 100000+ installs, we will take them as 100000. 

In [25]:
category_freq_table = freq_table(googleplay_final, 1)
for category in category_freq_table:
    total = 0
    len_category = 0
    for app in googleplay_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace("+", "")
            n_installs = n_installs.replace(",", "")
            total += int(n_installs)
            len_category += 1
            
    avg_category_installs = total / len_category
    print(f"{category} : {avg_category_installs}")      

ART_AND_DESIGN : 2021626.7857142857
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3697848.1731343283
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

On average, communication apps have the most installs: **almost 38 million**. However, this figure is not entirely true not only due to not using precise figures, but also because the data is heavily skewed by apps like *Whatsapp, Facebook, Gmail, Telegram and a few others*. We can illustrate that with the following code.

In [26]:
def num_install_in_category(category, dataset):
    print(category + ":")
    for app in dataset:
        category_name = app[1]
        downloads = app[5]
        app_name = app[0]
        if category_name == category and (downloads == "1,000,000,000+" or downloads == "500,000,000+" or downloads == "100,000,000+"):
            print(f"{app_name} : {downloads}")
        #1,000,000,000+, 500,000,000+, 100,000,000+ three largest possible values

num_install_in_category("COMMUNICATION", googleplay_final)
print("\n", "-" * 100, end="\n\n")
num_install_in_category("PHOTOGRAPHY", googleplay_final)
print("\n", "-" * 100, end="\n\n")
num_install_in_category("VIDEO_PLAYERS", googleplay_final)
print("\n", "-" * 100, end="\n\n")
num_install_in_category("PRODUCTIVITY", googleplay_final)


COMMUNICATION:
WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger ‚Äì Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,00

If we remove these apps from the list, the average number of installations will be much smaller. The same pattern can be observed *productivity, photography or video players*. Let us take a look at categories that might not be heavily dominated by *app giants* or at least not by many of them.

In [27]:
category_list = []
for app in googleplay_final:
    category = app[1]
    if category not in category_list:
        category_list.append(category)

print(category_list)

for category in category_list:
    num_install_in_category(category, googleplay_final)
    print("\n", "-" * 100, end="\n\n")

['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION', 'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL', 'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER', 'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION']
ART_AND_DESIGN:

 ----------------------------------------------------------------------------------------------------

AUTO_AND_VEHICLES:

 ----------------------------------------------------------------------------------------------------

BEAUTY:

 ----------------------------------------------------------------------------------------------------

BOOKS_AND_REFERENCE:
Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad üìñ Free Books : 100,000,000+
Audioboo

#### Choosing an app [*](#Table-of-contents)
The `GAME`genre seems interesting, but it will be hard competing in the gamy industry. We could look at `PERSONALIZATION`, `PHOTOGRAPHY` or `BOOKS_AND_REFERENCE` apps. One proposal could be to make an application about Bible and add more features to attract as many users as possible (such as online prayers, daily verses and proverbs, etc.). We could also propose to make an application similar to CapCut or PicsArt but this seems a more arduous task. Developing a wallpaper app is also an alternative though making quality pictures is going to cost a lot of resources.

# Conclusions [*](#Table-of-contents)
In this project, we cleaned and analyzed data with the goal of recommending an application that can be profitable in Google Play and App Store. We decided that our main focus is going to be applications from these categories: **Personalization, books and reference, and photography**. It is up to you to decide in which direction you wish to go, but I hope this project was helpful and interesting to follow!