# Profitable App Profiles for the App Store and Google Play Markets

Our goal for this project is to build data-driven analysis to assist the company's developers that builds Android and iOS mobile apps with making profitable decisions on understanding what type of application are likely to attract more users.

At the company, we only build apps that are free to download and install, and the main revenue source consists of in-app ads, therefore, the number of users of our apps can greatly impact the incoming source of revenue.

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

* A data set containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from this link.
* A data set containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this link.

Let's start by opening the two data sets and then continue with exploring the data.

In [1]:
from csv import reader

# The Apple Store dataset
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple_dataset = list(read_file)
apple_header = apple_dataset[0]
apple_dataset = apple_dataset[1:]

# The Google Play store dataset
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
google_dataset = list(read_file)
google_header = google_dataset[0]
google_dataset = google_dataset[1:]

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


print(apple_header)        
print("\n")
explore_data(apple_dataset,0,5, True)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


Apple Store dataset consists of 7197 rows (excluding the header) and 16 columns, of which we are going to select `track_name`, `price`, `size_bytes`, `rating_count_total`, `user_rating`, `cont_rating`, and `prime_genre`, as our main columns for the analysis

In [2]:
print(google_header)        
print("\n")
explore_data(google_dataset,0,5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

Google Store dataset consists of 10841 (excluding the header) and 13 columns, of which we are going to select `Apps`, `price`, `reviews`, `size`, `Installs`, `content rating`, and `Genre`, as our main columns for the analysis.

**Deleting Wrong Data**

The Google Play dataset has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row. 

In [3]:
print(google_header) 
print('\n')
print(google_dataset[10472])
print('\n')
print('Correct Row:','\n', google_dataset[0])
      

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Correct Row: 
 ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The row 10472nd's list contains 12 elements, whereas the correct row would have 13, we can see that row 10472nd misses the value in `Category` column , as a consequence, we are going to delete this row. 

In [4]:
print(len(google_dataset))
del google_dataset[10472]  # don't run this more than once
print(len(google_dataset))

10841
10840


**Removing Duplicate Entries**

**Part One**

Some apps have duplicate entries. We need to remove the duplicate entries and keep only one per app. For instance, Instagram has 4 different entries:



In [5]:
for app in google_dataset:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [6]:
google_unique_app = []
google_duplicate_app = []
for row in google_dataset:
    app_name = row[0]
    if app_name in google_unique_app:
        google_duplicate_app.append(app_name)
    else:
        google_unique_app.append(app_name)
print('Numbers of duplicated apps: ', len(google_duplicate_app))
print('\n')
print('Example of duplicate apps: ', google_duplicate_app[:10])

Numbers of duplicated apps:  1181


Example of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


**Part Two**

If we examine the rows we printed for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times. Thus, we want to keep the row with highest number of reviews since it means the data is most recent. 

To remove the duplicates, we will do the following:
* Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

* Use the information stored in the dictionary and create a new dataset, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [7]:
reviews_max = {}
for row in google_dataset:
    name = row[0]
    reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < reviews:
        reviews_max[name] = reviews
    elif name not in reviews_max:
        reviews_max[name] = reviews

            

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [8]:
print('Actual length: ', len(reviews_max))

Actual length:  9659


Now, let's use the `reviews_max` dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

* We start by initializing two empty lists, `google_clean` and `already_added`.
* We loop through the google data set, and for every iteration:
    * We isolate the name of the app and the number of reviews.
    * We add the current row (`row`) to the `google_clean` list, and the app name (`name`) to the `already_added` list if:
    * The number of reviews of the current app matches the number of reviews of that app as described in the `reviews_max` dictionary; and
    * The name of the app is not already in the `already_added` list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for `reviews_max[name] == n_reviews`, we'll still end up with duplicate entries for some apps.

In [9]:
google_clean = [] #store new dataset
already_added = [] #store app names

for row in google_dataset:
    name = row[0]
    reviews = float(row[3])
    if reviews == reviews_max[name] and name not in already_added:
        google_clean.append(row)
        already_added.append(name)

explore_data(google_clean,0,5,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9659
Number of columns: 13


**Removing Non-English Apps**

**Part One**

Remember we use English for the apps we develop at our company, and we'd like to analyze only the apps that are designed for an English-speaking audience. However, if we explore the data long enough, we'll find that both datasets have apps with names that suggest they are not designed for an English-speaking audience.

Each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for character 'a' is 97, character 'A' is 65, and character '爱' is 29,233. We can get the corresponding number of each character using the `ord()` built-in function.

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

In [10]:
def eng_chr(string):
    non_eng_chr = 0
    for character in string:
        if ord(character) > 127:
            non_eng_chr += 1
    if non_eng_chr > 3:
        return False
    else:
        return True

print(eng_chr("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(eng_chr('Instagram'))
print(eng_chr('Docs To Go™ Free Office Suite'))
print(eng_chr('Instachat 😜'))

False
True
True
True


In [11]:
google_english = []
apple_english = []

for row in google_clean:
    name = row[0]
    if eng_chr(name):
        google_english.append(row)

for row in apple_dataset:
    name = row[1]
    if eng_chr(name):
        apple_english.append(row)

explore_data(google_english, 0, 3, True)
explore_data(apple_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'G

**Isolating Free Apps**

In this project, we are only building apps that are free to download and install, and our main source of revenue consists of in-app ads. Our dataset contains both free and non-free apps; we only need to keep the ones with no cost for our analysis.

In [12]:
google_free_dataset = [row for row in google_english if row[7] == '0']
apple_free_dataset = [row for row in apple_english if row[4] == '0.0']
google_dataset = google_free_dataset
apple_dataset = apple_free_dataset
explore_data(google_dataset,0,6,True)
explore_data(apple_dataset,0,6,True)

        

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+

**Most Common Apps by Genre** 

**Part One**

So far in the data cleaning process, we've done the following:
* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps
* Isoating the free apps

As mentioned, our goal is to determine the kinds of apps that are likely to attract more users since our revenue is tied directly to the number of people using our apps.

In order to find the most appealing and attractive app to users, we need to look the one that is the most successful among Google Play market and Apple Store market. Thus, we come up with a validation strategy entails 3 steps:

1. Build a minimal Android version of the app, and add it to Google Play - because there are relatively more data and users available in Google Play store market.

2. If the app has good response from users, we develop it further - when the app is popular at certain extent, users will expect a better developed version to be released.

3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store - to expand our market, attract more users from iOS platform and ultimately generate more revenue.


After inspecting both Google Play and Apple store dataset, we decided to generate the frequency table using the `Category` and `Genre` column in Google dataset and `prime_genre` column in Apple store dataset.


**Part Two**

To help with analyzing the frequency tables, first we are going to build a a function to generate frequency tables that show percentages.

In [13]:
# Function to generate frequency tables that show percentages:

def freq_table(dataset, col_index):
    dataset_dict = {}
    for row in dataset:
        key = row[col_index]
        if key in dataset_dict:
            dataset_dict[key] += 1
        else:
            dataset_dict[key] = 1
    
    for key in dataset_dict:
        percentage = (dataset_dict[key] / len(dataset)) * 100
        dataset_dict[key] = '{0:.2f}'.format(percentage)
        
    return dataset_dict


google_category = freq_table(google_dataset, 1)
google_genre = freq_table(google_dataset, 9)
apple_genre = freq_table(apple_dataset, 11)



Next step, we are going to set up a function that takes in a dataset and a column index to return a display table with a percentage associated with a single category or genre in a descending order.

In [14]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

**Part Three**

In [15]:
display_table(apple_dataset, -5)

Entertainment : 7.88
Games : 58.16
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


After running the function display table on App Store dataset, we could come to understand the following:

* The most common genre in the market is **Games** with 58.16% appearance, way far more compared to the next common genre which is Entertainment with 7.88%.

* Most of the apps are designed for entertainment purposes such as Games, social Networking, Shopping, Photo & Video,... This is more likely due to the scalability and easily accessible of those particular applications genre, they are meant to reach as much users as possible

* According to the data we find within the App Store dataset, we have promissing expectation that building an application in Games genre would most likely attract more users due to the huge gap between the Games genre and others.


Let's continue to examnie the `Genre` and `Category` column on Google Play data set:

In [16]:
display_table(google_dataset, 1)

GAME : 9.72
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.90
PRODUCTIVITY : 3.89
FINANCE : 3.70
MEDICAL : 3.53
SPORTS : 3.40
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.80
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
FAMILY : 18.91
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.40
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.80
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.60


In [17]:
display_table(google_dataset, 9)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.70
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.10
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.80
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.40
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.82
Weather : 0.80
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.60
Art & Design : 0.60
Parenting : 0.50
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.20
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.1

As we can see from both tables of `Category` and `Genre`, **Games** apps is no longer dominating, there is a fair and equal amount of apps accorss the market for both for-fun and practical apps.

Now we are going to analyze which type of apps would attract most users.

**Most Popular Apps by Genre on the App Store**

To find out what genres are the most popular, is to calculate the average number of installs for each app genre. We can find this information under the `Installs` column in Google data set. Whereas for Apple Store data set, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` column.

We'll calculate the average number of user ratings per app genre on the Appe Store:

In [18]:
for genre in apple_genre:
    total = 0
    len_genre = 0
    for row in apple_dataset:
        genre_app = row[11]
        if genre_app == genre:
            n_ratings = float(row[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = '{0:.2f}'.format(total / len_genre)
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.35
Photo & Video : 28441.54
Games : 22788.67
Music : 57326.53
Reference : 74942.11
Health & Fitness : 23298.02
Weather : 52279.89
Utilities : 18684.46
Travel : 28243.80
Shopping : 26919.69
News : 21248.02
Navigation : 86090.33
Lifestyle : 16485.76
Entertainment : 14029.83
Food & Drink : 33333.92
Sports : 23008.90
Book : 39758.50
Finance : 31467.94
Education : 7003.98
Productivity : 21028.41
Business : 7491.12
Catalogs : 4004.00
Medical : 612.00


On average, Navigation apps have the highest number of user ratings, however when we look more into the apps under Navigation genre, most of the ratings come from Google Maps and Waze, which heavily skewed the number to the right. Below, we'll look at which apps are placed under `Navigation` genre.

In [19]:
for app in apple_dataset:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Although `Nagivation` genre has the highest number of ratings on average, our goal is to build apps that needs to generate income through advertisement and such an app will not gain support by majority if it will be created for people to use while operating vehicles. 

Let's move on to Google Play data set.

In [20]:
for category in google_category:
    total = 0
    len_category = 0
    for row in google_dataset:
        category_app = row[1]
        if category_app == category:
            n_installs = row[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    avg_n_installs = '{0:.2f}'.format(avg_n_installs)
    print(category, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.09
AUTO_AND_VEHICLES : 647317.82
BEAUTY : 513151.89
BOOKS_AND_REFERENCE : 8767811.89
BUSINESS : 1712290.15
COMICS : 817657.27
COMMUNICATION : 38456119.17
DATING : 854028.83
EDUCATION : 1833495.15
ENTERTAINMENT : 11640705.88
EVENTS : 253542.22
FINANCE : 1387692.48
FOOD_AND_DRINK : 1924897.74
HEALTH_AND_FITNESS : 4188821.99
HOUSE_AND_HOME : 1331540.56
LIBRARIES_AND_DEMO : 638503.73
LIFESTYLE : 1437816.27
GAME : 15588015.60
FAMILY : 3695641.82
MEDICAL : 120550.62
SOCIAL : 23253652.13
SHOPPING : 7036877.31
PHOTOGRAPHY : 17840110.40
SPORTS : 3638640.14
TRAVEL_AND_LOCAL : 13984077.71
TOOLS : 10801391.30
PERSONALIZATION : 5201482.61
PRODUCTIVITY : 16787331.34
PARENTING : 542603.62
WEATHER : 5074486.20
VIDEO_PLAYERS : 24727872.45
NEWS_AND_MAGAZINES : 9549178.47
MAPS_AND_NAVIGATION : 4056941.77


We can see after pulling the values from `Category` column, we end up with Communication category to be the most popular in terms of number of installs.

We are going to do analyze the frequency table for the `Genre` column of the Google Play data set, then we can make more thorough observation of the market.

In [21]:
for genre in google_genre:
    total = 0
    len_genre = 0
    for row in google_dataset:
        app_genre = row[9]
        if app_genre == genre:
            n_installs = row[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            total += float(n_installs)
            len_genre += 1
    avg_n_installs = total / len_genre
    avg_n_installs = '{0:.2f}'.format(avg_n_installs)
    print(genre, ':', avg_n_installs)

Art & Design : 2122850.94
Art & Design;Creativity : 285000.00
Auto & Vehicles : 647317.82
Beauty : 513151.89
Books & Reference : 8767811.89
Business : 1712290.15
Comics : 831873.15
Comics;Creativity : 50000.00
Communication : 38456119.17
Dating : 854028.83
Education : 550185.44
Education;Creativity : 2875000.00
Education;Education : 4759517.00
Education;Pretend Play : 1800000.00
Education;Brain Games : 5333333.33
Entertainment : 5602792.78
Entertainment;Brain Games : 3314285.71
Entertainment;Creativity : 4000000.00
Entertainment;Music & Video : 6413333.33
Events : 253542.22
Finance : 1387692.48
Food & Drink : 1924897.74
Health & Fitness : 4188821.99
House & Home : 1331540.56
Libraries & Demo : 638503.73
Lifestyle : 1412998.34
Lifestyle;Pretend Play : 10000000.00
Card : 3815462.50
Arcade : 22888365.49
Puzzle : 8302861.91
Racing : 15910645.68
Sports : 4596842.62
Casual : 19569221.60
Simulation : 3475484.09
Adventure : 4922785.33
Trivia : 3475712.70
Action : 12603588.87
Word : 9094458.70


According to the Google Play data set, `Communication` is by far on top of the list of the most installs app genre for Google Play market. This can be easily understood by looking at the trend of the world recently as more phones gradually become rather a necessity to function simple daily tasks, thus, the demand for communication grows exponentially since it has become more and more simple with just few clicks of button. Plus, generating revenue from ads utilizing through communication application profile is practical due to the strong cross-platform features between communication application.

However, there is a concern about the skewedness of the `Communication` data, despise of huge number of user installs, they are mostly dominated by a few big names such as Facebook Messenger, Whatsapp, Skype, etc. This just shows that although users high demand for quick, efficient, and trendy communication platform, we are at the odds of having to win over the big guys. Therefore, we are going to look at other alternatives idea.

Let's turn to a different genre but also relatively high number of installs - `Adventure;Action & Adventure`. Below, we analyze which apps placed in such genre.

In [24]:
for app in google_dataset:
    if app[9] == 'Adventure;Action & Adventure':
        print(app[0], ':', app[5])

Leo and Tig : 1,000,000+
Transformers Rescue Bots: Hero Adventures : 5,000,000+
ROBLOX : 100,000,000+


`ROBLOX` has been a giant among the games designed for everyone. However, `Games` genre is usually a very versatile and efficient way to achieve large revenue flow from advertisements, since they have the possibility to be interacted with the game itself (watching an ad to gain one more life). Furthermore, `Games` is not only popular within Google Play market, it also dominates the App Store market, which opens a lot more potential to the developers team and our clients, scalability and profit wise. 

**Conclusions**

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We have made some analysis and observations in both given Google Play and App Store data sets and we would like to proceed with the idea of building an application profile within `Games` genre, a multiplayer platform where players are placed into open world map and need to fight each other or cooperate to survive. 