# <ins>Profitable App Profiles for the App Store and Google Play Markets</ins>

## Viewing and Exploring our data 

We'll begin by opening both datasets and exploring each one of them, trying to identify exactly which columns will be the most useful in our analysis. The App Store and Google Play have a combined number of apps that exceed 4 million, therefore in order to avoid any costs and reduce time for this entry level project, we'll be using smaller datasets to get a basic idea of which app profiles have the potential to be the most profitable. 

The code below first opens both our datasets as a list of lists, excluding the header, and then uses the explore_data function to print out a specified range of rows. The function also allows us to see how many total rows and columns there are in each dataset.  



In [16]:
opened_file1 = open('AppleStore.csv')
opened_file2 = open('googleplaystore.csv')
from csv import reader
read_file1 = reader(opened_file1)
read_file2 = reader(opened_file2)
apple_dataset = list(read_file1)
google_dataset = list(read_file2)
apple = apple_dataset[1:]
google = google_dataset[1:]
apple_columns = apple_dataset[0]
google_columns = google_dataset[0]

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') 

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


print(apple_columns)
print('\n')
explore_data(apple,0,5,True)
print('-------------------------------------------------------------------------\n')
print(google_columns)
print('\n')
explore_data(google,0,5,True)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16
---------------

At a quick glance, the columns that might be useful for the purpose of our analysis from the Google dataset are: 'App', 'Category', 'Reviews', 'Installs', 'Price', and 'Genres'.

The columns that might be useful from the Apple dataset are: 'track_name', 'price', 'rating_count_tot', and 'prime_genre'.

## Data Cleaning

It is said that data scientists usually spend 80% of their time cleaning data. We want to make sure that we are analyzing the right information, otherwise our analysis will be useless and ineffective to the task at hand. When a company hires us it expects us to come up with results that will be impactful to its business model, therefore we have to make sure everything is accurate and this can be time consuming. In this project, the data cleaning process consists of four tasks:
- Deleting Wrong Data
- Removing Duplicate Entries
- Removing Non-English Apps
- Isolating the Free Apps

Let's get started.

### Deleting Wrong Data

When analyzing a dataset you can either do it manually or we can search the internet to see if there are any inconsistencies with the particular dataset. If we were working for a big time client, we would surely do both, but for the sake of time and since this is a simple project, we'll scour the web to see if we find any information on the datasets. 

After properly searching the web, we found that the Google Play data set has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct:


In [17]:
print('----------------------------------------------------\n')
print(google_columns)
print(len(google_columns))

print('\n')
print(google[10472])
print(len(google[10472]))


----------------------------------------------------

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
13


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12


The row 10472 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and we can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5 (as mentioned in the discussions section, this problem is caused by a missing value in the 'Category' column). As a consequence, we'll delete this row and print out the same row to verify that it's been actually deleted:

In [19]:
del google[10472]
print(google[10472])
print('\n')
print('Length of Google dataset after deleting row with error: ', len(google))


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


Length of Google dataset after deleting row with error:  10840


The Apple store dataset seems to be correct and we didn't find any information on errors, therefore we will now move on to the next step of the data cleaning process.

### Removing Duplicate Entries

We're going to continue cleaning our data by seeing if our datasets have any duplicate apps. Our code below will identify the number of duplicates and with that information we'll know how many apps we have to remove in each dataset:

In [20]:
duplicate_apps_google = []
unique_apps_google = []

for app in google:
    app_name = app[0]
    if app_name in unique_apps_google:
        duplicate_apps_google.append(app_name)
    else:
        unique_apps_google.append(app_name)

print('Number of duplicate apps in Google dataset:', len(duplicate_apps_google))
print('Number of unique apps in Google dataset:', len(unique_apps_google))



Number of duplicate apps in Google dataset: 1181
Number of unique apps in Google dataset: 9659


In [21]:
duplicate_apps_apple = {}
unique_apps_apple = {}

for app in apple:
    app_id = int(app[0])
    app_name = app[1]
    if app_name in unique_apps_apple and unique_apps_apple[app_name] == app_id:
        duplicate_apps_apple[app_name] = app_id
    else:
        unique_apps_apple[app_name] = app_id

print('Number of duplicate apps in Apple dataset:', len(duplicate_apps_apple))
print('Number of unique apps in Apple dataset:', len(unique_apps_apple))








Number of duplicate apps in Apple dataset: 0
Number of unique apps in Apple dataset: 7195


We see that the Apple dataset has no duplicate entries,so let's just print out some names of duplicates in the Google dataset:


In [22]:
sample_data = duplicate_apps_google[:5]
print(sample_data)


['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Now, let's observe these duplicates:

In [23]:
for app_name in sample_data:
    for app in google:
        if app_name in app:
            print(app)
            print('\n')
            

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,

According to the first five duplicates, we can see that the columns are pretty much the same and there is barely any variance, therefore removing duplicates randomly would be justified. 

Let's print a few more rows:

In [24]:
sample_data = duplicate_apps_google[:20]
for app_name in sample_data:
    for app in google:
        if app_name in app:
            print(app)
            print('\n')

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,

If you observe each column carefully, you would have been able to see that there is one column that does display a tiny bit of variance, which is the reviews column. Some reviews are different than others, therefore this should be a criteria for cleaning up removing the duplicates to preserve efficiency in our data. We will keep the row with the higher number of reviews as it indicates that this particular row of data is more recent. 

The way we will go about removing the duplicates will be to create a dictionary whose keys will be the name of the unique apps and values will be the highest number of reviews. Before we code our dictionary, let's see how many unique apps there should be:

In [25]:
print('Number of unique apps:' , len(unique_apps_google))

Number of unique apps: 9659


Let's verify this number again with a more mathematical approach using the following code:

In [26]:
print('Number of unique apps:' , len(google) - len(duplicate_apps_google))

Number of unique apps: 9659


Now let's code our dictionary to include our unique apps as keys and our criterion of only using the duplicates with the highest reviews as the values:

In [27]:
reviews_max = {}
for app in google:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

To verify that this is correct, all we do is get the length of our dictionary and make sure it equals the number of unique apps: 

In [28]:
print(len(unique_apps_google) == len(reviews_max))

True


Before we move on, our last step in removing duplicates is to make sure we only have one unique entry for each app with the highest reviews. If you explore the dataset, you can see that most apps have more than two entries with highest reviews, therefore we need to make sure we only use one of these entries. For example, the Box app has three entries, and the number of reviews is the same. If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps. Here is the code for our final list, without ANY duplicates:

In [29]:
android_clean = []
already_added = []

for app in google:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

print(len(android_clean))        


9659


We used the reviews_max dictionary here because we know it contains all the entries we want, but we only want one row from all those entries because they are identical. Line 7 adds the condition we're looking for in order to get a truly unique list as it extracts only one row from reviews_max and adds it to the list of lists android_clean, which is our desired result in this second task. 

Let's continue to the next step.

### Removing Non-English Apps

Our company is only interested in analyzing the most popular English apps at this point, therefore we will further curb our data to only include apps developed in English. In order to achieve a list of only English apps, we first have to define what an English app consists of and how we can identify them. According to the ASCII (American Standard Code for Information Interchange) system, the numbers corresponding to the characters we commonly use in the English language are all in the range 0 to 127. With this information, we'll first construct a function that checks if an app is English:  

In [30]:
def isEnglish(string):
    
    non_english_chars = ''
    
    for character in string:
        
        if ord(character) > 127:
            non_english_chars += character
            if len(non_english_chars) > 3:
                return False
    return True

print(isEnglish('Instagram'))
print(isEnglish('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))
print(isEnglish('Docs To Go‚Ñ¢ Free Office Suite'))
print(isEnglish('Instachat üòú'))

True
False
True
True


As you can see we used a nested if statement in our function to include any characters outside the ASCII range and that have corresponding numbers over 127. The reason for this is some english apps in our datasets include characters like emojis and trademark labels that we do not want to exclude because it will exclude the app itself. In order to avoid this, our code includes a clause(if len(non_english_chars) > 3) that will label an app as non-english if a string has more than 3 non-english characters. 

Next, we'll use our isEnglish function to create a program that creates a list of lists of english and non-english apps for both datasets:  

In [31]:
english_apps_apple = []
non_english_apps_apple = []
english_apps_google = []
non_english_apps_google = []

for app in apple:
    name = app[1]
    if isEnglish(name):
        english_apps_apple.append(app)
    else:
        non_english_apps_apple.append(app)
        
for app in android_clean:
    name = app[0]
    if isEnglish(name):
        english_apps_google.append(app)
    else:
        non_english_apps_google.append(app)


explore_data(english_apps_google, 0, 10, rows_and_columns=True)
print('\n')
explore_data(non_english_apps_google, 0, 10, rows_and_columns=True)
print('\n')
explore_data(english_apps_apple, 0, 10, rows_and_columns=True)
print('\n')
explore_data(non_english_apps_apple, 0, 10, rows_and_columns=True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,00

We now have two datasets that don't have errors, duplicates, and non-english apps. Isolating the free apps will be our last step in the data cleaning process. 

### Isolating the Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps so we'll need to isolate only the free apps for our analysis.

In [32]:
free_apps_apple = []
paid_apps_apple = []

free_apps_google = []
paid_apps_google = []

for app in english_apps_apple:
    price = app[4]
    if price == '0.0':
        free_apps_apple.append(app)
    else:
        paid_apps_apple.append(app)
        

for app in english_apps_google:
    price = app[7]
    if price == '0':
        free_apps_google.append(app)
    else:
        paid_apps_google.append(app)


explore_data(free_apps_google, 0, 10, rows_and_columns=True)
print('\n')
explore_data(free_apps_apple, 0, 10, rows_and_columns=True)



['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,00

The code above is very simple as it just uses two separate for loops to construct two datasets using the most updated lists(english_apps_apple and english_apps_google). We iterate through the lists to extract the price and use an if/else clause to identify the free apps and append them to a final set of lists (free_apps_apple and free_apps_google).

This concludes our data cleaning process for this project, now let's prepare our analysis using our clean and updated datasets. 

## Analysis of App Profiles 

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we then develop it further.
- If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.
  
We will need to find app profiles that are successful on both markets in order to reach our end goal. What defines a successful app profile? In our case, there are two things that we should be looking for: commonality and popularity. Therefore, we will find the most common apps by genre first and then move on to finding the most popular apps.  

### Most Common Apps by Genre

Let's analyze both data sets and identify the columns we could use to generate frequency tables to find out what the most common genres in each market are.

In [33]:
print(apple_columns)
print('\n')
print(google_columns)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In the Apple dataset we will use the 'prime_genre' column and for the Google dataset we will use the 'Genres' column. In addition to 'Genres', we will also do an analysis using the 'Category' column of the Google dataset to see if we get differing results using similar columns.

Let's now build a function to generate a frequency table that creates a unique entry for each genre, using the genre itself as the key and a percentage of how many times an app in that particular genre is built. We will also create a a function called display_table that will organize our data from highest to lowest percentages.  

In [50]:
def freq_table(dataset, index):
    
    frequency_table = {}
    
    for entry in dataset:
        
        key = entry[index]
        if key in frequency_table:
            frequency_table[key] += 1
        else:
            frequency_table[key] = 1
            
    for key in frequency_table:
        frequency_table[key] = round((frequency_table[key]/len(dataset)) * 100, 4)
        
    
    return frequency_table





In [51]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


Using these two functions we can now print and visualize the most common free apps in both the App Store and Google Play: 

In [52]:
print('Percentage of most common free English Apple apps by genre\n')
display_table(free_apps_apple, 11)

Percentage of most common free Apple apps by genre

Games : 58.1626
Entertainment : 7.8833
Photo & Video : 4.9659
Education : 3.6623
Social Networking : 3.2899
Shopping : 2.6071
Utilities : 2.514
Sports : 2.1415
Music : 2.0484
Health & Fitness : 2.0174
Productivity : 1.7381
Lifestyle : 1.5829
News : 1.3346
Travel : 1.2415
Finance : 1.1173
Weather : 0.869
Food & Drink : 0.807
Reference : 0.5587
Business : 0.5276
Book : 0.4345
Navigation : 0.1862
Medical : 0.1862
Catalogs : 0.1241


We can see that among the free English apps, the majority of apps (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

Observing these results, the general consensus should be that the free apps in the App Store mostly consist of apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), 

Let's continue by examining the 'Genres' and 'Category' columns of the Google Play data set. 

In [55]:
print('\nPercentage of most common free English Google apps by genre\n')
display_table(free_apps_google, 9)



Percentage of most common free English Google apps by genre

Tools : 8.4499
Entertainment : 6.0695
Education : 5.3475
Business : 4.5916
Productivity : 3.8921
Lifestyle : 3.8921
Finance : 3.7004
Medical : 3.5311
Sports : 3.4634
Personalization : 3.3168
Communication : 3.2378
Action : 3.1024
Health & Fitness : 3.0799
Photography : 2.9445
News & Magazines : 2.7978
Social : 2.6625
Travel & Local : 2.324
Shopping : 2.245
Books & Reference : 2.1435
Simulation : 2.042
Dating : 1.8615
Arcade : 1.8502
Video Players & Editors : 1.7712
Casual : 1.7599
Maps & Navigation : 1.3989
Food & Drink : 1.241
Puzzle : 1.1282
Racing : 0.9928
Role Playing : 0.9364
Libraries & Demo : 0.9364
Auto & Vehicles : 0.9251
Strategy : 0.9138
House & Home : 0.8236
Weather : 0.801
Events : 0.7107
Adventure : 0.6769
Comics : 0.6092
Beauty : 0.5979
Art & Design : 0.5979
Parenting : 0.4964
Card : 0.4513
Casino : 0.4287
Trivia : 0.4174
Educational;Education : 0.3949
Board : 0.3836
Educational : 0.3723
Education;Education : 

On Google Play, it seems that not many apps are designed for fun, and it looks as if a good number of apps are designed for practical purposes (tools, education, business, lifestyle, productivity, etc.). 

Before we do a more thorough analysis, let's look at the frequency table using the 'Category' column:

In [56]:
print('\nPercentage of most common free English Google apps by category\n')
display_table(free_apps_google, 1)


Percentage of most common free English Google apps by category

FAMILY : 18.9079
GAME : 9.7247
TOOLS : 8.4612
BUSINESS : 4.5916
LIFESTYLE : 3.9034
PRODUCTIVITY : 3.8921
FINANCE : 3.7004
MEDICAL : 3.5311
SPORTS : 3.3958
PERSONALIZATION : 3.3168
COMMUNICATION : 3.2378
HEALTH_AND_FITNESS : 3.0799
PHOTOGRAPHY : 2.9445
NEWS_AND_MAGAZINES : 2.7978
SOCIAL : 2.6625
TRAVEL_AND_LOCAL : 2.3353
SHOPPING : 2.245
BOOKS_AND_REFERENCE : 2.1435
DATING : 1.8615
VIDEO_PLAYERS : 1.7938
MAPS_AND_NAVIGATION : 1.3989
FOOD_AND_DRINK : 1.241
EDUCATION : 1.162
ENTERTAINMENT : 0.9589
LIBRARIES_AND_DEMO : 0.9364
AUTO_AND_VEHICLES : 0.9251
HOUSE_AND_HOME : 0.8236
WEATHER : 0.801
EVENTS : 0.7107
PARENTING : 0.6543
ART_AND_DESIGN : 0.6431
COMICS : 0.6205
BEAUTY : 0.5979


Observing the results, we see that many of the entry's are duplicated from the 'Genres' column and have similar percentages, but the top entry which is 'Family' is only present in the 'Category' column. However, if we actually go to Gooogle Play and search the family category, we will see that it is mostly made up of games for kids. 

The difference between the Genres and the Category columns is not crystal clear, but we can see that in the Genres column a general games entry is missing and we could have mistakenly concluded that 'Tools' is the most common type of app. This tells us that the frequency table using 'Genres' might be too scattered for a proper analysis, thus we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is mostly made up of apps designed for fun, while Google Play shows a landscape that also puts practical apps as a viable contender for commonality. Now we'd like to get an idea about the kind of apps that have the most users, so let's construct a table to see which apps by genre are the most popular.


### Most Popular Apps by Genre on the App Store

We can calculate popularity by seeing which type of apps have the most users. Looking at the dataset, it makes sense to use the 'Installs' column for Google Play, as it tells us the number of downloads for each app. Unfortunately in the App Store dataset, there is no column that tells us how many installs there are for each app. We'll have to think outside the box here and use the 'rating_count_tot' column. This column gives us the total number of user ratings for each app. The logic here is that the more ratings an app has, the higher the number of users and thus the higher its popularity. 

Let's begin the final part of our project by calculating the average number of user ratings per app by genre in the App Store: 


In [110]:
prime_genre_table = freq_table(free_apps_apple, 11)
table = []

for genre in prime_genre_table:
    total = 0
    len_genre = 0
    
    for app in free_apps_apple:
        genre_app = app[11]
        
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
            
    table_tuple = (total/len_genre, genre)
    table.append(table_tuple)

table_sorted = sorted(table, reverse = True)
    
for entry in table_sorted:
    print('{} : {:,.0f}'.format(entry[1], entry[0]))    
    

Navigation : 86,090
Reference : 74,942
Social Networking : 71,548
Music : 57,327
Weather : 52,280
Book : 39,758
Food & Drink : 33,334
Finance : 31,468
Photo & Video : 28,442
Travel : 28,244
Shopping : 26,920
Health & Fitness : 23,298
Sports : 23,009
Games : 22,789
News : 21,248
Productivity : 21,028
Utilities : 18,684
Lifestyle : 16,486
Entertainment : 14,030
Business : 7,491
Education : 7,004
Catalogs : 4,004
Medical : 612


We see that 'Navigation' apps is leading the way in providing the most user reviews, let's see which apps dominate this genre and exactly how many ratings they have:

In [108]:
for app in free_apps_apple:
    if app[-5] == 'Navigation':
        print('{} : {:,}'.format(app[1], int(app[5])))

Waze - GPS Navigation, Maps & Real-time Traffic : 345,046
Google Maps - Navigation & Transit : 154,911
Geocaching¬Æ : 12,811
CoPilot GPS ‚Äì Car Navigation & Offline Maps : 3,582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


When we visualize the apps that are making up most of the reviews we find that Waze and Google Maps have the majority of the user reviews. Seeing that two apps are dominating this entire genre and the next popular app only has 12,811 reviews, we have to ask ourselves do we really want to spend our time competing with these two giants. Is there anything different we can create that will persuade the user to use our app rather than these two popular ones? The goal of our company is to increase the number of users that utilize our apps, but unless we have an idea that will disrupt the GPS market, creating a navigation app right now seems like a roll of the dice and it would most likely be a miss than a hit because of how heavily people depend on Waze and Google Maps. 

Let's move on and explore the second most reviewed genre, which is 'Reference' and see if there's a potential profitable profile in there:

In [109]:
for app in free_apps_apple:
    if app[-5] == 'Reference':
        print('{} : {:,}'.format(app[1], int(app[5])))

Bible : 985,920
Dictionary.com Dictionary & Thesaurus : 200,047
Dictionary.com Dictionary & Thesaurus for iPad : 54,175
Google Translate : 26,786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18,418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17,588
Merriam-Webster Dictionary : 16,849
Night Sky : 12,122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8,535
LUCKY BLOCK MOD ‚Ñ¢ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4,693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1,497
Guides for Pok√©mon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
Êïô„Åà„Å¶!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


The Bible app seems to have the most reviews here, but other than that we see that this genre is not dominated by only a couple of apps. The main reason for this could be is that there are various books that are popular and not just a given few. Given this information we can potentially convert a popular book into an app and add features that are not offered in any of the apps that exist already. For example, there doesn't seem to be any medical books listed here, maybe we can create a medical dictionary that would be referenced by medical professions or even patients. 


### Most Popular Apps by Genre on Google Play

Now that we identified a potential popular profile for the App Store, let's analyze the Google PlayStore to verify if a book app would be befitting for our company. Remember, our goal is to create a Google app first and see if does well before we create it for the App Store. 

We noticed that the 'Installs' column would be able to give us an idea of the popularity of an app genre. Keep in mind, 'Installs' does have a flaw, in which it doesn't give us an exact number, but rather it tells us the minimum amount of installs. Thus, we will use the information we are given and calculate the average number of installs per genre using the minimum amount of installs. 

In [103]:
category_table = freq_table(free_apps_google, 1)
table = []

for category in category_table:
    total = 0
    len_category = 0
    
    for app in free_apps_google:
        category_app = app[1]
        
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            total += float(installs)
            len_category += 1
            
    table_tuple = (total/len_category, category)
    table.append(table_tuple)

table_sorted = sorted(table, reverse = True)
    
for entry in table_sorted:
    print('{} : {:,.0f}'.format(entry[1], entry[0]))    
           

COMMUNICATION : 38,456,119
VIDEO_PLAYERS : 24,727,872
SOCIAL : 23,253,652
PHOTOGRAPHY : 17,840,110
PRODUCTIVITY : 16,787,331
GAME : 15,588,016
TRAVEL_AND_LOCAL : 13,984,078
ENTERTAINMENT : 11,640,706
TOOLS : 10,801,391
NEWS_AND_MAGAZINES : 9,549,178
BOOKS_AND_REFERENCE : 8,767,812
SHOPPING : 7,036,877
PERSONALIZATION : 5,201,483
WEATHER : 5,074,486
HEALTH_AND_FITNESS : 4,188,822
MAPS_AND_NAVIGATION : 4,056,942
FAMILY : 3,695,642
SPORTS : 3,638,640
ART_AND_DESIGN : 1,986,335
FOOD_AND_DRINK : 1,924,898
EDUCATION : 1,833,495
BUSINESS : 1,712,290
LIFESTYLE : 1,437,816
FINANCE : 1,387,692
HOUSE_AND_HOME : 1,331,541
DATING : 854,029
COMICS : 817,657
AUTO_AND_VEHICLES : 647,318
LIBRARIES_AND_DEMO : 638,504
PARENTING : 542,604
BEAUTY : 513,152
EVENTS : 253,542
MEDICAL : 120,551


Communication seems to be the dominant genre in Google Play, but let's explore the top apps:

In [122]:
for app in free_apps_google:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger ‚Äì Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Me

As you can see this genre is actually saturated with very popular apps that will be hard to compete with unless you have an idea that will disrupt the industry in some way and is very unique. Let's take a look at the second most popular genre, which is 'VIDEO_PLAYERS': 

In [128]:
for app in free_apps_google:
    if app[1] == 'VIDEO_PLAYERS' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

YouTube : 1,000,000,000+
Motorola Gallery : 100,000,000+
VLC for Android : 100,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+
Dubsmash : 100,000,000+
VivaVideo - Video Editor & Photo Movie : 100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera : 100,000,000+
Motorola FM Radio : 100,000,000+


Here we have fewer number of apps that are over 100,000,000, but the most dominant app here is YouTube and right now our company is not in the market for creating a new video viewing platform, so let's move on. 

After observing two of the most popular genres we can conclude that most of the top genre's are saturated with apps that may be hard to compete with, but this doesn't mean we shouldn't look into creating any content for them. If we were actually working for a company and getting paid for a thorough analysis, we would definitely have to explore each and every category, but for the sake of this project let's go straight to the 'BOOKS_AND_REFERENCE' genre to see if our medical dictionary idea would be viable based on the landscape there.  

In [129]:
for app in free_apps_google:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad üìñ Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


Looking at the results of the top number of installs for 'BOOKS_AND_REFERENCE' we can already see the potential here as there aren't too many popular apps that are dominating this genre. In addition, the apps that are dominating the genre are digital readers such as Google Play Books and Amazon Kindle. The Bible, like in the App Store remains the top book here also. 

## Conclusion

Both the App Store and Google Play markets are vast and since their conception, apps have been continuously showing up on both these markets by the minute. It can get very challenging to create something that stands out from the rest and eventually brings in a revenue stream. In this project, we concluded from our analysis that if we wanted to create a profitable app profile, we should look into designing an app that digitizes a book and adds features that are unique to the app, perhaps a medical dictionary. At this stage, our analysis really just touched the tip of the surface. Our goal was to use Python and pinpoint the type of apps that would create a profit for our company. We could actually dive deeper and explore other genres to see if we can find other potential profiles that pique our interests, but we will leave this for a future project.   