# Analysis of profitable apps based on number of downloads

Acme Inc is a company that builds exclusively free apps for Google Play and App Store. The main source of revenue for these apps is in-app ads. As a general norm, the most downloads an app has the more revenue will these adds generate, hence we need to analyze what are the main characteristics of the most downloaded apps in Acme's portfolio.

The goal of this project is to provide the main characteristics of their "winner" apps, so that the developers better understand which apps are more likely to attract more users and generate revenue in a more predictable way.

### We have 2 datasets: 'AppleStore.csv' and 'googleplaystore.csv' with data from approximately 7,000 and 10,000 apps respectively.

Our first step is opening each dataset and transforming it into a list of lists to make it more readable:

In [1]:
open_dataset_Apple = open('AppleStore.csv')
from csv import reader
read_dataset_Apple = reader(open_dataset_Apple)
final_dataset_Apple = list(read_dataset_Apple)

open_dataset_Android = open('googleplaystore.csv')
from csv import reader
read_dataset_Android = reader(open_dataset_Android)
final_dataset_Android = list(read_dataset_Android)

We then need to explore the data. Below we create a function that will find out: how does the data look like, how many rows does it have and how many columns does it have.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

We go on to apply the 'explore_data' function to both databases to have an idea of what are we working with:

In [3]:
explore_data(final_dataset_Apple, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


In [4]:
explore_data(final_dataset_Android, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


----
Our function assummed that there is no header when it counted the number of rows, hence we need to subsctract 1 from our total number of rows from each database. This means that we have:

- One 16 column table with 7197 rows for Apple's dataset
- One 13 column table with 10841 rows for Android's dataset

## We proceed with data cleaning - look for duplicates and missing data:

### Looking for missing data and removing it:

We look for missing data for Android's dataset comparing the length of the header with the length of the rest of the lines

In [5]:
for row in final_dataset_Android:
    if len(row) != len(final_dataset_Android[0]):
        print(row)
        print(len(row))
        print(final_dataset_Android.index(row))
        
    

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12
10473


##### We leave the delete command as Markdown to avoid repeated deletions. We still needed if we re-run the code from scratch or the functions below will give us errors ('cannot convert str '3.M' to float)
----
del final_dataset_Android[10473]

In [6]:
del final_dataset_Android[10473]

We have found and deleted the incorrect row in Android's database. Now we look for missing data for Apple:

In [7]:
for row in final_dataset_Apple:
    if len(row) != len(final_dataset_Apple[0]):
        print(row)
        print(len(row))
        print(final_dataset_Apple.index(row))
    

No results means that there is no missing data for Apple's database.

### Looking for duplicates:

We start with Android:

In [8]:
non_duplicates_Android = []
duplicates_Android = []
for row in final_dataset_Android:
    if row[0] not in non_duplicates_Android:
        non_duplicates_Android.append(row[0])
    else: 
        duplicates_Android.append(row[0])
    
print(len(duplicates_Android))
print(duplicates_Android)

1181
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express', 'Accounting App - Zoho Books', 'Invoice & Time Tracking - Zoho', 'join.me - Simple Meetings', 'Invoice 2go — Professional Invoices and Estimates', 'SignEasy | Sign and Fill PDF and other Documents', 'Quick PDF Scanner + OCR FREE', 'Genius Scan - PDF Scanner', 'Tiny Scanner - PDF Scanner App', 'Fast Scanner : Free PDF Scan', 'Mobile Doc Scanner (MDScan) Lite', 'TurboScan: scan documents and receipts in PDF', 'Tiny Scanner Pro: PDF Doc Scan', 'Docs To Go™ Free Office Suite', 'OfficeSuite : Free Office + PDF Edi

There are 1181 duplicates in the Android database. Now we analyze Apple's database

In [9]:
non_duplicates_Apple = []
duplicates_Apple = []
for row in final_dataset_Apple:
    if row[0] not in non_duplicates_Apple:
        non_duplicates_Apple.append(row[0])
    else: 
        duplicates_Apple.append(row[0])

print(len(duplicates_Apple))
print(duplicates_Apple)

0
[]


There are no duplicates in this case

We proceed to remove all the duplicates from the Android database since this one is the one with extensive 'not clean' data. We are going to focus on keeping the apps with the highest number of reviews, because we assume the more number of reviews the more up to date should be the information (number of ratings is index 3). 

To do this, we create an empty dictionary, we loop through the database, and if the app name is in the dictionary and the number of reviews is lower than the number in this dictionary, we update the value.

In [10]:
reviews_max = {}
for row in final_dataset_Android[1:]:
    name = row[0] 
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] <= n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))

9659


As a result we get a dictionary with the names of 9659 apps with their unique and highest ratings

Now that we have the unique values in the dictionary we can proceed to create a list that countains exclusively these unique apps with all the information in each of the rows from final_database_Android:

In [11]:
android_clean = [] # will store new cleaned dataset
already_added = [] # will store app names and help us keep track of what we added
for row in final_dataset_Android[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

print(android_clean[:3])
print(len(android_clean))



[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]
9659


## Finding Non-English app names

### Testing ord() function

Next step of the data cleaning is check which apps have English names, as some of them have non-English characters (e.g Chinese, Arabic or Japanese characters). 

To identify these, we will extract the index associated to each Unicode character. We do this with the function "ord()". And we know that most English characters end after number 127.

We start with a test:

In [12]:
def check_English_characters(sentence):
    for character in sentence:
       
        if ord(character) > 127:
            return False
        
    return True

check_English_characters ('Instachat 😜')

False

The nature of a function is to stop once it can execute a return. If we put a return inside a loop, it will break the loop. Since we are only looking for 1 or more indication that this is not English, it is fine to put the first return inside the loop. However, if we put the second return inside the loop, it will ALSO stop in the first letter without verifying whether the rest are correct or not. That is why we put the return outside the loop, we are saying: if you looped through the whole string and did not find any incorrect character, then simply return True.

### Improving Non-English apps identification with ord()

The instructions make the point that for apps like Instachat above we get False when we have something unusual, like an emoji. In order to not miss these false negatives, we ask our function to do this only if there are 3 or more Non-English characters:

In [13]:
def check_English_characters_2(sentence2):
    non_english_characters_registry = []
    for character in sentence2:
       
        if ord(character) > 127:
            non_english_characters_registry.append(character)
            if len(non_english_characters_registry) > 3:
                return False
      
    
    return True

check_English_characters_2 ('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

We proceed to apply this to the both datasets: android_clean and final_dataset_Apple. Those apps that are in English will be appended to a new list.

In [14]:
android_clean_English = []
for appsnames in android_clean:
    android_temp_clean_English = check_English_characters_2(appsnames[0])
    if android_temp_clean_English == True:
        android_clean_English.append(appsnames)

apple_clean_English = []
for appsnamesapple in final_dataset_Apple[1:]:
    apple_temp_clean_English = check_English_characters_2(appsnamesapple[1])
    if apple_temp_clean_English == True:
        apple_clean_English.append(appsnamesapple)

print(android_clean_English[0:3])
print(len(android_clean_English))
print(apple_clean_English[0:3])
print(len(apple_clean_English))

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]
9614
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']]
6183


We have new clean datasets (android_clean_English and apple_clean_English) in English with 9614 apps for Android and 6183 apps for Apple.

## Extracting the free apps

Right now both datasets have both free and paid apps. We only need the free ones, so we will isolate them in a similar way as we did with the Non English apps.

In [15]:
android_clean_english_free = []
for appsnames in android_clean_English:
    if appsnames[7] == '0':
        android_clean_english_free.append(appsnames)

apple_clean_english_free = []
for appleappsnames in apple_clean_English:
    if appleappsnames[4] == '0.0':
        apple_clean_english_free.append(appleappsnames)
        
print(len(android_clean_english_free))
print(len(apple_clean_english_free))
    

8864
3222


This leaves us with 8864 free English apps in Android and 3222 free English apps in Apple.

## Finding the profitable apps

Now that we have both datasets completely clean and only with the apps that we are looking for, we can proceed with a deeper data analysis. As a reminder, our final goal is to find out which are the most profitable kinds of apps. The first step is to have an overview for which are the most common kinds of apps.

We will make the overview by creating a frequency table that gives us the most common app types. But first, we identify which are the most useful columns using our first function 'explore data':

In [16]:
explore_data(final_dataset_Android, 0, 2,)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']




In [17]:
explore_data(final_dataset_Apple, 0, 2,)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']




The most useful columns would be: 
- For Android the columns Category and Genres, since the column type only says 'Free' or 'Paid'. 
- For Apple the column prime_genre

In [18]:
def freq_table(dataset, index):
    table_dictionary = {}
    for list in dataset[1:]:
        dictionary_key = list[index]
        if dictionary_key not in table_dictionary:
            table_dictionary[dictionary_key] = 1
        else:
            table_dictionary[dictionary_key] += 1
    
    for keys in table_dictionary:      
        table_dictionary[keys] = round((table_dictionary[keys]/len(dataset))*100,2)
        
    return table_dictionary
    
print(freq_table(final_dataset_Apple, 11))

{'Social Networking': 2.32, 'Photo & Video': 4.85, 'Games': 53.65, 'Music': 1.92, 'Reference': 0.89, 'Health & Fitness': 2.5, 'Weather': 1.0, 'Utilities': 3.45, 'Travel': 1.13, 'Shopping': 1.69, 'News': 1.04, 'Navigation': 0.64, 'Lifestyle': 2.0, 'Entertainment': 7.43, 'Food & Drink': 0.88, 'Sports': 1.58, 'Book': 1.56, 'Finance': 1.44, 'Education': 6.29, 'Productivity': 2.47, 'Business': 0.79, 'Catalogs': 0.14, 'Medical': 0.32}


We put 2 'for' functions because just like a 'return' statement, 'if' and 'else' will stop the code once executed. Now we use the function that was given to us to display the table with the genres and their proportions in a descending order.

In [19]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
#We use the function to display prime_genre, Genres and Category from each of the databases

display_table(final_dataset_Android, 1)
print('--------------------------------')
print('  ')
display_table(final_dataset_Android, 9)
print('--------------------------------')
print('  ')
display_table(final_dataset_Apple, 11)

FAMILY : 18.19
GAME : 10.55
TOOLS : 7.78
MEDICAL : 4.27
BUSINESS : 4.24
PRODUCTIVITY : 3.91
PERSONALIZATION : 3.62
COMMUNICATION : 3.57
SPORTS : 3.54
LIFESTYLE : 3.52
FINANCE : 3.38
HEALTH_AND_FITNESS : 3.15
PHOTOGRAPHY : 3.09
SOCIAL : 2.72
NEWS_AND_MAGAZINES : 2.61
SHOPPING : 2.4
TRAVEL_AND_LOCAL : 2.38
DATING : 2.16
BOOKS_AND_REFERENCE : 2.13
VIDEO_PLAYERS : 1.61
EDUCATION : 1.44
ENTERTAINMENT : 1.37
MAPS_AND_NAVIGATION : 1.26
FOOD_AND_DRINK : 1.17
HOUSE_AND_HOME : 0.81
LIBRARIES_AND_DEMO : 0.78
AUTO_AND_VEHICLES : 0.78
WEATHER : 0.76
ART_AND_DESIGN : 0.6
EVENTS : 0.59
PARENTING : 0.55
COMICS : 0.55
BEAUTY : 0.49
--------------------------------
  
Tools : 7.77
Entertainment : 5.75
Education : 5.06
Medical : 4.27
Business : 4.24
Productivity : 3.91
Sports : 3.67
Personalization : 3.62
Communication : 3.57
Lifestyle : 3.51
Finance : 3.38
Action : 3.37
Health & Fitness : 3.15
Photography : 3.09
Social : 2.72
News & Magazines : 2.61
Shopping : 2.4
Travel & Local : 2.37
Dating : 2.16
Boo

### First observations

For the App Store we can see the following:
- Games and Entertainment seem to be the most common genres, specially games who beats every other category by a landslide. Photo and Video could also even be considered as part of 'Entertainment'

- Education and Utilities are the other major categories, which confirms the saying of smartphones being handheld computers. 

- Every other category falls to such a small amount that it can signal: either these sectors have very big entry barriers (such as Music and Social Media, both monopolized sectors), or are not really worth it. We will see later on

For the Google Play Store we can see the following:

- Family and Games are the most common categories. We can say that this follows a similar pattern to the App Store. However, in Genres we see the top is Tools? In categories we also see Tools very high up

- In the App Store we ALSO see Utilities in the top 5, and depending how you categorize it, some items in Education section for App Store could also fall under 'Tools' in Google Play

- This last point could suggest that, apart from Gaming being the obvious popular contender, there is a high volume of tools with some sort of gamification in it. 

We cannot recommend anything yet because we do not know what is the real volume of downloads of these apps, and even though Games looks promising, it is also heavily competitive

### Getting to the most popular apps

Most popular means more downloads, so more people will see our adds. 

#### We start with the App Store.

The App Store does not have a specific column for the volume of downloads, but it has a column for the number of ratings (rating_count_tot), index 5. We need to analyze how many ratings does each genre have on average (there are many apps with many different numbers of ratings in each genre, so the average will give us a good idea).

To do so, we go through the list of genres that we had previously, since they are all unique, and we ask the code:
- For each unique gender, check which apps in the total database have this gender
- In that case (if genre_app == genre) then add the rating count to the variable total and add 1 to the total of apps that have been checked

At the end of the loop we will have the sum of all the rating counts and the total of apps for each genre, for which we calculate the average per genre. 

In [20]:
averagelist_ratings_Apple = []
table_with_genres_Apple = freq_table(final_dataset_Apple, 11) 
for genre in table_with_genres_Apple:
    total = 0 #total will contain the sum of user ratings for each genre
    len_genre = 0 #len_genre will contain the number of apps specific to each genre
    for allcolumns in final_dataset_Apple:
        genre_app = allcolumns[11]
        if genre_app == genre:
            n_of_ratings_Apple = float(allcolumns[5])
            total += n_of_ratings_Apple #we assign the sum of user ratings for this specific app
            len_genre += 1 #we add 1 app to the list, as if we were marking ths app as 'complete'
    average_number_of_ratings_Apple = round(total/len_genre,2)
    
    #alternative without sorting: print(genre,':', ' ', average_number_of_ratings_Apple)
   
    tuple_ratings_Apple = (average_number_of_ratings_Apple, genre)
    averagelist_ratings_Apple.append(tuple_ratings_Apple)
    averagelist_ratings_Apple_sorted = sorted(averagelist_ratings_Apple, reverse = True)
for installs in averagelist_ratings_Apple_sorted:
    print(installs[0], ':', installs[1])

45498.9 : Social Networking
28842.02 : Music
22410.84 : Reference
22181.03 : Weather
18615.33 : Shopping
14352.28 : Photo & Video
14129.44 : Travel
14026.93 : Sports
13938.62 : Food & Drink
13692.0 : Games
13015.07 : News
11853.96 : Navigation
11047.65 : Finance
9913.17 : Health & Fitness
8051.33 : Productivity
7533.68 : Entertainment
6863.82 : Utilities
6161.76 : Lifestyle
5125.44 : Book
4788.09 : Business
2239.23 : Education
1732.5 : Catalogs
592.78 : Medical


The apps with most reviews are Social Networking apps, but there is a very small proportion of these. As we suspected, if there is a small proportion and an overwhelming amount of reviews, this means that there is a clear monopoly and the entry barriers are way too high. Music is the same case.

Entertainment and Education were very high up in app proportions but they have a poor number of reviews.

Photo & Video and Games look like the most promising categories with a high number of reviews and we know they had a high proportion of apps.

#### Next up is the Google Play Store

The data in the number of installs is not very precise, instead of giving us a specific number we only get a bracket, such as 100,000+ or 1,000,000+. Since we only want to know the most installed apps by gender, we are not very seriously affected by this. We see some of these brackets below:

In [21]:
explore_data(final_dataset_Android, 0, 4,)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']




These brackets will definitely give us an error because they have a comma ',' and a '+' character. We will need to keep this in mind and remove later on the commas and the pluses with the function string.replace().

Just like with the App Store for Apple, we use the unique values from the frequency table to compare with the rest of the database and do the iterations based on these.

In [22]:
genres_category_list_Android = freq_table(final_dataset_Android, 1)
averagelist_installs_Android = []
for categorized_genres in genres_category_list_Android:
    total = 0
    len_category = 0
    
    for all_columns_Android in final_dataset_Android:
        category_app = all_columns_Android[1]
        
        if category_app == categorized_genres:
            n_installs_Android = all_columns_Android[5]
            n_installs_Android = n_installs_Android.replace('+', '')
            n_installs_Android = n_installs_Android.replace(',', '')
            n_installs_Android = float(n_installs_Android)
            
            #We reintroduce the clean values to index 5
            
            all_columns_Android[5] = n_installs_Android
            
            #We keep going, assigning the total number of installs per category to the variable "total"
            
            total += n_installs_Android
            len_category += 1
            
    average_installs_Android = round(total/len_category, 2)
# Alternative to the sorting: print(categorized_genres, ':', ' ', average_installs_Android)
    tuple_installs_Android = (average_installs_Android, categorized_genres)
    averagelist_installs_Android.append(tuple_installs_Android)
    averagelist_installs_Android_sorted = sorted(averagelist_installs_Android, reverse = True)

for installs in averagelist_installs_Android_sorted:
    print(installs[0], ':', installs[1])

    

84359886.95 : COMMUNICATION
47694467.46 : SOCIAL
35554301.26 : VIDEO_PLAYERS
33434177.76 : PRODUCTIVITY
30669601.76 : GAME
30114172.1 : PHOTOGRAPHY
26623593.59 : TRAVEL_AND_LOCAL
26488755.34 : NEWS_AND_MAGAZINES
19256107.38 : ENTERTAINMENT
13585731.81 : TOOLS
12491726.1 : SHOPPING
8318050.11 : BOOKS_AND_REFERENCE
5932384.65 : PERSONALIZATION
5586230.77 : EDUCATION
5286729.12 : MAPS_AND_NAVIGATION
5201959.18 : FAMILY
5196347.8 : WEATHER
4642441.38 : HEALTH_AND_FITNESS
4560350.26 : SPORTS
2395215.12 : FINANCE
2178075.79 : BUSINESS
2156683.08 : FOOD_AND_DRINK
1917187.06 : HOUSE_AND_HOME
1912893.85 : ART_AND_DESIGN
1407443.82 : LIFESTYLE
1129533.36 : DATING
934769.17 : COMICS
741128.35 : LIBRARIES_AND_DEMO
625061.31 : AUTO_AND_VEHICLES
525351.83 : PARENTING
513151.89 : BEAUTY
249580.64 : EVENTS
115026.86 : MEDICAL


Communication (84M),Social (47M), Video_Players (35M) are the top categories, with Games and Photography being very close to those numbers (30M each). From all these categories, we had on the top 5 in abundance Game only.

Same as before, a high number of apps and a high number of installs per app suggest us that it is a category that is promising and with not a lot of entry barriers. Communication would be the next highest, but it only constitutes a little over 3% of the database, which can suggest that there are few apps that dominate everything else.

## Conclusions

The data from both stores suggests that the easiest genre to enter is Games. 

- Social and Video are also big players for both stores, but we know very well that Social is overly monopolized (its low proportion to the total of apps and high numbers of installs corroborates it), so that is a no-go. 

- Video also has a high number of installs and not a very high proportion in Android's case. If done smartly, a Video Tool for Social could get us somewhere in both stores, but is has to be catchy and somehow different to stand out enough and not fall to the big players.

## A little help for our development team

All these values are averages, and ideally we would want to prevent our development team to avoid wasting their time. An image is worth a thousand words, so a very good 'picture' for them to have would be to get the top-rated, top-installed apps for each of our most popular Genres. 

Looking these up in the Google Play Store and App Store is very easy from anynone's smartphone. However, we will not always have at hand this kind of seamless databases, and if we present these samples to our developers inside this project we will have everything in one place and make it easier for the development team to use this project to corroborate other pieces of data.

For simplicity, we are only going to do this for the top genres of each store:

- App Store: 'Photo & Video' and 'Games'
- Google Play Store: 'GAME' and 'VIDEO_PLAYERS'

Apple will go first. We start by finding out the maximum number of installs (in Apple's case, of ratings), and move from there:

In [23]:
Apple_rating_count_list = []
for apps in final_dataset_Apple[1:]:
    user_rating_count_Apple = float(apps[5])
    Apple_rating_count_list.append(user_rating_count_Apple)
print(max(Apple_rating_count_list))
Apple_rating_count_list = sorted(Apple_rating_count_list, reverse=True)
print(Apple_rating_count_list[0:50])

2974676.0
[2974676.0, 2161558.0, 2130805.0, 1724546.0, 1126879.0, 1061624.0, 985920.0, 961794.0, 878563.0, 824451.0, 706110.0, 698516.0, 679055.0, 677247.0, 669079.0, 612532.0, 567344.0, 541693.0, 522012.0, 508808.0, 507706.0, 503230.0, 495626.0, 481564.0, 479440.0, 464312.0, 446880.0, 446185.0, 426463.0, 418033.0, 417779.0, 416736.0, 414803.0, 405647.0, 405007.0, 402925.0, 397730.0, 395261.0, 393469.0, 391401.0, 386521.0, 373857.0, 373835.0, 373519.0, 370370.0, 360974.0, 359832.0, 354058.0, 351466.0, 345046.0]


In [24]:
sample_for_devs_Apple = []
well_rated_sample_Apple = []
for apps in final_dataset_Apple[1:]:
    ratingApple = float(apps[7])
    
    if ratingApple > 4.0:
        well_rated_sample_Apple.append(apps)
        
for appsApple in well_rated_sample_Apple:
    genre_Apple = appsApple[11]
    if genre_Apple == 'Games' or genre_Apple == 'Photo & Video': 
        sample_for_devs_Apple.append(appsApple)
        
final_sample_with_installs_for_devs_Apple = []
for apps in sample_for_devs_Apple:
    rating_count_Apple_sample = float(apps[5])
    if rating_count_Apple_sample > 250000:
        final_sample_with_installs_for_devs_Apple.append(apps)
        
print(final_sample_with_installs_for_devs_Apple)
print(len(final_sample_with_installs_for_devs_Apple))

[['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'], ['553834731', 'Candy Crush Saga', '222846976', 'USD', '0.0', '961794', '2453', '4.5', '4.5', '1.101.0', '4+', 'Games', '43', '5', '24', '1'], ['343200656', 'Angry Birds', '175966208', 'USD', '0.0', '824451', '107', '4.5', '3.0', '7.4.0', '4+', 'Games', '38', '0', '10', '1'], ['512939461', 'Subway Surfers', '156038144', 'USD', '0.0', '706110', '97', '4.5', '4.0', '1.72.1', '9+', 'Games', '38', '5', '1', '1'], ['362949845', 'Fruit Ninja Classic', '104590336', 'USD', '1.99', '698516', '132', '4.5', '4.0', '2.3.9', '4+', 'Games', '38', '5', '13', '1'], ['359917414', 'Solitaire', '

***Above is a sample of 45 apps that fit the criteria that defines the top performing apps in the most lucrative categories of the App Store.*** 

We now do exactly the same for Android, starting with figuring out the maximum number of installs in the database.

In [25]:
Android_install_count_list = []
for apps in final_dataset_Android[1:]:
    install_count_Android = float(apps[5])
    Android_install_count_list.append(install_count_Android)
print(max(Android_install_count_list))
Android_rating_count_list = sorted(Android_install_count_list, reverse=True)
print(Android_install_count_list[0:50])

1000000000.0
[10000.0, 500000.0, 5000000.0, 50000000.0, 100000.0, 50000.0, 50000.0, 1000000.0, 1000000.0, 10000.0, 1000000.0, 1000000.0, 10000000.0, 100000.0, 100000.0, 5000.0, 500000.0, 10000.0, 5000000.0, 10000000.0, 100000.0, 100000.0, 500000.0, 100000.0, 50000.0, 10000.0, 500000.0, 100000.0, 10000.0, 100000.0, 100000.0, 50000.0, 100000.0, 100000.0, 10000.0, 100000.0, 500000.0, 5000000.0, 10000.0, 500000.0, 10000.0, 100000.0, 10000000.0, 100000.0, 10000.0, 10000000.0, 100000.0, 100000.0, 100000.0, 100000.0]


In [26]:
sample_for_devs_Android = []
well_rated_sample_Android = []
for apps in final_dataset_Android[1:]:
    ratingAndroid = float(apps[2])
    
    if ratingAndroid > 4.0:
        well_rated_sample_Android.append(apps)
        
for appsAndroid in well_rated_sample_Android:
    genre_Android = appsAndroid[1]
    if genre_Android == 'GAME' or genre_Android == 'VIDEO_PLAYERS': 
        sample_for_devs_Android.append(appsAndroid)

final_sample_with_installs_for_devs_Android = []
for apps in sample_for_devs_Android:
    install_count_Android_sample = float(apps[5])
    if install_count_Android_sample > 100000000:
        final_sample_with_installs_for_devs_Android.append(apps)
        
print(final_sample_with_installs_for_devs_Android)
print(len(final_sample_with_installs_for_devs_Android))

[['Subway Surfers', 'GAME', '4.5', '27722264', '76M', 1000000000.0, 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up'], ['Candy Crush Saga', 'GAME', '4.4', '22426677', '74M', 500000000.0, 'Free', '0', 'Everyone', 'Casual', 'July 5, 2018', '1.129.0.2', '4.1 and up'], ['Temple Run 2', 'GAME', '4.3', '8118609', '62M', 500000000.0, 'Free', '0', 'Everyone', 'Action', 'July 5, 2018', '1.49.1', '4.0 and up'], ['Pou', 'GAME', '4.3', '10485308', '24M', 500000000.0, 'Free', '0', 'Everyone', 'Casual', 'May 25, 2018', '1.4.77', '4.0 and up'], ['Subway Surfers', 'GAME', '4.5', '27723193', '76M', 1000000000.0, 'Free', '0', 'Everyone 10+', 'Arcade', 'July 12, 2018', '1.90.0', '4.1 and up'], ['Pou', 'GAME', '4.3', '10485334', '24M', 500000000.0, 'Free', '0', 'Everyone', 'Casual', 'May 25, 2018', '1.4.77', '4.0 and up'], ['Candy Crush Saga', 'GAME', '4.4', '22428456', '74M', 500000000.0, 'Free', '0', 'Everyone', 'Casual', 'July 5, 2018', '1.129.0.2', '4.1 and up'], ['My Tal

***And finally, above is an example of 28 apps that the developers can use as a sample to develop top lucrative apps for the Google Play Store.***

# *Coded and analyzed by Carlos Aguirrebeitia Gorostiaga*