### Guided Project: Finding profitable app profiles for the App Store and Google Play markets


*This project is the final assignment of the 'Python for Data Science: Fundamentals' course on Dataquest.io.
My goal is to practice everything learned so far.*

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In this project, we will go through a complete data science workflow:
* We will clarify the goal of our project
* We will collect relevant data
* We will clean the data to prepare it for analysis: remove wrong data, remove duplicate entries, remove non-English apps, isolate free apps, 
* We will analyze the cleaned data.


***Opening and Exploring the Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

* [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play
* [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store

Let's start by opening the two data sets and then continue with exploring the data.

In [29]:
from csv import reader

### Google Play dataset ###
opened_dataset = open('googleplaystore.csv')
read_dataset = reader(opened_dataset)
google_file = list(read_dataset)
google_header = google_file[0]    #separate the header from the data
google_data = google_file[1:]


### Apple Store dataset ###
opened_dataset = open('AppleStore.csv')
read_dataset = reader(opened_dataset)
apple_file = list(read_dataset)
apple_header = apple_file[0]    #separate the header from the data
apple_data = apple_file[1:]

To make it easier to explore the two data sets, we'll first write a function named explore_data() that we can use repeatedly to explore rows in a more readable way. 
We'll also add an option for our function to show the number of rows and columns for any data set.

In [30]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:    # Loops through the slice, and for each iteration, prints a row and adds a new line after
        print(row)
        print('\n') 

    if rows_and_columns:    # Prints the number of rows and columns if rows_and_columns is True
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(google_header)    # print the header of the google data
print('\n')
explore_data(google_data, 0, 4, True)    # print the first few rows and number of rows and columns

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


So the Google Play data set has 10841 apps and 13 columns.
For our analysis, the columns that might be useful are: 
'App' (= name), 'Category' (= type of app), 'Reviews' (= number of reviews), 'Installs' (= number of installs), 'Type' (can also indicate whether it's free), 'Price' (should be 0.0), 'Genre'.

In [31]:
print(apple_header)    # print the header of the apple data
print('\n')
explore_data(apple_data, 0, 4, True)    # print the first few rows and number of rows and columns

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


So the Apple Store data set has 7197 apps and 16 columns.
For our analysis, the columns that might be useful are: 
'track_name' (= name), 'price' (should be 0.0, no matter what currency), 'rating_count_total' (= User Rating counts for all versions), 'user_rating_ver' (= Average User Rating value for current version), 'prime_genre" (= Primary Genre).

### Deleting Wrong Data
Since at our company, we only build apps that are free to download and install, and that are directed toward an English-speaking audience. This means that we'll need to:

* Detect and delete wrong data
* Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
* Remove apps that aren't free.

The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [32]:
print(google_header)
print('\n')
print(google_data[10472])
print('\n')
print(google_data[1])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


When comparing row 10472 to the header and row 1, which is correct, you can see the 'Category' is missing. 
Let's delete this row, and then check this by comparing the number of rows before and after deleting. 

In [33]:
print(len(google_data))
del google_data[10472]
print(len(google_data))

10841
10840


### Removing Duplicate Entries
If you explore the Google Play data set long enough or look at the discussions section, you'll notice some apps have duplicate entries. For instance, Instagram has four entries:

In [34]:
for app in google_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [35]:
### Count the number of duplicates in the entire Google Playstore dataset
unique_google_apps = []
duplicate_google_apps = []

for app in google_data:
    name = app[0]
    if name in unique_google_apps:    #if name is already in unique_google_apps, it is a duplicate
        duplicate_google_apps.append(name)
    else:
        unique_google_apps.append(name)
print('Number of duplicate apps: ', len(duplicate_google_apps))

Number of duplicate apps:  1181


So there are 1181 duplicate apps in the Google Playstore dataset, that's quite a lot and will mess up our analysis.
Instead of removing duplicates randomly, which might lead to unwanted results, lets base this on data. As can be seen in the Instagram entries above, the only difference in these rows is the fourth position, the number of reviews ('66577313', '66577446', '66577313', '66509917' in this case).
It's probably safe to assume that the entry with the most reviews is the most recent entry. Therefor lets hold on to the entry with the most reviews. 
Let's create a dictionary where each key is a unique app name and the value is the highest number of reviews of that app.
For every app in google_data it will check if this app is already in the reviews_max dictionary AND if this entry has more reviews.
If yes, replace number of reviews with this higher number of reviews.  
If not, add this entry to the dictionary (with its number of reviews).

To check this, the length of this dictionary should be equal to the length of google_data minus 1181 (number of duplicates found above).

In [36]:
reviews_max = {}

for app in google_data:
    name = app[0]
    n_reviews = float(app[3]) 

    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Expected length:', len(google_data) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


*Remove duplicate rows from the Google Play data set*

Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. 

In the code cell below:
Initialize two empty lists, android_clean and already_added.
Loop through the Google Play data set, and for every iteration:
We isolate the name of the app and the number of reviews.
If the current app is not yet in the already_added list, and its number of reviews matches the number of reviews of that app as described in the reviews_max dictionary (because this stores the highest value for this); add the current app to the android_clean list, and the app name ('name') to the already_added list.
Check the length of android_clean, this should be 9659.

In [37]:
google_clean = []    # this will store our new cleaned data set
already_added = []    # this will just store app names

for app in google_data:    # Loop through the Google Play data set
    name = app[0]
    n_reviews = float(app[3])

    if name not in already_added and n_reviews == reviews_max[name]:
        google_clean.append(app)
        already_added.append(name)
                
explore_data(google_clean, 0, 4, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9659
Number of columns: 13


### Removing Non-English Apps

We'd like to analyze only the apps that are directed toward an English-speaking audience. However, if we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience.

Some examples:

In [38]:
print(apple_data[813][1])
print(apple_data[6731][1])
print(google_clean[4412][0])
print(google_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


We're not interested in keeping these apps, so we'll remove them. One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text. 

For this, we define a function that checks for Non_English characters in any string.
* English characters have an ASCII value between 0 and 127
* ord() is a built-in function to get the corresponding value of a character

If the value of a character in any_string is higher than 127, it returns False. If this hasn't occured, it returns True.

In [39]:
def check_string(any_string):
    for character in any_string:
        if ord(character) > 127:
            return False
    return True

print(check_string('Instagram'))
print(check_string('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_string('Docs To Go™ Free Office Suite'))
print(check_string('Instachat 😜'))       

True
False
False
False


This function also excludes app names such as 'Docs To Go™ Free Office Suite' and 'Instachat 😜'. This is because emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127.
This way we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. Our filter function is still not perfect, but it should be effective enough this way.
In the tests you can see that English apps with one different character are still accepted, but non English names are not. 

In [40]:
def check_string(any_string):
    n = 0
    for character in any_string:
        if ord(character) > 127:
            n += 1
            if n > 3:
                return False
    return True

print(check_string('Docs To Go™ Free Office Suite'))
print(check_string('Instachat 😜'))
print(check_string('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
True
False


In [41]:
english_google_apps = []

for app in google_clean:
    name = app[0]
    if check_string(name):
        english_google_apps.append(app)
        
english_apple_apps = []

for app in apple_data:
    name = app[1]    # name is in the second position
    if check_string(name):
        english_apple_apps.append(app)
        
explore_data(english_google_apps, 0, 4, True)
print('\n')
explore_data(english_apple_apps, 0, 4, True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '

After checking the whole datasets on non English names, the Google Play Store dataset has 9614 apps remaining.
And the App Store dataset has 6183 apps.

### Isolating the free apps
As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

We'll loop through each data set to isolate the free apps in separate lists. 
Then we'll check the length of each data set to see how many apps  remain.

In [42]:
google_final = []    

for app in english_google_apps:
    price = app[7]

    if price == '0':
        google_final.append(app)
        
apple_final = []    

for app in english_apple_apps:
    price = app[4]

    if price == '0.0':
        apple_final.append(app)
                
print(len(google_final))
print(len(apple_final))

8864
3222


So we have 8864 Google Playstore apps remaining, and 3222 apple ios store apps.

### Most Common Apps by Genre
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. 
In the Google Play data the columns are useful are 'Category' and 'Genre'.
In the Apple ios data the columns that are useful are 'prime_genre".
We'll build frequency tables for these columns in our data sets.
We'll build two functions we can use to analyze the frequency tables:
* One function to generate frequency tables that show percentages
* Another function we can use to display the percentages in a descending order


In [43]:
def freq_table(dataset, index):
    frequency_n = {}
    frequency_percentages = {}
    total = len(dataset)

    for row in dataset:
        a_data_point = row[index]
        if a_data_point in frequency_n:
            frequency_n[a_data_point] += 1
        else:
            frequency_n[a_data_point] = 1
    
    for frequency in frequency_n:
        percentage = (frequency_n[frequency] / total * 100) 
        frequency_percentages[frequency] = percentage
    return frequency_percentages

# function to print a freq table in descending order
# created by dataquest
def display_table(dataset, index):    
    table = freq_table(dataset, index)    
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

We'll display the frequency table of the columns prime_genre, Genres, and Category.

apple: prime_genre:

In [44]:
display_table(apple_final, 11)    

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


As can be seen the most common genre is 'Games', with 58,16%. 
There is  a big difference with the runner-up 'Entertainment' with only 7,88%.  
'Photo & Video' comes third, with almost 5%.
At the fourth position is 'Education', the first useful genre, with 3,66%.
This is followed by 'social networking' with 3,29% and 'shopping' with 2,61%.
Number 7 is 'Utilities' with 2,51%.
This implies most English ios apps are designed for entertainment. 
Especially 'Games' seems to be a big part. But we don't know anything about user numbers at this point, maybe 'Games' has a relatively low number of users? 

In [45]:
display_table(google_final, 1)    # google: category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [46]:
display_table(google_final, 9)    # google: genres

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

So the Google Playstore data holds both 'category' and 'genre'. There is some overlap, but not completely. 
The biggest 'category' is 'family' with 18.91%.
This is followed by 'game' with 9.72%. The next biggest categories are tools, business, lifestyle, productivity, finance, medical, sports, etc. 
So it seems that here a lot more apps are intended to be practical, instead of just for fun. 

When we look at genres, we see a much longer list. There seem to be a lot of genres, and some very specific ones. The biggest genre is 'Tools' with 8.45%, the second biggest genre is 'entertainment' with 6.07%, so not big numbers. Also there is not 1 genre 'Games', but there are genres 'Action', 'Simulation', 'Arcade', 'Puzzle', 'Racing', 'Role Playing', etc. But these are all very small. The top ten of genres is made up of almost exclusively practical genres, such as 'tools', 'education', 'business', 'productivity', 'lifestyle', 'finance', etc. 
Because these genres are so many and so specific, and we're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.


When we compare the App Store market with the Google Play market, we can say that the App Store holds more apps designed for fun, and the Google Play market holds a mix of practical and entertaining apps.  

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot column.

### Most Popular Apps by Genre on App Store
Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:

* Isolate the apps of each genre.
* Sum up the user ratings for the apps of that genre.
* Divide the sum by the number of apps belonging to that genre.

In [47]:
appstore_genres = freq_table(apple_final, 11)

for genre in appstore_genres:
    total = 0
    len_genre = 0
    for row in apple_final:
        genre_app = row[11]
        if genre == genre_app:
            total += float(row[5])    # rating_count_tot
            len_genre += 1
    avg_user_ratings = total / len_genre
    print(genre, ': ', avg_user_ratings)

Lifestyle :  16485.764705882353
Medical :  612.0
Travel :  28243.8
Business :  7491.117647058823
Education :  7003.983050847458
Games :  22788.6696905016
Navigation :  86090.33333333333
Finance :  31467.944444444445
Productivity :  21028.410714285714
Entertainment :  14029.830708661417
Shopping :  26919.690476190477
News :  21248.023255813954
Weather :  52279.892857142855
Sports :  23008.898550724636
Book :  39758.5
Catalogs :  4004.0
Photo & Video :  28441.54375
Social Networking :  71548.34905660378
Food & Drink :  33333.92307692308
Reference :  74942.11111111111
Utilities :  18684.456790123455
Health & Fitness :  23298.015384615384
Music :  57326.530303030304


Navigation :  86090.33333333333
Reference :  74942.11111111111
Social Networking :  71548.34905660378
Weather :  52279.892857142855
Book :  39758.5
Food & Drink :  33333.92307692308


This shows that 'Navigation' apps have the most ratings. This is followed by 'Reference', 'Social Networking', 'Weather', 'Book' and 'Food & Drink'. 
'Games', which appeared to be the biggest category earlier, now comes at number 15 (of 23), so this analysis creates a very different picture than the one we did before. 

Let's dive a little deeper:

In [54]:
print('NAVIGATION:')
for app in apple_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

print('\n')
print('SOCIAL NETWORKING:')        
for app in apple_final:
    if app[-5] == 'Social Networking':
        print(app[1], ':', app[5])

NAVIGATION:
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


SOCIAL NETWORKING:
Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 4

'Navigation' and 'Social Networking' are dominated by a few very big apps, so not as useful for us as they might seem. 
'Weather' apps are usually used for a quick glance only, so also not very useful for our purpose.

When we check out the 'Reference' category a bit more, we see this mainly contains the Bible and dictionaries.

In [55]:
for app in apple_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Let's check out the 'Book' category a bit more:

In [52]:
for app in apple_final:
    if app[-5] == 'Book':
        print(app[1], ':', app[5])

Kindle – Read eBooks, Magazines & Textbooks : 252076
Audible – audio books, original series & podcasts : 105274
Color Therapy Adult Coloring Book for Adults : 84062
OverDrive – Library eBooks and Audiobooks : 65450
HOOKED - Chat Stories : 47829
BookShout: Read eBooks & Track Your Reading Goals : 879
Dr. Seuss Treasury — 50 best kids books : 451
Green Riding Hood : 392
Weirdwood Manor : 197
MangaZERO - comic reader : 9
ikouhoushi : 0
MangaTiara - love comic reader : 0
謎解き : 0
謎解き2016 : 0


The main app here is Kindle. But second comes an audio book & podcast app, and third is actually an adult coloring book! 
Maybe we should make an audio & coloring book, that tells stories and is acompanied by coloring assignments? This will be a entertaining app, with a fun element (but not a 'game', which is a genre with many entries and probably hard to win in). 

### Most Popular Apps by Genre on Google Play
Now let's calculate the average number of installs per app genre for the Google Play data set.

In [50]:
playstore_cats = freq_table(google_final, 1)

for category in playstore_cats:
    total = 0
    len_category = 0
    for row in google_final:
        cat_app = row[1]
        if category == cat_app:
            installs = row[5]    # installs
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            total += float(installs) 
            len_category += 1
    avg_installs = total / len_category
    print(category, ': ', avg_installs)

HOUSE_AND_HOME :  1331540.5616438356
DATING :  854028.8303030303
TOOLS :  10801391.298666667
SOCIAL :  23253652.127118643
SPORTS :  3638640.1428571427
PRODUCTIVITY :  16787331.344927534
EDUCATION :  1833495.145631068
ENTERTAINMENT :  11640705.88235294
NEWS_AND_MAGAZINES :  9549178.467741935
LIFESTYLE :  1437816.2687861272
BOOKS_AND_REFERENCE :  8767811.894736841
HEALTH_AND_FITNESS :  4188821.9853479853
ART_AND_DESIGN :  1986335.0877192982
LIBRARIES_AND_DEMO :  638503.734939759
SHOPPING :  7036877.311557789
MAPS_AND_NAVIGATION :  4056941.7741935486
GAME :  15588015.603248259
COMICS :  817657.2727272727
PHOTOGRAPHY :  17840110.40229885
FINANCE :  1387692.475609756
BEAUTY :  513151.88679245283
EVENTS :  253542.22222222222
AUTO_AND_VEHICLES :  647317.8170731707
BUSINESS :  1712290.1474201474
PARENTING :  542603.6206896552
VIDEO_PLAYERS :  24727872.452830188
FOOD_AND_DRINK :  1924897.7363636363
TRAVEL_AND_LOCAL :  13984077.710144928
MEDICAL :  120550.61980830671
FAMILY :  3695641.8198090694

These are all big numbers. 'Communication' is the biggest category, with 38 million installs. This is followed by 'video players', 'social', 'photography', 'productivity' and then 'games'.

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

If we follow our earlier idea, to create an audio & coloring book, let's check relevant genre 'Books and Reference'.

In [56]:
for app in google_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

Again, religious books and dictionaries have many installs. However, there are also many audiobook apps here.

### Conclusion
What was our purpose again?

*For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.
We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.*

Since we want to create an app for both android and ios, books, and more specific audio books might be a good way to go.

So let's create an audio book (app)!
And add some coloring if we want to make it more special and captivating :)