# Profitable App Profiles for the App Store and Google Play Markets

*DataQuest Guided Project #1

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We're working as data analysts for a company that builds English-language Android and iOS mobile apps, and our job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

At our company, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

## Available Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

  - [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
  - [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).
  
Let's start by opening the two data sets:

In [1]:
from csv import reader

# Opening the App Store data #
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apps_data = list(read_file)

# Opening the Google Play data #
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
goog_data = list(read_file)

print(apps_data[:3])
print()
print(goog_data[:3])

[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']]

[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 

Success! Both files can be imported and analysed as required. 

We can also determine that each dataset has a header row, therefore we will isolate these from the main body of data for seperate analysis:

In [2]:
apps_header = apps_data[0] # the header row for the App Store data
apps_data = apps_data[1:]  # removing the header row from the overall results

goog_header = goog_data[0] # likewise for the Google Play data
goog_data = goog_data[1:]  

Now we know both files are accessible, the below `explore_data()` function can be used to repeatedly explore the rows in a more user-friendly manner. This will be repeatable across any data set and also includes an option to present the number of rows and columns:

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


To begin with, lets review the App Store data:

In [4]:
print(apps_header)
print()
explore_data(apps_data,0,3,True)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


There are 7,197 apps in the App Store data, seperated into 16 columns. The columns that seem interesting to us at this stage are: 
   - 'track_name'[1]
   - 'currency'[3]
   - 'price'[4]
   - 'rating_count_tot'[5]
   - 'rating_count_ver'[6]
   - 'user_rating'[7]
   - 'user_rating_ver'[8]
   - 'cont_rating'[10]
   - 'prime_genre'[11]
      
Not all column names are self-explanatory, however details of each can be found in the [data set expanded description](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

Now, let's repeat the process for the Google Play data:

In [5]:
print(goog_header)
print()
explore_data(goog_data,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


There are 10,841 apps in the Google Play Store data, seperated into 13 columns. The columns that seem interesting to us at this stage are: 
   - 'App'[0]
   - 'Category'[1]
   - 'Rating'[2]
   - 'Reviews'[3]
   - 'Installs'[5]
   - 'Type'[6]
   - 'Price'[7]
   - 'Content Rating'[8]
   - 'Genres'[9]

# Data Cleansing

Before beginning our analysis, we need to make sure the data we analyse is accurate, otherwise the results of our analysis will be wrong. This means that we need to:

  - Detect inaccurate data, and correct or remove it.
  - Detect duplicate data, and remove the duplicates.
  

### Removing incorrect data

The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/164101) describes an error within row 10,472.

Let's review that row and compare it to another:

In [6]:
print(goog_header)
print()
print(goog_data[10472])
print()
print('Number of columns: ',len(goog_data[10472]))
print()
print(goog_data[0])
print()
print('Number of columns: ',len(goog_data[0]))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

Number of columns:  12

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']

Number of columns:  13


We know there are 13 columns in our data and can see that row 10,472, for the 'Life Made WI-Fi Touchscreen Photo Frame' app, only contains 12. 

Reviewing the data reveals that the second column should be for 'Category' and has an entry of '1.9', whereas the comparison data has 'ART_AND_DESIGN', so it appears that 'Category' is missing from row 10,472 - this is confirmed in the [discussions section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/164101).

Therefore, we will need to delete this row:


In [7]:
print(len(goog_data))
del goog_data[10472]  # only run once or additional rows will be deleted
print(len(goog_data))

10841
10840


After the update, the length of `goog_data` reduced from 10,841 to 10,840.

The [discussion section](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) for the App Store data set does not include any reports of incorrect data.

Onwards we go....

### Identifying duplicate entries

If we explore the Google Play data set long enough, or look at the [discussions section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), we'll discover some apps have duplicate entries. For instance, Instagram has four entries:

In [8]:
print(goog_header)

for app in goog_data:
    name = app[0]
    if name == 'Instagram':
        print('\n',app)
        

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

 ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']

 ['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']

 ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']

 ['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


So, we need to determine how many duplicate names exist in the each data set, plus make it a repeatable exercise for any column in any dataset. 

The below function, `duplicate_entries`, will allow us to pass in any dataset and column index and utilise lists to count how many duplicate entries exist.

First up, we'll pass in the App Store data which contains the names in column index [1]:

In [9]:
unique_entry = []
dupe_entry = []

def duplicate_entries(dataset,column=0):
    for app in dataset:
        name = app[column]
        if name in unique_entry:      # if the name is already in the unique list...
            dupe_entry.append(name)   # ..then add it to the duplicate list..
        else:
            unique_entry.append(name) # ..otherwise add it to the unique list

duplicate_entries(apps_data,1)        # utilising the function

print('Unique entries: ',len(unique_entry))
print()
print('Duplicate entries: ',len(dupe_entry))


Unique entries:  7195

Duplicate entries:  2


The results indicate there are 2 potential unique entries to be investigated.

However, the entries in the App Store data each have a unique ID (column index[0]) so if we were to validate against this:


In [10]:
unique_entry = []  # initialising the lists to remove existing entries
dupe_entry = []
duplicate_entries(apps_data)       

print('Unique entries: ',len(unique_entry))
print()
print('Duplicate entries: ',len(dupe_entry))

Unique entries:  7197

Duplicate entries:  0


Then we get no duplicate entries to be concerned about.

To validate the `duplicate_entries` function returns different results per column, we'll run it again based on the 'prime_genre' in column index 11:

In [11]:
unique_entry = []  # initialising the lists to remove existing entries
dupe_entry = []
duplicate_entries(apps_data,11)       

print('Unique entries: ',len(unique_entry))
print()
print('Duplicate entries: ',len(dupe_entry))


Unique entries:  23

Duplicate entries:  7174


We can see a completely different set of results, with there being only 23 unique 'prime_genres' and therefore 7,174 duplicate entries.

We are happy that the function works as required and can now be applied to the Google Play data set which has the name in column index [0]:

In [12]:
unique_entry = []
dupe_entry = []

duplicate_entries(goog_data)   # no column index required as the default is [0]  

print('Unique entries: ',len(unique_entry))
print()
print('Duplicate entries: ',len(dupe_entry))


Unique entries:  9659

Duplicate entries:  1181


So, there are 1,181 potential duplicate apps in the Google Play data set that need to be investigated and removed.

We will require detailed analysis to confirm if the potential duplications are actually duplicates, and to ensure that only the must recent entry is retained. 

The logic to be applied will be to retain the version name with the highest number of reviews, in theory the highest number will be the most recent version, and remove the other entries.

### Removing duplicate entries

To remove the duplicates, we will:

  1. Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
  
  2. Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).
  
Let's create the dictionary:

In [13]:
reviews_max = {}

for app in goog_data:
    name = app[0]
    n_reviews = int(app[3])
  
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews     

If all has gone to plan then we should expect the number of entries in `reviews_max` to equal the previously identified 9,659 unique entries.

In [14]:
print(len(reviews_max))

9659


Boom, looking good! 

The next step is to use `reviews_max` to remove the duplicate entries from `goog_data` and have a clean data set containing only those with the highest number of reviews.

Below:

  - We start by initializing two empty lists, goog_clean and already_added.
  - We loop through the Google Play data set, and for every iteration:
    - We isolate the name of the app and the number of reviews.
    -We add the current row (app) to the goog_clean list, and the app name (name) to the already_added list if:
      - The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and
      - The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps.

In [15]:
goog_clean = []
already_added = []

for app in goog_data:
    name = app[0]
    n_reviews = int(app[3])
    
    if reviews_max[name] == n_reviews and name not in already_added:
        goog_clean.append(app)
        already_added.append(name)


Now, let's review the `goog_clean` data and validate the number of rows:

In [16]:
explore_data(goog_clean,0,4,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9659
Number of columns: 13


We have 9,659 rows as expected, and all 13 columns are in existence with the data looking as it should. Happy days!


### Identifying non-English language Apps

If we explore the data sets enough, we'll find they both have apps with names that suggest they are not directed toward an English-speaking audience.

Below are some examples:

In [17]:
print(apps_data[813][1])
print(apps_data[6731][1])
print()
print(goog_clean[4412][0])
print(goog_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜

中国語 AQリスニング
لعبة تقدر تربح DZ


We need to create a function that identified any non-english characters that appear in the name of an App. All english letters and characters are covered in the [ASCII characters](http://www.asciitable.com/) 0 to 127, so we need to identify any characters that are greater than 127.

The below `is_english` function will loop through a word and if any of the characters are greater than 127 then it will return `False`, otherwise if they are ll below 127 it will return `True`.

Examples are provided to test the function:

In [18]:
def is_english(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))


True
False
False
False


It works, to a fashion. 

Although we are indeed identifying non-English characters for exclusion, this also excludes emojis and characters like ™ and 😜 as they fall outside the ASCII range and have corresponding numbers over 127. If we were to proceed then we'll lose useful data as English apps will be oncorrectly labeled as non-English.

To minimize the impact of data loss, we'll only remove an app if its name has three or more characters with corresponding numbers falling outside the ASCII range.

Our filter function is still not perfect, but it should be fairly effective.

Let's test it again on the same Apps as previous:

In [19]:
def is_english(string):
    not_english = 0
    for character in string:
        if ord(character) > 127:
            not_english += 1
          
    if not_english > 3:
        return False
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


Great, only the '爱奇艺PPS -《欢乐颂2》电视剧热播' App now returns a `False` statement.

Let's enhance this approach and apply it to the two datasets, splitting out the  English and non-English Apps:

In [20]:
apps_english = []
goog_english = []

for app in apps_data:
    name = app[1]
    if is_english(name):
        apps_english.append(app)

for app in goog_clean:
    name = app[0]
    if is_english(name):
        goog_english.append(app)    

print(apps_header,'\n')
explore_data(apps_english,0,4,True)
print()
print(goog_header,'\n')
explore_data(goog_english,0,4,True)
    

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 6183
Number of columns: 16

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo E

We now have 6,183 remaining from the App Store and 9,614 from Google Play.

### Focusing on Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

Isolating the free apps will be our last step in the data cleaning process, the below code does exactly that:

In [21]:
apps_free = []
goog_free = []

for app in apps_english:
    price = app[4]
    if price == '0.0':
        apps_free.append(app)

for app in goog_english:
    price = app[7]
    if price == '0':
        goog_free.append(app)    

print(apps_header,'\n')
explore_data(apps_free,0,4, True)
print()
print(goog_header,'\n')
explore_data(goog_free,0,4,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 3222
Number of columns: 16

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo E

Scores on the doors: 3,222 remaining from the App Store and 8,864 from Google Play.

### Genre analysis

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

  1. Build a minimal Android version of the app, and add it to Google Play.
  2. If the app has a good response from users, we develop it further.
  3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.
  
Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

The columns that appear to contain the data we require are:

  - **App Store: ** 'prime_genre' [index:11]
  - **Google Play: ** 'Category' [index:1] & 'Genres'[index:9]
  
We'll build two functions we can use to analyse the frequency tables:

  - One function to generate frequency tables that show percentages
  - Another function that we can use to display the percentages in a descending order

In [22]:
def freq_table(dataset,index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        selection = row[index]
        if selection in table:
            table[selection] += 1
        else:
            table[selection] = 1
            
    print('Total entries: ',total)
    print('Category count: ',len(table),'\n')
    
    table_percentages = {}
    
    for key in table:
        percentage = round((table[key] / total) * 100,3)
        table_percentages[key] = percentage
    return (table_percentages)

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


First up, we'll examine the `prime_genre` column of the App Store data:

In [23]:
print(display_table(apps_free,11))

Total entries:  3222
Category count:  23 

Games : 58.163
Entertainment : 7.883
Photo & Video : 4.966
Education : 3.662
Social Networking : 3.29
Shopping : 2.607
Utilities : 2.514
Sports : 2.142
Music : 2.048
Health & Fitness : 2.017
Productivity : 1.738
Lifestyle : 1.583
News : 1.335
Travel : 1.241
Finance : 1.117
Weather : 0.869
Food & Drink : 0.807
Reference : 0.559
Business : 0.528
Book : 0.435
Navigation : 0.186
Medical : 0.186
Catalogs : 0.124
None


We can see that from the 3,222 free English apps that there are 23 genres at that more than half of the overall total, 58.163%, are for 'Games'. This is by far the most common genre, with 'Entertainment' next at 7.883%, followed by 'Photo & Video' at 4.966%. Next is 'Education', coming in at only 3.662%, followed by 'Social Networking' at 3.29%.

The general impression is that the vast majority of apps on the App Store (for the free, English language apps) are designed for fun ('Games', 'Entertainment', 'Photo & Video', 'Social Networking', 'Sports', 'Music' etc.) whilst more practical apps ('Education', 'Shopping', 'Utilities', 'Health & Fitness') are less common.

However, the volume of apps in a particular genre does not necessarily correalate with the number of users - this will require investigating further.

Let's continue by examining the `Category` and `Genres` columns of the Google Play data set (two columns which may be related).

In [24]:
print(display_table(goog_free,1)) # Category

Total entries:  8864
Category count:  33 

FAMILY : 18.908
GAME : 9.725
TOOLS : 8.461
BUSINESS : 4.592
LIFESTYLE : 3.903
PRODUCTIVITY : 3.892
FINANCE : 3.7
MEDICAL : 3.531
SPORTS : 3.396
PERSONALIZATION : 3.317
COMMUNICATION : 3.238
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.944
NEWS_AND_MAGAZINES : 2.798
SOCIAL : 2.662
TRAVEL_AND_LOCAL : 2.335
SHOPPING : 2.245
BOOKS_AND_REFERENCE : 2.144
DATING : 1.861
VIDEO_PLAYERS : 1.794
MAPS_AND_NAVIGATION : 1.399
FOOD_AND_DRINK : 1.241
EDUCATION : 1.162
ENTERTAINMENT : 0.959
LIBRARIES_AND_DEMO : 0.936
AUTO_AND_VEHICLES : 0.925
HOUSE_AND_HOME : 0.824
WEATHER : 0.801
EVENTS : 0.711
PARENTING : 0.654
ART_AND_DESIGN : 0.643
COMICS : 0.62
BEAUTY : 0.598
None


We can see that from the 8,864 free English apps that there are 33 categories and it's evident that the spread across the categories is much closer than the App Store with no one category dominating. That said, the 'FAMILY' category is most common at 18.908% and is almost twice the total of the second most common, 'GAMES' at 9.725%.

There are not as many apps designed for fun, percentage wise, as the App Store with 'FAMILY' and 'GAME' being the only obvious entries in the top 10. The remainder of the top 10 are made up of more practical sounding categories: 'TOOLS', 'BUSINESS', 'LIFESTYLE' and 'PRODUCTIVITY', for example - although 'TOOLS' is a little ambiguous.

This theme is confirmed when reviewing 'Genres':


In [25]:
print(display_table(goog_free,9)) # Genres

Total entries:  8864
Category count:  114 

Tools : 8.45
Entertainment : 6.069
Education : 5.347
Business : 4.592
Productivity : 3.892
Lifestyle : 3.892
Finance : 3.7
Medical : 3.531
Sports : 3.463
Personalization : 3.317
Communication : 3.238
Action : 3.102
Health & Fitness : 3.08
Photography : 2.944
News & Magazines : 2.798
Social : 2.662
Travel & Local : 2.324
Shopping : 2.245
Books & Reference : 2.144
Simulation : 2.042
Dating : 1.861
Arcade : 1.85
Video Players & Editors : 1.771
Casual : 1.76
Maps & Navigation : 1.399
Food & Drink : 1.241
Puzzle : 1.128
Racing : 0.993
Role Playing : 0.936
Libraries & Demo : 0.936
Auto & Vehicles : 0.925
Strategy : 0.914
House & Home : 0.824
Weather : 0.801
Events : 0.711
Adventure : 0.677
Comics : 0.609
Beauty : 0.598
Art & Design : 0.598
Parenting : 0.496
Card : 0.451
Casino : 0.429
Trivia : 0.417
Educational;Education : 0.395
Board : 0.384
Educational : 0.372
Education;Education : 0.338
Word : 0.259
Casual;Pretend Play : 0.237
Music : 0.203
Raci

What immediately jumps out is that there are 114 genres, which is much larger than the 23 genres in the App Store and the 33 categories also in Google Play.

However, the 114 genres do appear to align with the category figures with only one obvious fun genre in the top 10 - 'Entertainment' in second position with 6.069%. The remainder of the top 10 appear to be for more practical purposes, although the ambiguous 'Tools' reappears at number 1 with 8.45%

The link between the Google Play category and genre is not immediately obvious, the obvious conclusion being that genre is much more granular. As we're only looking for the bigger picture at the moment, we'll only work with the `Category` column moving forward.

In conclusion so far: the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. There is a disparity in numbers between the two datasets so that could be a factor to explore further. 

Next, we'd like to get an idea about the kind of apps that have most users.

### Most popular Apps by Genre - App Store

*From this point onwards, the narrative and approach has been lifted from the solution provided by DataQuest, as it was the techniques I was learning rather than the analytical approach. That said, all of the main formulas are still my own work*

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app (both are column index [5]).

Below, we calculate the average number of user ratings per app genre on the App Store:

In [26]:
apps_genres = freq_table(apps_free,11)  # identify each genre

for genre in apps_genres:
    total = 0      # the sum of user ratings per genre
    len_genre = 0  # number of apps per genre
    for row in apps_free:
        genre_app = row[11]        
        if genre_app == genre:
            ratings = int(row[5])
            total += ratings           
            len_genre += 1
    avg_n_users = round(total / len_genre,2)  # calculate average number of ratings
    print(genre,':',avg_n_users,'across',len_genre,'apps')
  
    

Total entries:  3222
Category count:  23 

Photo & Video : 28441.54 across 160 apps
Music : 57326.53 across 66 apps
Finance : 31467.94 across 36 apps
Sports : 23008.9 across 69 apps
Productivity : 21028.41 across 56 apps
Reference : 74942.11 across 18 apps
Utilities : 18684.46 across 81 apps
Food & Drink : 33333.92 across 26 apps
Social Networking : 71548.35 across 106 apps
Book : 39758.5 across 14 apps
Weather : 52279.89 across 28 apps
Navigation : 86090.33 across 6 apps
Entertainment : 14029.83 across 254 apps
News : 21248.02 across 43 apps
Business : 7491.12 across 17 apps
Lifestyle : 16485.76 across 51 apps
Games : 22788.67 across 1874 apps
Catalogs : 4004.0 across 4 apps
Health & Fitness : 23298.02 across 65 apps
Shopping : 26919.69 across 84 apps
Education : 7003.98 across 118 apps
Travel : 28243.8 across 40 apps
Medical : 612.0 across 6 apps


The list is unordered, the top ten entries based on number of user ratings are:

  1. Navigation 86090.33
  1. Reference 74942.11
  1. Social Networking 71548.35
  1. Music 57326.53
  1. Weather 52279.89
  1. Book 39758.5
  1. Food & Drink 33333.92
  1. Finance 31467.94
  1. Photo & Video 28441.54
  1. Travel 28243.8


Games, the most populous volume of apps in the App Store, make it into the list at number 14 with 22788.67 average user ratings.

Let's investiagte the number 1, which is Navigation. Immediately there are doubts as there are only a total of 6 apps - but let's review the data:

In [27]:
for app in apps_free:
    if app[11] == 'Navigation':
        print(app[1],':',app[5]) # display app name and rating volume

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The results are dominatd by Waze and Google Maps, with a large drop-off between the number of user ratings between each of the remaining Apps. This does not look like a promising genre to pursue.

It's a similar issue for 'Social Networking' (Facebook, Pinterest, Skype, etc) and 'Music' (Pandora, Spotify, and Shazam ) in which there are more Apps but some heavy hitters dominate the number of ratings.
     

In [28]:
for app in apps_free:
    if app[11] == 'Social Networking':
        print(app[1],':',app[5]) # display app name and rating volume
print('------------------------------------------------------------')       
for app in apps_free:
    if app[11] == 'Music':
        print(app[1],':',app[5]) # display app name and rating volume

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

The 'Reference' category was second on the list, lets review:

In [29]:
for app in apps_free:
    if app[11] == 'Reference':
        print(app[1],':',app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Reference apps have 74,942 user ratings on average across 18 Apps, but it's actually the Bible and Dictionary.com which skew up the average rating.

However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

  - Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

  - Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

  - Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

Now let's analyze the Google Play market a bit.

### Most popular Apps by Genre - Google Play

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [30]:
display_table(goog_free, 5) # the Installs columns


Total entries:  8864
Category count:  21 

1,000,000+ : 15.727
100,000+ : 11.552
10,000,000+ : 10.548
10,000+ : 10.199
1,000+ : 8.394
100+ : 6.916
5,000,000+ : 6.825
500,000+ : 5.562
50,000+ : 4.772
5,000+ : 4.513
10+ : 3.542
500+ : 3.249
50,000,000+ : 2.301
100,000,000+ : 2.132
50+ : 1.918
5+ : 0.79
1+ : 0.508
500,000,000+ : 0.271
1,000,000,000+ : 0.226
0+ : 0.045
0 : 0.011


One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to `int` — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [31]:
goog_cats = freq_table(goog_free,1) # identify each category

for category in goog_cats:
    total = 0     # the sum of installs per category
    len_cat = 0   # number of installs per category
    for row in goog_free:
        category_app = row[1]
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+','')
            installs = int(installs.replace(',',''))
            total += installs
            len_cat += 1
    avg_installs = round(total/len_cat,2)
    print(category,":",avg_installs,'across',len_cat,'apps')
            

Total entries:  8864
Category count:  33 

COMICS : 817657.27 across 55 apps
VIDEO_PLAYERS : 24727872.45 across 159 apps
EVENTS : 253542.22 across 63 apps
PARENTING : 542603.62 across 58 apps
DATING : 854028.83 across 165 apps
BUSINESS : 1712290.15 across 407 apps
EDUCATION : 1833495.15 across 103 apps
TRAVEL_AND_LOCAL : 13984077.71 across 207 apps
HEALTH_AND_FITNESS : 4188821.99 across 273 apps
GAME : 15588015.6 across 862 apps
ART_AND_DESIGN : 1986335.09 across 57 apps
BEAUTY : 513151.89 across 53 apps
PRODUCTIVITY : 16787331.34 across 345 apps
SOCIAL : 23253652.13 across 236 apps
BOOKS_AND_REFERENCE : 8767811.89 across 190 apps
PHOTOGRAPHY : 17840110.4 across 261 apps
FAMILY : 3695641.82 across 1676 apps
PERSONALIZATION : 5201482.61 across 294 apps
ENTERTAINMENT : 11640705.88 across 85 apps
MEDICAL : 120550.62 across 313 apps
HOUSE_AND_HOME : 1331540.56 across 73 apps
COMMUNICATION : 38456119.17 across 287 apps
LIFESTYLE : 1437816.27 across 346 apps
TOOLS : 10801391.3 across 750 app

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [32]:
for app in goog_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [33]:
under_100_m = []

for app in goog_free:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386


We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [34]:
for app in goog_free:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [35]:
for app in goog_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Amazon Kindle : 100,000,000+
Wattpad 📖 Free Books : 100,000,000+
Audiobooks from Audible : 100,000,000+


However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [36]:
for app in goog_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H


This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

## Conclusions

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.