## Profitable App Profiles for the App Store and Google Play Markets

This project is about searching for the most profitable profiles of the apps within the datasets extracted from AppStore and Google Play Markets. It might be useful to gain the reader the context that the following work will be done by the Data Analyst working for the company that builds Android and iOS mobile apps, and his job is to enable the team of developers to make data-driven decisions with respect to the kind of apps they build. It should be noted that the company builds only free apps, so the main source of revenue is app-in ads, therefore the more people see and engage with the ads the more revenue company receives.

Apple Store dataset was downloaded from [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) 

Google Play Store dataset was downloaded from from [here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)



## 1. Opening and exploring the data


#### 1.1 Opening

The first steps we need to do before exploring the data are opening  reading in and giving the name to the datasets. We will call the datasets as `apple_store`and `google_play_store` correspondingly:

In [96]:
from csv import reader
# The App Store data set:
opened_file = open('AppleStore.csv', encoding='utf8')
readed_file = reader(opened_file)
apple_store = list(readed_file)
apple_store_header = apple_store[0]
apple_store_noheader = apple_store[1:]
#The Google Play data set:
opened_file = open('googleplaystore.csv', encoding='utf8')
readed_file = reader(opened_file)
google_play_store = list(readed_file)
google_play_store_header = google_play_store[0]
google_play_store_noheader = google_play_store[1:]

#### 1.2 Exploring

It would be easier to understand the data if there would be a function providing an analyst with a short intro about the dataset. So, let's create such one:

In [97]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')
#defining the option to print out the number of rows and columns in the dataset
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

1.3 Now let's take a quick glance on the datasets we have by using the function `explore_data`, which we previously created :

In [98]:
print(apple_store_header)
print('\n')
explore_data(apple_store_noheader, 0, 2, rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


In [99]:
print(google_play_store_header)
print('\n')
explore_data(google_play_store_noheader, 0, 2, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


You may note that some column names of the Apple Store dataset are not self-explanatory. Description of each column name and other usefull information about the datasets are presented on the Kaggle's website.
[Apple Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home?select=AppleStore.csv)
[Google Play](https://www.kaggle.com/lava18/google-play-store-apps)

## 2. Data Cleaning

Before analyzing we have to prepare the datasets by checking for errors and duplicates because if there are incorrect data points the results would be affected and highly likely inaccurate. We should delete or correct data points having errors. Duplicates should be deleted also.



#### 2.1 Errors

While reading discussion section of the [Google Play data set](https://www.kaggle.com/lava18/google-play-store-apps) , the analyst found someone's post telling that there is the wrong entry for 10472 row. Let's check it.


In [100]:
print(google_play_store_header)
print('\n')
print(google_play_store_noheader[10472])
print('\n')

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




After printing out the 10472-th row you can see that there is the value of 19 within `rating` column. Taking into account that the max value for this column is 5, it is obviously the mistake. The reason behind this error is the absence of the value in the `Category` column, as a result all other data enries within 10472-th row starting from `Rating` and ending with `Android Ver` were offseted to the left. Let's show it:

In [101]:
print('The number of columns at 10472 row:',len(google_play_store_noheader[10472]))
print('The number of columns at 566 row:',len(google_play_store_noheader[566]))
print('The number of columns at 9985 row:',len(google_play_store_noheader[9985]))

The number of columns at 10472 row: 12
The number of columns at 566 row: 13
The number of columns at 9985 row: 13


As a consequence, we decided to remove that row

In [102]:
print('Number of rows before removing:',len(google_play_store_noheader))
del(google_play_store_noheader[10472])
print('Number of rows after removing:',len(google_play_store_noheader))

Number of rows before removing: 10841
Number of rows after removing: 10840


#### 2.2 Duplicates.

##### 2.2.1 Checking for duplicates.
Let's check our data set for having duplicates. For this purpose we will create the following function:

In [103]:
def duplicates(dataset, index, Numbers=False, Examples=False):
    unique_apps = []
    duplicate_apps = []
    for row in dataset:
        name = row[index]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
    if Numbers:    
        print('The number of the unique apps:', len(unique_apps))
        print('The number of the duplicate apps:', len(duplicate_apps))
        
    if Examples:
        print(duplicate_apps[0:15])

In [104]:
print('The Google Play Market dataset: ')
duplicates(google_play_store_noheader, 0, Numbers=True )

The Google Play Market dataset: 
The number of the unique apps: 9659
The number of the duplicate apps: 1181


In [105]:
print('The Apple Store dataset: ')
duplicates(apple_store_noheader, 1, Numbers=True)

The Apple Store dataset: 
The number of the unique apps: 7195
The number of the duplicate apps: 2


The results we obtained show that there are *1181* duplicate apps in the Google Play market and *2* in the Apple Store. Let's provide some examples of it:

In [106]:
print('Examples of duplicates in Google Play dataset:')
print()
duplicates(google_play_store_noheader, 0, Examples=True) 


Examples of duplicates in Google Play dataset:

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [107]:
print('Examples of duplicates in Apple Store dataset:')
print()
duplicates(apple_store_noheader, 1, Examples=True) 

Examples of duplicates in Apple Store dataset:

['Mannequin Challenge', 'VR Roller Coaster']


As we said earlier,  we need to remove the duplicate entries and keep only one entry per app. One thing we could do is to remove the duplicate rows randomly, but we could probably find a better way. We can try to find the reason why there are more than one data entry for some apps. Let's see:

In [108]:
for i in google_play_store_noheader:
    name = i[0]
    if name == 'Instagram':
        print(i)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


The main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers is due to the data was collected at different time. It will be our criterion to remove the duplicates.

##### 2.2.2 Deleting Duplicate entries

Lets create a new dictionary from the google play data set, which will store name of each app (key) with the corresponding maximal number of reviews for this particular app(value).

In [109]:
review_max ={}

for i in google_play_store_noheader:
    name = i[0]
    n_reviews = float(i[3])
    
    if name in review_max and review_max[name] < n_reviews:
        review_max[name] = n_reviews
    
    if name not in review_max:
        review_max[name] = n_reviews

Earlier, we have found that there are 1181 duplicate data entries. So lets check how many unique data entries we have after creation a dictionary

In [110]:
print('Expected lenght:', len(google_play_store_noheader)-1181)
print('Actual lenght:', len(review_max))

Expected lenght: 9659
Actual lenght: 9659


After we make sure that we have exact number of unique data entries in our dictionary lets make a new data set.

1. First of all, we will create two new lists `android_clean` (out new data set where we will write down each app with the highes number of reviews to) and `already_added` (where we will write already added data entries to).

2. Then we will iterate over `google_play_store_noheader`.

3. We isolate the name and the number of reviews from each row.

4. Then we create two conditions. The first one is if the number of reviews corresponds to the value we have in our dictionary and the second one is if the name of the app is already in `already_added` list. The second condition is needed to avoid cases when we will have duplicate cases with equal maximal number of reviews. For example for one app we can have 2 maximal number of reviews.


In [111]:
android_clean = []
already_added = []

for i in google_play_store_noheader:
    name = i[0]
    n_reviews = float(i[3])
    
    if n_reviews == review_max[name] and name not in already_added:
        android_clean.append(i)
        already_added.append(name)

Now lets explore our new data set

In [112]:
explore_data(android_clean, 0, 3, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Now we can see that we have exact number of unique apps as expected

Lets do the same with the App Store data set. We firstly creare a dictionary with the highest numbers of reviews `rating_count_tot` as a primary condition to avoid duplicates.

In [113]:
appstore_dictionary = {}

for row in apple_store_noheader:
    name = row[1]
    appstore_max_reviews = float(row[5])
    
    if name not in appstore_dictionary:
        appstore_dictionary[name] = appstore_max_reviews
    
    if name in appstore_dictionary and appstore_dictionary[name] < appstore_max_reviews:
        appstore_dictionary[name] = appstore_max_reviews

print('Expected lenght:', len(apple_store_noheader)-2)
print('Actual lenght:',len(appstore_dictionary))
    

Expected lenght: 7195
Actual lenght: 7195


After we created a new dictionary that stores unique data entries with the highest number of reviews lets create a new data set.

In [114]:
appstore_clean = []
appstore_already_added = []

for row in apple_store_noheader:
    name = row[1]
    n_reviews_appstore = float(row[5])
    
    if  appstore_dictionary[name] == n_reviews_appstore and name not in appstore_already_added:
        appstore_clean.append(row)
        appstore_already_added.append(name)
        

Lets explore our new cleaned data set for App Store

In [115]:
explore_data(appstore_clean, 0, 3, rows_and_columns=True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7195
Number of columns: 16


Number of rows is (7195) as expected

So now we have two cleaned from duplicates data sets. They are  `android_clean` and `appstore_clean`

#### 2.3  Removing Non-English apps.



If we explore both datasets , we will find apps with Non-English names. We should avoid such apps in our analysis because our team develops apps directed only to English-speaking audience. Lower you can find examples of Non-English apps.

In [116]:
print(appstore_clean[813][1])
print(appstore_clean[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
激ムズ！和のひとふで書き！ 〜頭をつかう脳トレパズルゲーム〜


中国語 AQリスニング
لعبة تقدر تربح DZ


Each string has its corresponding number if we use build-in function ord() on it. For example, for character `'A'` corresponding number is 65, character`'a'` is 97, and character `'激'` is 28608.

In [117]:
print(ord('5'))
print(ord('-'))
print(ord('A'))
print(ord('激'))

53
45
65
28608


Usually, the numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the [ASCII](https://en.wikipedia.org/wiki/ASCII). Using this fact we can create a function which will iterate over each name and each character to check whether it lies in the range 0 to 127 or not.

In [118]:
def is_english(word):    
    for i in word:
        if (ord(i)) > 127:
            return False
    return True

After we created the function that identifies non-english apps lets check it on some examples:

In [119]:
a = 'Instagram'
b = '爱奇艺PPS -《欢乐颂2》电视剧热播'
c = 'Docs To Go™ Free Office Suite'
d = 'Instachat 😜'

print(is_english(a))
print(is_english(b))
print(is_english(c))
print(is_english(d))


True
False
False
False


We can see that function returned False in the last two examples: `'Docs To Go™ Free Office Suite'` and `'Instachat 😜'`. This happened bacause apps names have `'™'` and `'😜'` characters. 

In [120]:
print(ord('™'))
print(ord('😜'))

8482
128540


After we have realized that fact we need to modify our function.The fucntion will return False only if an app name has more than three characters with corresponding numbers higher than 127 (like a filter).

In [121]:
def is_english(name):
    asci = 0
    for i in name:
        if ord(i) > 127: 
            asci += 1 
            
    if asci > 3:
        return False
    else:
        return True
            
        

In [122]:
a = 'Docs To Go™ Free Office Suite'
b = 'Instachat 😜'
c = '爱奇艺PPS -《欢乐颂2》电视剧热播'

print(is_english(a))
print(is_english(b))
print(is_english(c))

True
True
False


Now lets create a new function to delete non-english apps from our datasets

In [123]:
android_clean_english = []
appstore_clean_english = []

for i in android_clean:
    name = i[0]
    if is_english(name):
        android_clean_english.append(i)

for i in appstore_clean:
    name = i[1]
    if is_english(name):
        appstore_clean_english.append(i)
        
print('Cleaned from duplicates and non-english apps Google Play dataset:','\n')        
explore_data(android_clean_english, 0 , 4, rows_and_columns= True)
print('\n')
print('Cleaned from duplicates and non-english apps Appstore dataset:', '\n')
explore_data(appstore_clean_english,0, 4, rows_and_columns= True)

Cleaned from duplicates and non-english apps Google Play dataset: 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9614
Number of columns: 13


Cleaned from duplicates and non-english apps Appstore dataset: 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5',

We can see that we're left with 9614 Android apps and 6183 iOS apps.

#### 2.4 Filtering free apps

It should be noted again that our company develops only free apps and that the main source of revenue is in-app ads. Therefore we should avoid paid apps in our analysis. We will create two new lists `android_free` and `appstore_free`, then iterate over existing datasets to check whether the app has zero price or not. If it has then we append the whole data entry to the new created lists.


In [124]:
android_free = []
appstore_free = []

for app in android_clean_english:
    price = app[7]
    if price == '0':
        android_free.append(app)

for app in appstore_clean_english:
    price = app[4]
    if price == '0.0':
        appstore_free.append(app)
        

explore_data(android_free,0,3,rows_and_columns = True)
print('\n')
explore_data(appstore_free,0,3,rows_and_columns = True)



['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

We can see that we're left with 3220 iOS apps and 8864 Android apps

### 3. Most common apps 

The main aim of our analysis is to determine the kind of apps that are likely to attract more users because our revenues are influenced by the number of people using our apps. Our validation strategy for an app idea comprised of the following steps:

1. Create an app for Android and add it to the google play market.
2. If the app has a good response from users we develop it further.
3. If the app is still profitable after 6 month, we build an iOS version of the app and add it to the AppStore.

Lets explore our dataset for the most common genres. First of all, lets create a frequency table for the `prime_genre` column for the `appstore_free` dataset and frequency tables for the `Category` and `genres` columns for the `android_free` dataset.

In [125]:
def frequency_table(dataset,index):
    freq_table = {}
    total = 0

    for row in dataset:
        total += 1
        genre = row[index]
    
        if genre in freq_table:
            freq_table[genre] +=1
    
        if genre not in freq_table:
            freq_table[genre] = 1 
    
    percentage_table = {}    #dictionary with absolute values {key:value}
    
    for key in freq_table:
        percentage = round(freq_table[key] / total * 100,2) 
        percentage_table[key] = percentage 
        
    return percentage_table    #return the dictionary with relative values

frequency_table(appstore_free, 11)

{'Social Networking': 3.29,
 'Photo & Video': 4.97,
 'Games': 58.14,
 'Music': 2.05,
 'Reference': 0.56,
 'Health & Fitness': 2.02,
 'Weather': 0.87,
 'Utilities': 2.52,
 'Travel': 1.24,
 'Shopping': 2.61,
 'News': 1.34,
 'Navigation': 0.19,
 'Lifestyle': 1.58,
 'Entertainment': 7.89,
 'Food & Drink': 0.81,
 'Sports': 2.14,
 'Book': 0.43,
 'Finance': 1.12,
 'Education': 3.66,
 'Productivity': 1.74,
 'Business': 0.53,
 'Catalogs': 0.12,
 'Medical': 0.19}

We can see that data entries are not sorted by percentage. We want to have a frequency table with descending order - from highest percentage to lowest. The build-in function `sorted()` does not work well with dictionaries because it only considers and returns the dictionary keys. However, the `sorted()` function works well with tuples. Lets write a function which transforms dictionary to tuple and sort our data entries by descending order.


In [126]:
def display_table(dataset, index):
    table1 = frequency_table(dataset, index)      #type = dictionary
    table_display = []                            #empty list
    
    for key in table1:                    #appending each tuple to new list
        tuple1 = (table1[key], key)
        table_display.append(tuple1)
        
    table_sorted = sorted(table_display, reverse = True)  #list of tuples sorted
    
    for entry in table_sorted:
        print(entry[1],':',entry[0],'%')

In [127]:
display_table(appstore_free, 11)    #appstore + genre

Games : 58.14 %
Entertainment : 7.89 %
Photo & Video : 4.97 %
Education : 3.66 %
Social Networking : 3.29 %
Shopping : 2.61 %
Utilities : 2.52 %
Sports : 2.14 %
Music : 2.05 %
Health & Fitness : 2.02 %
Productivity : 1.74 %
Lifestyle : 1.58 %
News : 1.34 %
Travel : 1.24 %
Finance : 1.12 %
Weather : 0.87 %
Food & Drink : 0.81 %
Reference : 0.56 %
Business : 0.53 %
Book : 0.43 %
Navigation : 0.19 %
Medical : 0.19 %
Catalogs : 0.12 %


After analyzing the App Store data we can make a conclusion that among free and english apps the most popular genres are Games (58.4%) , Entertainment (7.89%) and Photo & Video (4.97%). Therefore the biggest part of apps is for Entertainment , not for practical purposes. But does it imply that these genres have a large number of users? However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

In [128]:
display_table(android_free,1)       #Android + Category

FAMILY : 18.91 %
GAME : 9.72 %
TOOLS : 8.46 %
BUSINESS : 4.59 %
LIFESTYLE : 3.9 %
PRODUCTIVITY : 3.89 %
FINANCE : 3.7 %
MEDICAL : 3.53 %
SPORTS : 3.4 %
PERSONALIZATION : 3.32 %
COMMUNICATION : 3.24 %
HEALTH_AND_FITNESS : 3.08 %
PHOTOGRAPHY : 2.94 %
NEWS_AND_MAGAZINES : 2.8 %
SOCIAL : 2.66 %
TRAVEL_AND_LOCAL : 2.34 %
SHOPPING : 2.25 %
BOOKS_AND_REFERENCE : 2.14 %
DATING : 1.86 %
VIDEO_PLAYERS : 1.79 %
MAPS_AND_NAVIGATION : 1.4 %
FOOD_AND_DRINK : 1.24 %
EDUCATION : 1.16 %
ENTERTAINMENT : 0.96 %
LIBRARIES_AND_DEMO : 0.94 %
AUTO_AND_VEHICLES : 0.93 %
HOUSE_AND_HOME : 0.82 %
WEATHER : 0.8 %
EVENTS : 0.71 %
PARENTING : 0.65 %
ART_AND_DESIGN : 0.64 %
COMICS : 0.62 %
BEAUTY : 0.6 %


Compared with AppStore data, the most popular categories in the Google Play Market are for practical purposes (Family (18.91%), Tools (8.46%), Business (4.59) ) not Entertainment. Also, we can say that categories are more diversified (no category with share more than 20%)

In [129]:
display_table(android_free, -4)   #android + genre

Tools : 8.45 %
Entertainment : 6.07 %
Education : 5.35 %
Business : 4.59 %
Productivity : 3.89 %
Lifestyle : 3.89 %
Finance : 3.7 %
Medical : 3.53 %
Sports : 3.46 %
Personalization : 3.32 %
Communication : 3.24 %
Action : 3.1 %
Health & Fitness : 3.08 %
Photography : 2.94 %
News & Magazines : 2.8 %
Social : 2.66 %
Travel & Local : 2.32 %
Shopping : 2.25 %
Books & Reference : 2.14 %
Simulation : 2.04 %
Dating : 1.86 %
Arcade : 1.85 %
Video Players & Editors : 1.77 %
Casual : 1.76 %
Maps & Navigation : 1.4 %
Food & Drink : 1.24 %
Puzzle : 1.13 %
Racing : 0.99 %
Role Playing : 0.94 %
Libraries & Demo : 0.94 %
Auto & Vehicles : 0.93 %
Strategy : 0.91 %
House & Home : 0.82 %
Weather : 0.8 %
Events : 0.71 %
Adventure : 0.68 %
Comics : 0.61 %
Beauty : 0.6 %
Art & Design : 0.6 %
Parenting : 0.5 %
Card : 0.45 %
Casino : 0.43 %
Trivia : 0.42 %
Educational;Education : 0.39 %
Board : 0.38 %
Educational : 0.37 %
Education;Education : 0.34 %
Word : 0.26 %
Casual;Pretend Play : 0.24 %
Music : 0.2 %
R

The frequency tables we analyzed showed us that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps. Now, we'd like to get an idea about the kind of apps with the most users.

###  4. The most popular Apps by Genre on App Store

Another way to determine what genre is more popular is to count average number of installs for each app genre. Google play dataset have special column showing number of installs for each app `installs`, but App Store dataset doesnt have such column. Instead we can take another column  `ratint_count_tot` to find what genre is more popular in the App Store.

We will write a special function




In [130]:
def average_number_of_reviews(dataset, index_of_genre, index_of_number_of_ratings):
    frequency_table_genre = frequency_table(dataset,index_of_genre)
    
    dict1 = {}
    
    for genre in frequency_table_genre:
        
        Sum_of_total_ratings = 0
        Number_of_apps_in_genre = 0
        
        for app in dataset:
            genre1 = app[index_of_genre]
            number_ratings = float(app[index_of_number_of_ratings])
            
            if genre == genre1: 
                
                Sum_of_total_ratings += number_ratings
                Number_of_apps_in_genre += 1
        
        avg_number = round(Sum_of_total_ratings/ Number_of_apps_in_genre, 2)
        
        dict1[genre]=avg_number
    
    
    list1 = []
    for genre in dict1:
        value = (dict1[genre],genre)
        list1.append(value)
    
    sorted_list = sorted(list1, reverse = True)
    for i in sorted_list:
        print(i[1],':',i[0])

    
    
print ('Average number of reviews in App Store:','\n')  
average_number_of_reviews(appstore_free, 11, 5)


Average number of reviews in App Store: 

Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89
Book : 39758.5
Food & Drink : 33333.92
Finance : 31467.94
Photo & Video : 28441.54
Travel : 28243.8
Shopping : 26919.69
Health & Fitness : 23298.02
Sports : 23008.9
Games : 22812.92
News : 21248.02
Productivity : 21028.41
Utilities : 18684.46
Lifestyle : 16485.76
Entertainment : 14029.83
Business : 7491.12
Education : 7003.98
Catalogs : 4004.0
Medical : 612.0


_Navigation_ genre has the highest number of reviews among other genres. Let's explore this genre

In [131]:
for app in appstore_free:
    if app[11] == 'Navigation':
        print(app[1],':',app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


We see that 2 apps (Waze and Goggle Maps) have the highest numbers. Due to these two giants average number was skewed up. We should avoid such big companies in our analysis. 

In [132]:
for app in appstore_free:
    if app[11] == 'Social Networking':
        print(app[1],':', app[5])

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

The same pattern we can find in  Social Networking genre. Giants are Facebook, Pinterest, Skype, etc. Also, Social Networking genre looks over-saturated. Lets explore Reference genre.

In [133]:
for app in appstore_free:
    if app[11] == 'Reference':
        print(app[1],':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


The reference genre looks unsaturated. There are only two apps with the large number of reviews. They are  Bible and Dictionary apps. it's possible to create an app within this genre. Let's say some app for popular MMORPG where users can find guides, descriptions of characters, their skills, abilities and all gamers need to play some particular online game.

### 5. The most popular Apps by Genre on the Google Play market




The Google play data set has `installs` column that can tell us which apps have the most number of installs. Lets create a frequency table for this column.

In [134]:
display_table(android_free, 5)

1,000,000+ : 15.73 %
100,000+ : 11.55 %
10,000,000+ : 10.55 %
10,000+ : 10.2 %
1,000+ : 8.39 %
100+ : 6.92 %
5,000,000+ : 6.83 %
500,000+ : 5.56 %
50,000+ : 4.77 %
5,000+ : 4.51 %
10+ : 3.54 %
500+ : 3.25 %
50,000,000+ : 2.3 %
100,000,000+ : 2.13 %
50+ : 1.92 %
5+ : 0.79 %
1+ : 0.51 %
500,000,000+ : 0.27 %
1,000,000,000+ : 0.23 %
0+ : 0.05 %
0 : 0.01 %


Data in that column is not preciece enough, i.e it does not show exact number of installs but `100,000+`. To overcome that issue we will count 100,000+ installs  as `100000` (changing format). Anyway it will provide us with idea of which apps is more popular. 

Lets now analyze Google Play market for the most popular categories by the number of installs.

In [135]:
a = frequency_table(android_free, 1) 
#print(a)

list1 = []

for category in a:       #iterating over frequency table of category column
    total = 0            # for each category we create 2 variables to which we will add data 
    len_category = 0     # about number of installs and count number of apps
                         # in each particular category.
    for app in android_free:
        category_app = app[1]
        
        if category == category_app:
            installs = app[5]                         #saving number of installs
            installs =installs.replace('+', '')       #changing format
            installs =installs.replace(',', '')
            installs_float = float(installs)          #converting to float
            total += installs_float                   #add number of installs to total variable
            len_category += 1                         # +1
        
    avg_installs = total / len_category     
    
    tuple1 = (avg_installs,category)
    list1.append(tuple1)
    
list1 = sorted(list1, reverse=True)
    
for i in list1:
    print(i[1],':',round(i[0],2))

        
    
    

COMMUNICATION : 38456119.17
VIDEO_PLAYERS : 24727872.45
SOCIAL : 23253652.13
PHOTOGRAPHY : 17840110.4
PRODUCTIVITY : 16787331.34
GAME : 15588015.6
TRAVEL_AND_LOCAL : 13984077.71
ENTERTAINMENT : 11640705.88
TOOLS : 10801391.3
NEWS_AND_MAGAZINES : 9549178.47
BOOKS_AND_REFERENCE : 8767811.89
SHOPPING : 7036877.31
PERSONALIZATION : 5201482.61
WEATHER : 5074486.2
HEALTH_AND_FITNESS : 4188821.99
MAPS_AND_NAVIGATION : 4056941.77
FAMILY : 3695641.82
SPORTS : 3638640.14
ART_AND_DESIGN : 1986335.09
FOOD_AND_DRINK : 1924897.74
EDUCATION : 1833495.15
BUSINESS : 1712290.15
LIFESTYLE : 1437816.27
FINANCE : 1387692.48
HOUSE_AND_HOME : 1331540.56
DATING : 854028.83
COMICS : 817657.27
AUTO_AND_VEHICLES : 647317.82
LIBRARIES_AND_DEMO : 638503.73
PARENTING : 542603.62
BEAUTY : 513151.89
EVENTS : 253542.22
MEDICAL : 120550.62


There are three the most popular genres: COMMUNICATION, VIDEO_PLAYERS, SOCIAL. In average `COMMUNICATION` category has 38,456,199 installs , but this number is skewed up by big giants like whatsapp, skype, gmail. Let's see exact numbers:

In [136]:
for app in android_free:
    category = app[1]
    installs = app[5]
    if category == 'COMMUNICATION' and (installs == '1,000,000,000+'
                                    or installs == '500,000,000+'
                                    or installs == '100,000,000+'):
        print(app[0],':',app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

We should avoid such apps, so we delete them:


In [154]:
under_100m = []

for app in android_free:
    category = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace(',','')
    n_installs = n_installs.replace('+','')
    n_installs = float(n_installs)
    if category == 'COMMUNICATION' and n_installs < 100000000:
        under_100m.append(n_installs)

        
print('the average number of installs in "COMMUNICATION" genre:',round(sum(under_100m)/len(under_100m),2))


the average number of installs in "COMMUNICATION" genre: 3603485.39


In [156]:
38456119.17 - 3603485.39

34852633.78

The difference is  34,852,633.78. If we explore other popular genres like VIDEO_PLAYERS, SOCIAL we will find the same pattern, i.e category dominated by giants. It's difficult to compete with such apps. Let's look at the BOOKS_AND_REFERENCE category.

In [168]:
list3 =[]
total = 0
len1 = 0

for app in android_free:
    category = app[1]
    n_installs = app[5]
    n_installs = n_installs.replace(',','')
    n_installs = n_installs.replace('+','')
    n_installs = float(n_installs)
    if category == 'BOOKS_AND_REFERENCE':
        total += n_installs
        len1 += 1 
        tuple3 = (n_installs,app[0])
        list3.append(tuple3)

print('Average number of installs in BOOKS_AND_REFERENCE:', total/len1,'\n')

        
list4 = sorted(list3, reverse = True)

for app in list4:
    print(app[1],':',app[0])


    

Average number of installs in BOOKS_AND_REFERENCE: 8767811.894736841 

Google Play Books : 1000000000.0
Wattpad 📖 Free Books : 100000000.0
Bible : 100000000.0
Audiobooks from Audible : 100000000.0
Amazon Kindle : 100000000.0
Wikipedia : 10000000.0
Spanish English Translator : 10000000.0
Quran for Android : 10000000.0
Oxford Dictionary of English : Free : 10000000.0
NOOK: Read eBooks & Magazines : 10000000.0
Moon+ Reader : 10000000.0
JW Library : 10000000.0
HTC Help : 10000000.0
FBReader: Favorite Book Reader : 10000000.0
English Hindi Dictionary : 10000000.0
English Dictionary - Offline : 10000000.0
Dictionary.com: Find Definitions for English Words : 10000000.0
Dictionary - Merriam-Webster : 10000000.0
Dictionary : 10000000.0
Cool Reader : 10000000.0
Aldiko Book Reader : 10000000.0
Al-Quran (Free) : 10000000.0
Al'Quran Bahasa Indonesia : 10000000.0
Al Quran Indonesia : 10000000.0
Read books online : 5000000.0
English to Hindi Dictionary : 5000000.0
Ebook Reader : 5000000.0
Dictionary 

There are giants like Bible, Google Play Books, etc. Lets explore category column avoiding giants:

In [173]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0],':',app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

This category is dominated by dictionaries, readers, translators so we should try to find something else. 

We also notice a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable.