# What apps attract more users

This is a walk through project with inspiration and some code taken from the DataQuest 'Data Analyst in Python' Course. It will be indicated when code or text is taken from the course.

This project is from the eyes of an analyst working for an app development company who produce free apps, in English, available on Google Play and the App Store. As the apps are free to download, the main source of revenue for a given app is in-app ads. This means that the more users who use an app, the more profitable the app becomes.

----------------------------------------------------------------

*From the course:*

This project aims to analyse data to help developers what types of apps are most likely to attract more users.

To do this, data about mobile apps on Google Play and the App Store will be collected and analysed. Due to the volume of apps on these two stores, a sample will be collected instead of analysing all apps.

There are around 10000 Android apps in the [Google Play data set](https://www.kaggle.com/lava18/google-play-store-apps) which was collected in August 2018.

There are around 7000 iOS apps in the [App Store data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) which was collected in July 2017.

The hyperlinks also provide documentation to describe the data sets including the headings used.

----------------------------------------------------------------

We shall begin by opening both the data sets into the notebook.

In [1]:
import csv

# Assign the Apple Store data as a list of lists
app_store_apps = list(csv.reader(open('AppleStore.csv')))
# Assign the Google Play data as a list of lists
google_play_apps = list(csv.reader(open('googleplaystore.csv')))

Before analysing the data, we shall explore it using the below `explore_data()` function which prints rows in a readable way.

This code and documentation is taken from the DataQuest course.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    
    for row in dataset_slice:
        print(row) # Print each row
        print('\n') # adds a new line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

The `explore_data()` function:

* Takes in four parameters:
    `dataset`, which is expected to be a list of lists.
    `start` and `end`, which are both expected to be integers and represent the starting and the ending indices of a slice from the data set.
    `rows_and_columns`, which is expected to be a Boolean and has False as a default argument.
* Slices the data set using `dataset[start:end]`.
* Loops through the slice, and for each iteration, prints a row and adds a new line after that row using `print('\n')`.
    The `\n` in `print('\n')` is a special character and won't be printed. Instead, the `\n` character adds a new line, and we use `print('\n')` to add some blank space between rows.
* Prints the number of rows and columns if `rows_and_columns` is `True`.
    `dataset` shouldn't have a header row, otherwise the function will print the wrong number of rows (one more row compared to the actual length).

----------------------------------------------------------------

First checking for headers in the data sets:

In [3]:
print(app_store_apps[0])
print('\n')
print(google_play_apps[0])

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Now creating lists for each without the headers to be able to use the `explore_data()` function properly:

In [4]:
ios = app_store_apps[1:]
android = google_play_apps[1:]

print(ios[0])
print('\n')
print(android[0])

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Now exploring the data sets using the `explore_data()` function

In [5]:
print('The App Store data set:')
print('\n')
print(app_store_apps[0])
print('\n')
explore_data(ios, 0, 5, rows_and_columns=True)
print('\n')
print('The Google Play data set:')
print('\n')
print(google_play_apps[0])
print('\n')
explore_data(android, 0, 5, rows_and_columns=True)

The App Store data set:


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1

From the `'AppleStore.csv'` data, the 'track_name' column (index 2) will be useful to identify the different apps. The 'price' column (index 5) will be useful to identify which apps are free as those are the apps we will be interested in. The 'rating_count_tot' column (index 6) will be useful to derermine the number of people who have reviewed the app and to allow suitable, weighted comparison to other apps. The 'user_rating' column (index 8) will be useful to determine which apps are the most popular. The 'prime_genre' column will be useful to determine the type of app  which will allow comparisons of which app type is most popular and thus reach our goal of understanding which apps are most popular.

From the `'googleplaystore.csv'` data, the 'App' column (index 0) will be useful to identify the different apps. The 'Category' column (index 1) will be useful to determine the type of app which will allow comparisons of which app type is most popular. The 'Rating' column (index 2) will be useful to determine the popularity of each app. Either the 'Reviews' column (index 3) or the 'Installs' column (index 5) will be useful to provide a weighted comparison between the apps and 'Installs' could act as another measure of popularity. The 'Price' column (index 7) will be useful to determine which apps are free as those are the apps we will be interested in.

----------------------------------------------------------------

*From the course:*

Before analysing the data, it is important to ensure the data being analysed is accurate so that the analysis isn't wrong. There are several things to be done to clean the data:

* Detect inaccurate data and correct or remove it.
* Detect duplicate data and remove the duplicates.

As the company builds free apps that are in English it is pertinent to remove any non-English apps and any apps that aren't free to download.

---

*From the course:*

Read the discussion sections from the provided data to look for any reports of wrong data and remove using the `del` code.

In [6]:
# For the Google Play data, error reported on row 10472 (with no header):

print(android[10472])
print(google_play_apps[10473])

print(len(android[0]))
print(len(android[10472]))



['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
13
12


Column 10472 from the Google Store data is missing a category which thus shifts the rest of the data which will make it incompatible with the analysis and thus needs to be removed.

In [7]:
# This code has been commented out to prevent further rows being accidentally deleted.

## Delete the incorrect rows
# del android[10472]
# del google_play_apps[10473]

# print(android[10472])
# print(google_play_apps[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


According to another report, there is an incorrect app type for a free app that causes inaccuracies in certain cases in row 9148 (without header row)

In [8]:
print(android[9148])
print(google_play_apps[9149])

['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']
['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']


Instead of removing this row, it can be corrected for

In [9]:
# Replace the incorrect app type with 'Free'

android[9148][6] = 'Free'
google_play_apps[9149][6] = 'Free'

print(android[9148])
print(google_play_apps[9149])

['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'Free', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']
['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'Free', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']


There appears to be no incorrect data in the Apple data but there are some duplicates reported.

*From the course:*

Both data sets have some duplicate data. It is therefore useful to count the number of duplicates and to separate the duplicates from the unique apps. 

When deciding which of the duplicates to use, the one with the highest number of reviews is likely to be the most recent and as such, this is the duplicate that shall be kept. 

In [10]:
# Proving the existance of duplicates in the data
# Code inspired by course

def dup_check(data_set, name_index):
    
    unique_names = []
    duplicate_names = []

    for row in data_set:
        # If the app is not a duplicate, add it to the  unique list
        if row[name_index] not in unique_names:
            unique_names.append(row[name_index])
            
        # If it is a duplicate, add it to the duplicate list
        else:
            duplicate_names.append(row[name_index])
    
    return unique_names, duplicate_names


In [11]:
# Checking for duplicates in the Apple Store and Google Play data

unique_apple, duplicate_apple = dup_check(ios, 1) # By id not name

unique_android, duplicate_android = dup_check(android, 0)

print(duplicate_apple)
print('\n')
print(len(duplicate_android))
print('\n')
print(duplicate_android)


[]


1181


['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express', 'Accounting App - Zoho Books', 'Invoice & Time Tracking - Zoho', 'join.me - Simple Meetings', 'Invoice 2go — Professional Invoices and Estimates', 'SignEasy | Sign and Fill PDF and other Documents', 'Quick PDF Scanner + OCR FREE', 'Genius Scan - PDF Scanner', 'Tiny Scanner - PDF Scanner App', 'Fast Scanner : Free PDF Scan', 'Mobile Doc Scanner (MDScan) Lite', 'TurboScan: scan documents and receipts in PDF', 'Tiny Scanner Pro: PDF Doc Scan', 'Docs To Go™ Free Office Suite', 'OfficeSuite : Free Office + 

From the above, it is confirmed that there are no duplicates in the Apple Store data set and 1181 in the Google Store data set.


In [12]:
print(len(unique_android))

9659


*From the course:*

To account for the duplicates, a dictionary of the unique apps will be created with the value corresponding to the highest number of views for the given app.

This information will then be used to create a new data set which will have no duplicates.

---

To do this for both data sets, a function has been created based on the course guidance to produce a dictionary with the maximum number of reviews for each app and then a second function which removes duplicates, leaving the row with the largest number of reviews.

In [13]:
# Alternate to course guidance to allow for both data sets

# Define function to produce max reviews dictionary

def max_reviews(data_set, name_index, reviews_index):
    
    reviews_max = {}

    for row in data_set:
        name = row[name_index]
        n_reviews = float(row[reviews_index])
        
        # Check if the app is a duplicate and if so compare to the already added duplicate
        # if greater n_reviews than the one added, replace it
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
            
        # If not in the dictionary, add it
        elif name not in reviews_max:
            reviews_max[name] = n_reviews

    
    return reviews_max

In [18]:
# Alternate to course guidance to allow for both data sets

# Define function to remove duplicates

def remove_duplicates(data_set, name_index, reviews_index, mx_rvw_dct):

    clean = []
    already_added = [] # To keep track of the apps added in case of duplicate having same number of reviews twice

    for row in data_set:
        name = row[name_index]
        n_reviews = float(row[reviews_index])
        
        # Check if the number of reviews for the app is the max number of reviews and hasn't been added already
        if n_reviews == mx_rvw_dct[name] and name not in already_added:
            clean.append(row)
            already_added.append(row[name_index])
        
    return clean

In [None]:
# Commented out due to above alternative
# # With course guidance

# reviews_max = {}

# for row in android:
#     name = row[0]
#     n_reviews = float(row[3])
    
#     if name in reviews_max and reviews_max[name] < n_reviews:
#         reviews_max[name] = n_reviews
#     elif name not in reviews_max:
#         reviews_max[name] = n_reviews


# # Check to see if dictionary has worked
# x = 0
    
# for row in android:
#     if x <= 3:
#         print (row[0], reviews_max[row[0]])
#         x +=1

# print('\n')
# print(len(reviews_max) == len(unique_android))

In [None]:
# Commented out due to above alternative
# # With course guidance
# # Use dict. to remove duplicates

# android_clean = []
# already_added = [] # To keep track of the apps added in case of duplicate having same number of reviews twice

# for row in android:
#     name = row[0]
#     n_reviews = float(row[3])
    
#     if n_reviews == reviews_max[name] and name not in already_added:
#         android_clean.append(row)
#         already_added.append(row[0])

# print(len(android_clean) == len(unique_android))
# print('\n')
# print(android_clean[:3])

Problem with max_reviews function to fix...

In [21]:
# Use the functions to remove duplicates in google store data set

android_reviews_max = max_reviews(android, 0, 3)

print(len(android_reviews_max))

clean_android = remove_duplicates(android, 0, 3, android_reviews_max)

print(len(clean_android))

# Check same length as unique list

print(len(clean_android) == len(unique_android))
print('\n')
print(clean_android[:3])

9659
9659
True


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


*From the course*

The aim is to understand the popularity of English-speaking apps and so it is a good idea to remove apps with names that suggest a non-English-speaking audience.

One way to do this is to removes apps which contain symbols not commonly seen in English text. English text all correspond to numbers from 0 - 127 within the ASCII system. Based on this number range and using the `ord()` built-in function, we can build a function that detects whether a character belongs to this range.

---

In [25]:
# With course guidance
# Function that returns false if there is a character in the name that isn't common in the English language

def eng_check(name):
    
    x = 0
    
    for letter in name:
        # Check if the ASCII code is > 127 for the letter
        # If so return False
        # Modified to allow for 3 non-English characters
        if ord(letter) > 127:
            x += 1
    
    # Modified
    # If ran through all letters and <= 3 non English, return True
    # Else return False
    if x <= 3:
        return True
    else:
        return False

In [23]:
# Check the function works

print(eng_check('Instagram'))
print(eng_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(eng_check('Docs To Go™ Free Office Suite'))
print(eng_check('Instachat 😜'))

True
False
False
False


In [24]:
print(eng_check('™'))
print(eng_check('😜'))

False
False


Clearly emojis, superscripts, and therefore subscripts are not recognised as being part of the English language. With more time, the function could be expanded to include these.

*From the course*

In order to minimise impact of above realisation (made independently of the course), the function can be modified to only return false if there are more than 3 non-English characters.

In [26]:
# Check modified function

print(eng_check('Instagram'))
print(eng_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(eng_check('Docs To Go™ Free Office Suite'))
print(eng_check('Instachat 😜'))

True
False
True
True


In [27]:
# Course suggestion but no guidance
# Create a function to clean data of non-English apps

def eng_filter(data_set, name_index):
    
    eng_data_set = []
    
    # Loop through each row
    for row in data_set:
        
        # If name is English, add to new data set
        
        if eng_check(row[name_index]):
            eng_data_set.append(row)
    
    return eng_data_set

In [30]:
# Check the new function works

eng_ios = eng_filter(ios, 2)
eng_android = eng_filter(clean_android, 0)

print(len(eng_ios))
print(len(eng_android))

# Counters of number of english and non-English apps in each cleaned list
apple_eng_apps = 0
apple_non_apps = 0
google_eng_apps = 0
google_non_apps = 0

# Checking names
for row in eng_ios:
    if eng_check(row[2]):
        apple_eng_apps += 1
    else:
        apple_non_apps += 1
        
for row in eng_android:
    if eng_check(row[0]):
        apple_eng_apps += 1
    else:
        apple_non_apps += 1  
        
print(apple_non_apps)
print(google_non_apps)


6183
9614
0
0


This check shows that the functions have worked.

---

*From the course*

The final stage of cleaning is to isolate the free apps.

In [38]:
# Course suggestion but no guidance
# Function to check for and isolate free apps

def is_free(data_set, price_index):
    
    free_apps = []
    paid_apps = []
    
    for row in data_set:
        # If the app costs 0.0 (ie is free) add to free_apps list
        if row[price_index] == '0.0' or row[price_index] == '0':
            free_apps.append(row)
        
        # Otherwise add to paid_apps list
        else:
            paid_apps.append(row)
    
    return (free_apps, paid_apps)

In [39]:
# Apply function and print a few rows to check if it has worked

free_ios, paid_ios = is_free(eng_ios, 5)

free_android, paid_android = is_free(eng_android, 7)

print(len(free_ios), len(free_android))
print('\n')
print(free_ios[:3])
print('\n')
print(free_android[:3])

3222 8864


[['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'], ['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'], ['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']]


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0

The number of apps left in each list is consistent with the solution programme from DataQuest suggesting that the cleaning code is correct.

---

*From the course:*

As mentioned in the introduction, this project aims to determine the kind of apps that are more likely to attract more users as the companies revenue is highly influenced by the number of people using the apps. The company develops apps with the aim of launching on both  Gogle Play and the Apple Store. As such it is important to analyse both markets.

To do this, we will first determine what the most common app genres are on each market. This will be done by building frequency tables.

In [40]:
print(app_store_apps[0])
print('\n')
print(google_play_apps[0])

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


From the headers, it can be seen that the `'prime_genre'` column (index 12) from the Apple Store data set and the `'Genres'` column (index 9) and `'Category'` column (index 1) from the Google Play data set will be needed to identify the most common genres.

*From the course*

Two functions will be built, one to build frequency tables which show percentages and then a second using the `sorted()` function to sort the frequency table into descending order.

However, the `sorted()` function only sorts the keys of a dictionary and so a list of tuples will be created, where each tuple contains a dictionary key along with its corresponding value. This second function has been provided by the course, this will be indicated with the function below.

In [42]:
# Course suggestion but no guidance

def freq_table(data_set, genre_index):
    
    table = {}
    
    # Create frequency table
    for row in data_set:
        
        genre = row[genre_index]
        
        if genre in table:
            table[genre] += 1
        
        else:
            table[genre] = 1
    
    # Convert to percentages
    for key in table:
        
        table[key] /= (len(data_set) / 100)
    
    return table
    

The `display_table()` function:

* Takes in two parameters: `data_set` and `genre_index`. `Data_set` is expected to be a list of lists and `genre_index` is expected to be an integer.

* Generates a frequency table using the `freq_table()` function.

* Transforms the frequency table into a list of tuples, then sorts the list into descending order.

* Prints the entries of the frequency table in descending order.

In [50]:
# Code provided by the course

def display_table(data_set, genre_index):
    
    table = freq_table(data_set, genre_index)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    
    for entry in table_sorted:
        # Rounding added independently
        print(entry[1], ':', format(entry[0], '.2f'))
 

In [51]:
# Display sorted tables

print('Apple Store genres:')
print('\n')
display_table(free_ios, 12)
print('\n')
print('Google Play genres:')
print('\n')
display_table(free_android, 9)
print('\n')
print('Google Play categories:')
print('\n')
display_table(free_android, 1)

Apple Store genres:


Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


Google Play genres:


Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.70
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.10
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.80
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.40
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & 

*With guidance from the course*

For **free**, apps on the Apple Store, aimed at **English speaking audiences**:

* The most common genre is 'Games' at 58.16% of the market, followed by 'Entertainment' at 7.88% of the market.

* It appears that other than 'Games' there are a number of different genres that all have a relatively small market share.

* over 65% of apps on the App Store for the defined market are for entertainment purposes. This suggests that an entertainment app, in particular, a gaming app would potentially be a good app profile. However, because there are so many gaming apps, it may be that each individual app has a smaller usership than apps of other genres as there are many different apps for the users to use, whereas for other genres there are fewer apps and so more users will use the individual apps. As such, an entertainment of social networking app may be a better profile than a gaming app due to less competition.

For **free**, apps on Google Play, aimed at **English speaking audiences**:

* The most common genre is 'Family at 18.91% of the market, followed by 'Game' at 9.72%. In the more detailed categories list, 'Tools' is the largest with 8.45% of the market, followed by entertainment at 6.07%.

* Contrary to the Apple Store data, there is no one genre with a much larger number of apps than the others but again, there are a number of genres which all hold a small percentage of the market.

* As with the Apple Store data for this market, entertainment type apps do hold a large market share, but utility and lifestyle type apps also hold a large market share with no clear leader.

* It is not possible to recommend an app profile based off this analysis of the Google Play data as there is no clear market leader and this does not take into account the number of users, just the number of apps in each genre.

As just stated in the Google Play analysis, it isn't possible to confidently recommend an app profile just on this analysis as this analysis only accounts for the number of apps of each genre and does not account for the number of users of each genre or the average number of users of an individual app in the genre.

---

*From the course*

Now we want to find out which genres are the most popular. To do this, the average number of users for each app genre should be calculated. For the Google Play data set, the `'Installs'` column (index 5) can be used. However, there is no such information in the Apple Store data set, and so the `'rating_count_tot'` column (index 6) will be used as approximation to the number of users, assuming the number of ratings is proportional to the number of users of each app.

To do this, the number of users of each genre will be divided by the number of apps in each genre.

In [61]:
# With course guidance
# Editted to allow for sorting

# Create frequency table for the genres
ios_genres = freq_table(free_ios, 12)

ios_users_genres = {}

# Loop over unique genres
for genre in ios_genres:
    
    total = 0
    len_genre = 0
    for row in free_ios:
        # Check to see if the app is in the genre
        # If so, add the number of users to total, and increase len_genre by 1
        if row[12] == genre:
            total += float(row[6])
            len_genre += 1
    
    avg = total / len_genre
    
    ios_users_genres[genre] = avg

# Using code from display_table()
users_table_display = []
    
for key in ios_users_genres:
    key_val_as_tuple = (ios_users_genres[key], key)
    users_table_display.append(key_val_as_tuple)

users_table_sorted = sorted(users_table_display, reverse = True)

for entry in users_table_sorted:
    # Rounding added independently
    print(entry[1], ':', format(entry[0], '.2f'))
    
    

Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89
Book : 39758.50
Food & Drink : 33333.92
Finance : 31467.94
Photo & Video : 28441.54
Travel : 28243.80
Shopping : 26919.69
Health & Fitness : 23298.02
Sports : 23008.90
Games : 22788.67
News : 21248.02
Productivity : 21028.41
Utilities : 18684.46
Lifestyle : 16485.76
Entertainment : 14029.83
Business : 7491.12
Education : 7003.98
Catalogs : 4004.00
Medical : 612.00


In [65]:
# With guidance from soln programme
# Check for skewness

print('Navigation')
print('\n')

for row in free_ios:
    if row[12] == 'Navigation':
        print(row[2], ':', row[6])
        
print('\n')

print('Reference')
print('\n')

for row in free_ios:
    if row[12] == 'Reference':
        print(row[2], ':', row[6])
        
print('\n')

print('Social Networking')
print('\n')

for row in free_ios:
    if row[12] == 'Social Networking':
        print(row[2], ':', row[6])
        
print('\n')

print('Music')
print('\n')

for row in free_ios:
    if row[12] == 'Music':
        print(row[2], ':', row[6])
        
print('\n')

print('Book')
print('\n')

for row in free_ios:
    if row[12] == 'Book':
        print(row[2], ':', row[6])
        
print('\n')



Navigation


Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Geocaching® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


Reference


Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
Merriam-Webster Dictionary : 16849
Google Translate : 26786
Night Sky : 12122
WWDC : 762
Jishokun-Japanese English Dictionary & Translator : 0
教えて!goo : 0
VPN Express : 14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
City Maps for Minecraft PE - The Bes

*With guidance from soln programme*

It can be seen that 'Navigation' apps have the highest average number of users, followed by 'Reference' and by 'Social Networking'. It is important to check for skew in the data to give an indication of the ease of entering the market competitively. For example in 'Navigation', Waze and Google Maps which are both incredibly popular which skews the data, other 'Navigation' apps have very few users. Similar senarios can be seen in the other genres printed above. 

With the skewness in mind. It could be recommended to produce an app in the book genre. There is a relatively large average usership and this overlaps with the earlier suggestion to produce an entertainment type app.

*From the course*

For the Google Play data set, it is possible to use the `'Installs'` column. However, instead of an exact number of installs, the column is organised into open ended groups. For the purpose of this exercise, the number of installs will be taken as the lower end of the group. I.e. for the 50,000+ group, the number of installs will be taken to be 50,000.

To perform the analysis, each group will have to be converted from a string to a float which requires first removing the commas and plus signs.

To do this, the `str.replace(old, new)` method can be used.

As this is a more general analysis, we will use the `'Categories'` column rather than the `'Genres'` column which is more granular.

In [66]:
# With course guidance

android_genres = freq_table(free_android, 1)

android_users_genres = {}

# Loop over unique genres
for genre in android_genres:
    
    total_android = 0
    len_genre_android = 0
    for row in free_android:
        # Check to see if the app is in the genre
        # If so, add the number of users to total, and increase len_genre by 1
        if row[1] == genre:
            # Convert string to float
            installs = row[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            total_android += float(installs)
            len_genre_android += 1
    
    avg_android = total_android / len_genre_android
    
    android_users_genres[genre] = avg_android

# Using code from display_table()
users_table_display_android = []
    
for key in android_users_genres:
    key_val_as_tuple_android = (android_users_genres[key], key)
    users_table_display_android.append(key_val_as_tuple_android)

users_table_sorted_android = sorted(users_table_display_android, reverse = True)

for entry in users_table_sorted_android:
    # Rounding added independently
    print(entry[1], ':', format(entry[0], '.2f'))

COMMUNICATION : 38456119.17
VIDEO_PLAYERS : 24727872.45
SOCIAL : 23253652.13
PHOTOGRAPHY : 17840110.40
PRODUCTIVITY : 16787331.34
GAME : 15588015.60
TRAVEL_AND_LOCAL : 13984077.71
ENTERTAINMENT : 11640705.88
TOOLS : 10801391.30
NEWS_AND_MAGAZINES : 9549178.47
BOOKS_AND_REFERENCE : 8767811.89
SHOPPING : 7036877.31
PERSONALIZATION : 5201482.61
WEATHER : 5074486.20
HEALTH_AND_FITNESS : 4188821.99
MAPS_AND_NAVIGATION : 4056941.77
FAMILY : 3695641.82
SPORTS : 3638640.14
ART_AND_DESIGN : 1986335.09
FOOD_AND_DRINK : 1924897.74
EDUCATION : 1833495.15
BUSINESS : 1712290.15
LIFESTYLE : 1437816.27
FINANCE : 1387692.48
HOUSE_AND_HOME : 1331540.56
DATING : 854028.83
COMICS : 817657.27
AUTO_AND_VEHICLES : 647317.82
LIBRARIES_AND_DEMO : 638503.73
PARENTING : 542603.62
BEAUTY : 513151.89
EVENTS : 253542.22
MEDICAL : 120550.62


*With guidance from soln programme*

If tested for skewness, similar results are found for the top genres here as was found for the Apple Store data set.

Again, the `'BOOKS_AND_REFERENCE'` genre seems to have a fairly high average usership, relatively high number of apps in the genre and fits into the suggestion of an entertainment type app. 

It also helps with the development for an app on both the Apple Store and on Google Play. 

This suggests that an ideal type of an to build would be that in the book genre, ie an e-reader type app.