**Analysing Free and Paid Apps in both the Google Play and Apple App Store**

We analyse the characteristics (price, ratings, category, etc.) of applications on both the Google Play and Apple App Stores. We then attempt to find trends that define both successful and less successful apps to be used in the development of our own company's apps.

In [1]:
# Define a reading function for displaying and exploring the data sets

def explore_data(dataset, start, end, rows_and_columns=False):
    """function for opening and printing rows (start to end rows) from a dataset"""
    """if rows_and_columns is True, length of both rows and columns of dataset is also printed"""
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Numbers of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
    



In [2]:
from csv import reader


apple_apps_data = list(reader(open('AppleStore.csv', encoding='utf8')))
google_apps_data = list(reader(open('googleplaystore.csv', encoding='utf8')))

In [3]:
print('Apple Store Sample')
explore_data(apple_apps_data, 0, 3)

print('Google Play Store Sample')
explore_data(google_apps_data, 0, 3)


Apple Store Sample
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Google Play Store Sample
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyon

Data Cleanup
-------------------

We begin the cleanup of our data by removing erroneous data, duplicates, etc.

**Google Play Store**

In [4]:
explore_data(google_apps_data, 10473, 10474)
# removed Life Made app which did not have a 'Category' entry.
del (google_apps_data[10473]), 
explore_data(google_apps_data, 10473, 10474)


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




We define a function that finds the duplicate entries within a given dataset. By iterating through every row, we first check if the name of the row's app has appeared before by checking if it is in the 'unique_names' list. If it is not, we append it to the list and move on to the next row. Otherwise, we check if it is already registered in the 'duplicates' dictionary. If yes, we add one to the duplicate entries count. If no, we register it as a new entry with an initial count of 1.


In [5]:
def find_duplicates(data_set, number=False, store='google'):
    """function that returns a list of duplicate entries in dataset"""
    """if number == True, return ONLY the total number of duplicate entries"""
    unique_names = []
    duplicates = {}
    
    for row in data_set:
        if store == 'google':
            name = row[0]
        elif store == 'apple':
            name = row[0]
            
        if name not in unique_names:
            unique_names.append(name)
        else:
            if name in duplicates:
                duplicates[name] += 1
            else:
                duplicates[name] = 1
    
    # Returns the total number of duplicates
    if number == True:
        count = 0
        for key in duplicates:
            count += duplicates[key]
            
        return count
    
    return duplicates


In [6]:
google_apps_duplicates = find_duplicates(google_apps_data)
print('The following are the duplicate apps in our Google Play Apps dataset:\n')
print(google_apps_duplicates)

google_apps_duplicates_count = find_duplicates(google_apps_data, number=True)
print('The total number of duplicate entries is:', google_apps_duplicates_count)


The following are the duplicate apps in our Google Play Apps dataset:

{'Quick PDF Scanner + OCR FREE': 2, 'Box': 2, 'Google My Business': 2, 'ZOOM Cloud Meetings': 1, 'join.me - Simple Meetings': 2, 'Zenefits': 1, 'Google Ads': 2, 'Slack': 2, 'FreshBooks Classic': 1, 'Insightly CRM': 1, 'QuickBooks Accounting: Invoicing & Expenses': 2, 'HipChat - Chat Built for Teams': 1, 'Xero Accounting Software': 1, 'MailChimp - Email, Marketing Automation': 1, 'Crew - Free Messaging and Scheduling': 1, 'Asana: organize team projects': 1, 'Google Analytics': 1, 'AdWords Express': 1, 'Accounting App - Zoho Books': 1, 'Invoice & Time Tracking - Zoho': 1, 'Invoice 2go — Professional Invoices and Estimates': 1, 'SignEasy | Sign and Fill PDF and other Documents': 1, 'Genius Scan - PDF Scanner': 1, 'Tiny Scanner - PDF Scanner App': 1, 'Fast Scanner : Free PDF Scan': 1, 'Mobile Doc Scanner (MDScan) Lite': 1, 'TurboScan: scan documents and receipts in PDF': 1, 'Tiny Scanner Pro: PDF Doc Scan': 1, 'Docs To 

The total number of duplicate entries is: 1181


Having identified the duplicates in our above dictionary, we now remove the duplicate entries for each app from our dataset. The criteria for which entry we choose to keep boils down to keeping only the most recent entry, which is indicated by the number of reviews each entry has. We assume that reviews are not removed over time, and hence the more reviews there are, the later the entry was made.


In [7]:
def find_max_review_entry(data_set):
    """function that returns a dictionary with keys = name of app
    and value = the max review score among all its entries"""
    reviews_max = {}
    for row in data_set[1:]:
        name = row[0]
        n_reviews = float(row[3])
        if name not in reviews_max:
            reviews_max[name] = n_reviews
        elif name in reviews_max and n_reviews > reviews_max[name]:
            reviews_max[name] = n_reviews
                
    return reviews_max

def remove_duplicates(data_set):
    """function to remove duplicates from a data_set and
    keep only the entry with the highest review count and return the
    cleaned dataset"""

    cleaned_list, already_added = [], []
    
    max_reviews_list = find_max_review_entry(data_set)
    
    for row in data_set[1:]:
        name = row[0]
        n_reviews = float(row[3])
        if (name not in already_added) and (n_reviews == max_reviews_list[name]):
            cleaned_list.append(row)
            already_added.append(name)
    
    return cleaned_list
        


In [8]:
print(len(find_max_review_entry(google_apps_data)))

9659


In [9]:
cleaned_google_apps_data = remove_duplicates(google_apps_data)
print(len(cleaned_google_apps_data))

9659


Since our company only develops apps targetting an English-speaking audience, we wish to only analyse apps that are similar in this way. We use an English-language filter that checks the name of each app, and use it to decide if that app is targetted at an English speaking audience. If so, we keep that data point in our dataset.


In [10]:
def check_english(string):
    """function for checking if a string is english"""
    """returns False if >3 chars are outside the ASCII range"""
    
    count = 0
    
    for char in string:
        if ord(char) > 127 or ord(char) < 0:
            count += 1
            if count == 4:
                return False
            
    return True

def filter_dataset_english(data_set, store='google'):
    """filters out and removes all apps that DON'T target
    an English-speaking audience"""
    
    filtered_dataset = []
    
    for row in data_set:
        if store == 'google':
            name = row[0]
        elif store == 'apple':
            name = row[1]
            
        if check_english(name):
            filtered_dataset.append(row)
    
    return filtered_dataset
    



In [11]:
cleaned_google_apps_data = filter_dataset_english(cleaned_google_apps_data, store='google')
print(len(cleaned_google_apps_data))


9614


As our company focuses on producing apps that are free for download and install, and focuses monetisation efforts on in-app ads, we want to filter out all apps in our dataset that are not free. We prepare a filter function that returns us a list of apps which are free.


In [12]:
def filter_dataset_free(data_set, store='google'):
    """filters out and removes apps that are NOT free"""
    """set store='google' if Google Play Apps"""
    """set store='apple' if Apple Store"""
    
    filtered_dataset = []
    
    if store == 'google':
        for row in data_set:
            price = row[7]
            if price == '0':
                filtered_dataset.append(row)
        
        return filtered_dataset

    elif store == 'apple':
        for row in data_set:
            price = row[4]
            if price == '0.0':
                filtered_dataset.append(row)
        
        return filtered_dataset    
                

In [13]:
cleaned_google_apps_data = filter_dataset_free(cleaned_google_apps_data, store='google')
print(len(cleaned_google_apps_data))

8864


**Apple App Store**

We make use of all the functions that we had prepared for the Google Play Store dataset cleanup without need for any additional preparations.

In [14]:
apple_apps_duplicates = find_duplicates(apple_apps_data, store='apple')
print('The following are the duplicate apps in our Apple Store Apps dataset:\n')
print(apple_apps_duplicates)

apple_apps_duplicates_count = find_duplicates(apple_apps_data, number=True, store='apple')
print('The total number of duplicate entries is:', apple_apps_duplicates_count)


The following are the duplicate apps in our Apple Store Apps dataset:

{}
The total number of duplicate entries is: 0


In order to maximise the market reach of our app, we need to ensure that our app fits the popular app profiles on both Google Play and Apple App Store. To validate our app idea, we will first produce a lightweight version for the Android platform, and gather user responses over the next six months after initial publishing. If response is good, we will then work on and publish an iOS version.

This allows us to exit the idea with minimal loss if it turns out that the app is not performing as expected.

In [15]:
cleaned_apple_apps_data = filter_dataset_english(apple_apps_data[1:], store='apple')
print(len(cleaned_apple_apps_data))



6183


In [16]:
cleaned_apple_apps_data = filter_dataset_free(cleaned_apple_apps_data, store='apple')
print(len(cleaned_apple_apps_data))

3222


## Creating a Frequency Table

We now proceed to use our cleaned-up data to create a frequency table to better see the trends within our data.

In [19]:
def freq_table(dataset, index):
    """function that returns a dictionary (freq table) of our data set"""
    
    freq_table = {}
    
    for row in dataset:
        entry = row[index]
        if entry not in freq_table:
            freq_table[entry] = 1
        else:
            freq_table[entry] += 1
        
    return freq_table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [23]:
print('Google Play Store Genre Rankings\n')
display_table(cleaned_google_apps_data, 9)
print('\nGoogle Play Store Category Rankings\n')
display_table(cleaned_google_apps_data, 1)
print('\nApple App Store Prime Genre Rankings\n')
display_table(cleaned_apple_apps_data, 11)

Google Play Store Genre Rankings

Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 81
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adv