# Profitable App Profiles for App Store and Google Play Markets

The goal of this project is to analyze apps in the App Store and Google Play markets to determine the characteristics of profitable apps. Our company builds Android and iOS mobile apps, and our job is to provide our developers with the infromation neccessary to build profitable apps.

Our company only builds apps that are free to download and are in english. Our main source of revenue is in app adds. Advertisement companies pay us based on how many users see their adds, which means our revenue is largely dependent on how many users  our apps have. We set out on this analysis to come back with actionable insights that allow our developers to build apps that attract the most users possible.

## Initial Data Exploration 

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.  
Analysing over 4 million apps is out of the scope for this project, it requires too many resources. We will instead analyze a sample of data, aiming to use data that is free of cost to acquire.

A data set containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from this link.  
A data set containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this link.

In [4]:
from csv import reader 

In [12]:
# Google Play data set
open_android = open('googleplaystore.csv')
read_android = reader(open_android)
android = list(read_android)
android_header = android[0]
android = android[1:]

#App Store data set
open_ios = open('AppleStore.csv')
read_ios = reader(open_ios)
ios = list(read_ios)
ios_header = ios[0]
ios = ios[1:]

To automate the process of exploring the data, we will create a function `explore_data` so we can read the data in a more digestable format. Also, we will add the option to show the number of rows and columns for any data set.

In [13]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [16]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


We have 7197 iOS apps in this data set, and the columns that seem interesting are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. Not all column names are self-explanatory in this case, but details about each column can be found in the data set documentation.

In [18]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

## Deleting Wrong Data

The Google Play data set has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [19]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The row 10472 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and we can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5 (as mentioned in the discussions section, this problem is caused by a missing value in the 'Category' column). As a consequence, we'll delete this row.

In [20]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android))


10841
10840


## Remove Duplicate Entries

We can find that the Google Play data set contains multiple entries for the same app. For example, the Instagram app has four entries.

In [21]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We can count how many duplicate apps there are by iterating over the data set and saving app names that appear more than one in a seperate list.

In [23]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('Examples of duplicate apps:', duplicate_apps[:5])

Number of duplicate apps: 1181
Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Rather than removing the duplicate apps randomly, we will decide based on how many reviews each duplicate has. The app with the highest reviews should be the most recent app, so this is the one we will keep.

In [24]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
len(reviews_max)

9659

Now that we have the app entries with the most reviews, we can remove the duplicates from our data.

In [26]:

android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

len(android_clean)

9659

Our `android_clean` list holds all of the unique apps with the highest reviews. Our `already_added` list holds the names of the apps we have already iterated over. The purpose of this list to prevent duplicate apps with the same number of reviews from being added, this is why we check if the `name not in already_added`.

The number of our unique apps should be equal to our total dataset minus our duplicate data set.

In [27]:
len(android) - len(duplicate_apps)

9659

## Removing Non-English Apps

We also need to remove apps in any language besides english because our company does not make those apps. We can do this by referring to ASCII system. The characters we typicall use in english are not greater than 127. We can check app names for values that are greater than this number

In [32]:
#A function that checks if an app name has more than 3 characters that are greater than 127
def check_chars(string):
    non_ascii = 0
    for char in string:
        if ord(char) > 127:
            non_ascii += 1
    if non_ascii > 3:
        return False
    else:
        return True

In [36]:
ios_english = []
android_english = []
for app in ios:
    name = app[0]
    if check_chars(name) == True:
        ios_english.append(app)
for app in android_clean:
    name = app[0]
    if check_chars(name) == True:
        android_english.append(app)


In [37]:
len(ios_english)

7197

In [38]:
len(android_english)

9614

## Removing Apps that are not Free

We also need to remove the apps that cost money to download since our company only produces free apps. We can do this by analyzing the `price` feature of both data sets.

In [46]:
print(ios_english[:10])

[['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'], ['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'], ['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'], ['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'], ['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1'], ['6', '283619399', 'Shanghai Mahjong', '10485713', 'USD', '0.99', '8253', '5516', '4', '4', '1.8', '4+', 'Games', '47', '5', '1', '1'], ['7', '283646709', 'PayPal - Send and request money safely', '227795968', 'USD', '

In [49]:
ios_final = []
android_final = []

for app in ios_english:
    price = app[5]
    if price == '0':
        ios_final.append(app)

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)

print(len(ios_final))
print(len(android_final))

4056
8864
