# Profitable App Profiles for the App Store and Google Play Markets

My aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. I am analysing data for company that builds Android and iOS mobile apps, and my job is to enable our team of developers to make data-driven decisions with respect to the kind of apps they build.

This company, only build apps that are free to download and install, and the main source of revenue consists of in-app ads. This means that the revenue for any given app is mostly influenced by the number of users that use the app. My goal for this project is to analyze data to help developers develop an understanding of the type of apps that are likely to attract more users.

# Opening and exploring data

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
open_file_apple = open('C:/Users/amaar/Desktop/DataQuest/Projects/Project1/AppleStore.csv', encoding = 'utf8')
open_file_google = open('C:/Users/amaar/Desktop/DataQuest/Projects/Project1/googleplaystore.csv', encoding = 'utf8')
from csv import reader
read_file_apple = reader(open_file_apple)
read_file_google = reader(open_file_google)
data_apple = list(read_file_apple)
data_google = list(read_file_google)

ios_header = data_apple[0]
ios = data_apple[1:]

android_header = data_google[0]
android = data_google[1:]

In [3]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


There are 7197 ios apps in the data set. The columns that are of interest are: 'track name', 'price', rating_count_tot', 'rating_count_ver' 'user_rating', user_rating_ver', 'prime_genre'. The column names are not all self explanatory. Details can be found in data [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [4]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


There are 10841 android apps in the data set. The columns that are of interest are: 'App', 'Category', 'Rating', 'Reviews', 'Installs', 'type', 'Price', 'Genres'. 

# Deleting incorrect data

Within the Google play data discussion section, one of the discussions highlights an error in row 10472. I will print this row and compare it to the heading and another row

In [5]:
print(android_header)
print('\n')
print(android[0])
print('\n')
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The rating in row 10472 ('Life Made WI-FI Touchscreen Photo Frame' app) is 19 this is clearly incorrect as the maximum rating is 5. This is due to a missing category value see [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion/164101)

This row will be removed

In [6]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


There is no more incorrect data

# Removing Duplicate Entries

Looking through the google play data duplicate entries were found. For example 4 entries were found for instagram.

In [7]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


there 1,181 instances of duplicate apps in total

In [8]:
duplicate_apps = []
unique_apps = []
for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('No. of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps: ', duplicate_apps[:15])
print('\n')
print('No. of unique apps: ', len(unique_apps))

No. of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


No. of unique apps:  9659


As we do not want to count duplicate entries when we analyse the data, we must remove them as to only keep one entry per app. 

The only difference seen in the Instagram data was the change in number of reviews. This is significant as it implies it is the most recent data set (even though the last updated column contains the same date). Therefore, instead of removing duplicate rows randomly we will remove all but the one with the highest number of reviews.

To carry out the removal of duplicates:
- create a dictionary where each apps name is a key, and the value is the highest number of reviews of that app
- I will then create a new data set using the dictionary, ensuring only one entry per app and select the app with the highest number of reviews

In [9]:
reviews_max = {}
for apps in android:
    name = apps[0]
    n_reviews = float(apps[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [10]:
print('expected length of data set: ', len(android) - 1181)
print('Actual length of data set', len(reviews_max))

expected length of data set:  9659
Actual length of data set 9659


In [11]:
android_clean = []
already_added = []

for apps in android:
    name = apps[0]
    n_reviews = float(apps[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(apps)
        already_added.append(name)

To confirm there are only 9659 rows we will explore the new cleaned dataset.

In [12]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


just as calculated above we have 9659 rows.

No duplicate entries were found in the App store data

However when lookinng at the data for both the app store and the google play store non english app names were discovered. As the audience of the app in development is english speakers these non english titled apps must be removed.

In [13]:
def english(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True

To test the above function we will be insterting 4 test cases

In [14]:
print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Docs To Go™ Free Office Suite'))
print(english('Instachat 😜'))

True
False
False
False
