# Profitable App Profiles for the App Store and Google Play Markets

The goal of this project is to analyze data in the app stores to help our developers understand what type of apps are likely to attract more users.

I work for a company that builds Android and iOS mobile apps and these apps are available on Google Play and the App Store. The company primarily builds apps that are free to download and install however it's main source of revenue consists of in-app ads. This means that the revenue is driven by the number of users who use the apps - the more users that see and engage with the ads, the better.

* The App Store dataset can be downloaded directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

* The Google Play dataset can be downloaded directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)



In [130]:
!pip3 install autopep8

Collecting autopep8
  Using cached https://files.pythonhosted.org/packages/58/e5/a06510cfa9caaaffebf74084f8886a93a96273dbf22a1c920ca1af00997a/autopep8-1.5.5-py2.py3-none-any.whl
Collecting pycodestyle>=2.6.0 (from autopep8)
  Using cached https://files.pythonhosted.org/packages/10/5b/88879fb861ab79aef45c7e199cae3ef7af487b5603dcb363517a50602dd7/pycodestyle-2.6.0-py2.py3-none-any.whl
Collecting toml (from autopep8)
  Using cached https://files.pythonhosted.org/packages/44/6f/7120676b6d73228c96e17f1f794d8ab046fc910d781c8d151120c3f1569e/toml-0.10.2-py2.py3-none-any.whl
Installing collected packages: pycodestyle, toml, autopep8
Successfully installed autopep8-1.5.5 pycodestyle-2.6.0 toml-0.10.2


In [131]:
# Download the App store and Google Play datasets

import requests
import csv

urls = [
    "https://dq-content.s3.amazonaws.com/350/AppleStore.csv", 
    "https://dq-content.s3.amazonaws.com/350/googleplaystore.csv"
       ]

for url in urls:
    x = url.split("/")
    params = 'datasets'
    params += '/'
    params += x[-1]
    
    download = requests.get(url)
    output = download.text
    with open(params, 'w') as temp_file:
        temp_file.writelines(output)
    
    

In [132]:
# open the App store and Google Play datasets
from csv import reader
import pprint

apple_file = 'datasets/AppleStore.csv'
google_file = 'datasets/googleplaystore.csv'

def open_file(file):
    try:
        opened_file = open(file)
    except OSError as e:
        print("Failure to open file", e)
        
    reader_file = reader(opened_file)
    data = list(reader_file)
    return data

apple_data = open_file(apple_file)  
google_data = open_file(google_file)


The explore_data function allows you to explore the dataset in a more readable way. The function also lets you count the number of rows and columns for the dataset

In [133]:
# explore the datasets
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print("Number of rows:", len(dataset))
        print("Number of columns:", len(dataset[0]))
    
explore_data(apple_data, 0, 5)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']




In [134]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print("Number of rows:", len(dataset))
        print("Number of columns:", len(dataset[0]))
    
explore_data(apple_data, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


Documentation describing the columns for the iOs App store dataset and Google Play dataset can be found [here](https://kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) and [here](https://kaggle.com/lava18/google-play-store-apps) respectively

In [135]:
explore_data(google_data, 0, 5)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite â\x80\x93 FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']




In [136]:
explore_data(google_data, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite â\x80\x93 FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


In [137]:
# show the headers for each dataset

header_apple = apple_data[0]
header_google = google_data[0]

# Data cleaning

Check the datasets and ensure each row has the correct number of columns

In [138]:
def check_dataset(dataset):
    for count, row in enumerate(dataset):
        header_length = len(dataset[0])
        if len(row) != header_length:
            print("row:",count,"has", len(row), "columns instead of", header_length)
            
check_dataset(apple_data)

In [139]:
check_dataset(google_data)

row: 10473 has 12 columns instead of 13


This shows that row 10473, of the google dataset, has an incorrect number of columns. 

In [140]:
# delete the row with the missing column(s)

del google_data[10473]

In [141]:
# number of rows in dataset

len(google_data)

10841

Check the Google Play dataset for duplicate app entries


In [142]:
duplicate_apps = []
unique_apps = []
def check_duplicates(dataset):
    for count, row in enumerate(dataset[1:]):
        app_name = row[0]
        if app_name in unique_apps:
            duplicate_apps.append(row)
        else:
            unique_apps.append(app_name)
    return duplicate_apps, unique_apps

dup_app, uniqs = check_duplicates(google_data)  
print("number of duplicates of google dataset:", len(dup_app))
print("\n")
print("examples of duplicate apps in google dataset:", dup_app[1:10])
number_of_duplicates = len(dup_app)
    

number of duplicates of google dataset: 1181


examples of duplicate apps in google dataset: [['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device'], ['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up'], ['ZOOM Cloud Meetings', 'BUSINESS', '4.4', '31614', '37M', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 20, 2018', '4.1.28165.0716', '4.0 and up'], ['join.me - Simple Meetings', 'BUSINESS', '4.0', '6989', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 16, 2018', '4.3.0.508', '4.4 and up'], ['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device'], ['Zenefits', 'BUSINESS', '4.2', '296', '14M', '5

The number of duplicates in the Google dataset are {{number_of_duplicates}}

The above samples, from the Google dataset, clearly show duplicate entries

In [143]:
# extract the names of the duplicated apps
app_names = []

for row in dup_app:
    name = row[0]
    app_names.append(name)

new_apps = sorted(app_names)

    

In [144]:
# count frequency of names occurring

apps = {}
for name in new_apps:
    apps[name] = new_apps.count(name)
    
pprint.pprint(apps)
   

{'10 Best Foods for You': 1,
 '1800 Contacts - Lens Store': 1,
 '2017 EMRA Antibiotic Guide': 1,
 '21-Day Meditation Experience': 1,
 '365Scores - Live Scores': 1,
 '420 BZ Budeze Delivery': 1,
 '8 Ball Pool': 6,
 '8fit Workouts & Meal Planner': 1,
 '95Live -SG#1 Live Streaming App': 1,
 'A Manual of Acupuncture': 1,
 'A&E - Watch Full Episodes of TV Shows': 3,
 'AAFP': 1,
 'ABC News - US & World News': 1,
 'AC - Tips & News for Androidâ\x84¢': 1,
 'AP Mobile - Breaking News': 1,
 'ASCCP Mobile': 1,
 'ASOS': 1,
 'Accounting App - Zoho Books': 1,
 'AccuWeather: Daily Forecast & Live Weather Reports': 1,
 'Acorns - Invest Spare Change': 1,
 'AdWords Express': 1,
 'Ada - Your Health Guide': 1,
 'Adobe Acrobat Reader': 2,
 'Adobe Photoshop Express:Photo Editor Collage Maker': 1,
 'Adult Dirty Emojis': 2,
 'Advanced Comprehension Therapy': 1,
 'Agar.io': 1,
 'Airbnb': 1,
 'Airway Ex - Intubate. Anesthetize. Train.': 1,
 'AliExpress - Smarter Shopping, Better Living': 3,
 'All Football - Lat

 'Swamp Attack': 1,
 'Sway Medical': 1,
 'TED': 3,
 'TO-FU Oh!SUSHI': 1,
 'Tagged - Meet, Chat & Dating': 1,
 'Talkatone: Free Texts, Calls & Phone Number': 1,
 'Talking Ben the Dog': 1,
 'Talking Tom Gold Run': 3,
 'Talkray - Free Calls & Texts': 1,
 'Tango - Live Video Broadcast': 1,
 'Tapatalk - 100,000+ Forums': 1,
 'Target - now with Cartwheel': 2,
 'Tee and Mo Bath Time Free': 1,
 'Teladoc Member': 1,
 'Telegram': 2,
 'Telegram X': 1,
 'Telemundo Now': 1,
 'Temple Run 2': 5,
 'Text Free: WiFi Calling App': 2,
 'Text free - Free Text + Call': 1,
 'TextNow - free text + calls': 1,
 'Textgram - write on photos': 1,
 'The 5th Stand': 1,
 'The CW': 3,
 'The Coupons App': 3,
 'The Emirates App': 1,
 'The Game of Life': 1,
 'The NBC App - Watch Live TV and Full Episodes': 1,
 'The Simsâ\x84¢ FreePlay': 1,
 'The Wall Street Journal: Business & Market News': 1,
 'Thomas & Friends: Race On!': 1,
 'TickTick: To Do List with Reminder, Day Planner': 1,
 'Tiny Scanner - PDF Scanner App': 1,
 '

In [145]:
#what is the highest number of duplicates ?
max(apps.values())

8

In [146]:
# which app do these duplicates belong to
for k, v in apps.items():
    if v == 8:
        print("k:", k)

k: ROBLOX


In [147]:
apps['ROBLOX']

8

In [148]:
for row in google_data[1:]:
    name = row[0]
    if name == 'ROBLOX':
        print(row)

['ROBLOX', 'GAME', '4.5', '4447388', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4447346', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4448791', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4449882', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4449910', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'FAMILY', '4.5', '4449910', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1

In [149]:
for row in google_data[1:]:
    name = row[0]
    if name == 'Box':
        print(row)

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


We have to find a way to remove duplicate data and keep the most current. One of the ways to do this is to keep the duplicate entry with the highest number of reviews and discard the others

In [150]:
# show length of dataset once duplicates are removed (exclude the header)

non_dup_google = len(google_data[1:]) - number_of_duplicates

Length of google dataset once duplicates have been removed: {{non_dup_google}}

Create a dictionary where each key is a unique app name with the corresponding highest number of reviews of the app

In [151]:
# this code will extract the maximum number of reviews per unique app name in the duplicates free dataset

reviews_max = {}
for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    else:
        name not in reviews_max
        reviews_max[name] = n_reviews
        
        
    
        

In [152]:
# compare the lengths of the reviews_max dictionary and the duplicates free (unique) dataset. Are they the same ?

len(reviews_max) == non_dup_google

True

In [153]:
# the already added list will be a check to confirm that all the unique app names 
# in the dataset have been added and cleaned

android_clean, already_added = [], []
for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if reviews_max[name] == n_reviews and name not in already_added:
        android_clean.append(row)
        already_added.append(name) 

In [154]:
len(android_clean)

9659

In [155]:
len(already_added)

9659

At the company, English is used to develop the apps because we cater to an English speaking audience. We therefore have to make sure our dataset contains only English apps.

We can check each character in an app name to see if it is an English character. Using ord(), a built-in Python function, we will check each character to see whether it's ASCII (American Standard Code for Information Interchange system) number is 127 or less. This shows us that the character is English. If the ASCII numbr returned is greater than 127, then that character is non-English.

We will devise a function that will check each string and if any character within that string is a non-English character, then return a boolean value of False that the string is non-English

In [156]:
len(android_clean[7940][0])

6

In [157]:
for char in 'Docs To Go™ Free Office Suite':
    if ord(char) < 127:
        print(True)
    else:
        print(False)

True
True
True
True
True
True
True
True
True
True
False
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True


In [158]:
text = 'Instachat 😜'

for char in text:
    if ord(char) > 127:
        print(False)
    

False


In [159]:

# change the original function such that if the count of non-English characters, in a string, is 3 or over, then
# return the string as False otherwise return True. This will cater for characters such as emoticons and trademarks
# which are not non-English

def is_english(text):
    count = 0
    for char in text:
        if ord(char) > 127:
            count += 1
    if count >= 3:
        return False
    
    return True

In [160]:
is_english('angela')

True

In [161]:
is_english(android_clean[7940][0])

True

In [162]:
print(is_english(
'爱奇艺PPS -《欢乐颂2》电视剧热播'))

False


In [163]:
print(is_english('Instagram'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
True
True


In [164]:
# extract the number of English apps in both the android and apple datasets

english_apple_app, english_android_app = [], []

for row in apple_data[1:]:
    name = row[1]
    if is_english(name):
        english_apple_app.append(row)

In [165]:
for row in android_clean:
    name = row[0]
    if is_english(name):
        english_android_app.append(row)

In [166]:
explore_data(english_apple_app, 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 5794
Number of columns: 16


In [167]:
explore_data(english_android_app, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up']


Number of rows: 9268
Number of columns: 13


In [168]:
print(header_google), print(header_apple)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


(None, None)

In [169]:
free_apps_google, free_apps_apple = [], []

for row in english_apple_app:
    if row[4] == '0.0':
        free_apps_apple.append(row)
        
for row in english_android_app:
    if row[6] == 'Free':
        free_apps_google.append(row)

In [170]:
# explore the google and apple datasets that should now contain english free apps

explore_data(free_apps_apple, 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 2970
Number of columns: 16


In [171]:
explore_data(free_apps_google, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'April 26, 2018', '1.1', '4.0.3 and up']


Number of rows: 8542
Number of columns: 13


So now there {{len(free_apps_google)}} free apps in the google dataset and {{len(free_apps_apple)}} in the apple dataset respectively

# What sort of apps could be profitable ?

The goal of this project is to determine the type of apps that would attract more users as our revenue is determined by the numbers of people using our apps.

So far the following has been done to the two datasets:

* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps
* Isolated the free apps

The validation strategy to understand better our customers, and the types of app that would appeal to them, is:

* Build a minimal Android version of the app, and add it to the Google Play store
* If there is a good response to the app, then develop it further
* If the app is profitable after 6 months, then an iOS version will be built and added to the Apple store

Look through the dataset columns, of both datasets, to see which columns indicate the most common genres in each app market

As you can see after inspecting a few rows of the apple dataset, and it's header, the prime_genre column would be a good frequency indicator: 
**{{header_apple}}**

And from the google dataset, the following columns could be used to indicate frequency, Genres and Category:     **{{header_google}}**

We will generate frequency tables starting with a function, that has dataset and column index as arguments, and returns counts in percentages

In [172]:
def freq_table(dataset, index):
    freq_table = {}
    name_of_col = []
    length_of_dataset = len(dataset)
    for row in dataset:
        col = row[index]
        name_of_col.append(col)
    
    for col in name_of_col:
        if col in freq_table:
            freq_table[col] += 1/length_of_dataset * 100
        else:
            freq_table[col] = 1/length_of_dataset * 100
    return freq_table


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    table_sorted = sorted(table_display, reverse = True)   
    
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#prime_genre       
pprint.pprint(display_table(free_apps_apple, -5))  
print('')

#Category
pprint.pprint(display_table(free_apps_google, 1))
print('')

#Genres
pprint.pprint(display_table(free_apps_google, -4)) 

 

Games : 59.22558922559027
Entertainment : 7.542087542087555
Photo & Video : 5.084175084175092
Education : 3.80471380471381
Social Networking : 3.0976430976431013
Shopping : 2.491582491582494
Utilities : 2.255892255892258
Sports : 2.1548821548821566
Music : 2.121212121212123
Health & Fitness : 1.9528619528619544
Productivity : 1.6835016835016845
Lifestyle : 1.4478114478114483
News : 1.346801346801347
Travel : 1.111111111111111
Finance : 1.111111111111111
Weather : 0.8754208754208751
Food & Drink : 0.8754208754208751
Reference : 0.5387205387205388
Business : 0.5387205387205388
Book : 0.2693602693602694
Medical : 0.20202020202020204
Navigation : 0.16835016835016836
Catalogs : 0.10101010101010101
None

FAMILY : 19.129009599625167
GAME : 9.353781315851197
TOOLS : 8.522594240224867
BUSINESS : 4.659330367595447
PRODUCTIVITY : 3.945211894170009
LIFESTYLE : 3.945211894170009
FINANCE : 3.79302271130885
MEDICAL : 3.629126668227602
SPORTS : 3.418403184265997
PERSONALIZATION : 3.254507141184749
COM

In answer to the following question for the prime_genre column of the App Store dataset:

* What is the most common genre ? 
* What is the runner-up ? 

The most popular genre by far is the Games category (59%) while the second most popular is Entertainment (7%). Photo and video apps account for 5% with social networking apps making up 3%.

There are [6](https://duckma.com/en/blog/types-of-mobile-apps) main type of mobile apps: lifestyle mobile apps, social media apps, utility mobile apps, games/entertainment apps, productivity apps and news information apps.

As can be seen from the App store dataset, more than 75% of the apps are dedicated to leisure (games, entertainment, social networking, photo and video, sports, music, health and fitness) with the rest left for practical pursuits such as education, shopping, productivity, business, news, travel and lifestyle.

Examination of the genre and category columns of the Google Play dataset shows the complete opposite. Just 15% of apps in this dataset are dedicated to recreation (games/entertainment, health and fitness, social) with the rest of the apps (85%) being for more utilitarian pursuits (family, productivity, business, education, travel, news, finance, personalization) with the largest category being that of family (19%).

I would argue, judging from the research, that an app profile for the Google Play store should be utilitarian. Further investigation to determine which of the genres/categories had the most users would confirm this. Also since the Android operating system is the most popular smartphone operating system on the planet, with a user base of [2.5 billion](https://www.theverge.com/2019/5/7/18528297/google-io-2019-android-devices-play-store-total-number-statistic-keynote), it would make sense to build for this market first. 





