# Determining what makes an app downloadable

In most cases, be it in-app purchases or paid apps, the amount of downloads corrisponds to profitability. The aim of this project is to find that data points that affect downloadability so that those who would like to endevour to make their own apps can analyse what they need to include to maximise downloadability

---

# The data

In [1]:
opened_ios = open('AppleStore.csv', encoding ='utf8')
opened_playstore = open('GooglePlayStore.csv', encoding ='utf8')
from csv import reader
read_ios = reader(opened_ios)
read_playstore = reader(opened_playstore)
ios_data = list(read_ios)
playstore_data = list(read_playstore)

## A Sample of the data

This is a sample of the data available.

I will print the header and first five rows of data. 

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):

# This function allows us to print multiple rows of a list with line breaks between them
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print('Sample of App store data set')
print('\n')
explore_data(ios_data, 0, 5)

print('Sample of Play store data set')
print('\n')
explore_data(playstore_data, 0, 5)

Sample of App store data set


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Sample of Play store data set


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Androi

---

# Data cleaning

In the [discussion board](https://www.kaggle.com/lava18/google-play-store-apps/discussion) for the play store data set user are reporting that there is one error row. 

The row in questsion is 10473.

I am going to print the header row, the row before the row with the error and the suspected row. 


In [4]:
print(playstore_data[0])
print('\n')
print(playstore_data[10472])
print('\n')
print(playstore_data[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


From the above we can see that the rating column reports a rating of 19. This is not a possible rating. Other users have noted that the category column is missing. Hence we will delete this row of data. 

In [5]:
print(playstore_data[10473])
if playstore_data[10473][2] == '19':
    del playstore_data[10473]
    # check if the dataset contains the error. Then delete it. 
print(playstore_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


## Checking for any duplicate data

In [6]:
unique_apple_apps = []
unique_play_apps = []
duplicate_apple_apps = []
duplicate_play_apps = []

for app in ios_data:
    if app[0] in unique_apple_apps:
        duplicate_apple_apps.append(app[0])
    else:
        unique_apple_apps.append(app[0])
print('Checking duplicate apps in Apple Data')
print('---')
print('\n')
print('Number of duplicate apple apps: ', len(duplicate_apple_apps))
print('\n')
print('Number of unique apple apps: ', len(unique_apple_apps))
print('\n')

for app in playstore_data:
    if app[0] in unique_play_apps:
        duplicate_play_apps.append(app[0])
    else:
        unique_play_apps.append(app[0])
print('Checking duplicate apps in Play Store Data')
print('---')
print('\n')
print('Number of duplicate play store app:', len(duplicate_play_apps))
print('\n')
print('Number of unique play store app: ', len(unique_play_apps))


Checking duplicate apps in Apple Data
---


Number of duplicate apple apps:  0


Number of unique apple apps:  7198


Checking duplicate apps in Play Store Data
---


Number of duplicate play store app: 1181


Number of unique play store app:  9660


# Criterion for removing duplicate data

When assessing the data there are multiple data points that seem unique (Last Updated, Current Version), but with some apps these unique data points don't give us what we need. In some cases there are multiple versions all with the same Last Updated date and where current version is "Varies with Device". 

For this reason I have decided to only keep the set that has the highest number of reviews. 

## Establishing what data is truly unique

We will be following  amount steps:

1. We will create a dictionary to store the unique apps
2. We will compare that with the data from the android apps. 
3. We will then create a clean version of the android apps data set
4. Next we will make sure that all the apps are only english as that is the language of interest here


**Dictionary to store unique apps**

In [7]:
check_reviews = {}

for app in playstore_data[1:]:
    if app[0] in check_reviews and check_reviews[app[0]] < float(app[3]):
    # if the app is already in check reviews and the amount of reviews listed as the value is smaller than the value update the value. 
        check_reviews[app[0]] = float(app[3])
    elif app[0] not in check_reviews:
    # else, if the app is already in check reviews add the app to check reviews. 
        check_reviews[app[0]] = float(app[3])


**Compare and store clean data**

In [8]:
playstore_data_cleaned = []
duplicate_values = []

for app in playstore_data[1:]:
    if float(app[3]) == check_reviews[app[0]] and app[0] not in duplicate_values:
    # check if the review count is in our cleaned dictionnary and the app is not in duplicate values. Then add then create a new list of list
        playstore_data_cleaned.append(app)
        duplicate_values.append(app[0])

playstore_data_cleaned.insert(0, playstore_data[0])
# insert header row to cleaned data
explore_data(playstore_data_cleaned, 0, 10)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Every

**Check to see if data is english only**

In [9]:
def english_checker(text):
# this function checks if the ascii value of a character is less than 127 (meaning it's english). We assume that an app with less than 3 non english characters are still english    
    counter = 0

    for character in text:
        if ord(character) > 127:
            counter = counter + 1

    return counter < 3


We will create lists for the english and non_english apps and append the header row in each. 

In [16]:

playstore_non_english_apps = [] 
playstore_english_apps = [] 
appstore_non_english_apps = [] 
appstore_english_apps = [] 

for apps in playstore_data_cleaned[1:]:
    if english_checker(apps[0]) == False:
        playstore_non_english_apps.append(apps)
    elif english_checker(apps[0]):
        playstore_english_apps.append(apps)


for app in ios_data[1:]:
    if english_checker(app[1]) == False:
        appstore_non_english_apps.append(app)
    elif english_checker(app[1]):
        appstore_english_apps.append(app)
print('playstore')
explore_data(playstore_english_apps, 0, 20)
print('appstore')
explore_data(appstore_english_apps, 0, 20)

playstore
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M'

For our use case we would like to have the free apps seperate from the paid apps. This is the last step in our data cleaning. 

In [24]:
playstore_english_free_apps = []
playstore_english_paid_apps = []
appstore_english_free_apps = []
appstore_english_paid_apps = []

for app in playstore_english_apps:
    if app[6] == "Free":
        playstore_english_free_apps.append(app)
    elif app[6] == "Paid":
        playstore_english_paid_apps.append(app)

for apps in appstore_english_apps:
    if float(apps[4]) == 0.0:
        appstore_english_free_apps.append(app)
    elif float(apps[4]) > 0.0:
        appstore_english_paid_apps.append(app)


playstore_english_free_apps.insert(0, playstore_data[0])
playstore_english_paid_apps.insert(0, playstore_data[0])
appstore_english_free_apps.insert(0, ios_data[0])
appstore_english_paid_apps.insert(0, ios_data[0])

print("Our final cleaned Android data set")
print("\n")
print("Free apps")
print("\n")
explore_data(playstore_english_free_apps, 0, 5)
print("\n")
print("\n")
print("Paid apps")
print("\n")
explore_data(playstore_english_paid_apps, 0, 5)
print("\n")
print("\n")
print("---")
print("\n")
print("\n")
print("Our final cleaned Apple data set")
print("\n")
print("Free apps")
print("\n")
explore_data(appstore_english_free_apps, 0, 5)
print("\n")
print("\n")
print("Paid apps")
print("\n")
explore_data(appstore_english_paid_apps, 0, 5)
print("\n")
print("\n")


Our final cleaned Android data set


Free apps


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']






Paid apps


['App', 'Category', 'Rating', 'Revie

# Data Analysis

For the purposes of our analysis we want to see what makes a free app profitable and successful. For that reason we will analyse various metrics to determine if there are commonalities around successful apps.

We first need to detrmine what criteria would classify a successful and unsuccessful app. To do this we need to sample apps the we know are successful.

Since our Appstore data doesn't have any data points regarding install we will infer the results of our analysis to Appstore apps

We will make a list of all the playstore install data and find out what is the max value

In [34]:
playstore_installs = []

for apps in playstore_english_free_apps[1:]:
    playstore_installs.append(apps[5])

print(max(playstore_installs))

500,000,000+


Next we will see how many apps we have where the installs are 500,000,000 + to see what they have in common

In [35]:
playstore_install_max = {}

for apps in playstore_english_free_apps[1:]:
    if apps[5] == '500,000,000+':
        playstore_install_max[apps[0]] = apps[5]

print(len(playstore_install_max))

24
