<u1><h1>Analysing successful free apps in the App Store and Google Play Store</h1></u1>

<h2>Introduction</h2>
<p>Today we have datasets that consists of mobile apps that are in the App Store and Google Play Store that contain important information regarding each app (i.e. Rating, Category, Description, Number Of Downloads, User Rating etc...)</p>

<h2> Goal of analysis </h2>
<p>The goal of our analysis is to deduce what is the driving force behind the success of free mobile apps for both mobile markets and how we can replicate their success. So, in order to measure the idea of success, we can use the number installs of each app in order to quantify ones success and if that is not available then we will use the user rating for each app. Here, we will use some old school methods to perform our analysis to mix it up abit and this will be done by omitting the use of packages such as pandas and matplotlib and relying on textual output.</p>

<h2> Data Collection </h2>
<p>The data will be from Kaggle from the following repositories:
<ul>
  <li>About 10,0000 Android apps on the Google Play store: <a href="https://www.kaggle.com/lava18/google-play-store-apps/home">Dataset</a>
</li>
  <li>About 7,000 IOS apps from the App store: <a href="https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home">Dataset</a>
</li>
</ul></p>

<h2>Importing Data</h2>

In [54]:
# import statements
from csv import reader

# retrieve datasets
def retrieve_data(dataset, header):
    open_file = open(dataset, encoding = "utf8")
    read_file = reader(open_file)
    list_data = list(read_file)
    
    # return header seperately if available
    if (header):
        return list_data[0], list_data[1:]
        
    return list_data

# convert the dataset csv files into list of lists, including header
android_header, android_apps = retrieve_data('googleplaystore.csv', True)
ios_header, ios_apps = retrieve_data('AppleStore.csv', True)

<h2> Exploring the dataset </h2>
<p>Before we get too exited, let's go through and explore what variables we are working with by printing the first 3 rows of both datasets and also include the header to clarify what each column represents.</p>

In [55]:
# print n rows of the dataset
def print_data(dataset, start_row, end_row):
    subset = dataset[start_row:end_row+1]
    for row in subset:
        print(row, '\n')
        
# get the number of rows and columns of the dataset
def dimensions(dataset, header):
    return len(dataset), len(header)

# explore android dataset
print("-----Android data (Header)-----")
print(android_header, '\n')

print("-----Android data (Dimensions)-----")
rows, cols = dimensions(android_apps, android_header)
print("Number of rows:", rows, 
      "\nNumber of columns:", cols, 
      "\n")

print("-----Android data (First 3 rows)-----")
print_data(android_apps,1,3)

# explore ios dataset
print("-----IOS data (Header)-----")
print(ios_header, '\n')

print("-----IOS data (Dimensions)-----")
rows, cols = dimensions(ios_apps, ios_header)
print("Number of rows:", rows, 
      "\nNumber of columns:", cols, 
      "\n")

print("-----IOS data (First 3 rows)-----")
print_data(ios_apps,1,3)

-----Android data (Header)-----
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

-----Android data (Dimensions)-----
Number of rows: 10841 
Number of columns: 13 

-----Android data (First 3 rows)-----
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] 

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

-----IOS data (Header)-----
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_r

<b>Comment:</b> Here, we see the Android dataset has the following 13 variables as the columns: 'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver' and it cotains 10841 rows of android app data.

Furthermore, the IOS dataset contains 16 variables which consists of: 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic' and it contains 7197 rows of ios app data.

<h2> Data Wrangling </h2>
Here, we will have to convert the datasets into more suitable data in order to proceed with our analysis. This may include detecting and removing duplicates, non english apps (for simplicity) and/or incorrect data for each app.

First, let's make sure that each row of mobile data actually has the same amount of data values that matches the number of columns, for the Android one this will be 13 columns and IOS one has 16 columns.

In [56]:
# check each for of data to check if there is any row that 
# has data values less than the number of columns

# Android Dataset
print("-----Android dataset with missing data in the rows-----")
n_index = 0
for row in android_apps:
    if (len(row) != len(android_header)):
        print("Row Index:", n_index)
        print(row)          
    n_index += 1
        
# IOS Dataset
print("\n-----IOS dataset with missing data in the rows-----")
n_index = 0
for row in ios_apps:
    if (len(row) != len(ios_header)):
        print(row)
        

-----Android dataset with missing data in the rows-----
Row Index: 10472
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']

-----IOS dataset with missing data in the rows-----


<b>Comment:</b> Here, we see that the only row is that has missing values is in the Android Dataset and it is located in the 10472th row. After a closer examination it seems that the Category column is missing (should be the second element) here we can insert the category column manually from its <a href="https://play.google.com/store/apps/details?id=com.lifemade.internetPhotoframe&hl=en_AU">Google Play app</a> (it shows that it is listed under the 'Lifestyle' category).

In [57]:
# insert the Category value for the row with the missing value (manually)
# it is located in the 10472th row and it is the second element
android_apps[10472][1] = 'Lifestyle'
print("Check the Android 10472th row for the updated value")
print_data(android_apps,10472,10472)

Check the Android 10472th row for the updated value
['Life Made WI-Fi Touchscreen Photo Frame', 'Lifestyle', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 



Now, we should focus on locating any duplicates and removing them from either of the datasets. We can do this by checking if any rows have the exact same app name as we assume that each app should be unique in order to be in the respective app store anyway. Also, we must consider which rows to delete as they may have other data that is different, such as other rows of data may have updated metrics (i.e. Updated number of reviews on the app) and it would be wise to use the most recent metrics for more accuracy. Furthermore, we can sort the datasets so that it is in alphabetical order based on the apps name.

In [58]:
print("----------")
# ANDROID DATASET
# sort the datasets based on the apps name, number of reviews 
# x[0] is app name (ascending order) and x[3] is number of reviews (descending)
android_apps.sort(key = lambda x: x[0])
android_apps.sort(key = lambda x: x[3], reverse = True)

# check duplicates in Android dataset
unique_android_apps = []
unique_apps = []
duplicates_removed = 0
index = 0
for row in android_apps:
    app_name = row[0]
    if (app_name not in unique_apps):
        unique_apps.append(app_name)
        unique_android_apps.append(row)
    else:
        duplicates_removed += 1
    index += 1
print("Number of Android app duplicates removed:", duplicates_removed)

# remove all paid Android apps
free_android_apps = []
for row in unique_android_apps:
    if (row[6] == 'Free'):
        free_android_apps.append(row)
print("Number of free Android apps:", len(free_android_apps))

print("----------")
# IOS DATASET
# x[1] is app name (ascending order) and x[5] is number of reviews (descending)
ios_apps.sort(key = lambda x: x[1])
ios_apps.sort(key = lambda x: x[5], reverse = True)

# check duplicates in IOS dataset
unique_ios_apps = []
unique_apps = []
duplicates_removed = 0
index = 0
for row in ios_apps:
    app_name = row[1]
    if (app_name not in unique_apps):
        unique_apps.append(app_name)
        unique_ios_apps.append(row)
    else:
        duplicates_removed += 1
print("Number of IOS app duplicates removed:", duplicates_removed)

# remove all paid IOS apps 
free_ios_apps = []
for row in unique_ios_apps:
    if (float(row[4]) == 0.0):
        free_ios_apps.append(row)
print("Number of free IOS apps:", len(free_ios_apps))
print("----------")

----------
Number of Android app duplicates removed: 1181
Number of free Android apps: 8902
----------
Number of IOS app duplicates removed: 2
Number of free IOS apps: 4054
----------


<b>Comment:</b> So, we can see using our criterion for duplicate removal, it is clear that the Android dataset contains more duplicates (1181) and were removed. On the other hand, we see that there were only two duplicates present in the IOS app and were then also removed. Additionally, there is 8902 free apps on the Google Play store whereas there are only 4054 free apps in the App Store in these datasets. Now, we have unique sets for both, we can begin analyse the cleansed datasets in detail.

<h2>Data Analysis</h2>
<p>So, we have added values to any missing data rows and located and removed the appropriate duplicates for both datasets. Now, we can decide which columns will contribute to the driving force in the success of an app, let's start with the android dataset. It's columns are:</p>

In [59]:
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


<p>Here, we would like to analyse the following variables, Reviews, Content Rating and Genres against the target variable, Installs, as we used the number of installs as the measure of success of an app. We now should create a frequency table for each of the explanatory variables against the number of installs to measure how successful the app is. To make this possible, will require a dictionary where the keys will be the each of the explanatory variables as discussed previously. Thus, we must first create a dictionary that has the number of installs as the key and the value is the Reviews, Content Rating and Genres variables respectively.</p>

In [60]:
# function to create table with respect to the installs variable
# key_index is where the column that contains the installs variable is located
# value_index contains the columns that contain the explanatory variable
def installs_table(dataset, header, key_index, value_index):
    freq = {}
    for row in dataset:
        key = row[key_index]
        value = row[value_index]
        if (key not in freq):
            freq[key] = [value]
        else:
            freq[key].append(value)
    return freq

# Android dataset
android_installs_reviews = installs_table(free_android_apps, android_header, 5, 3) 
android_installs_content = installs_table(free_android_apps, android_header, 5, 8)
android_installs_genres = installs_table(free_android_apps, android_header, 5, 9)

# display the average number of reviews for each installs milestone
print("-----Average number of reviews per Android app installs milestone-----")
for key in sorted(android_installs_reviews.keys()):
    numlist = [int(x) for x in android_installs_reviews[key]]
    print("Installs:", key, "||| Average Reviews:", int(sum(numlist)/len(numlist)))
print("--------------------------------------------------------------")

-----Average number of reviews per Android app installs milestone-----
Installs: 0+ ||| Average Reviews: 0
Installs: 1+ ||| Average Reviews: 0
Installs: 1,000+ ||| Average Reviews: 25
Installs: 1,000,000+ ||| Average Reviews: 32191
Installs: 1,000,000,000+ ||| Average Reviews: 20011994
Installs: 10+ ||| Average Reviews: 0
Installs: 10,000+ ||| Average Reviews: 240
Installs: 10,000,000+ ||| Average Reviews: 347985
Installs: 100+ ||| Average Reviews: 4
Installs: 100,000+ ||| Average Reviews: 2533
Installs: 100,000,000+ ||| Average Reviews: 4055292
Installs: 5+ ||| Average Reviews: 0
Installs: 5,000+ ||| Average Reviews: 70
Installs: 5,000,000+ ||| Average Reviews: 101360
Installs: 50+ ||| Average Reviews: 2
Installs: 50,000+ ||| Average Reviews: 745
Installs: 50,000,000+ ||| Average Reviews: 1220515
Installs: 500+ ||| Average Reviews: 9
Installs: 500,000+ ||| Average Reviews: 8955
Installs: 500,000,000+ ||| Average Reviews: 9852672
--------------------------------------------------------

<b>Comment:</b> Here, we see the top 3 install milestones are 1,000,000,000+, 500,000,000+ and 100,000,000+ with 20011994, 9852672, 4055292 average number of reviews respectively. Also, it is clear there is a linear trend between the number of app installs and the average number of reviews which should be the norm.

Now, let's investigate the trend between the content rating and the number of installs for each Android app:

In [61]:
# function to create a frequency dictionary
# for each frequency make a pair such as [value, frequency of value]
# i.e. ['Everyone', 27] means that the term 'Everyone' has occurred 27 times
def freq_table(dataset):
    freq = {}
    for item in dataset:
        if (item in freq):
            freq[item] += 1
        else:
            freq[item] = 1
    
    # convert freq dict into a sorted list
    total_list = []
    for key in freq:
        key_freq = [key, freq[key]]
        total_list.append(key_freq)
    total_list.sort(key = lambda x: x[1], reverse = True)
    return total_list

print("-----Frequency for each content rating per Android app installs milestone-----")
# display frequency for content rating for each install milestone
for key in sorted(android_installs_content.keys()):
    key_list = android_installs_content[key]
    freq_list = freq_table(key_list)
    print(key, 'Installs', '\n', freq_list, '\n')
print("------------------------------------------------------------------------------\n")

-----Frequency for each content rating per Android app installs milestone-----
0+ Installs 
 [['Everyone', 3], ['Teen', 1]] 

1+ Installs 
 [['Everyone', 40], ['Mature 17+', 3], ['Teen', 3]] 

1,000+ Installs 
 [['Everyone', 661], ['Teen', 57], ['Mature 17+', 21], ['Everyone 10+', 11]] 

1,000,000+ Installs 
 [['Everyone', 1093], ['Teen', 174], ['Mature 17+', 67], ['Everyone 10+', 63], ['Adults only 18+', 1]] 

1,000,000,000+ Installs 
 [['Everyone', 11], ['Teen', 8], ['Everyone 10+', 1]] 

10+ Installs 
 [['Everyone', 267], ['Teen', 34], ['Mature 17+', 10], ['Everyone 10+', 4]] 

10,000+ Installs 
 [['Everyone', 791], ['Teen', 70], ['Mature 17+', 34], ['Everyone 10+', 18]] 

10,000,000+ Installs 
 [['Everyone', 690], ['Teen', 142], ['Mature 17+', 51], ['Everyone 10+', 49]] 

100+ Installs 
 [['Everyone', 536], ['Teen', 58], ['Mature 17+', 16], ['Everyone 10+', 6]] 

100,000+ Installs 
 [['Everyone', 843], ['Teen', 100], ['Mature 17+', 55], ['Everyone 10+', 33]] 

100,000,000+ Installs

<b>Comment:</b> So, it is clear that for every installs milestone for Android apps, the content rating that is very prominent that we can see above is the 'Everyone' rating. This implies that the most successful apps and the least popular ones all have one thing in common is that all of them have mobile apps that cater to all demographics ('Everyone' rating).

Lastly, lets observe the trend between the genre of the app and the number of installs, but this time lets look just at the top 3 genres for each installs milestone for simplicity:

In [62]:
print("-----Frequency for each genre per Android app installs milestone-----")
# display frequency for content rating for each install milestone
for key in sorted(android_installs_genres.keys()):
    key_list = android_installs_genres[key]
    freq_list = freq_table(key_list)
    print(key, 'Installs', '\n', freq_list[0:3], '\n')
print("------------------------------------------------------------------------------\n")

-----Frequency for each genre per Android app installs milestone-----
0+ Installs 
 [['Business', 1], ['Social', 1], ['Art & Design', 1]] 

1+ Installs 
 [['Business', 5], ['Tools', 4], ['Medical', 4]] 

1,000+ Installs 
 [['Education', 82], ['Tools', 69], ['Entertainment', 64]] 

1,000,000+ Installs 
 [['Tools', 100], ['Entertainment', 71], ['Finance', 57]] 

1,000,000,000+ Installs 
 [['Communication', 6], ['Social', 3], ['Video Players & Editors', 2]] 

10+ Installs 
 [['Business', 40], ['Medical', 25], ['Productivity', 20]] 

10,000+ Installs 
 [['Tools', 84], ['Education', 78], ['Finance', 67]] 

10,000,000+ Installs 
 [['Tools', 82], ['Action', 64], ['Photography', 56]] 

100+ Installs 
 [['Business', 74], ['Medical', 45], ['Entertainment', 44]] 

100,000+ Installs 
 [['Tools', 95], ['Entertainment', 78], ['Education', 56]] 

100,000,000+ Installs 
 [['Tools', 22], ['Arcade', 19], ['Photography', 18]] 

5+ Installs 
 [['Business', 17], ['Medical', 8], ['Sports', 8]] 

5,000+ Inst

<b>Comment:</b> As we see above, there isn't as clear of a trend as we saw in the previous two, but we can see two genres that outperform the rest and that is the 'Tools' and 'Business' genres with 'Tools' over taking most of the high installs milestones.

Now, we are half way through the analysis and we still have to take a close look at the App Store dataset. Our first move will be to take a quick look back at the columns of the dataset:

In [63]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Here, we see the absence of a Installs column that counts the number of ios installations for each app and so we will have to resort to another metric to measure the success of an app. It seems that we can use the 'user_rating' as the defining factor on what constitutes a successful app, as the higher the rating then the better the app. 

Again, we will compare this against the genre of each app, which is denoted by the 'prime_genre' column and content rating of the apps, as denoted by the 'cont_rating' column. This will allow us to have similar analysis to the Android dataset as we can see the influence on how the content rating and the genre has on a success of a app.

In [64]:
# IOS dataset
# we can reuse the installs_table() function defined earlier but we are using the user rating instead
ios_installs_content = installs_table(free_ios_apps, ios_header, 7, 10) 
ios_installs_genre = installs_table(free_ios_apps, ios_header, 7, 11)

print("-----Frequency for each content rating per IOS app with respect to the user rating-----")
# display frequency for each content rating for each install milestone
for key in sorted(ios_installs_content.keys()):
    key_list = ios_installs_content[key]
    freq_list = freq_table(key_list)
    print(key, 'User Rating', '\n', freq_list[0:3], '\n')
print("-------------------------------------------------------------------------------------\n")

-----Frequency for each content rating per IOS app with respect to the user rating-----
0.0 User Rating 
 [['4+', 329], ['17+', 152], ['12+', 116]] 

1.0 User Rating 
 [['4+', 13], ['17+', 4], ['9+', 2]] 

1.5 User Rating 
 [['4+', 18], ['12+', 4], ['17+', 3]] 

2.0 User Rating 
 [['4+', 36], ['12+', 12], ['17+', 8]] 

2.5 User Rating 
 [['4+', 73], ['12+', 20], ['17+', 19]] 

3.0 User Rating 
 [['4+', 135], ['12+', 38], ['17+', 25]] 

3.5 User Rating 
 [['4+', 247], ['12+', 54], ['17+', 45]] 

4.0 User Rating 
 [['4+', 545], ['12+', 139], ['9+', 94]] 

4.5 User Rating 
 [['4+', 918], ['12+', 274], ['9+', 180]] 

5.0 User Rating 
 [['4+', 150], ['12+', 46], ['9+', 32]] 

-------------------------------------------------------------------------------------



<b>Comment:</b> Similarly, just like the Android apps, we can see that the most dominant content rating for all user ratings is the '4+', which is basicall the 'Everyone' rating in the Google Play store. Also, the highest user rating '5.0', we see that the top 3 content ratings are '4+', '12+' and '9+' respetively which implies the higher rating such as '17+' apps are not as successful. This is also evident in the Android dataset as the apps the rack up over 100,000,000+ Installs where the 'Mature 17+' are also the least popular rating out of the successful Android apps.

In [65]:
print("-----Frequency for each genre per IOS app with respect to the user rating-----")
# display frequency for each genre for each install milestone
for key in sorted(ios_installs_genre.keys()):
    key_list = ios_installs_genre[key]
    freq_list = freq_table(key_list)
    print(key, 'User Rating', '\n', freq_list[0:3], '\n')
print("-------------------------------------------------------------------------------------\n")

-----Frequency for each genre per IOS app with respect to the user rating-----
0.0 User Rating 
 [['Games', 364], ['Entertainment', 50], ['Book', 41]] 

1.0 User Rating 
 [['Games', 10], ['Sports', 3], ['Lifestyle', 2]] 

1.5 User Rating 
 [['Finance', 6], ['Games', 5], ['Lifestyle', 4]] 

2.0 User Rating 
 [['Entertainment', 15], ['Games', 11], ['Sports', 7]] 

2.5 User Rating 
 [['Games', 28], ['Entertainment', 24], ['Sports', 11]] 

3.0 User Rating 
 [['Games', 67], ['Entertainment', 45], ['Social Networking', 15]] 

3.5 User Rating 
 [['Games', 146], ['Entertainment', 45], ['Education', 25]] 

4.0 User Rating 
 [['Games', 496], ['Entertainment', 65], ['Education', 32]] 

4.5 User Rating 
 [['Games', 978], ['Entertainment', 75], ['Photo & Video', 74]] 

5.0 User Rating 
 [['Games', 150], ['Photo & Video', 21], ['Entertainment', 14]] 

-------------------------------------------------------------------------------------



<b>Comment:</b> On the contrary, unlike the Android apps where the most popular apps based on the number of installs were dominated by the 'Business' and 'Tools' genres, it is evident that successful IOS apps are mostly in the 'Games' followed by the 'Entertainment' genres.

<h2>Conclusion of Analysis</h2>
<p>From our Android apps dataset, we saw that there was clear linearity between the number of reviews left on the app and the number of installs. Despite knowing this implicity we showed that this was infact the case based on our analysis. Also, the most installed/downloaded Android apps were found out to have content rating for every demographic and these apps mostly fell under the 'Business' or 'Tools' genres. We may assume that most of these installs maybe be from the older generations due to these genres are mostly what adults are more likely to download and install.</p>

<p>For our IOS dataset, we stated that the measurement of success was based on the user rating for each app in the App Store. By using this metric we compared this against the content rating and genre to see if there is any trends/relationships present. From here, we found that the content rating for all successful IOS apps was '4+' which also basically caters for all demographics except maybe for all under 4 years old, but this maybe just that you cannot select that content rating when you upload an app to the App Store. Also, the genres that these successful apps reside in are mostly under the 'Games' and 'Entertainment' categories which is the exact opposite of the 'Business' and 'Tools' successful genres seen in the Google Play store.</p>

<h2>Final Takeaway</h2>
<p>Thus, we venture back to the original motivation for this analysis which was we wanted to find the driving force behind the success of these free mobile apps in both app stores and how we can use these ideas for our own apps. So, we can conclude that in order to be successful in both app markets is to cater for the wider audience, such as do not create apps that are limited to a certain age range in order to attract more users. For the Google Play store specifically, we should try to make apps in the 'Business' and 'Tools' category as these are the apps that attract the most downloads and installs. And for the App store, we should try to strive for apps that are either under 'Games' or 'Entertainment' apps if we want to produce a more successful app.</p>