## Profitable App Profiles : App Store and Google Play Markets

The aim in this project is to provide an analysis for a company that builds Android and iOS mobile apps. The developers of the company build apps that are free to download and install, the primary source of revenue consists of in-app ads. This also means that its influenced by the number of users who use the app. 

We intend to analyze data to aid our developers and help them identify which type of apps attract more users...

### Import Datasets of iOS and Android apps

In [None]:
from csv import reader

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

For easier data exploration, the following function can be used to open the 'ios' and 'android' list of lists with specific rows and columns, as a slice. Or, rows and columns count can be obtained by the function as well.

In [None]:
def explore_data(dataset, start, end, count_rows_columns = False):
    sliced = dataset[start:end]
    for row in sliced:
        print(row,'\n')
    
    if count_rows_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

print(android_header)
print('\n')
print(explore_data(android, 0, 3))

Explore the iOS dataset as well...

In [None]:
print(ios_header)
print('\n')
print(explore_data(ios, 0, 3))

### Deleting Incorrect data



The Google Play data set has a dedicated discussion section here <https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015> and we can see that one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [None]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

The row 10472 corresponds to an app that has a rating of 19. This is incorrect because the maximum rating for a Google Play app is 5 (mentioned in the discussions section, problem is caused by a missing value in the 'Category' column). Hence we will have to delete this row.

In [None]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android))

### Looking for identical entries

The following code snippet identifies the total number of duplicate apps entries in the Google App Playstore

In [None]:
identical_apps = []
unique_apps = []

for app in android:
    app_name = app[0]
    if app_name in unique_apps:
        identical_apps.append(app_name)
    else:
        unique_apps.append(app_name)

print('Number of identical apps entries: ', len(identical_apps))
print('Examples of identical apps: ', identical_apps[:15] )    

Let's look at two of the apps which has multiple entries: "ZOOM Cloud Meetings" and "Instagram"

In [None]:
for app in android:
    app_name = app[0]
    if app_name == 'ZOOM Cloud Meetings' or app_name == 'Instagram':
        print(app)

We notice that while "ZOOM Cloud Meetings" has duplicate entry, "Instagram" has entries with updated column values for <b>User Reviews</b>. 

To remove the duplicates, we will create a dictionary with key-value pairs where key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

In [None]:
reviews_max = {}
for app in android:
    name = app[0] #app name
    n_reviews = float(app[3]) #no. of reviews
    if (name in reviews_max) and (reviews_max[name] < n_reviews) :
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

Let's cross check between the evaluated and actual number of apps
...

In [None]:
print("Expected number of unique apps: ", len(android) - len(identical_apps))
print("Actual number of unique apps: ", len(reviews_max))

Using the dictionary that we created above, we can remove the duplicate rows in the dataset that we intend to use for further analysis.

In [None]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added :
        android_clean.append(app)
        already_added.append(name)
print("No. of rows in the cleaned android dataset: ", len(android_clean))
print('\n')
explore_data(android_clean, 0, 5, True)

### Removing Non-English Apps

Since the development team will be focusing on English based apps, our analysis would be directed toward an English-speaking audience.

Referring to the discussions on the website link shared before, there are specific rows, columns that indicate the app name.

<b>Part - I : Building a function that detects English vs Non-English characters </b>

Each character used in English falls within the 127 range in an ASCII encoding. Anything beyond can be classified as a special character.

Using built-in ord() function can help us with the unicode code point representation of the app in question.

In [None]:
def is_english(app_name):
    for character in app_name:
        if ord(character) > 127:
            return False
    return True

print("\n Q. Is 爱奇艺PPS -《欢乐颂2》电视剧热播 an app in English? ")
print("A. ",is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

print("\n Q. Is Instachat 😜 an app in English? ")
print("A. ",is_english('Instachat 😜'))

print("\n Q. Is Instagram an app in English? ")
print("A. ",is_english('Instagram'))

We notice that even a single character beyond 127 is classified as non-english since the unicode code point representation value for 😜 is 128540 that you can check out using ord('😜')

If we use the above function to eliminate apps, we might end up removing apps which have special characters like ™ since the app will be incorrectly labeled as non-English.
Hence, to minimize the impact of data loss, we will only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

Modifying the is_english() function further...

In [None]:
def is_english(app_name):
    count_non_ascii = 0
    for character in app_name:
        if ord(character) > 127:
            count_non_ascii += 1
    if count_non_ascii > 3:
        return False
    else:
        return True

print("\n Q. Is Docs To Go™ Free Office Suite an app in English? ")
print("A. ",is_english('Docs To Go™ Free Office Suite'))

print("\n Q. Is Instachat 😜 an app in English? ")
print("A. ",is_english('Instachat 😜'))

print("\n Q. Is Instagram an app in English? ")
print("A. ",is_english('Instagram'))

Now, we can use the function to filter out the cleaned android dataset.

In [None]:
android_english = []
ios_english = []

for app in android_clean:
    if is_english(app[0]):
        android_english.append(app)

for app in ios:
    if is_english(app[1]):
        ios_english.append(app)

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

We are left with 9614 Android apps and 6183 iOS apps...

### Isolating the Free Apps

We only focus on apps that are free to download and install, and our main source of revenue consists of in-app ads. 

Since our data sets contain both free and non-free apps, we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [None]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)

for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
explore_data(android_final, 5, 10, True)
print('\n')
explore_data(ios_final, 5, 10, True)

We're left with 8864 Android apps and 3222 iOS apps, which should be enough for our analysis.

So far, we spent a good amount of time on cleaning data, and:

- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps
- Isolated the free apps

Our aim is *to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

* Build a minimum viable product, and add it to Google Play.
* If the app has a good response from users, we develop it further.
* If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of **what are the most common genres for each market**. For this, we'll need to build frequency tables for *prime_genre* in App Store dataset, and for the *Genres* and *Category* in Google Play data sets.

### Most Common Apps by Genre

We'll build two functions that will help us with the following:
* to generate frequency tables that show percentages
* to display the percentages in a descending order using the built-in sorted() function

The sorted() function doesn't work too well with dictionaries because it only considers and returns the dictionary keys.
However, the sorted() function works well if we transform the dictionary into a list of tuples, where each tuple contains a dictionary key along with its corresponding dictionary value. To ensure the sorting works right, the dictionary value comes first, and the dictionary key comes second.

In [None]:
def freq_table(dataset, index):
    frequency_table = {}
    
    for row in dataset:
        value = row[index]
        if value in frequency_table:
            frequency_table[value] += 1
        else:
            frequency_table[value] = 1
    
    for key in frequency_table:
        frequency_table[key] = (frequency_table[key]/len(dataset)) * 100
    return frequency_table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
display_table(ios_final, -5)

We notice that 58% are games, entertainment apps are close to 8% followed by video apps which are close to 5%. 

We can examine the Genres and Category columns of the Google Play dataset.

In [None]:
display_table(android_final, 1) # Category

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

In [None]:
display_table(android_final, -4) #Genres

The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

***

### Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot app`.

In [None]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0 #sum of user ratings (number of ratings)
    len_genre = 0 #number of apps specific to each genre
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [None]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) #print name and number of ratings

The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [None]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

Now let's analyze the Google Play market a bit.

### Most Popular Apps by Genre on Google Play

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [None]:
display_table(android_final, 5) # the Installs columns

One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to `float` — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [None]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [None]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [None]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [None]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. It seems there's still a small number of extremely popular apps that skew the average:

In [None]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [None]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

## Conclusion

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.