## Profitable App Profiles for the App Store and Google Play Markets

In this project the main purpose is to analyze and explore the volume of apps currently available in both the two main app markets of the world: _App Store_ and _Google Play_.

With this information, it will be possible to extract valuable information about the most user-attractive apps, which can help in an early state of development for an app.

### Opening and exploring Data

In [4]:
# Function to open the data sets as header and the rest of rows separately #

def open_dataset(dataset):
    from csv import reader
    opened_file = open(dataset)
    read_file = reader(opened_file)
    data_set = list(read_file)
    return data_set[0], data_set[1:]

header_apple, apps_data_apple = open_dataset('AppleStore.csv')
header_google, apps_data_google = open_dataset('googleplaystore.csv')

In [5]:
# Function to explore the data. The dataset here must be introduced without headers #

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
#explore_data(apps_data_apple, 0, 5, True)
#explore_data(apps_data_google, 0, 5, True)

In [6]:
# Information about the columns of the dataset #
print(header_apple)
print(header_google)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


More information about the datasets and a description of what contains each column inside them can be obtained in the following links:
- [Google Play Market](https://www.kaggle.com/lava18/google-play-store-apps)
- [Apple Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

### Data cleaning

For the main purposes of our analysis, a cleaning of the data is necessary to be accomplished. For that, our efforts will be centered to:

- Remove non-English apps.
- Remove non-free apps.
- Detect other inaccurate data.
- Detect duplicated data.

In [7]:
# Detection of inaccurate/uncompleted data #

def detect(dataset, header):
    for row in dataset: 
        if len(row) != len(header):
            print(row)
            print(dataset.index(row))
        
detect(apps_data_google, header_google)
detect(apps_data_apple, header_apple)


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


In [8]:
# Deletion of wrong data #

def delete(dataset, index):
    del dataset[index]

delete(apps_data_google, 10472)

In [9]:
# Detection of duplicated entries #

def duplicated(dataset, index):
    duplicated_apps = []
    unique_apps = []
    for row in dataset:
        name = row[index]
        if name in unique_apps:
            duplicated_apps.append(name)
        else:
            unique_apps.append(name)
    print('Number of duplicated entries: ', len(duplicated_apps))
    print('Examples of duplicated entries: ', duplicated_apps[:15])

duplicated(apps_data_google, 0)
duplicated(apps_data_apple, 0)
    

Number of duplicated entries:  1181
Examples of duplicated entries:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']
Number of duplicated entries:  0
Examples of duplicated entries:  []


The condition to remove the duplicated entries is the following: keep the most updated row, that is, the one with the higher number of reviews. 

In [10]:
# Deletion of duplicated data #

def delete_duplicated(dataset, index_name, index_review_number):
    reviews_max = {}
    for row in dataset:
        name = row[index_name]
        n_reviews = row[index_review_number]
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
        elif name not in reviews_max:
            reviews_max[name] = n_reviews
    
    dataset_clean = []
    already_added = []
    for row in dataset:
        name = row[index_name]
        n_reviews = row[index_review_number]
        if (reviews_max[name] == n_reviews) and (name not in already_added):
            dataset_clean.append(row)
            already_added.append(name) # make sure this is inside the if block
    return dataset_clean

apps_data_google_cl = delete_duplicated(apps_data_google, 0, 3)
apps_data_apple_cl = delete_duplicated(apps_data_apple, 0, 5)

In [11]:
print('Number of rows Google Play dataset: ', len(apps_data_google_cl))
print('Number of rows Apple Store dataset: ', len(apps_data_apple_cl))

Number of rows Google Play dataset:  9659
Number of rows Apple Store dataset:  7197


We proceed now to remove the non-English apps from each of the stores.

In [12]:
# Detection of non_English apps #

def check_string(a_str):
    count = 0
    for character in a_str:
        number = ord(character) # return an integer representing the Unicode code point of that character
        if number > 127:
            count += 1
    if count > 3:
        return False
    return True
        
#print(check_string('Instagram'))
#print(check_string('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))
#print(check_string('Docs To Go‚Ñ¢ Free Office Suite'))
#print(check_string('Instachat üòú'))

# Deletion of non-English apps #

def filter_non_en(dataset, index_name):
    data = []
    for row in dataset:
        name = row[index_name]
        is_english = check_string(name)
        if is_english:
            data.append(row)
    return data

apps_data_google_gd = filter_non_en(apps_data_google_cl, 0)
apps_data_apple_gd = filter_non_en(apps_data_apple_cl, 1)

In [13]:
print('Number of rows Google Play dataset: ', len(apps_data_google_gd))
print('Number of rows Apple Store dataset: ', len(apps_data_apple_gd))

Number of rows Google Play dataset:  9614
Number of rows Apple Store dataset:  6183


Now, the focus is to remove the non-free apps from both of the stores.

In [14]:
# Deletion of non-free apps #

def filter_non_free(dataset, index_price):
    dataset_clean = []
    for row in dataset:
        price = row[index_price]
        if price == '0' or price == '0.0':
            dataset_clean.append(row)
    return dataset_clean

apps_data_google_final = filter_non_free(apps_data_google_gd, 7)
apps_data_apple_final = filter_non_free(apps_data_apple_gd, 4)

In [15]:
print('Number of rows Google Play dataset: ', len(apps_data_google_final))
print('Number of rows Apple Store dataset: ', len(apps_data_apple_final))

Number of rows Google Play dataset:  8862
Number of rows Apple Store dataset:  3222


### Analyzing Data

Our main purpose on this project is to determine the kinds of apps that are likely to attract more users because the revenue is highly influenced by the number of people using the apps.

With that in mind, it is possible to start the analysis by getting a sense of what are the most common genres for each market. For this, it will be necessary to build frequency tables for a few columns in the data sets.

#### Most Common Apps by Genre

In [18]:
# Completion of frequency tables with percentages #

def freq_table(dataset, index_column):
    freq_table = {}
    for row in dataset:
        key = row[index_column]
        if key in freq_table:
            freq_table[key] += 1
        elif key not in freq_table:
            freq_table[key] = 1
    total_apps = len(dataset)
    for key in freq_table:
        proportion = freq_table[key] / total_apps
        percentage = proportion * 100
        freq_table[key] = percentage
    return freq_table

# Display the frequency tables #

def display_table(dataset, index_column):
    table = freq_table(dataset, index_column)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True) # Return a new sorted list from the items in iterable.
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

print('The frequency table for the most common genres in the Apple Store is:' + '\n')
display_table(apps_data_apple_final, 11)
print('\n')
print('The frequency table for the most common genres in the Google Play Store is:' + '\n')
display_table(apps_data_google_final, 9)
print('\n')
print('The frequency table for the most common categories in the Google Play Store is:' + '\n')
display_table(apps_data_google_final, 1)
print('\n')

The frequency table for the most common genres in the Apple Store is:

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


The frequency table for the most common genres in the Google Play Store is:

Tools : 8.440532611148726
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.893026404

We can see that among the free English apps, more than a half (58.16%) are games. Entertainment apps are close to 8%, followed by photo and video apps, which are close to 5%. Only 3.66% of the apps are designed for education, followed by social networking apps which amount for 3.29% of the apps in our data set.

The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users ‚Äî the demand might not be the same as the offer.

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.

The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

#### Most Popular Apps by Genre on the App Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre.

This information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

In [23]:
# Calculation of the average number of user ratings per app genre on the Apple Store #

def average_number(dataset, index_unique, index_search):
    table_unique = freq_table(dataset, index_unique)
    for genre in table_unique:
        total = 0 # sum of user ratings (number of ratings, not the actual ratings)
        len_genre = 0 # number of apps
        for row in dataset:
            genre_app = row[index_unique]
            if genre_app == genre: 
                total += float(row[index_search])
                len_genre += 1
        average_number = total / len_genre
        print('App Genre: ', genre, '- User ratings: ', average_number)
        
print('The most popular apps by genre on the Apple Store are:' + '\n')
average_number(apps_data_apple_final, 11, 5)

The most popular apps by genre on the Apple Store are:

App Genre:  Business - User ratings:  7491.117647058823
App Genre:  Catalogs - User ratings:  4004.0
App Genre:  Utilities - User ratings:  18684.456790123455
App Genre:  Weather - User ratings:  52279.892857142855
App Genre:  Music - User ratings:  57326.530303030304
App Genre:  News - User ratings:  21248.023255813954
App Genre:  Finance - User ratings:  31467.944444444445
App Genre:  Social Networking - User ratings:  71548.34905660378
App Genre:  Lifestyle - User ratings:  16485.764705882353
App Genre:  Shopping - User ratings:  26919.690476190477
App Genre:  Education - User ratings:  7003.983050847458
App Genre:  Food & Drink - User ratings:  33333.92307692308
App Genre:  Photo & Video - User ratings:  28441.54375
App Genre:  Sports - User ratings:  23008.898550724636
App Genre:  Travel - User ratings:  28243.8
App Genre:  Games - User ratings:  22788.6696905016
App Genre:  Productivity - User ratings:  21028.410714285714
Ap

On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together.

The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating. However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

- Weather apps: people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.
- Food and drink: examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.
- Finance apps: these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

#### Most Popular Apps by Genre on Google Play

We have data about the number of installs for the Google Play market, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough ‚Äî we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.). For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. 

However, we don't need very precise data for our purposes ‚Äî we only want to find out which app genres attract the most users, and we don't need perfect precision with respect to the number of users. We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

In [24]:
# Calculation of the average number of installs per app genre on the Google Play Store #

def installs_number(dataset, index_unique, index_search):
    table_unique = freq_table(dataset, index_unique)
    for category in table_unique:
        total = 0 # sum of installs specific to each genre
        len_category = 0 # number of apps specific to each genre
        for row in dataset:
            category_app = row[index_unique]
            if category_app == category: 
                n_installs = row[index_search]
                n_installs = n_installs.replace('+', '')
                n_installs = n_installs.replace(',', '')
                total += float(n_installs)
                len_category += 1
        average_number = total / len_category
        print('App Genre: ', category, '- Number of installs: ', average_number)
        
print('The most popular apps by genre on the Google Play Store are:' + '\n')
installs_number(apps_data_google_final, 1, 5)

The most popular apps by genre on the Google Play Store are:

App Genre:  EDUCATION - Number of installs:  1820673.076923077
App Genre:  COMMUNICATION - Number of installs:  38456119.167247385
App Genre:  PERSONALIZATION - Number of installs:  5201482.6122448975
App Genre:  ENTERTAINMENT - Number of installs:  11640705.88235294
App Genre:  DATING - Number of installs:  854028.8303030303
App Genre:  MEDICAL - Number of installs:  120616.48717948717
App Genre:  TRAVEL_AND_LOCAL - Number of installs:  13984077.710144928
App Genre:  SPORTS - Number of installs:  3638640.1428571427
App Genre:  FAMILY - Number of installs:  3694276.334922527
App Genre:  FOOD_AND_DRINK - Number of installs:  1924897.7363636363
App Genre:  SHOPPING - Number of installs:  7036877.311557789
App Genre:  LIBRARIES_AND_DEMO - Number of installs:  638503.734939759
App Genre:  GAME - Number of installs:  15560965.599534342
App Genre:  PARENTING - Number of installs:  542603.6206896552
App Genre:  HEALTH_AND_FITNESS -

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs. If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times.

We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.

The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

### Conclusions

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.